Normalization
Unicode Normalization
Section titled “Unicode Normalization”Apply Unicode normalization forms for consistent string representation:
use Cline\Babel\Babel;use Normalizer;
// Default: NFC (Canonical Decomposition, followed by Canonical Composition)Babel::from('café')->normalize();
// NFD: Canonical DecompositionBabel::from('café')->normalize(Normalizer::NFD);
// NFKC: Compatibility Decomposition, followed by Canonical CompositionBabel::from('fi')->normalize(Normalizer::NFKC); // "fi"
// NFKD: Compatibility DecompositionBabel::from('①')->normalize(Normalizer::NFKD); // "1"Normalization Forms
Section titled “Normalization Forms”| Form | Description | Use Case |
|---|---|---|
| NFC | Composed characters | Default, web content |
| NFD | Decomposed characters | Sorting, searching |
| NFKC | Compatibility composed | Search normalization |
| NFKD | Compatibility decomposed | Maximum decomposition |
Remove BOM
Section titled “Remove BOM”Strip byte-order marks from the beginning of strings:
// UTF-8 BOMBabel::from("\xEF\xBB\xBFHello")->removeBom()->value(); // "Hello"
// UTF-16 BE BOMBabel::from("\xFE\xFFHello")->removeBom()->value(); // "Hello"
// UTF-16 LE BOMBabel::from("\xFF\xFEHello")->removeBom()->value(); // "Hello"
// No BOM (unchanged)Babel::from('Hello')->removeBom()->value(); // "Hello"Remove Non-Printable Characters
Section titled “Remove Non-Printable Characters”Strip characters that don’t render visibly (preserves tabs, newlines, carriage returns):
// Null byteBabel::from("Hello\x00World")->removeNonPrintable()->value(); // "HelloWorld"
// Bell characterBabel::from("Hello\x07World")->removeNonPrintable()->value(); // "HelloWorld"
// Preserves whitespaceBabel::from("Hello\tWorld\n")->removeNonPrintable()->value(); // "Hello\tWorld\n"Remove Control Characters
Section titled “Remove Control Characters”Strip all ASCII control characters (including tabs and newlines):
// Removes all control charsBabel::from("Hello\tWorld\n")->removeControlChars()->value(); // "HelloWorld"
// Null and bellBabel::from("Hello\x00\x07World")->removeControlChars()->value(); // "HelloWorld"Remove Invisible Characters
Section titled “Remove Invisible Characters”Strip zero-width and invisible Unicode characters:
// Zero-width spaceBabel::from("Hello\u{200B}World")->removeInvisible()->value(); // "HelloWorld"
// Zero-width non-joinerBabel::from("Hello\u{200C}World")->removeInvisible()->value(); // "HelloWorld"
// Zero-width joinerBabel::from("Hello\u{200D}World")->removeInvisible()->value(); // "HelloWorld"
// Byte order mark (inline)Babel::from("Hello\u{FEFF}World")->removeInvisible()->value(); // "HelloWorld"
// Word joinerBabel::from("Hello\u{2060}World")->removeInvisible()->value(); // "HelloWorld"Remove Emoji
Section titled “Remove Emoji”Strip emoji characters from strings:
Babel::from('Hello 👋 World 🌍')->removeEmoji()->value();// "Hello World "
Babel::from('Great job! 🎉👏')->removeEmoji()->value();// "Great job! "
// No emoji (unchanged)Babel::from('Hello World')->removeEmoji()->value();// "Hello World"Remove Script
Section titled “Remove Script”Strip all characters from a specific Unicode script:
// Remove CyrillicBabel::from('Hello Привет World')->removeScript('Cyrillic')->value();// "Hello World"
// Remove Han (Chinese)Babel::from('Hello 世界 World')->removeScript('Han')->value();// "Hello World"
// Remove ArabicBabel::from('Hello مرحبا World')->removeScript('Arabic')->value();// "Hello World"Remove Diacritics
Section titled “Remove Diacritics”Strip accent marks and diacritical marks from characters:
// Accented charactersBabel::from('café')->removeDiacritics()->value(); // "cafe"Babel::from('Ñoño')->removeDiacritics()->value(); // "Nono"Babel::from('naïve')->removeDiacritics()->value(); // "naive"
// Note: some characters like Polish 'ł' are distinct letters, not diacriticsBabel::from('Żółć')->removeDiacritics()->value(); // "Zołc"
// Plain ASCII unchangedBabel::from('Hello')->removeDiacritics()->value(); // "Hello"Collapse Whitespace
Section titled “Collapse Whitespace”Normalize multiple whitespace characters into single spaces:
// Multiple spacesBabel::from('Hello World')->collapseWhitespace()->value();// "Hello World"
// Mixed whitespace (tabs, newlines)Babel::from("Hello\t\n\tWorld")->collapseWhitespace()->value();// "Hello World"
// Trims leading/trailing whitespaceBabel::from(' Hello World ')->collapseWhitespace()->value();// "Hello World"Custom Transliteration
Section titled “Custom Transliteration”Apply ICU transliteration rules for advanced transformations:
// Default: Any-Latin; Latin-ASCIIBabel::from('Żółć')->transliterate()->value(); // "Zolc"Babel::from('北京')->transliterate()->value(); // "bei jing"Babel::from('Москва')->transliterate()->value(); // "Moskva"
// Case conversionBabel::from('HELLO')->transliterate('Upper; Lower')->value(); // "hello"Babel::from('hello')->transliterate('Lower; Title')->value(); // "Hello"
// Custom rulesBabel::from('café')->transliterate('NFD; [:Nonspacing Mark:] Remove; NFC')->value();// "cafe"Error Handling
Section titled “Error Handling”use Cline\Babel\Exceptions\TransliterationException;
try { Babel::from('text')->transliterate('Invalid-Rules');} catch (TransliterationException $e) { // Handle invalid transliteration rules}Chaining Transformations
Section titled “Chaining Transformations”Combine multiple cleaning operations:
$cleaned = Babel::from($dirtyInput) ->removeBom() ->removeInvisible() ->removeNonPrintable() ->normalize() ->value();Use Cases
Section titled “Use Cases”File Content Processing
Section titled “File Content Processing”function cleanFileContent(string $content): string{ return Babel::from($content) ->removeBom() ->removeNonPrintable() ->normalize() ->value() ?? '';}User Input Sanitization
Section titled “User Input Sanitization”function sanitizeUserInput(string $input): string{ return Babel::from($input) ->removeInvisible() ->removeControlChars() ->normalize() ->value() ?? '';}Emoji-Free Content
Section titled “Emoji-Free Content”function stripEmoji(string $text): string{ return Babel::from($text) ->removeEmoji() ->value() ?? '';}Preparing for Search
Section titled “Preparing for Search”function normalizeForSearch(string $query): string{ return Babel::from($query) ->normalize(Normalizer::NFKC) ->transliterate('Any-Latin; Latin-ASCII; Lower') ->value() ?? '';}