Intl.Segmenter: segmentação Unicode em JavaScript

Prefácio à tradução



Esta é uma tradução da parte explicativa da proposta Intl.Segmenter, que provavelmente será adicionada à próxima especificação ECMAScript.



A proposta já está implementada em V8 e sem o sinalizador pode ser usada na versão 8.7 (mais precisamente, em 8.7.38e acima), então pode ser testada no Google Chrome Canary (a partir da versão 87.0.4252.0) ou no Node.js V8 Canary (a partir da versão v15.0.0-v8-canary202009025a2ca762b8; para Windows binários estão disponíveis v15.0.0-v8-canary202009173b56586162)



Se você testar em versões anteriores com o sinalizador --harmony-intl-segmenter, tome cuidado, pois a especificação foi alterada e a implementação sob o sinalizador pode estar obsoleta. Verifique pela saída em exemplos de código.



Após a tradução, são fornecidos links para materiais com base nos problemas que esta proposta resolve.






Intl.Segmenter: Segmentação Unicode em JavaScript



A proposta está na Etapa 3 com o apoio de Richard Gibson.



Motivação



(code point) «» . , (, ). , . , .



, CLDR (Common Locale Data Repository, ) (, locales). , , , .



, UAX 29. , JavaScript .



Chrome API Intl.v8BreakIterator. API . API, API JavaScript — , ES2015.







, segment(), Intl.Segmenter, Iterable.



//      .
let segmenter = new Intl.Segmenter("fr", {granularity: "word"});

//       .
let input = "Moi?  N'est-ce pas.";
let segments = segmenter.segment(input);

//    !
for (let {segment, index, isWordLike} of segments) {
  console.log("segment at code units [%d, %d): «%s»%s",
    index, index + segment.length,
    segment,
    isWordLike ? " (word-like)" : ""
  );
}

//  console.log:
// segment at code units [0, 3): «Moi» (word-like)
// segment at code units [3, 4): «?»
// segment at code units [4, 6): «  »
// segment at code units [6, 11): «N'est» (word-like)
// segment at code units [11, 12): «-»
// segment at code units [12, 14): «ce» (word-like)
// segment at code units [14, 15): « »
// segment at code units [15, 18): «pas» (word-like)
// segment at code units [18, 19): «.»


, API .



// ┃0 1 2 3 4 5┃6┃7┃8┃9
// ┃A l l o n s┃-┃y┃!┃
let input = "Allons-y!";

let segmenter = new Intl.Segmenter("fr", {granularity: "word"});
let segments = segmenter.segment(input);
let current = undefined;

current = segments.containing(0)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segments.containing(5)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segments.containing(6)
// → { index: 6, segment: "-", isWordLike: false }

current = segments.containing(current.index + current.segment.length)
// → { index: 7, segment: "y", isWordLike: true }

current = segments.containing(current.index + current.segment.length)
// → { index: 8, segment: "!", isWordLike: false }

current = segments.containing(current.index + current.segment.length)
// → undefined


API



.



new Intl.Segmenter(locale, options)



.



options , granularity, ("grapheme" ( ), "word" ( ) "sentence" ( ); — "grapheme").



Intl.Segmenter.prototype.segment(string)



%Segments% Iterable .





:



  • segment — .
  • index — (code unit index) , .
  • input — .
  • isWordLiketrue, "word" ( ) ( /// ..); false, "word" ( // ..); undefined, "word".


%Segments%.prototype:



%Segments%.prototype.containing(index)



, , (code unit) , undefined, .



%Segments%.prototype[Symbol.iterator]



%SegmentIterator%, "" (lazy, ) , .



%SegmentIterator%.prototype:



%SegmentIterator%.prototype.next()



next() Iterator, IteratorResult, value , .



FAQ



? ?



— , . . . CLDR. , CLDR/ICU , .



API ?



, 3- , . TC39 . ; , , .



?



API, , API : , API (, ). API CSS Houdini.



?



API:



  • .
  • .
  • , (.. Web API (Web Platform), ECMAScript).
  • , . CLDR ICU . CSS, . . , , , ; .


?



%SegmentIterator%.prototype, (, seek([inclusiveStartIndex = thisIterator.index + 1]) seekBefore([exclusiveLastIndex = thisIterator.index]), . ECMA-262 ( ). , , .



API Intl, String?



, . segment() SegmentIterator. , API Intl, ECMA-402. , . String, , .



?



n (code unit), . , "Hello, world\u{1F499}" ( , - — ), 0, 5, 6, 7 12. : ┃Hello┃,┃ ┃world┃\u{1F499}┃, (code units), (code point). , .



?



, next().



, ?



, - QA ;)



Number: null 0, — 0 1, , , Symbol BigInt, undefined NaN *. , ( , ).



* . "fail". Chrome Canary, Symbol BigInt TypeError, undefined NaN , 0.








JavaScript.



  1. Joel Spolsky. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
  2. Dmitri Pavlutin. What every JavaScript developer should know about Unicode
  3. Dr. Axel Rauschmayer. JavaScript for impatient programmers: 17. Unicode – a brief introduction
  4. Dr. Axel Rauschmayer. JavaScript for impatient programmers: 18.6. Atoms of text: Unicode characters, JavaScript characters, grapheme clusters
  5. Jonathan New. "\u{1F4A9}".length === 2
  6. Nicolás Bevacqua. ES6 Strings (and Unicode, ) in Depth
  7. Mathias Bynens. JavaScript has a Unicode problem
  8. Mathias Bynens. Unicode-aware regular expressions in ECMAScript 6
  9. Mathias Bynens. Unicode property escapes in JavaScript regular expressions
  10. Mathias Bynens. Unicode sequence property escapes
  11. Awesome Unicode: a curated list of delightful Unicode tidbits, packages and resources



All Articles