8 lines
337 B
Plaintext
8 lines
337 B
Plaintext
[[analysis-letter-tokenizer]]
|
|
=== Letter Tokenizer
|
|
|
|
A tokenizer of type `letter` that divides text at non-letters. That's to
|
|
say, it defines tokens as maximal strings of adjacent letters. Note,
|
|
this does a decent job for most European languages, but does a terrible
|
|
job for some Asian languages, where words are not separated by spaces.
|