=== Char Group Tokenizer
The `char_group` tokenizer breaks text into terms whenever it encounters
a
character which is in a defined set. It is mostly useful for cases where
a simple
custom tokenization is desired, and the overhead of use of the
<<analysis-pattern-tokenizer, `pattern` tokenizer>>
is not acceptable.
=== Configuration
The `char_group` tokenizer accepts one parameter:
`tokenize_on_chars`::
A string containing a list of characters to tokenize the string on.
Whenever a character
from this list is encountered, a new token is started. Also supports
escaped values like `\\n` and `\\f`,
and in addition `\\s` to represent whitespace, `\\d` to represent
digits and `\\w` to represent letters.
Defaults to an empty list.
=== Example output
```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2```
When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the
above sentence would produce the following terms:
```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone,
for, $2 ]```
Expose the experimental simplepattern and
simplepatternsplit tokenizers in the common
analysis plugin. They provide tokenization based
on regular expressions, using Lucene's
deterministic regex implementation that is usually
faster than Java's and has protections against
creating too-deep stacks during matching.
Both have a not-very-useful default pattern of the
empty string because all tokenizer factories must
be able to be instantiated at index creation time.
They should always be configured by the user
in practice.
* Docs: Improved tokenizer docs
Added descriptions and runnable examples
* Addressed Nik's comments
* Added TESTRESPONSEs for all tokenizer examples
* Added TESTRESPONSEs for all analyzer examples too
* Added docs, examples, and TESTRESPONSES for character filters
* Skipping two tests:
One interprets "$1" as a stack variable - same problem exists with the REST tests
The other because the "took" value is always different
* Fixed tests with "took"
* Fixed failing tests and removed preserve_original from fingerprint analyzer
Add `irish` analyzer
Add `sorani` analyzer (Kurdish)
Add `classic` tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc.
Add `thai` tokenizer: segments thai text into words.
Add `classic` tokenfilter: cleans up acronyms and possessives from classic tokenizer
Add `apostrophe` tokenfilter: removes text after apostrophe and the apostrophe itself
Add `german_normalization` tokenfilter: umlaut/sharp S normalization
Add `hindi_normalization` tokenfilter: accounts for hindi spelling differences
Add `indic_normalization` tokenfilter: accounts for different unicode representations in Indian languages
Add `sorani_normalization` tokenfilter: normalizes kurdish text
Add `scandinavian_normalization` tokenfilter: normalizes Norwegian, Danish, Swedish text
Add `scandinavian_folding` tokenfilter: much more aggressive form of `scandinavian_normalization`
Add additional languages to stemmer tokenfilter: `galician`, `minimal_galician`, `irish`, `sorani`, `light_nynorsk`, `minimal_nynorsk`
Add support access to default Thai stopword set "_thai_"
Fix some bugs and broken links in documentation.
Closes#5935