2013-08-28 19:24:34 -04:00
|
|
|
[[analysis-edgengram-tokenizer]]
|
|
|
|
=== Edge NGram Tokenizer
|
|
|
|
|
|
|
|
A tokenizer of type `edgeNGram`.
|
|
|
|
|
|
|
|
This tokenizer is very similar to `nGram` but only keeps n-grams which
|
|
|
|
start at the beginning of a token.
|
|
|
|
|
|
|
|
The following are settings that can be set for a `edgeNGram` tokenizer
|
|
|
|
type:
|
|
|
|
|
|
|
|
[cols="<,<,<",options="header",]
|
|
|
|
|=======================================================================
|
|
|
|
|Setting |Description |Default value
|
|
|
|
|`min_gram` |Minimum size in codepoints of a single n-gram |`1`.
|
|
|
|
|
|
|
|
|`max_gram` |Maximum size in codepoints of a single n-gram |`2`.
|
|
|
|
|
2013-09-03 15:27:49 -04:00
|
|
|
|`token_chars` | Characters classes to keep in the
|
2013-08-28 19:24:34 -04:00
|
|
|
tokens, Elasticsearch will split on characters that don't belong to any
|
|
|
|
of these classes. |`[]` (Keep all characters)
|
|
|
|
|=======================================================================
|
|
|
|
|
|
|
|
|
2014-03-19 07:46:06 -04:00
|
|
|
`token_chars` accepts the following character classes:
|
2013-08-28 19:24:34 -04:00
|
|
|
|
|
|
|
[horizontal]
|
|
|
|
`letter`:: for example `a`, `b`, `ï` or `京`
|
|
|
|
`digit`:: for example `3` or `7`
|
2014-03-19 07:46:06 -04:00
|
|
|
`whitespace`:: for example `" "` or `"\n"`
|
2013-08-28 19:24:34 -04:00
|
|
|
`punctuation`:: for example `!` or `"`
|
2014-03-19 07:46:06 -04:00
|
|
|
`symbol`:: for example `$` or `√`
|
2013-08-28 19:24:34 -04:00
|
|
|
|
|
|
|
[float]
|
|
|
|
==== Example
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
curl -XPUT 'localhost:9200/test' -d '
|
|
|
|
{
|
|
|
|
"settings" : {
|
|
|
|
"analysis" : {
|
|
|
|
"analyzer" : {
|
|
|
|
"my_edge_ngram_analyzer" : {
|
|
|
|
"tokenizer" : "my_edge_ngram_tokenizer"
|
|
|
|
}
|
|
|
|
},
|
|
|
|
"tokenizer" : {
|
|
|
|
"my_edge_ngram_tokenizer" : {
|
|
|
|
"type" : "edgeNGram",
|
|
|
|
"min_gram" : "2",
|
|
|
|
"max_gram" : "5",
|
|
|
|
"token_chars": [ "letter", "digit" ]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}'
|
|
|
|
|
|
|
|
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_edge_ngram_analyzer' -d 'FC Schalke 04'
|
|
|
|
# FC, Sc, Sch, Scha, Schal, 04
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
[float]
|
|
|
|
==== `side` deprecated
|
|
|
|
|
2013-09-18 06:33:49 -04:00
|
|
|
There used to be a `side` parameter up to `0.90.1` but it is now deprecated. In
|
2014-03-19 07:46:06 -04:00
|
|
|
order to emulate the behavior of `"side" : "BACK"` a
|
|
|
|
<<analysis-reverse-tokenfilter,`reverse` token filter>> should be used together
|
|
|
|
with the <<analysis-edgengram-tokenfilter,`edgeNGram` token filter>>. The
|
|
|
|
`edgeNGram` filter must be enclosed in `reverse` filters like this:
|
2013-08-28 19:24:34 -04:00
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
"filter" : ["reverse", "edgeNGram", "reverse"]
|
|
|
|
--------------------------------------------------
|
|
|
|
|
2014-03-19 07:46:06 -04:00
|
|
|
which essentially reverses the token, builds front `EdgeNGrams` and reverses
|
2013-08-28 19:24:34 -04:00
|
|
|
the ngram again. This has the same effect as the previous `"side" : "BACK"` setting.
|
2014-03-19 07:46:06 -04:00
|
|
|
|