106 lines
2.5 KiB
Plaintext
106 lines
2.5 KiB
Plaintext
|
[[analysis-simplepattern-tokenizer]]
|
||
|
=== Simple Pattern Tokenizer
|
||
|
|
||
|
experimental[]
|
||
|
|
||
|
The `simplepattern` tokenizer uses a regular expression to capture matching
|
||
|
text as terms. The set of regular expression features it supports is more
|
||
|
limited than the <<analysis-pattern-tokenizer,`pattern`>> tokenizer, but the
|
||
|
tokenization is generally faster.
|
||
|
|
||
|
This tokenizer does not support splitting the input on a pattern match, unlike
|
||
|
the <<analysis-pattern-tokenizer,`pattern`>> tokenizer. To split on pattern
|
||
|
matches using the same restricted regular expression subset, see the
|
||
|
<<analysis-simplepatternsplit-tokenizer,`simplepatternsplit`>> tokenizer.
|
||
|
|
||
|
This tokenizer uses {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].
|
||
|
For an explanation of the supported features and syntax, see <<regexp-syntax,Regular Expression Syntax>>.
|
||
|
|
||
|
The default pattern is the empty string, which produces no terms. This
|
||
|
tokenizer should always be configured with a non-default pattern.
|
||
|
|
||
|
[float]
|
||
|
=== Configuration
|
||
|
|
||
|
The `simplepattern` tokenizer accepts the following parameters:
|
||
|
|
||
|
[horizontal]
|
||
|
`pattern`::
|
||
|
{lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.
|
||
|
|
||
|
[float]
|
||
|
=== Example configuration
|
||
|
|
||
|
This example configures the `simplepattern` tokenizer to produce terms that are
|
||
|
three-digit numbers
|
||
|
|
||
|
[source,js]
|
||
|
----------------------------
|
||
|
PUT my_index
|
||
|
{
|
||
|
"settings": {
|
||
|
"analysis": {
|
||
|
"analyzer": {
|
||
|
"my_analyzer": {
|
||
|
"tokenizer": "my_tokenizer"
|
||
|
}
|
||
|
},
|
||
|
"tokenizer": {
|
||
|
"my_tokenizer": {
|
||
|
"type": "simplepattern",
|
||
|
"pattern": "[0123456789]{3}"
|
||
|
}
|
||
|
}
|
||
|
}
|
||
|
}
|
||
|
}
|
||
|
|
||
|
POST my_index/_analyze
|
||
|
{
|
||
|
"analyzer": "my_analyzer",
|
||
|
"text": "fd-786-335-514-x"
|
||
|
}
|
||
|
----------------------------
|
||
|
// CONSOLE
|
||
|
|
||
|
/////////////////////
|
||
|
|
||
|
[source,js]
|
||
|
----------------------------
|
||
|
{
|
||
|
"tokens" : [
|
||
|
{
|
||
|
"token" : "786",
|
||
|
"start_offset" : 3,
|
||
|
"end_offset" : 6,
|
||
|
"type" : "word",
|
||
|
"position" : 0
|
||
|
},
|
||
|
{
|
||
|
"token" : "335",
|
||
|
"start_offset" : 7,
|
||
|
"end_offset" : 10,
|
||
|
"type" : "word",
|
||
|
"position" : 1
|
||
|
},
|
||
|
{
|
||
|
"token" : "514",
|
||
|
"start_offset" : 11,
|
||
|
"end_offset" : 14,
|
||
|
"type" : "word",
|
||
|
"position" : 2
|
||
|
}
|
||
|
]
|
||
|
}
|
||
|
----------------------------
|
||
|
// TESTRESPONSE
|
||
|
|
||
|
/////////////////////
|
||
|
|
||
|
The above example produces these terms:
|
||
|
|
||
|
[source,text]
|
||
|
---------------------------
|
||
|
[ 786, 335, 514 ]
|
||
|
---------------------------
|