OpenSearch/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc
Christoph Büscher 25aac4f77f
Remove include_type_name in asciidoc where possible (#37568)
The "include_type_name" parameter was temporarily introduced in #37285 to facilitate
moving the default parameter setting to "false" in many places in the documentation
code snippets. Most of the places can simply be reverted without causing errors.
In this change I looked for asciidoc files that contained the
"include_type_name=true" addition when creating new indices but didn't look
likey they made use of the "_doc" type for mappings. This is mostly the case
e.g. in the analysis docs where index creating often only contains settings. I
manually corrected the use of types in some places where the docs still used an
explicit type name and not the dummy "_doc" type.
2019-01-18 09:34:11 +01:00

416 lines
8.9 KiB
Plaintext

[[analysis-pattern-analyzer]]
=== Pattern Analyzer
The `pattern` analyzer uses a regular expression to split the text into terms.
The regular expression should match the *token separators* not the tokens
themselves. The regular expression defaults to `\W+` (or all non-word characters).
[WARNING]
.Beware of Pathological Regular Expressions
========================================
The pattern analyzer uses
http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java Regular Expressions].
A badly written regular expression could run very slowly or even throw a
StackOverflowError and cause the node it is running on to exit suddenly.
Read more about http://www.regular-expressions.info/catastrophic.html[pathological regular expressions and how to avoid them].
========================================
[float]
=== Example output
[source,js]
---------------------------
POST _analyze
{
"analyzer": "pattern",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
---------------------------
// CONSOLE
/////////////////////
[source,js]
----------------------------
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "2",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "quick",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 2
},
{
"token": "brown",
"start_offset": 12,
"end_offset": 17,
"type": "word",
"position": 3
},
{
"token": "foxes",
"start_offset": 18,
"end_offset": 23,
"type": "word",
"position": 4
},
{
"token": "jumped",
"start_offset": 24,
"end_offset": 30,
"type": "word",
"position": 5
},
{
"token": "over",
"start_offset": 31,
"end_offset": 35,
"type": "word",
"position": 6
},
{
"token": "the",
"start_offset": 36,
"end_offset": 39,
"type": "word",
"position": 7
},
{
"token": "lazy",
"start_offset": 40,
"end_offset": 44,
"type": "word",
"position": 8
},
{
"token": "dog",
"start_offset": 45,
"end_offset": 48,
"type": "word",
"position": 9
},
{
"token": "s",
"start_offset": 49,
"end_offset": 50,
"type": "word",
"position": 10
},
{
"token": "bone",
"start_offset": 51,
"end_offset": 55,
"type": "word",
"position": 11
}
]
}
----------------------------
// TESTRESPONSE
/////////////////////
The above sentence would produce the following terms:
[source,text]
---------------------------
[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
---------------------------
[float]
=== Configuration
The `pattern` analyzer accepts the following parameters:
[horizontal]
`pattern`::
A http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html[Java regular expression], defaults to `\W+`.
`flags`::
Java regular expression http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary[flags].
Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`.
`lowercase`::
Should terms be lowercased or not. Defaults to `true`.
`stopwords`::
A pre-defined stop words list like `_english_` or an array containing a
list of stop words. Defaults to `\_none_`.
`stopwords_path`::
The path to a file containing stop words.
See the <<analysis-stop-tokenfilter,Stop Token Filter>> for more information
about stop word configuration.
[float]
=== Example configuration
In this example, we configure the `pattern` analyzer to split email addresses
on non-word characters or on underscores (`\W|_`), and to lower-case the result:
[source,js]
----------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_email_analyzer": {
"type": "pattern",
"pattern": "\\W|_", <1>
"lowercase": true
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_email_analyzer",
"text": "John_Smith@foo-bar.com"
}
----------------------------
// CONSOLE
<1> The backslashes in the pattern need to be escaped when specifying the
pattern as a JSON string.
/////////////////////
[source,js]
----------------------------
{
"tokens": [
{
"token": "john",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "smith",
"start_offset": 5,
"end_offset": 10,
"type": "word",
"position": 1
},
{
"token": "foo",
"start_offset": 11,
"end_offset": 14,
"type": "word",
"position": 2
},
{
"token": "bar",
"start_offset": 15,
"end_offset": 18,
"type": "word",
"position": 3
},
{
"token": "com",
"start_offset": 19,
"end_offset": 22,
"type": "word",
"position": 4
}
]
}
----------------------------
// TESTRESPONSE
/////////////////////
The above example produces the following terms:
[source,text]
---------------------------
[ john, smith, foo, bar, com ]
---------------------------
[float]
==== CamelCase tokenizer
The following more complicated example splits CamelCase text into tokens:
[source,js]
--------------------------------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}
GET my_index/_analyze
{
"analyzer": "camel",
"text": "MooseX::FTPClass2_beta"
}
--------------------------------------------------
// CONSOLE
/////////////////////
[source,js]
----------------------------
{
"tokens": [
{
"token": "moose",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "x",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 1
},
{
"token": "ftp",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 2
},
{
"token": "class",
"start_offset": 11,
"end_offset": 16,
"type": "word",
"position": 3
},
{
"token": "2",
"start_offset": 16,
"end_offset": 17,
"type": "word",
"position": 4
},
{
"token": "beta",
"start_offset": 18,
"end_offset": 22,
"type": "word",
"position": 5
}
]
}
----------------------------
// TESTRESPONSE
/////////////////////
The above example produces the following terms:
[source,text]
---------------------------
[ moose, x, ftp, class, 2, beta ]
---------------------------
The regex above is easier to understand as:
[source,regex]
--------------------------------------------------
([^\p{L}\d]+) # swallow non letters and numbers,
| (?<=\D)(?=\d) # or non-number followed by number,
| (?<=\d)(?=\D) # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
(?=\p{Lu}) # followed by upper case,
| (?<=\p{Lu}) # or upper case
(?=\p{Lu} # followed by upper case
[\p{L}&&[^\p{Lu}]] # then lower case
)
--------------------------------------------------
[float]
=== Definition
The `pattern` anlayzer consists of:
Tokenizer::
* <<analysis-pattern-tokenizer,Pattern Tokenizer>>
Token Filters::
* <<analysis-lowercase-tokenfilter,Lower Case Token Filter>>
* <<analysis-stop-tokenfilter,Stop Token Filter>> (disabled by default)
If you need to customize the `pattern` analyzer beyond the configuration
parameters then you need to recreate it as a `custom` analyzer and modify
it, usually by adding token filters. This would recreate the built-in
`pattern` analyzer and you can use it as a starting point for further
customization:
[source,js]
----------------------------------------------------
PUT /pattern_example
{
"settings": {
"analysis": {
"tokenizer": {
"split_on_non_word": {
"type": "pattern",
"pattern": "\\W+" <1>
}
},
"analyzer": {
"rebuilt_pattern": {
"tokenizer": "split_on_non_word",
"filter": [
"lowercase" <2>
]
}
}
}
}
}
----------------------------------------------------
// CONSOLE
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: pattern_example, first: pattern, second: rebuilt_pattern}\nendyaml\n/]
<1> The default pattern is `\W+` which splits on non-word characters
and this is where you'd change it.
<2> You'd add other token filters after `lowercase`.