OpenSearch/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc

129 lines
3.2 KiB
Plaintext
Raw Normal View History

[[analysis-pattern-analyzer]]
=== Pattern Analyzer
An analyzer of type `pattern` that can flexibly separate text into terms
via a regular expression. Accepts the following settings:
The following are settings that can be set for a `pattern` analyzer
type:
[horizontal]
`lowercase`:: Should terms be lowercased or not. Defaults to `true`.
`pattern`:: The regular expression pattern, defaults to `\W+`.
`flags`:: The regular expression flags.
`stopwords`:: A list of stopwords to initialize the stop filter with.
Defaults to an 'empty' stopword list Check
<<analysis-stop-analyzer,Stop Analyzer>> for more details.
*IMPORTANT*: The regular expression should match the *token separators*,
not the tokens themselves.
Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java
Pattern API] for more details about `flags` options.
[float]
==== Pattern Analyzer Examples
In order to try out these examples, you should delete the `test` index
before running each example.
[float]
===== Whitespace tokenizer
[source,js]
--------------------------------------------------
DELETE test
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"whitespace": {
"type": "pattern",
"pattern": "\\s+"
}
}
}
}
}
GET /test/_analyze?analyzer=whitespace&text=foo,bar baz
# "foo,bar", "baz"
--------------------------------------------------
// AUTOSENSE
[float]
===== Non-word character tokenizer
[source,js]
--------------------------------------------------
DELETE test
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"nonword": {
"type": "pattern",
"pattern": "[^\\w]+" <1>
}
}
}
}
}
GET /test/_analyze?analyzer=nonword&text=foo,bar baz
# "foo,bar baz" becomes "foo", "bar", "baz"
GET /test/_analyze?analyzer=nonword&text=type_1-type_4
# "type_1","type_4"
--------------------------------------------------
// AUTOSENSE
[float]
===== CamelCase tokenizer
[source,js]
--------------------------------------------------
DELETE test
PUT /test?pretty=1
{
"settings": {
"analysis": {
"analyzer": {
"camel": {
"type": "pattern",
"pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
}
}
}
}
}
GET /test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta
# "moose","x","ftp","class","2","beta"
--------------------------------------------------
// AUTOSENSE
The regex above is easier to understand as:
[source,js]
--------------------------------------------------
([^\p{L}\d]+) # swallow non letters and numbers,
| (?<=\D)(?=\d) # or non-number followed by number,
| (?<=\d)(?=\D) # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]]) # or lower case
(?=\p{Lu}) # followed by upper case,
| (?<=\p{Lu}) # or upper case
(?=\p{Lu} # followed by upper case
[\p{L}&&[^\p{Lu}]] # then lower case
)
--------------------------------------------------