130 lines
4.1 KiB
Plaintext
130 lines
4.1 KiB
Plaintext
[[analysis-pattern-analyzer]]
|
|
=== Pattern Analyzer
|
|
|
|
An analyzer of type `pattern` that can flexibly separate text into terms
|
|
via a regular expression. Accepts the following settings:
|
|
|
|
The following are settings that can be set for a `pattern` analyzer
|
|
type:
|
|
|
|
[cols="<,<",options="header",]
|
|
|===================================================================
|
|
|Setting |Description
|
|
|`lowercase` |Should terms be lowercased or not. Defaults to `true`.
|
|
|`pattern` |The regular expression pattern, defaults to `\W+`.
|
|
|`flags` |The regular expression flags.
|
|
|`stopwords` |A list of stopwords to initialize the stop filter with.
|
|
Defaults to an 'empty' stopword list coming[1.0.0.RC1, Previously
|
|
defaulted to the English stopwords list]
|
|
|===================================================================
|
|
|
|
*IMPORTANT*: The regular expression should match the *token separators*,
|
|
not the tokens themselves.
|
|
|
|
Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check
|
|
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java
|
|
Pattern API] for more details about `flags` options.
|
|
|
|
[float]
|
|
==== Pattern Analyzer Examples
|
|
|
|
In order to try out these examples, you should delete the `test` index
|
|
before running each example:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XDELETE localhost:9200/test
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
===== Whitespace tokenizer
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XPUT 'localhost:9200/test' -d '
|
|
{
|
|
"settings":{
|
|
"analysis": {
|
|
"analyzer": {
|
|
"whitespace":{
|
|
"type": "pattern",
|
|
"pattern":"\\\\s+"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}'
|
|
|
|
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz'
|
|
# "foo,bar", "baz"
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
===== Non-word character tokenizer
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
|
|
curl -XPUT 'localhost:9200/test' -d '
|
|
{
|
|
"settings":{
|
|
"analysis": {
|
|
"analyzer": {
|
|
"nonword":{
|
|
"type": "pattern",
|
|
"pattern":"[^\\\\w]+"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}'
|
|
|
|
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'foo,bar baz'
|
|
# "foo,bar baz" becomes "foo", "bar", "baz"
|
|
|
|
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'type_1-type_4'
|
|
# "type_1","type_4"
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
===== CamelCase tokenizer
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
|
|
curl -XPUT 'localhost:9200/test?pretty=1' -d '
|
|
{
|
|
"settings":{
|
|
"analysis": {
|
|
"analyzer": {
|
|
"camel":{
|
|
"type": "pattern",
|
|
"pattern":"([^\\\\p{L}\\\\d]+)|(?<=\\\\D)(?=\\\\d)|(?<=\\\\d)(?=\\\\D)|(?<=[\\\\p{L}&&[^\\\\p{Lu}]])(?=\\\\p{Lu})|(?<=\\\\p{Lu})(?=\\\\p{Lu}[\\\\p{L}&&[^\\\\p{Lu}]])"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}'
|
|
|
|
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d '
|
|
MooseX::FTPClass2_beta
|
|
'
|
|
# "moose","x","ftp","class","2","beta"
|
|
--------------------------------------------------
|
|
|
|
The regex above is easier to understand as:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
|
|
([^\\p{L}\\d]+) # swallow non letters and numbers,
|
|
| (?<=\\D)(?=\\d) # or non-number followed by number,
|
|
| (?<=\\d)(?=\\D) # or number followed by non-number,
|
|
| (?<=[ \\p{L} && [^\\p{Lu}]]) # or lower case
|
|
(?=\\p{Lu}) # followed by upper case,
|
|
| (?<=\\p{Lu}) # or upper case
|
|
(?=\\p{Lu} # followed by upper case
|
|
[\\p{L}&&[^\\p{Lu}]] # then lower case
|
|
)
|
|
--------------------------------------------------
|