[[analysis-pattern-analyzer]] === Pattern Analyzer An analyzer of type `pattern` that can flexibly separate text into terms via a regular expression. Accepts the following settings: The following are settings that can be set for a `pattern` analyzer type: [horizontal] `lowercase`:: Should terms be lowercased or not. Defaults to `true`. `pattern`:: The regular expression pattern, defaults to `\W+`. `flags`:: The regular expression flags. `stopwords`:: A list of stopwords to initialize the stop filter with. Defaults to an 'empty' stopword list Check <> for more details. *IMPORTANT*: The regular expression should match the *token separators*, not the tokens themselves. Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java Pattern API] for more details about `flags` options. [float] ==== Pattern Analyzer Examples In order to try out these examples, you should delete the `test` index before running each example. [float] ===== Whitespace tokenizer [source,js] -------------------------------------------------- DELETE test PUT /test { "settings": { "analysis": { "analyzer": { "whitespace": { "type": "pattern", "pattern": "\\s+" } } } } } GET /test/_analyze?analyzer=whitespace&text=foo,bar baz # "foo,bar", "baz" -------------------------------------------------- // AUTOSENSE [float] ===== Non-word character tokenizer [source,js] -------------------------------------------------- DELETE test PUT /test { "settings": { "analysis": { "analyzer": { "nonword": { "type": "pattern", "pattern": "[^\\w]+" <1> } } } } } GET /test/_analyze?analyzer=nonword&text=foo,bar baz # "foo,bar baz" becomes "foo", "bar", "baz" GET /test/_analyze?analyzer=nonword&text=type_1-type_4 # "type_1","type_4" -------------------------------------------------- // AUTOSENSE [float] ===== CamelCase tokenizer [source,js] -------------------------------------------------- DELETE test PUT /test?pretty=1 { "settings": { "analysis": { "analyzer": { "camel": { "type": "pattern", "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])" } } } } } GET /test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta # "moose","x","ftp","class","2","beta" -------------------------------------------------- // AUTOSENSE The regex above is easier to understand as: [source,js] -------------------------------------------------- ([^\p{L}\d]+) # swallow non letters and numbers, | (?<=\D)(?=\d) # or non-number followed by number, | (?<=\d)(?=\D) # or number followed by non-number, | (?<=[ \p{L} && [^\p{Lu}]]) # or lower case (?=\p{Lu}) # followed by upper case, | (?<=\p{Lu}) # or upper case (?=\p{Lu} # followed by upper case [\p{L}&&[^\p{Lu}]] # then lower case ) --------------------------------------------------