OpenSearch/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc

[[analysis-pattern-analyzer]]
=== Pattern Analyzer

An analyzer of type `pattern` that can flexibly separate text into terms
via a regular expression. Accepts the following settings:

The following are settings that can be set for a `pattern` analyzer
type:

[cols="<,<",options="header",]
|===================================================================
|Setting |Description
|`lowercase` |Should terms be lowercased or not. Defaults to `true`.
|`pattern` |The regular expression pattern, defaults to `\W+`.
|`flags` |The regular expression flags.
|`stopwords` |A list of stopwords to initialize the stop filter with.
Defaults to an 'empty' stopword list coming[1.0.0.RC1, Previously 
defaulted to the English stopwords list]
|===================================================================

*IMPORTANT*: The regular expression should match the *token separators*,
not the tokens themselves.

Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java
Pattern API] for more details about `flags` options.

[float]
==== Pattern Analyzer Examples

In order to try out these examples, you should delete the `test` index
before running each example:

[source,js]
--------------------------------------------------
    curl -XDELETE localhost:9200/test
--------------------------------------------------

[float]
===== Whitespace tokenizer

[source,js]
--------------------------------------------------
    curl -XPUT 'localhost:9200/test' -d '
    {
        "settings":{
            "analysis": {
                "analyzer": {
                    "whitespace":{
                        "type": "pattern",
                        "pattern":"\\\\s+"
                    }
                }
            }
        }
    }'

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz'
    # "foo,bar", "baz"
--------------------------------------------------

[float]
===== Non-word character tokenizer

[source,js]
--------------------------------------------------

    curl -XPUT 'localhost:9200/test' -d '
    {
        "settings":{
            "analysis": {
                "analyzer": {
                    "nonword":{
                        "type": "pattern",
                        "pattern":"[^\\\\w]+"
                    }
                }
            }
        }
    }'

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'foo,bar baz'
    # "foo,bar baz" becomes "foo", "bar", "baz"

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'type_1-type_4'
    # "type_1","type_4"
--------------------------------------------------

[float]
===== CamelCase tokenizer

[source,js]
--------------------------------------------------

    curl -XPUT 'localhost:9200/test?pretty=1' -d '
    {
        "settings":{
            "analysis": {
                "analyzer": {
                    "camel":{
                        "type": "pattern",
                        "pattern":"([^\\\\p{L}\\\\d]+)|(?<=\\\\D)(?=\\\\d)|(?<=\\\\d)(?=\\\\D)|(?<=[\\\\p{L}&&[^\\\\p{Lu}]])(?=\\\\p{Lu})|(?<=\\\\p{Lu})(?=\\\\p{Lu}[\\\\p{L}&&[^\\\\p{Lu}]])"
                    }
                }
            }
        }
    }'

    curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d '
        MooseX::FTPClass2_beta
    '
    # "moose","x","ftp","class","2","beta"
--------------------------------------------------

The regex above is easier to understand as:

[source,js]
--------------------------------------------------

      ([^\\p{L}\\d]+)                 # swallow non letters and numbers,
    | (?<=\\D)(?=\\d)                 # or non-number followed by number,
    | (?<=\\d)(?=\\D)                 # or number followed by non-number,
    | (?<=[ \\p{L} && [^\\p{Lu}]])    # or lower case
      (?=\\p{Lu})                    #   followed by upper case,
    | (?<=\\p{Lu})                   # or upper case
      (?=\\p{Lu}                     #   followed by upper case
        [\\p{L}&&[^\\p{Lu}]]          #   then lower case
      )
--------------------------------------------------
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`[[analysis-pattern-analyzer]]`
			`=== Pattern Analyzer`

			An analyzer of type `pattern` that can flexibly separate text into terms
			`via a regular expression. Accepts the following settings:`

			The following are settings that can be set for a `pattern` analyzer
			`type:`

			`[cols="<,<",options="header",]`
			`\|===================================================================`
			`\|Setting \|Description`
			\|`lowercase` \|Should terms be lowercased or not. Defaults to `true`.
			\|`pattern` \|The regular expression pattern, defaults to `\W+`.
			\|`flags` \|The regular expression flags.
Default stopwords list should be `_none_` for all but language-specific analyzers `standard_html_strip` and `pattern` analyzer support stopwords which are set to the default `english` stopwords by default. Those analyzers should not use stopwords by default since they are language neutral Closes #4699 2014-01-13 14:21:26 +01:00			\|`stopwords` \|A list of stopwords to initialize the stop filter with.
			`Defaults to an 'empty' stopword list coming[1.0.0.RC1, Previously`
			`defaulted to the English stopwords list]`
Migrated documentation into the main repo 2013-08-29 01:24:34 +02:00			`\|===================================================================`

			`IMPORTANT: The regular expression should match the token separators,`
			`not the tokens themselves.`

			Flags should be pipe-separated, eg `"CASE_INSENSITIVE\|COMMENTS"`. Check
			`http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java`
			Pattern API] for more details about `flags` options.

			`[float]`
			`==== Pattern Analyzer Examples`

			In order to try out these examples, you should delete the `test` index
			`before running each example:`

			`[source,js]`
			`--------------------------------------------------`
			`curl -XDELETE localhost:9200/test`
			`--------------------------------------------------`

			`[float]`
			`===== Whitespace tokenizer`

			`[source,js]`
			`--------------------------------------------------`
			`curl -XPUT 'localhost:9200/test' -d '`
			`{`
			`"settings":{`
			`"analysis": {`
			`"analyzer": {`
			`"whitespace":{`
			`"type": "pattern",`
			`"pattern":"\\\\s+"`
			`}`
			`}`
			`}`
			`}`
			`}'`

			`curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz'`
			`# "foo,bar", "baz"`
			`--------------------------------------------------`

			`[float]`
			`===== Non-word character tokenizer`

			`[source,js]`
			`--------------------------------------------------`

			`curl -XPUT 'localhost:9200/test' -d '`
			`{`
			`"settings":{`
			`"analysis": {`
			`"analyzer": {`
			`"nonword":{`
			`"type": "pattern",`
			`"pattern":"[^\\\\w]+"`
			`}`
			`}`
			`}`
			`}`
			`}'`

			`curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'foo,bar baz'`
			`# "foo,bar baz" becomes "foo", "bar", "baz"`

			`curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'type_1-type_4'`
			`# "type_1","type_4"`
			`--------------------------------------------------`

			`[float]`
			`===== CamelCase tokenizer`

			`[source,js]`
			`--------------------------------------------------`

			`curl -XPUT 'localhost:9200/test?pretty=1' -d '`
			`{`
			`"settings":{`
			`"analysis": {`
			`"analyzer": {`
			`"camel":{`
			`"type": "pattern",`
			`"pattern":"([^\\\\p{L}\\\\d]+)\|(?<=\\\\D)(?=\\\\d)\|(?<=\\\\d)(?=\\\\D)\|(?<=[\\\\p{L}&&[^\\\\p{Lu}]])(?=\\\\p{Lu})\|(?<=\\\\p{Lu})(?=\\\\p{Lu}[\\\\p{L}&&[^\\\\p{Lu}]])"`
			`}`
			`}`
			`}`
			`}`
			`}'`

			`curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d '`
			`MooseX::FTPClass2_beta`
			`'`
			`# "moose","x","ftp","class","2","beta"`
			`--------------------------------------------------`

			`The regex above is easier to understand as:`

			`[source,js]`
			`--------------------------------------------------`

			`([^\\p{L}\\d]+) # swallow non letters and numbers,`
			`\| (?<=\\D)(?=\\d) # or non-number followed by number,`
			`\| (?<=\\d)(?=\\D) # or number followed by non-number,`
			`\| (?<=[ \\p{L} && [^\\p{Lu}]]) # or lower case`
			`(?=\\p{Lu}) # followed by upper case,`
			`\| (?<=\\p{Lu}) # or upper case`
			`(?=\\p{Lu} # followed by upper case`
			`[\\p{L}&&[^\\p{Lu}]] # then lower case`
			`)`
			`--------------------------------------------------`