OpenSearch/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc

[[analysis-pattern-analyzer]]
=== Pattern Analyzer

An analyzer of type `pattern` that can flexibly separate text into terms
via a regular expression. Accepts the following settings:

The following are settings that can be set for a `pattern` analyzer
type:

[horizontal]
`lowercase`::   Should terms be lowercased or not. Defaults to `true`.
`pattern`::     The regular expression pattern, defaults to `\W+`.
`flags`::       The regular expression flags.
`stopwords`::   A list of stopwords to initialize the stop filter with.
                Defaults to an 'empty' stopword list Check
                <<analysis-stop-analyzer,Stop Analyzer>> for more details.

*IMPORTANT*: The regular expression should match the *token separators*,
not the tokens themselves.

Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java
Pattern API] for more details about `flags` options.

[float]
==== Pattern Analyzer Examples

In order to try out these examples, you should delete the `test` index
before running each example.

[float]
===== Whitespace tokenizer

[source,js]
--------------------------------------------------
DELETE test

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace": {
          "type": "pattern",
          "pattern": "\\s+"
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=whitespace&text=foo,bar baz

# "foo,bar", "baz"
--------------------------------------------------
// AUTOSENSE

[float]
===== Non-word character tokenizer

[source,js]
--------------------------------------------------
DELETE test

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "nonword": {
          "type": "pattern",
          "pattern": "[^\\w]+" <1>
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=nonword&text=foo,bar baz
# "foo,bar baz" becomes "foo", "bar", "baz"

GET /test/_analyze?analyzer=nonword&text=type_1-type_4
# "type_1","type_4"
--------------------------------------------------
// AUTOSENSE


[float]
===== CamelCase tokenizer

[source,js]
--------------------------------------------------
DELETE test

PUT /test?pretty=1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}

GET /test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta
# "moose","x","ftp","class","2","beta"
--------------------------------------------------
// AUTOSENSE

The regex above is easier to understand as:

[source,js]
--------------------------------------------------

  ([^\p{L}\d]+)                 # swallow non letters and numbers,
| (?<=\D)(?=\d)                 # or non-number followed by number,
| (?<=\d)(?=\D)                 # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]])    # or lower case
  (?=\p{Lu})                    #   followed by upper case,
| (?<=\p{Lu})                   # or upper case
  (?=\p{Lu}                     #   followed by upper case
    [\p{L}&&[^\p{Lu}]]          #   then lower case
  )
--------------------------------------------------
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[analysis-pattern-analyzer]]`
			`=== Pattern Analyzer`

			An analyzer of type `pattern` that can flexibly separate text into terms
			`via a regular expression. Accepts the following settings:`

			The following are settings that can be set for a `pattern` analyzer
			`type:`

Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`[horizontal]`
			`lowercase`:: Should terms be lowercased or not. Defaults to `true`.
			`pattern`:: The regular expression pattern, defaults to `\W+`.
			`flags`:: The regular expression flags.
			`stopwords`:: A list of stopwords to initialize the stop filter with.
			`Defaults to an 'empty' stopword list Check`
			`<<analysis-stop-analyzer,Stop Analyzer>> for more details.`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`IMPORTANT: The regular expression should match the token separators,`
			`not the tokens themselves.`

			Flags should be pipe-separated, eg `"CASE_INSENSITIVE\|COMMENTS"`. Check
			`http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java`
			Pattern API] for more details about `flags` options.

			`[float]`
			`==== Pattern Analyzer Examples`

			In order to try out these examples, you should delete the `test` index
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`before running each example.`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`[float]`
			`===== Whitespace tokenizer`

			`[source,js]`
			`--------------------------------------------------`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`DELETE test`

			`PUT /test`
			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"whitespace": {`
			`"type": "pattern",`
			`"pattern": "\\s+"`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`}`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`}`
			`}`
			`}`
			`}`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`GET /test/_analyze?analyzer=whitespace&text=foo,bar baz`

			`# "foo,bar", "baz"`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`--------------------------------------------------`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`// AUTOSENSE`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`[float]`
			`===== Non-word character tokenizer`

			`[source,js]`
			`--------------------------------------------------`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`DELETE test`

			`PUT /test`
			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"nonword": {`
			`"type": "pattern",`
			`"pattern": "[^\\w]+" <1>`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`}`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`}`
			`}`
			`}`
			`}`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`GET /test/_analyze?analyzer=nonword&text=foo,bar baz`
			`# "foo,bar baz" becomes "foo", "bar", "baz"`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`GET /test/_analyze?analyzer=nonword&text=type_1-type_4`
			`# "type_1","type_4"`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`--------------------------------------------------`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`// AUTOSENSE`

Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`[float]`
			`===== CamelCase tokenizer`

			`[source,js]`
			`--------------------------------------------------`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`DELETE test`

			`PUT /test?pretty=1`
			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"camel": {`
			`"type": "pattern",`
			`"pattern": "([^\\p{L}\\d]+)\|(?<=\\D)(?=\\d)\|(?<=\\d)(?=\\D)\|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})\|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`}`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`}`
			`}`
			`}`
			`}`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`GET /test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta`
			`# "moose","x","ftp","class","2","beta"`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`--------------------------------------------------`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`// AUTOSENSE`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`The regex above is easier to understand as:`

			`[source,js]`
			`--------------------------------------------------`

Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`([^\p{L}\d]+) # swallow non letters and numbers,`
			`\| (?<=\D)(?=\d) # or non-number followed by number,`
			`\| (?<=\d)(?=\D) # or number followed by non-number,`
			`\| (?<=[ \p{L} && [^\p{Lu}]]) # or lower case`
			`(?=\p{Lu}) # followed by upper case,`
			`\| (?<=\p{Lu}) # or upper case`
			`(?=\p{Lu} # followed by upper case`
			`[\p{L}&&[^\p{Lu}]] # then lower case`
			`)`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`--------------------------------------------------`