OpenSearch/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc

[[analysis-pattern-analyzer]]
=== Pattern Analyzer

An analyzer of type `pattern` that can flexibly separate text into terms
via a regular expression. Accepts the following settings:

The following are settings that can be set for a `pattern` analyzer
type:

[horizontal]
`lowercase`::   Should terms be lowercased or not. Defaults to `true`.
`pattern`::     The regular expression pattern, defaults to `\W+`.
`flags`::       The regular expression flags.
`stopwords`::   A list of stopwords to initialize the stop filter with.
                Defaults to an 'empty' stopword list Check
                <<analysis-stop-analyzer,Stop Analyzer>> for more details.

*IMPORTANT*: The regular expression should match the *token separators*,
not the tokens themselves.

Flags should be pipe-separated, eg `"CASE_INSENSITIVE|COMMENTS"`. Check
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java
Pattern API] for more details about `flags` options.

[float]
==== Pattern Analyzer Examples

In order to try out these examples, you should delete the `test` index
before running each example.

[float]
===== Whitespace tokenizer

[source,js]
--------------------------------------------------
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace": {
          "type": "pattern",
          "pattern": "\\s+"
        }
      }
    }
  }
}

GET _cluster/health?wait_for_status=yellow

GET test/_analyze?analyzer=whitespace&text=foo,bar baz
# "foo,bar", "baz"
--------------------------------------------------
// CONSOLE

[float]
===== Non-word character tokenizer

[source,js]
--------------------------------------------------
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "nonword": {
          "type": "pattern",
          "pattern": "[^\\w]+" <1>
        }
      }
    }
  }
}

GET _cluster/health?wait_for_status=yellow

GET test/_analyze?analyzer=nonword&text=foo,bar baz
# "foo,bar baz" becomes "foo", "bar", "baz"

GET test/_analyze?analyzer=nonword&text=type_1-type_4
# "type_1","type_4"
--------------------------------------------------
// CONSOLE


[float]
===== CamelCase tokenizer

[source,js]
--------------------------------------------------
PUT test?pretty=1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}

GET _cluster/health?wait_for_status=yellow

GET test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta
# "moose","x","ftp","class","2","beta"
--------------------------------------------------
// CONSOLE

The regex above is easier to understand as:

[source,js]
--------------------------------------------------

  ([^\p{L}\d]+)                 # swallow non letters and numbers,
| (?<=\D)(?=\d)                 # or non-number followed by number,
| (?<=\d)(?=\D)                 # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]])    # or lower case
  (?=\p{Lu})                    #   followed by upper case,
| (?<=\p{Lu})                   # or upper case
  (?=\p{Lu}                     #   followed by upper case
    [\p{L}&&[^\p{Lu}]]          #   then lower case
  )
--------------------------------------------------
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[analysis-pattern-analyzer]]`
			`=== Pattern Analyzer`

			An analyzer of type `pattern` that can flexibly separate text into terms
			`via a regular expression. Accepts the following settings:`

			The following are settings that can be set for a `pattern` analyzer
			`type:`

Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`[horizontal]`
			`lowercase`:: Should terms be lowercased or not. Defaults to `true`.
			`pattern`:: The regular expression pattern, defaults to `\W+`.
			`flags`:: The regular expression flags.
			`stopwords`:: A list of stopwords to initialize the stop filter with.
			`Defaults to an 'empty' stopword list Check`
			`<<analysis-stop-analyzer,Stop Analyzer>> for more details.`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`IMPORTANT: The regular expression should match the token separators,`
			`not the tokens themselves.`

			Flags should be pipe-separated, eg `"CASE_INSENSITIVE\|COMMENTS"`. Check
			`http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary[Java`
			Pattern API] for more details about `flags` options.

			`[float]`
			`==== Pattern Analyzer Examples`

			In order to try out these examples, you should delete the `test` index
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`before running each example.`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`[float]`
			`===== Whitespace tokenizer`

			`[source,js]`
			`--------------------------------------------------`
Generate and run tests from the docs Adds infrastructure so `gradle :docs:check` will extract tests from snippets in the documentation and execute the tests. This is included in `gradle check` so it should happen on CI and during a normal build. By default each `// AUTOSENSE` snippet creates a unique REST test. These tests are executed in a random order and the cluster is wiped between each one. If multiple snippets chain together into a test you can annotate all snippets after the first with `// TEST[continued]` to have the generated tests for both snippets joined. Snippets marked as `// TESTRESPONSE` are checked against the response of the last action. See docs/README.asciidoc for lots more. Closes #12583. That issue is about catching bugs in the docs during build. This catches some bugs in the docs during build which is a good start. 2016-04-29 10:42:03 -04:00			`PUT test`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"whitespace": {`
			`"type": "pattern",`
			`"pattern": "\\s+"`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`}`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`}`
			`}`
			`}`
			`}`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
Generate and run tests from the docs Adds infrastructure so `gradle :docs:check` will extract tests from snippets in the documentation and execute the tests. This is included in `gradle check` so it should happen on CI and during a normal build. By default each `// AUTOSENSE` snippet creates a unique REST test. These tests are executed in a random order and the cluster is wiped between each one. If multiple snippets chain together into a test you can annotate all snippets after the first with `// TEST[continued]` to have the generated tests for both snippets joined. Snippets marked as `// TESTRESPONSE` are checked against the response of the last action. See docs/README.asciidoc for lots more. Closes #12583. That issue is about catching bugs in the docs during build. This catches some bugs in the docs during build which is a good start. 2016-04-29 10:42:03 -04:00			`GET _cluster/health?wait_for_status=yellow`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00
Generate and run tests from the docs Adds infrastructure so `gradle :docs:check` will extract tests from snippets in the documentation and execute the tests. This is included in `gradle check` so it should happen on CI and during a normal build. By default each `// AUTOSENSE` snippet creates a unique REST test. These tests are executed in a random order and the cluster is wiped between each one. If multiple snippets chain together into a test you can annotate all snippets after the first with `// TEST[continued]` to have the generated tests for both snippets joined. Snippets marked as `// TESTRESPONSE` are checked against the response of the last action. See docs/README.asciidoc for lots more. Closes #12583. That issue is about catching bugs in the docs during build. This catches some bugs in the docs during build which is a good start. 2016-04-29 10:42:03 -04:00			`GET test/_analyze?analyzer=whitespace&text=foo,bar baz`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`# "foo,bar", "baz"`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`--------------------------------------------------`
Renamed all AUTOSENSE snippets to CONSOLE (#18210) 2016-05-09 09:42:23 -04:00			`// CONSOLE`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`[float]`
			`===== Non-word character tokenizer`

			`[source,js]`
			`--------------------------------------------------`
Generate and run tests from the docs Adds infrastructure so `gradle :docs:check` will extract tests from snippets in the documentation and execute the tests. This is included in `gradle check` so it should happen on CI and during a normal build. By default each `// AUTOSENSE` snippet creates a unique REST test. These tests are executed in a random order and the cluster is wiped between each one. If multiple snippets chain together into a test you can annotate all snippets after the first with `// TEST[continued]` to have the generated tests for both snippets joined. Snippets marked as `// TESTRESPONSE` are checked against the response of the last action. See docs/README.asciidoc for lots more. Closes #12583. That issue is about catching bugs in the docs during build. This catches some bugs in the docs during build which is a good start. 2016-04-29 10:42:03 -04:00			`PUT test`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"nonword": {`
			`"type": "pattern",`
			`"pattern": "[^\\w]+" <1>`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`}`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`}`
			`}`
			`}`
			`}`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
Generate and run tests from the docs Adds infrastructure so `gradle :docs:check` will extract tests from snippets in the documentation and execute the tests. This is included in `gradle check` so it should happen on CI and during a normal build. By default each `// AUTOSENSE` snippet creates a unique REST test. These tests are executed in a random order and the cluster is wiped between each one. If multiple snippets chain together into a test you can annotate all snippets after the first with `// TEST[continued]` to have the generated tests for both snippets joined. Snippets marked as `// TESTRESPONSE` are checked against the response of the last action. See docs/README.asciidoc for lots more. Closes #12583. That issue is about catching bugs in the docs during build. This catches some bugs in the docs during build which is a good start. 2016-04-29 10:42:03 -04:00			`GET _cluster/health?wait_for_status=yellow`

			`GET test/_analyze?analyzer=nonword&text=foo,bar baz`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`# "foo,bar baz" becomes "foo", "bar", "baz"`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
Generate and run tests from the docs Adds infrastructure so `gradle :docs:check` will extract tests from snippets in the documentation and execute the tests. This is included in `gradle check` so it should happen on CI and during a normal build. By default each `// AUTOSENSE` snippet creates a unique REST test. These tests are executed in a random order and the cluster is wiped between each one. If multiple snippets chain together into a test you can annotate all snippets after the first with `// TEST[continued]` to have the generated tests for both snippets joined. Snippets marked as `// TESTRESPONSE` are checked against the response of the last action. See docs/README.asciidoc for lots more. Closes #12583. That issue is about catching bugs in the docs during build. This catches some bugs in the docs during build which is a good start. 2016-04-29 10:42:03 -04:00			`GET test/_analyze?analyzer=nonword&text=type_1-type_4`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`# "type_1","type_4"`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`--------------------------------------------------`
Renamed all AUTOSENSE snippets to CONSOLE (#18210) 2016-05-09 09:42:23 -04:00			`// CONSOLE`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`[float]`
			`===== CamelCase tokenizer`

			`[source,js]`
			`--------------------------------------------------`
Generate and run tests from the docs Adds infrastructure so `gradle :docs:check` will extract tests from snippets in the documentation and execute the tests. This is included in `gradle check` so it should happen on CI and during a normal build. By default each `// AUTOSENSE` snippet creates a unique REST test. These tests are executed in a random order and the cluster is wiped between each one. If multiple snippets chain together into a test you can annotate all snippets after the first with `// TEST[continued]` to have the generated tests for both snippets joined. Snippets marked as `// TESTRESPONSE` are checked against the response of the last action. See docs/README.asciidoc for lots more. Closes #12583. That issue is about catching bugs in the docs during build. This catches some bugs in the docs during build which is a good start. 2016-04-29 10:42:03 -04:00			`PUT test?pretty=1`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`{`
			`"settings": {`
			`"analysis": {`
			`"analyzer": {`
			`"camel": {`
			`"type": "pattern",`
			`"pattern": "([^\\p{L}\\d]+)\|(?<=\\D)(?=\\d)\|(?<=\\d)(?=\\D)\|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})\|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`}`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`}`
			`}`
			`}`
			`}`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
[docs] Add wait_until_yellow to fix build failure The snippet in the docs creates and index and uses it with the _analyze api. The trouble is that if the index hasn't been created fully the _analyze API will fail. This adds a GET _cluster/health?wait_for_status=yellow which fixes the issue. While this does make the docs more cluttered, it also makes the snippets actually runnable. Closes #18165 2016-05-05 15:59:21 -04:00			`GET _cluster/health?wait_for_status=yellow`

Generate and run tests from the docs Adds infrastructure so `gradle :docs:check` will extract tests from snippets in the documentation and execute the tests. This is included in `gradle check` so it should happen on CI and during a normal build. By default each `// AUTOSENSE` snippet creates a unique REST test. These tests are executed in a random order and the cluster is wiped between each one. If multiple snippets chain together into a test you can annotate all snippets after the first with `// TEST[continued]` to have the generated tests for both snippets joined. Snippets marked as `// TESTRESPONSE` are checked against the response of the last action. See docs/README.asciidoc for lots more. Closes #12583. That issue is about catching bugs in the docs during build. This catches some bugs in the docs during build which is a good start. 2016-04-29 10:42:03 -04:00			`GET test/_analyze?analyzer=camel&text=MooseX::FTPClass2_beta`
Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`# "moose","x","ftp","class","2","beta"`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`--------------------------------------------------`
Renamed all AUTOSENSE snippets to CONSOLE (#18210) 2016-05-09 09:42:23 -04:00			`// CONSOLE`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`The regex above is easier to understand as:`

			`[source,js]`
			`--------------------------------------------------`

Docs: Fixed the backslash escaping on the pattern analyzer docs Closes #11099 2015-05-15 12:40:16 -04:00			`([^\p{L}\d]+) # swallow non letters and numbers,`
			`\| (?<=\D)(?=\d) # or non-number followed by number,`
			`\| (?<=\d)(?=\D) # or number followed by non-number,`
			`\| (?<=[ \p{L} && [^\p{Lu}]]) # or lower case`
			`(?=\p{Lu}) # followed by upper case,`
			`\| (?<=\p{Lu}) # or upper case`
			`(?=\p{Lu} # followed by upper case`
			`[\p{L}&&[^\p{Lu}]] # then lower case`
			`)`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`--------------------------------------------------`