lucene/solr/solr-ref-guide/src/tokenizers.adoc

= Tokenizers
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements.  See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership.  The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License.  You may obtain a copy of the License at
//
//   http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied.  See the License for the
// specific language governing permissions and limitations
// under the License.

Tokenizers are responsible for breaking field data into lexical units, or _tokens_.

You configure the tokenizer for a text field type in `schema.xml` with a `<tokenizer>` element, as a child of `<analyzer>`:

[source,xml]
----
<fieldType name="text" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>
----

The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the `org.apache.solr.analysis.TokenizerFactory`. A TokenizerFactory's `create()` method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field.

Arguments may be passed to tokenizer factories by setting attributes on the `<tokenizer>` element.

[source,xml]
----
<fieldType name="semicolonDelimited" class="solr.TextField">
  <analyzer type="query">
    <tokenizer class="solr.PatternTokenizerFactory" pattern="; "/>
  </analyzer>
</fieldType>
----

The following sections describe the tokenizer factory classes included in this release of Solr.

For user tips about Solr's tokenizers, see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

== Standard Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

* Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.
* The "@" character is among the set of token-splitting punctuation, so email addresses are *not* preserved as single tokens.

Note that words are split at hyphens.

The Standard Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundaries with the following token types: `<ALPHANUM>`, `<NUM>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`.

*Factory class:* `solr.StandardTokenizerFactory`

*Arguments:*

`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.

*Example:*

[source,xml]
----
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
----

*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."

*Out:* "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"

== Classic Tokenizer

The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. It does not use the http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundary rules that the Standard Tokenizer uses. This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

* Periods (dots) that are not followed by whitespace are kept as part of the token.

* Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.

* Recognizes Internet domain names and email addresses and preserves them as a single token.

*Factory class:* `solr.ClassicTokenizerFactory`

*Arguments:*

`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.

*Example:*

[source,xml]
----
<analyzer>
  <tokenizer class="solr.ClassicTokenizerFactory"/>
</analyzer>
----

*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."

*Out:* "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"

== Keyword Tokenizer

This tokenizer treats the entire text field as a single token.

*Factory class:* `solr.KeywordTokenizerFactory`

*Arguments:* None

*Example:*

[source,xml]
----
<analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
----

*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."

*Out:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."

== Letter Tokenizer

This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.

*Factory class:* `solr.LetterTokenizerFactory`

*Arguments:* None

*Example:*

[source,xml]
----
<analyzer>
  <tokenizer class="solr.LetterTokenizerFactory"/>
</analyzer>
----

*In:* "I can't."

*Out:* "I", "can", "t"

== Lower Case Tokenizer

Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace and non-letters are discarded.

*Factory class:* `solr.LowerCaseTokenizerFactory`

*Arguments:* None

*Example:*

[source,xml]
----
<analyzer>
  <tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
----

*In:* "I just \*LOVE* my iPhone!"

*Out:* "i", "just", "love", "my", "iphone"

== N-Gram Tokenizer

Reads the field text and generates n-gram tokens of sizes in the given range.

*Factory class:* `solr.NGramTokenizerFactory`

*Arguments:*

`minGramSize`: (integer, default 1) The minimum n-gram size, must be > 0.

`maxGramSize`: (integer, default 2) The maximum n-gram size, must be >= `minGramSize`.

*Example:*

Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding.

[source,xml]
----
<analyzer>
  <tokenizer class="solr.NGramTokenizerFactory"/>
</analyzer>
----

*In:* "hey man"

*Out:* "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"

*Example:*

With an n-gram size range of 4 to 5:

[source,xml]
----
<analyzer>
  <tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/>
</analyzer>
----

*In:* "bicycle"

*Out:* "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"

== Edge N-Gram Tokenizer

Reads the field text and generates edge n-gram tokens of sizes in the given range.

*Factory class:* `solr.EdgeNGramTokenizerFactory`

*Arguments:*

`minGramSize`: (integer, default is 1) The minimum n-gram size, must be > 0.

`maxGramSize`: (integer, default is 1) The maximum n-gram size, must be >= `minGramSize`.

*Example:*

Default behavior (min and max default to 1):

[source,xml]
----
<analyzer>
  <tokenizer class="solr.EdgeNGramTokenizerFactory"/>
</analyzer>
----

*In:* "babaloo"

*Out:* "b"

*Example:*

Edge n-gram range of 2 to 5

[source,xml]
----
<analyzer>
  <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"/>
</analyzer>
----

*In:* "babaloo"

**Out:**"ba", "bab", "baba", "babal"

== ICU Tokenizer

This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute.

You can customize this tokenizer's behavior by specifying http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules[per-script rule files]. To add per-script rules, add a `rulefiles` argument, which should contain a comma-separated list of `code:rulefile` pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"), you would enter `Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi`.

The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word break rules tokenization (like `solr.StandardTokenizer`), but also includes custom tailorings for Hebrew (specializing handling of double and single quotation marks), for syllable tokenization for Khmer, Lao, and Myanmar, and dictionary-based word segmentation for CJK characters.

*Factory class:* `solr.ICUTokenizerFactory`

*Arguments:*

`rulefile`: a comma-separated list of `code:rulefile` pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path.

*Example:*

[source,xml]
----
<analyzer>
  <!-- no customization -->
  <tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>
----

[source,xml]
----
<analyzer>
  <tokenizer class="solr.ICUTokenizerFactory"
             rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>
----

[IMPORTANT]
====

To use this tokenizer, you must add additional .jars to Solr's classpath (as described in the section <<resource-and-plugin-loading.adoc#resources-and-plugins-on-the-filesystem,Resources and Plugins on the Filesystem>>). See the `solr/contrib/analysis-extras/README.txt` for information on which jars you need to add.

====

== Path Hierarchy Tokenizer

This tokenizer creates synonyms from file path hierarchies.

*Factory class:* `solr.PathHierarchyTokenizerFactory`

*Arguments:*

`delimiter`: (character, no default) You can specify the file path delimiter and replace it with a delimiter you provide. This can be useful for working with backslash delimiters.

`replace`: (character, no default) Specifies the delimiter character Solr uses in the tokenized output.

*Example:*

[source,xml]
----
<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\" replace="/"/>
  </analyzer>
</fieldType>
----

*In:* "c:\usr\local\apache"

*Out:* "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"

== Regular Expression Pattern Tokenizer

This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.

See {java-javadocs}java/util/regex/Pattern.html[the Javadocs for `java.util.regex.Pattern`] for more information on Java regular expression syntax.

*Factory class:* `solr.PatternTokenizerFactory`

*Arguments:*

`pattern`: (Required) The regular expression, as defined by in `java.util.regex.Pattern`.

`group`: (Optional, default -1) Specifies which regex group to extract as the token(s). The value -1 means the regex should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate that character sequences matching that regex group should be converted to tokens. Group zero refers to the entire regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted from left to right.

*Example:*

A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces.

[source,xml]
----
<analyzer>
  <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/>
</analyzer>
----

*In:* "fee,fie, foe , fum, foo"

*Out:* "fee", "fie", "foe", "fum", "foo"

*Example:*

Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token.

[source,xml]
----
<analyzer>
  <tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*" group="0"/>
</analyzer>
----

*In:* "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."

*Out:* "Hello", "My", "Inigo", "Montoya", "You", "Prepare"

*Example:*

Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens.

[source,xml]
----
<analyzer>
  <tokenizer class="solr.PatternTokenizerFactory" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>
</analyzer>
----

*In:* "SKU: 1234, Part Number 5678, Part: 126-987"

*Out:* "1234", "5678", "126-987"

== Simplified Regular Expression Pattern Tokenizer

This tokenizer is similar to the `PatternTokenizerFactory` described above, but uses Lucene {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] pattern matching to construct distinct tokens for the input stream. The syntax is more limited than `PatternTokenizerFactory`, but the tokenization is quite a bit faster.

*Factory class:* `solr.SimplePatternTokenizerFactory`

*Arguments:*

`pattern`: (Required) The regular expression, as defined by in the {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] javadocs, identifying the characters to include in tokens. The matching is greedy such that the longest token matching at a given point is created. Empty tokens are never created.

`maxDeterminizedStates`: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp.

*Example:*

To match tokens delimited by simple whitespace characters:

[source,xml]
----
<analyzer>
  <tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/>
</analyzer>
----

== Simplified Regular Expression Pattern Splitting Tokenizer

This tokenizer is similar to the `SimplePatternTokenizerFactory` described above, but uses Lucene {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] pattern matching to identify sequences of characters that should be used to split tokens. The syntax is more limited than `PatternTokenizerFactory`, but the tokenization is quite a bit faster.

*Factory class:* `solr.SimplePatternSplitTokenizerFactory`

*Arguments:*

`pattern`: (Required) The regular expression, as defined by in the {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] javadocs, identifying the characters that should split tokens. The matching is greedy such that the longest token separator matching at a given point is matched. Empty tokens are never created.

`maxDeterminizedStates`: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp.

*Example:*

To match tokens delimited by simple whitespace characters:

[source,xml]
----
<analyzer>
  <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
</analyzer>
----

== UAX29 URL Email Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

* Periods (dots) that are not followed by whitespace are kept as part of the token.

* Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.

* Recognizes and preserves as single tokens the following:
** Internet domain names containing top-level domains validated against the white list in the http://www.internic.net/zones/root.zone[IANA Root Zone Database] when the tokenizer was generated
** email addresses
** `file://`, `http(s)://`, and `ftp://` URLs
** IPv4 and IPv6 addresses

The UAX29 URL Email Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundaries with the following token types: `<ALPHANUM>`, `<NUM>`, `<URL>`, `<EMAIL>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`.

*Factory class:* `solr.UAX29URLEmailTokenizerFactory`

*Arguments:*

`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.

*Example:*

[source,xml]
----
<analyzer>
  <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
</analyzer>
----

*In:* "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"

*Out:* "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "e", "mail", "bob.cratchet@accarol.com"

== White Space Tokenizer

Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation _will_ be included in the tokens.

*Factory class:* `solr.WhitespaceTokenizerFactory`

*Arguments:*

`rule`::
Specifies how to define whitespace for the purpose of tokenization. Valid values:

* `java`: (Default) Uses {java-javadocs}java/lang/Character.html#isWhitespace-int-[Character.isWhitespace(int)]
* `unicode`: Uses Unicode's WHITESPACE property

*Example:*

[source,xml]
----
<analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" />
</analyzer>
----

*In:* "To be, or what?"

*Out:* "To", "be,", "or", "what?"

== OpenNLP Tokenizer and OpenNLP Filters

See <<language-analysis.adoc#opennlp-integration,OpenNLP Integration>> for information about using the OpenNLP Tokenizer, along with information about available OpenNLP token filters.