lucene/solr/solr-ref-guide/src/tokenizers.adoc

492 lines
17 KiB
Plaintext

= Tokenizers
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
Tokenizers are responsible for breaking field data into lexical units, or _tokens_.
You configure the tokenizer for a text field type in `schema.xml` with a `<tokenizer>` element, as a child of `<analyzer>`:
[source,xml]
----
<fieldType name="text" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
----
The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the `org.apache.solr.analysis.TokenizerFactory`. A TokenizerFactory's `create()` method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field.
Arguments may be passed to tokenizer factories by setting attributes on the `<tokenizer>` element.
[source,xml]
----
<fieldType name="semicolonDelimited" class="solr.TextField">
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory" pattern="; "/>
</analyzer>
</fieldType>
----
The following sections describe the tokenizer factory classes included in this release of Solr.
For user tips about Solr's tokenizers, see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.
== Standard Tokenizer
This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:
* Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.
* The "@" character is among the set of token-splitting punctuation, so email addresses are *not* preserved as single tokens.
Note that words are split at hyphens.
The Standard Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundaries with the following token types: `<ALPHANUM>`, `<NUM>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`.
*Factory class:* `solr.StandardTokenizerFactory`
*Arguments:*
`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.
*Example:*
[source,xml]
----
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
----
*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
*Out:* "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"
== Classic Tokenizer
The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. It does not use the http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundary rules that the Standard Tokenizer uses. This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:
* Periods (dots) that are not followed by whitespace are kept as part of the token.
* Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
* Recognizes Internet domain names and email addresses and preserves them as a single token.
*Factory class:* `solr.ClassicTokenizerFactory`
*Arguments:*
`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.
*Example:*
[source,xml]
----
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
</analyzer>
----
*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
*Out:* "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"
== Keyword Tokenizer
This tokenizer treats the entire text field as a single token.
*Factory class:* `solr.KeywordTokenizerFactory`
*Arguments:* None
*Example:*
[source,xml]
----
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
----
*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
*Out:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
== Letter Tokenizer
This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.
*Factory class:* `solr.LetterTokenizerFactory`
*Arguments:* None
*Example:*
[source,xml]
----
<analyzer>
<tokenizer class="solr.LetterTokenizerFactory"/>
</analyzer>
----
*In:* "I can't."
*Out:* "I", "can", "t"
== Lower Case Tokenizer
Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace and non-letters are discarded.
*Factory class:* `solr.LowerCaseTokenizerFactory`
*Arguments:* None
*Example:*
[source,xml]
----
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
----
*In:* "I just \*LOVE* my iPhone!"
*Out:* "i", "just", "love", "my", "iphone"
== N-Gram Tokenizer
Reads the field text and generates n-gram tokens of sizes in the given range.
*Factory class:* `solr.NGramTokenizerFactory`
*Arguments:*
`minGramSize`: (integer, default 1) The minimum n-gram size, must be > 0.
`maxGramSize`: (integer, default 2) The maximum n-gram size, must be >= `minGramSize`.
*Example:*
Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding.
[source,xml]
----
<analyzer>
<tokenizer class="solr.NGramTokenizerFactory"/>
</analyzer>
----
*In:* "hey man"
*Out:* "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"
*Example:*
With an n-gram size range of 4 to 5:
[source,xml]
----
<analyzer>
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/>
</analyzer>
----
*In:* "bicycle"
*Out:* "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"
== Edge N-Gram Tokenizer
Reads the field text and generates edge n-gram tokens of sizes in the given range.
*Factory class:* `solr.EdgeNGramTokenizerFactory`
*Arguments:*
`minGramSize`: (integer, default is 1) The minimum n-gram size, must be > 0.
`maxGramSize`: (integer, default is 1) The maximum n-gram size, must be >= `minGramSize`.
*Example:*
Default behavior (min and max default to 1):
[source,xml]
----
<analyzer>
<tokenizer class="solr.EdgeNGramTokenizerFactory"/>
</analyzer>
----
*In:* "babaloo"
*Out:* "b"
*Example:*
Edge n-gram range of 2 to 5
[source,xml]
----
<analyzer>
<tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"/>
</analyzer>
----
*In:* "babaloo"
**Out:**"ba", "bab", "baba", "babal"
== ICU Tokenizer
This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute.
You can customize this tokenizer's behavior by specifying http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules[per-script rule files]. To add per-script rules, add a `rulefiles` argument, which should contain a comma-separated list of `code:rulefile` pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"), you would enter `Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi`.
The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word break rules tokenization (like `solr.StandardTokenizer`), but also includes custom tailorings for Hebrew (specializing handling of double and single quotation marks), for syllable tokenization for Khmer, Lao, and Myanmar, and dictionary-based word segmentation for CJK characters.
*Factory class:* `solr.ICUTokenizerFactory`
*Arguments:*
`rulefile`: a comma-separated list of `code:rulefile` pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path.
*Example:*
[source,xml]
----
<analyzer>
<!-- no customization -->
<tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>
----
[source,xml]
----
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>
----
[IMPORTANT]
====
To use this tokenizer, you must add additional .jars to Solr's classpath (as described in the section <<resource-and-plugin-loading.adoc#resources-and-plugins-on-the-filesystem,Resources and Plugins on the Filesystem>>). See the `solr/contrib/analysis-extras/README.txt` for information on which jars you need to add.
====
== Path Hierarchy Tokenizer
This tokenizer creates synonyms from file path hierarchies.
*Factory class:* `solr.PathHierarchyTokenizerFactory`
*Arguments:*
`delimiter`: (character, no default) You can specify the file path delimiter and replace it with a delimiter you provide. This can be useful for working with backslash delimiters.
`replace`: (character, no default) Specifies the delimiter character Solr uses in the tokenized output.
*Example:*
[source,xml]
----
<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\" replace="/"/>
</analyzer>
</fieldType>
----
*In:* "c:\usr\local\apache"
*Out:* "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"
== Regular Expression Pattern Tokenizer
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.
See {java-javadocs}java/util/regex/Pattern.html[the Javadocs for `java.util.regex.Pattern`] for more information on Java regular expression syntax.
*Factory class:* `solr.PatternTokenizerFactory`
*Arguments:*
`pattern`: (Required) The regular expression, as defined by in `java.util.regex.Pattern`.
`group`: (Optional, default -1) Specifies which regex group to extract as the token(s). The value -1 means the regex should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate that character sequences matching that regex group should be converted to tokens. Group zero refers to the entire regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted from left to right.
*Example:*
A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces.
[source,xml]
----
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/>
</analyzer>
----
*In:* "fee,fie, foe , fum, foo"
*Out:* "fee", "fie", "foe", "fum", "foo"
*Example:*
Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token.
[source,xml]
----
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*" group="0"/>
</analyzer>
----
*In:* "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."
*Out:* "Hello", "My", "Inigo", "Montoya", "You", "Prepare"
*Example:*
Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens.
[source,xml]
----
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>
</analyzer>
----
*In:* "SKU: 1234, Part Number 5678, Part: 126-987"
*Out:* "1234", "5678", "126-987"
== Simplified Regular Expression Pattern Tokenizer
This tokenizer is similar to the `PatternTokenizerFactory` described above, but uses Lucene {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] pattern matching to construct distinct tokens for the input stream. The syntax is more limited than `PatternTokenizerFactory`, but the tokenization is quite a bit faster.
*Factory class:* `solr.SimplePatternTokenizerFactory`
*Arguments:*
`pattern`: (Required) The regular expression, as defined by in the {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] javadocs, identifying the characters to include in tokens. The matching is greedy such that the longest token matching at a given point is created. Empty tokens are never created.
`maxDeterminizedStates`: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp.
*Example:*
To match tokens delimited by simple whitespace characters:
[source,xml]
----
<analyzer>
<tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/>
</analyzer>
----
== Simplified Regular Expression Pattern Splitting Tokenizer
This tokenizer is similar to the `SimplePatternTokenizerFactory` described above, but uses Lucene {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] pattern matching to identify sequences of characters that should be used to split tokens. The syntax is more limited than `PatternTokenizerFactory`, but the tokenization is quite a bit faster.
*Factory class:* `solr.SimplePatternSplitTokenizerFactory`
*Arguments:*
`pattern`: (Required) The regular expression, as defined by in the {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] javadocs, identifying the characters that should split tokens. The matching is greedy such that the longest token separator matching at a given point is matched. Empty tokens are never created.
`maxDeterminizedStates`: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp.
*Example:*
To match tokens delimited by simple whitespace characters:
[source,xml]
----
<analyzer>
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
</analyzer>
----
== UAX29 URL Email Tokenizer
This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:
* Periods (dots) that are not followed by whitespace are kept as part of the token.
* Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
* Recognizes and preserves as single tokens the following:
** Internet domain names containing top-level domains validated against the white list in the http://www.internic.net/zones/root.zone[IANA Root Zone Database] when the tokenizer was generated
** email addresses
** `file://`, `http(s)://`, and `ftp://` URLs
** IPv4 and IPv6 addresses
The UAX29 URL Email Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundaries with the following token types: `<ALPHANUM>`, `<NUM>`, `<URL>`, `<EMAIL>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`.
*Factory class:* `solr.UAX29URLEmailTokenizerFactory`
*Arguments:*
`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.
*Example:*
[source,xml]
----
<analyzer>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
</analyzer>
----
*In:* "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"
*Out:* "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "e", "mail", "bob.cratchet@accarol.com"
== White Space Tokenizer
Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation _will_ be included in the tokens.
*Factory class:* `solr.WhitespaceTokenizerFactory`
*Arguments:*
`rule`::
Specifies how to define whitespace for the purpose of tokenization. Valid values:
* `java`: (Default) Uses {java-javadocs}java/lang/Character.html#isWhitespace-int-[Character.isWhitespace(int)]
* `unicode`: Uses Unicode's WHITESPACE property
*Example:*
[source,xml]
----
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" />
</analyzer>
----
*In:* "To be, or what?"
*Out:* "To", "be,", "or", "what?"
== OpenNLP Tokenizer and OpenNLP Filters
See <<language-analysis.adoc#opennlp-integration,OpenNLP Integration>> for information about using the OpenNLP Tokenizer, along with information about available OpenNLP token filters.