mirror of https://github.com/apache/lucene.git
492 lines
17 KiB
Plaintext
492 lines
17 KiB
Plaintext
= Tokenizers
|
|
// Licensed to the Apache Software Foundation (ASF) under one
|
|
// or more contributor license agreements. See the NOTICE file
|
|
// distributed with this work for additional information
|
|
// regarding copyright ownership. The ASF licenses this file
|
|
// to you under the Apache License, Version 2.0 (the
|
|
// "License"); you may not use this file except in compliance
|
|
// with the License. You may obtain a copy of the License at
|
|
//
|
|
// http://www.apache.org/licenses/LICENSE-2.0
|
|
//
|
|
// Unless required by applicable law or agreed to in writing,
|
|
// software distributed under the License is distributed on an
|
|
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
// KIND, either express or implied. See the License for the
|
|
// specific language governing permissions and limitations
|
|
// under the License.
|
|
|
|
Tokenizers are responsible for breaking field data into lexical units, or _tokens_.
|
|
|
|
You configure the tokenizer for a text field type in `schema.xml` with a `<tokenizer>` element, as a child of `<analyzer>`:
|
|
|
|
[source,xml]
|
|
----
|
|
<fieldType name="text" class="solr.TextField">
|
|
<analyzer type="index">
|
|
<tokenizer class="solr.StandardTokenizerFactory"/>
|
|
<filter class="solr.LowerCaseFilterFactory"/>
|
|
</analyzer>
|
|
</fieldType>
|
|
----
|
|
|
|
The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the `org.apache.solr.analysis.TokenizerFactory`. A TokenizerFactory's `create()` method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field.
|
|
|
|
Arguments may be passed to tokenizer factories by setting attributes on the `<tokenizer>` element.
|
|
|
|
[source,xml]
|
|
----
|
|
<fieldType name="semicolonDelimited" class="solr.TextField">
|
|
<analyzer type="query">
|
|
<tokenizer class="solr.PatternTokenizerFactory" pattern="; "/>
|
|
</analyzer>
|
|
</fieldType>
|
|
----
|
|
|
|
The following sections describe the tokenizer factory classes included in this release of Solr.
|
|
|
|
For user tips about Solr's tokenizers, see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.
|
|
|
|
== Standard Tokenizer
|
|
|
|
This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:
|
|
|
|
* Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.
|
|
* The "@" character is among the set of token-splitting punctuation, so email addresses are *not* preserved as single tokens.
|
|
|
|
Note that words are split at hyphens.
|
|
|
|
The Standard Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundaries with the following token types: `<ALPHANUM>`, `<NUM>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`.
|
|
|
|
*Factory class:* `solr.StandardTokenizerFactory`
|
|
|
|
*Arguments:*
|
|
|
|
`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.
|
|
|
|
*Example:*
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.StandardTokenizerFactory"/>
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
|
|
|
|
*Out:* "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"
|
|
|
|
== Classic Tokenizer
|
|
|
|
The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. It does not use the http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundary rules that the Standard Tokenizer uses. This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:
|
|
|
|
* Periods (dots) that are not followed by whitespace are kept as part of the token.
|
|
|
|
* Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
|
|
|
|
* Recognizes Internet domain names and email addresses and preserves them as a single token.
|
|
|
|
*Factory class:* `solr.ClassicTokenizerFactory`
|
|
|
|
*Arguments:*
|
|
|
|
`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.
|
|
|
|
*Example:*
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.ClassicTokenizerFactory"/>
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
|
|
|
|
*Out:* "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"
|
|
|
|
== Keyword Tokenizer
|
|
|
|
This tokenizer treats the entire text field as a single token.
|
|
|
|
*Factory class:* `solr.KeywordTokenizerFactory`
|
|
|
|
*Arguments:* None
|
|
|
|
*Example:*
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.KeywordTokenizerFactory"/>
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
|
|
|
|
*Out:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
|
|
|
|
== Letter Tokenizer
|
|
|
|
This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.
|
|
|
|
*Factory class:* `solr.LetterTokenizerFactory`
|
|
|
|
*Arguments:* None
|
|
|
|
*Example:*
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.LetterTokenizerFactory"/>
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "I can't."
|
|
|
|
*Out:* "I", "can", "t"
|
|
|
|
== Lower Case Tokenizer
|
|
|
|
Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace and non-letters are discarded.
|
|
|
|
*Factory class:* `solr.LowerCaseTokenizerFactory`
|
|
|
|
*Arguments:* None
|
|
|
|
*Example:*
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "I just \*LOVE* my iPhone!"
|
|
|
|
*Out:* "i", "just", "love", "my", "iphone"
|
|
|
|
== N-Gram Tokenizer
|
|
|
|
Reads the field text and generates n-gram tokens of sizes in the given range.
|
|
|
|
*Factory class:* `solr.NGramTokenizerFactory`
|
|
|
|
*Arguments:*
|
|
|
|
`minGramSize`: (integer, default 1) The minimum n-gram size, must be > 0.
|
|
|
|
`maxGramSize`: (integer, default 2) The maximum n-gram size, must be >= `minGramSize`.
|
|
|
|
*Example:*
|
|
|
|
Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding.
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.NGramTokenizerFactory"/>
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "hey man"
|
|
|
|
*Out:* "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"
|
|
|
|
*Example:*
|
|
|
|
With an n-gram size range of 4 to 5:
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/>
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "bicycle"
|
|
|
|
*Out:* "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"
|
|
|
|
== Edge N-Gram Tokenizer
|
|
|
|
Reads the field text and generates edge n-gram tokens of sizes in the given range.
|
|
|
|
*Factory class:* `solr.EdgeNGramTokenizerFactory`
|
|
|
|
*Arguments:*
|
|
|
|
`minGramSize`: (integer, default is 1) The minimum n-gram size, must be > 0.
|
|
|
|
`maxGramSize`: (integer, default is 1) The maximum n-gram size, must be >= `minGramSize`.
|
|
|
|
*Example:*
|
|
|
|
Default behavior (min and max default to 1):
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.EdgeNGramTokenizerFactory"/>
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "babaloo"
|
|
|
|
*Out:* "b"
|
|
|
|
*Example:*
|
|
|
|
Edge n-gram range of 2 to 5
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"/>
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "babaloo"
|
|
|
|
**Out:**"ba", "bab", "baba", "babal"
|
|
|
|
== ICU Tokenizer
|
|
|
|
This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute.
|
|
|
|
You can customize this tokenizer's behavior by specifying http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules[per-script rule files]. To add per-script rules, add a `rulefiles` argument, which should contain a comma-separated list of `code:rulefile` pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"), you would enter `Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi`.
|
|
|
|
The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word break rules tokenization (like `solr.StandardTokenizer`), but also includes custom tailorings for Hebrew (specializing handling of double and single quotation marks), for syllable tokenization for Khmer, Lao, and Myanmar, and dictionary-based word segmentation for CJK characters.
|
|
|
|
*Factory class:* `solr.ICUTokenizerFactory`
|
|
|
|
*Arguments:*
|
|
|
|
`rulefile`: a comma-separated list of `code:rulefile` pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path.
|
|
|
|
*Example:*
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<!-- no customization -->
|
|
<tokenizer class="solr.ICUTokenizerFactory"/>
|
|
</analyzer>
|
|
----
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.ICUTokenizerFactory"
|
|
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
|
|
</analyzer>
|
|
----
|
|
|
|
[IMPORTANT]
|
|
====
|
|
|
|
To use this tokenizer, you must add additional .jars to Solr's classpath (as described in the section <<resource-and-plugin-loading.adoc#resources-and-plugins-on-the-filesystem,Resources and Plugins on the Filesystem>>). See the `solr/contrib/analysis-extras/README.txt` for information on which jars you need to add.
|
|
|
|
====
|
|
|
|
== Path Hierarchy Tokenizer
|
|
|
|
This tokenizer creates synonyms from file path hierarchies.
|
|
|
|
*Factory class:* `solr.PathHierarchyTokenizerFactory`
|
|
|
|
*Arguments:*
|
|
|
|
`delimiter`: (character, no default) You can specify the file path delimiter and replace it with a delimiter you provide. This can be useful for working with backslash delimiters.
|
|
|
|
`replace`: (character, no default) Specifies the delimiter character Solr uses in the tokenized output.
|
|
|
|
*Example:*
|
|
|
|
[source,xml]
|
|
----
|
|
<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
|
|
<analyzer>
|
|
<tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\" replace="/"/>
|
|
</analyzer>
|
|
</fieldType>
|
|
----
|
|
|
|
*In:* "c:\usr\local\apache"
|
|
|
|
*Out:* "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"
|
|
|
|
== Regular Expression Pattern Tokenizer
|
|
|
|
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.
|
|
|
|
See {java-javadocs}java/util/regex/Pattern.html[the Javadocs for `java.util.regex.Pattern`] for more information on Java regular expression syntax.
|
|
|
|
*Factory class:* `solr.PatternTokenizerFactory`
|
|
|
|
*Arguments:*
|
|
|
|
`pattern`: (Required) The regular expression, as defined by in `java.util.regex.Pattern`.
|
|
|
|
`group`: (Optional, default -1) Specifies which regex group to extract as the token(s). The value -1 means the regex should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate that character sequences matching that regex group should be converted to tokens. Group zero refers to the entire regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted from left to right.
|
|
|
|
*Example:*
|
|
|
|
A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces.
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/>
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "fee,fie, foe , fum, foo"
|
|
|
|
*Out:* "fee", "fie", "foe", "fum", "foo"
|
|
|
|
*Example:*
|
|
|
|
Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token.
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*" group="0"/>
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."
|
|
|
|
*Out:* "Hello", "My", "Inigo", "Montoya", "You", "Prepare"
|
|
|
|
*Example:*
|
|
|
|
Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens.
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.PatternTokenizerFactory" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "SKU: 1234, Part Number 5678, Part: 126-987"
|
|
|
|
*Out:* "1234", "5678", "126-987"
|
|
|
|
== Simplified Regular Expression Pattern Tokenizer
|
|
|
|
This tokenizer is similar to the `PatternTokenizerFactory` described above, but uses Lucene {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] pattern matching to construct distinct tokens for the input stream. The syntax is more limited than `PatternTokenizerFactory`, but the tokenization is quite a bit faster.
|
|
|
|
*Factory class:* `solr.SimplePatternTokenizerFactory`
|
|
|
|
*Arguments:*
|
|
|
|
`pattern`: (Required) The regular expression, as defined by in the {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] javadocs, identifying the characters to include in tokens. The matching is greedy such that the longest token matching at a given point is created. Empty tokens are never created.
|
|
|
|
`maxDeterminizedStates`: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp.
|
|
|
|
*Example:*
|
|
|
|
To match tokens delimited by simple whitespace characters:
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/>
|
|
</analyzer>
|
|
----
|
|
|
|
== Simplified Regular Expression Pattern Splitting Tokenizer
|
|
|
|
This tokenizer is similar to the `SimplePatternTokenizerFactory` described above, but uses Lucene {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] pattern matching to identify sequences of characters that should be used to split tokens. The syntax is more limited than `PatternTokenizerFactory`, but the tokenization is quite a bit faster.
|
|
|
|
*Factory class:* `solr.SimplePatternSplitTokenizerFactory`
|
|
|
|
*Arguments:*
|
|
|
|
`pattern`: (Required) The regular expression, as defined by in the {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] javadocs, identifying the characters that should split tokens. The matching is greedy such that the longest token separator matching at a given point is matched. Empty tokens are never created.
|
|
|
|
`maxDeterminizedStates`: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp.
|
|
|
|
*Example:*
|
|
|
|
To match tokens delimited by simple whitespace characters:
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
|
|
</analyzer>
|
|
----
|
|
|
|
== UAX29 URL Email Tokenizer
|
|
|
|
This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:
|
|
|
|
* Periods (dots) that are not followed by whitespace are kept as part of the token.
|
|
|
|
* Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
|
|
|
|
* Recognizes and preserves as single tokens the following:
|
|
** Internet domain names containing top-level domains validated against the white list in the http://www.internic.net/zones/root.zone[IANA Root Zone Database] when the tokenizer was generated
|
|
** email addresses
|
|
** `file://`, `http(s)://`, and `ftp://` URLs
|
|
** IPv4 and IPv6 addresses
|
|
|
|
The UAX29 URL Email Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundaries with the following token types: `<ALPHANUM>`, `<NUM>`, `<URL>`, `<EMAIL>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`.
|
|
|
|
*Factory class:* `solr.UAX29URLEmailTokenizerFactory`
|
|
|
|
*Arguments:*
|
|
|
|
`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.
|
|
|
|
*Example:*
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"
|
|
|
|
*Out:* "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "e", "mail", "bob.cratchet@accarol.com"
|
|
|
|
== White Space Tokenizer
|
|
|
|
Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation _will_ be included in the tokens.
|
|
|
|
*Factory class:* `solr.WhitespaceTokenizerFactory`
|
|
|
|
*Arguments:*
|
|
|
|
`rule`::
|
|
Specifies how to define whitespace for the purpose of tokenization. Valid values:
|
|
|
|
* `java`: (Default) Uses {java-javadocs}java/lang/Character.html#isWhitespace-int-[Character.isWhitespace(int)]
|
|
* `unicode`: Uses Unicode's WHITESPACE property
|
|
|
|
*Example:*
|
|
|
|
[source,xml]
|
|
----
|
|
<analyzer>
|
|
<tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" />
|
|
</analyzer>
|
|
----
|
|
|
|
*In:* "To be, or what?"
|
|
|
|
*Out:* "To", "be,", "or", "what?"
|
|
|
|
== OpenNLP Tokenizer and OpenNLP Filters
|
|
|
|
See <<language-analysis.adoc#opennlp-integration,OpenNLP Integration>> for information about using the OpenNLP Tokenizer, along with information about available OpenNLP token filters.
|