SOLR-13691: Add example field type configurations using name attributes to Ref Guide

This commit is contained in:
Tomoko Uchida 2019-09-01 01:32:10 +09:00
parent 77c1ed7d16
commit 66d7dffc79
9 changed files with 2421 additions and 140 deletions

View File

@ -64,7 +64,7 @@ Upgrade Notes
* SOLR-11266: default Content-Type override for JSONResponseWriter from _default configSet is removed. Example has been * SOLR-11266: default Content-Type override for JSONResponseWriter from _default configSet is removed. Example has been
provided in sample_techproducts_configs to override content-type. (Ishan Chattopadhyaya, Munendra S N, Gus Heck) provided in sample_techproducts_configs to override content-type. (Ishan Chattopadhyaya, Munendra S N, Gus Heck)
* SOLR-13593 SOLR-13690: Allow to look up analyzer components by their SPI names in field type configuration. (Tomoko Uchida) * SOLR-13593 SOLR-13690 SOLR-13691: Allow to look up analyzer components by their SPI names in field type configuration. (Tomoko Uchida)
Other Changes Other Changes
---------------------- ----------------------

View File

@ -22,6 +22,25 @@ A filter may also do more complex analysis by looking ahead to consider multiple
Because filters consume one `TokenStream` and produce a new `TokenStream`, they can be chained one after another indefinitely. Each filter in the chain in turn processes the tokens produced by its predecessor. The order in which you specify the filters is therefore significant. Typically, the most general filtering is done first, and later filtering stages are more specialized. Because filters consume one `TokenStream` and produce a new `TokenStream`, they can be chained one after another indefinitely. Each filter in the chain in turn processes the tokens produced by its predecessor. The order in which you specify the filters is therefore significant. Typically, the most general filtering is done first, and later filtering stages are more specialized.
[.dynamic-tabs]
--
[example.tab-pane#byname-filterexample]
====
[.tab-label]*With name*
[source,xml]
----
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="englishPorter"/>
</analyzer>
</fieldType>
----
====
[example.tab-pane#byclass-filterexample]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<fieldType name="text" class="solr.TextField"> <fieldType name="text" class="solr.TextField">
@ -32,6 +51,8 @@ Because filters consume one `TokenStream` and produce a new `TokenStream`, they
</analyzer> </analyzer>
</fieldType> </fieldType>
---- ----
====
--
This example starts with Solr's standard tokenizer, which breaks the field's text into tokens. All the tokens are then set to lowercase, which will facilitate case-insensitive matching at query time. This example starts with Solr's standard tokenizer, which breaks the field's text into tokens. All the tokens are then set to lowercase, which will facilitate case-insensitive matching at query time.

View File

@ -20,6 +20,23 @@ The job of a <<tokenizers.adoc#tokenizers,tokenizer>> is to break up a stream of
Characters in the input stream may be discarded, such as whitespace or other delimiters. They may also be added to or replaced, such as mapping aliases or abbreviations to normalized forms. A token contains various metadata in addition to its text value, such as the location at which the token occurs in the field. Because a tokenizer may produce tokens that diverge from the input text, you should not assume that the text of the token is the same text that occurs in the field, or that its length is the same as the original text. It's also possible for more than one token to have the same position or refer to the same offset in the original text. Keep this in mind if you use token metadata for things like highlighting search results in the field text. Characters in the input stream may be discarded, such as whitespace or other delimiters. They may also be added to or replaced, such as mapping aliases or abbreviations to normalized forms. A token contains various metadata in addition to its text value, such as the location at which the token occurs in the field. Because a tokenizer may produce tokens that diverge from the input text, you should not assume that the text of the token is the same text that occurs in the field, or that its length is the same as the original text. It's also possible for more than one token to have the same position or refer to the same offset in the original text. Keep this in mind if you use token metadata for things like highlighting search results in the field text.
[.dynamic-tabs]
--
[example.tab-pane#byname-tok]
====
[.tab-label]*With name*
[source,xml]
----
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer name="standard"/>
</analyzer>
</fieldType>
----
====
[example.tab-pane#byclass-tok]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<fieldType name="text" class="solr.TextField"> <fieldType name="text" class="solr.TextField">
@ -28,6 +45,8 @@ Characters in the input stream may be discarded, such as whitespace or other del
</analyzer> </analyzer>
</fieldType> </fieldType>
---- ----
====
--
The class named in the tokenizer element is not the actual tokenizer, but rather a class that implements the `TokenizerFactory` API. This factory class will be called upon to create new tokenizer instances as needed. Objects created by the factory must derive from `Tokenizer`, which indicates that they produce sequences of tokens. If the tokenizer produces tokens that are usable as is, it may be the only component of the analyzer. Otherwise, the tokenizer's output tokens will serve as input to the first filter stage in the pipeline. The class named in the tokenizer element is not the actual tokenizer, but rather a class that implements the `TokenizerFactory` API. This factory class will be called upon to create new tokenizer instances as needed. Objects created by the factory must derive from `Tokenizer`, which indicates that they produce sequences of tokens. If the tokenizer produces tokens that are usable as is, it may be the only component of the analyzer. Otherwise, the tokenizer's output tokens will serve as input to the first filter stage in the pipeline.

View File

@ -35,6 +35,27 @@ Even the most complex analysis requirements can usually be decomposed into a ser
For example: For example:
[.dynamic-tabs]
--
[example.tab-pane#byname]
====
[.tab-label]*With name*
[source,xml]
----
<fieldType name="nametext" class="solr.TextField">
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="stop"/>
<filter name="englishPorter"/>
</analyzer>
</fieldType>
----
Tokenizer and filter factory classes are referred by their symbolic names (SPI names). Here, name="standard" refers `org.apache.lucene.analysis.standard.StandardTokenizerFactory`.
====
[example.tab-pane#byclass]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<fieldType name="nametext" class="solr.TextField"> <fieldType name="nametext" class="solr.TextField">
@ -46,8 +67,9 @@ For example:
</analyzer> </analyzer>
</fieldType> </fieldType>
---- ----
Note that classes in the `org.apache.lucene.analysis` package may be referred to here with the shorthand `solr.` prefix.
Note that classes in the `org.apache.solr.analysis` package may be referred to here with the shorthand `solr.` prefix. ====
--
In this case, no Analyzer class was specified on the `<analyzer>` element. Rather, a sequence of more specialized classes are wired together and collectively act as the Analyzer for the field. The text of the field is passed to the first item in the list (`solr.StandardTokenizerFactory`), and the tokens that emerge from the last one (`solr.EnglishPorterFilterFactory`) are the terms that are used for indexing or querying any fields that use the "nametext" `fieldType`. In this case, no Analyzer class was specified on the `<analyzer>` element. Rather, a sequence of more specialized classes are wired together and collectively act as the Analyzer for the field. The text of the field is passed to the first item in the list (`solr.StandardTokenizerFactory`), and the tokens that emerge from the last one (`solr.EnglishPorterFilterFactory`) are the terms that are used for indexing or querying any fields that use the "nametext" `fieldType`.
@ -65,6 +87,30 @@ In many cases, the same analysis should be applied to both phases. This is desir
If you provide a simple `<analyzer>` definition for a field type, as in the examples above, then it will be used for both indexing and queries. If you want distinct analyzers for each phase, you may include two `<analyzer>` definitions distinguished with a type attribute. For example: If you provide a simple `<analyzer>` definition for a field type, as in the examples above, then it will be used for both indexing and queries. If you want distinct analyzers for each phase, you may include two `<analyzer>` definitions distinguished with a type attribute. For example:
[.dynamic-tabs]
--
[example.tab-pane#byname-phases]
====
[.tab-label]*With name*
[source,xml]
----
<fieldType name="nametext" class="solr.TextField">
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="keepWord" words="keepwords.txt"/>
<filter name="synonymFilter" synonyms="syns.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer name="standard"/>
<filter name="lowercase"/>
</analyzer>
</fieldType>
----
====
[example.tab-pane#byclass-phases]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<fieldType name="nametext" class="solr.TextField"> <fieldType name="nametext" class="solr.TextField">
@ -80,6 +126,8 @@ If you provide a simple `<analyzer>` definition for a field type, as in the exam
</analyzer> </analyzer>
</fieldType> </fieldType>
---- ----
====
--
In this theoretical example, at index time the text is tokenized, the tokens are set to lowercase, any that are not listed in `keepwords.txt` are discarded and those that remain are mapped to alternate values as defined by the synonym rules in the file `syns.txt`. This essentially builds an index from a restricted set of possible values and then normalizes them to values that may not even occur in the original text. In this theoretical example, at index time the text is tokenized, the tokens are set to lowercase, any that are not listed in `keepwords.txt` are discarded and those that remain are mapped to alternate values as defined by the synonym rules in the file `syns.txt`. This essentially builds an index from a restricted set of possible values and then normalizes them to values that may not even occur in the original text.
@ -103,14 +151,14 @@ For most use cases, this provides the best possible behavior, but if you wish fo
---- ----
<fieldType name="nametext" class="solr.TextField"> <fieldType name="nametext" class="solr.TextField">
<analyzer type="index"> <analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/> <tokenizer name="standard"/>
<filter class="solr.LowerCaseFilterFactory"/> <filter name="lowercase"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/> <filter name="keepWord" words="keepwords.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="syns.txt"/> <filter name="synonym" synonyms="syns.txt"/>
</analyzer> </analyzer>
<analyzer type="query"> <analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/> <tokenizer name="standard"/>
<filter class="solr.LowerCaseFilterFactory"/> <filter name="lowercase"/>
</analyzer> </analyzer>
<!-- No analysis at all when doing queries that involved Multi-Term expansion --> <!-- No analysis at all when doing queries that involved Multi-Term expansion -->
<analyzer type="multiterm"> <analyzer type="multiterm">

View File

@ -28,6 +28,23 @@ This filter requires specifying a `mapping` argument, which is the path and name
Example: Example:
[.dynamic-tabs]
--
[example.tab-pane#byname-charfilter]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<charFilter name="mapping" mapping="mapping-FoldToASCII.txt"/>
<tokenizer ...>
[...]
</analyzer>
----
====
[example.tab-pane#byclass-charfilter]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
@ -36,6 +53,8 @@ Example:
[...] [...]
</analyzer> </analyzer>
---- ----
====
--
Mapping file syntax: Mapping file syntax:
@ -101,6 +120,23 @@ The table below presents examples of HTML stripping.
Example: Example:
[.dynamic-tabs]
--
[example.tab-pane#byname-charfilter-htmlstrip]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<charFilter name="htmlStrip"/>
<tokenizer ...>
[...]
</analyzer>
----
====
[example.tab-pane#byclass-charfilter-htmlstrip]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
@ -109,6 +145,8 @@ Example:
[...] [...]
</analyzer> </analyzer>
---- ----
====
--
== solr.ICUNormalizer2CharFilterFactory == solr.ICUNormalizer2CharFilterFactory
@ -124,6 +162,23 @@ Arguments:
Example: Example:
[.dynamic-tabs]
--
[example.tab-pane#byname-charfilter-icunormalizer2]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<charFilter name="icuNormalizer2"/>
<tokenizer ...>
[...]
</analyzer>
----
====
[example.tab-pane#byclass-charfilter-icunormalizer2]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
@ -132,6 +187,8 @@ Example:
[...] [...]
</analyzer> </analyzer>
---- ----
====
--
== solr.PatternReplaceCharFilterFactory == solr.PatternReplaceCharFilterFactory
@ -145,6 +202,24 @@ Arguments:
You can configure this filter in `schema.xml` like this: You can configure this filter in `schema.xml` like this:
[.dynamic-tabs]
--
[example.tab-pane#byname-charfilter-patternreplace]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<charFilter name="patternReplace"
pattern="([nN][oO]\.)\s*(\d+)" replacement="$1$2"/>
<tokenizer ...>
[...]
</analyzer>
----
====
[example.tab-pane#byclass-charfilter-patternreplace]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
@ -154,6 +229,8 @@ You can configure this filter in `schema.xml` like this:
[...] [...]
</analyzer> </analyzer>
---- ----
====
--
The table below presents examples of regex-based pattern replacement: The table below presents examples of regex-based pattern replacement:

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -333,13 +333,13 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{
"positionIncrementGap":"100", "positionIncrementGap":"100",
"analyzer" : { "analyzer" : {
"charFilters":[{ "charFilters":[{
"class":"solr.PatternReplaceCharFilterFactory", "name":"patternReplace",
"replacement":"$1$1", "replacement":"$1$1",
"pattern":"([a-zA-Z])\\\\1+" }], "pattern":"([a-zA-Z])\\\\1+" }],
"tokenizer":{ "tokenizer":{
"class":"solr.WhitespaceTokenizerFactory" }, "name":"whitespace" },
"filters":[{ "filters":[{
"class":"solr.WordDelimiterFilterFactory", "name":"wordDelimiter",
"preserveOriginal":"0" }]}} "preserveOriginal":"0" }]}}
}' http://localhost:8983/solr/gettingstarted/schema }' http://localhost:8983/solr/gettingstarted/schema
---- ----
@ -361,11 +361,11 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{
"class":"solr.TextField", "class":"solr.TextField",
"indexAnalyzer":{ "indexAnalyzer":{
"tokenizer":{ "tokenizer":{
"class":"solr.PathHierarchyTokenizerFactory", "name":"pathHierarchy",
"delimiter":"/" }}, "delimiter":"/" }},
"queryAnalyzer":{ "queryAnalyzer":{
"tokenizer":{ "tokenizer":{
"class":"solr.KeywordTokenizerFactory" }}} "name":"keyword" }}}
}' http://localhost:8983/solr/gettingstarted/schema }' http://localhost:8983/solr/gettingstarted/schema
---- ----
==== ====
@ -383,11 +383,11 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{
"class":"solr.TextField", "class":"solr.TextField",
"indexAnalyzer":{ "indexAnalyzer":{
"tokenizer":{ "tokenizer":{
"class":"solr.PathHierarchyTokenizerFactory", "name":"pathHierarchy",
"delimiter":"/" }}, "delimiter":"/" }},
"queryAnalyzer":{ "queryAnalyzer":{
"tokenizer":{ "tokenizer":{
"class":"solr.KeywordTokenizerFactory" }}} "name":"keyword" }}}
}' http://localhost:8983/api/cores/gettingstarted/schema }' http://localhost:8983/api/cores/gettingstarted/schema
---- ----
==== ====
@ -446,7 +446,7 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{
"positionIncrementGap":"100", "positionIncrementGap":"100",
"analyzer":{ "analyzer":{
"tokenizer":{ "tokenizer":{
"class":"solr.StandardTokenizerFactory" }}} "name":"standard" }}}
}' http://localhost:8983/solr/gettingstarted/schema }' http://localhost:8983/solr/gettingstarted/schema
---- ----
==== ====
@ -463,7 +463,7 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{
"positionIncrementGap":"100", "positionIncrementGap":"100",
"analyzer":{ "analyzer":{
"tokenizer":{ "tokenizer":{
"class":"solr.StandardTokenizerFactory" }}} "name":"standard" }}}
}' http://localhost:8983/api/cores/gettingstarted/schema }' http://localhost:8983/api/cores/gettingstarted/schema
---- ----
==== ====
@ -565,13 +565,13 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{
"positionIncrementGap":"100", "positionIncrementGap":"100",
"analyzer":{ "analyzer":{
"charFilters":[{ "charFilters":[{
"class":"solr.PatternReplaceCharFilterFactory", "name":"patternReplace",
"replacement":"$1$1", "replacement":"$1$1",
"pattern":"([a-zA-Z])\\\\1+" }], "pattern":"([a-zA-Z])\\\\1+" }],
"tokenizer":{ "tokenizer":{
"class":"solr.WhitespaceTokenizerFactory" }, "name":"whitespace" },
"filters":[{ "filters":[{
"class":"solr.WordDelimiterFilterFactory", "name":"wordDelimiter",
"preserveOriginal":"0" }]}}, "preserveOriginal":"0" }]}},
"add-field" : { "add-field" : {
"name":"sell_by", "name":"sell_by",

View File

@ -20,6 +20,24 @@ Tokenizers are responsible for breaking field data into lexical units, or _token
You configure the tokenizer for a text field type in `schema.xml` with a `<tokenizer>` element, as a child of `<analyzer>`: You configure the tokenizer for a text field type in `schema.xml` with a `<tokenizer>` element, as a child of `<analyzer>`:
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer]
====
[.tab-label]*With name*
[source,xml]
----
<fieldType name="text" class="solr.TextField">
<analyzer type="index">
<tokenizer name="standard"/>
<filter name="lowercase"/>
</analyzer>
</fieldType>
----
====
[example.tab-pane#byclass-tokenizer]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<fieldType name="text" class="solr.TextField"> <fieldType name="text" class="solr.TextField">
@ -29,11 +47,30 @@ You configure the tokenizer for a text field type in `schema.xml` with a `<token
</analyzer> </analyzer>
</fieldType> </fieldType>
---- ----
====
--
The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the `org.apache.solr.analysis.TokenizerFactory`. A TokenizerFactory's `create()` method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field. The name/class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the `org.apache.lucene.analysis.util.TokenizerFactory`. A TokenizerFactory's `create()` method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field.
Arguments may be passed to tokenizer factories by setting attributes on the `<tokenizer>` element. Arguments may be passed to tokenizer factories by setting attributes on the `<tokenizer>` element.
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-args]
====
[.tab-label]*With name*
[source,xml]
----
<fieldType name="semicolonDelimited" class="solr.TextField">
<analyzer type="query">
<tokenizer name="pattern" pattern="; "/>
</analyzer>
</fieldType>
----
====
[example.tab-pane#byclass-tokenizer-args]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<fieldType name="semicolonDelimited" class="solr.TextField"> <fieldType name="semicolonDelimited" class="solr.TextField">
@ -42,6 +79,8 @@ Arguments may be passed to tokenizer factories by setting attributes on the `<to
</analyzer> </analyzer>
</fieldType> </fieldType>
---- ----
====
--
The following sections describe the tokenizer factory classes included in this release of Solr. The following sections describe the tokenizer factory classes included in this release of Solr.
@ -66,12 +105,29 @@ The Standard Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries
*Example:* *Example:*
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-standard]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="standard"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-standard]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer> </analyzer>
---- ----
====
--
*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq." *In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
@ -95,12 +151,29 @@ The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of S
*Example:* *Example:*
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-classic]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="classic"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-classic]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/> <tokenizer class="solr.ClassicTokenizerFactory"/>
</analyzer> </analyzer>
---- ----
====
--
*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq." *In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
@ -116,12 +189,29 @@ This tokenizer treats the entire text field as a single token.
*Example:* *Example:*
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-keyword]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="keyword"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-keyword]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/> <tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer> </analyzer>
---- ----
====
--
*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq." *In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
@ -137,12 +227,29 @@ This tokenizer creates tokens from strings of contiguous letters, discarding all
*Example:* *Example:*
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-letter]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="letter"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-letter]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.LetterTokenizerFactory"/> <tokenizer class="solr.LetterTokenizerFactory"/>
</analyzer> </analyzer>
---- ----
====
--
*In:* "I can't." *In:* "I can't."
@ -158,12 +265,29 @@ Tokenizes the input stream by delimiting at non-letters and then converting all
*Example:* *Example:*
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-lowercase]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="lowercase"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-lowercase]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/> <tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer> </analyzer>
---- ----
====
--
*In:* "I just \*LOVE* my iPhone!" *In:* "I just \*LOVE* my iPhone!"
@ -185,12 +309,29 @@ Reads the field text and generates n-gram tokens of sizes in the given range.
Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding. Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding.
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-ngram]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="nGram"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-ngram]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.NGramTokenizerFactory"/> <tokenizer class="solr.NGramTokenizerFactory"/>
</analyzer> </analyzer>
---- ----
====
--
*In:* "hey man" *In:* "hey man"
@ -200,12 +341,29 @@ Default behavior. Note that this tokenizer operates over the whole field. It doe
With an n-gram size range of 4 to 5: With an n-gram size range of 4 to 5:
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-ngram-args]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="nGram" minGramSize="4" maxGramSize="5"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-ngram-args]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/> <tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/>
</analyzer> </analyzer>
---- ----
====
--
*In:* "bicycle" *In:* "bicycle"
@ -227,12 +385,29 @@ Reads the field text and generates edge n-gram tokens of sizes in the given rang
Default behavior (min and max default to 1): Default behavior (min and max default to 1):
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-edgengram]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="edgeNGram"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-edgengram]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.EdgeNGramTokenizerFactory"/> <tokenizer class="solr.EdgeNGramTokenizerFactory"/>
</analyzer> </analyzer>
---- ----
====
--
*In:* "babaloo" *In:* "babaloo"
@ -242,12 +417,29 @@ Default behavior (min and max default to 1):
Edge n-gram range of 2 to 5 Edge n-gram range of 2 to 5
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-edgengram-args]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="edgeNGram" minGramSize="2" maxGramSize="5"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-edgengram-args]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"/> <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"/>
</analyzer> </analyzer>
---- ----
====
--
*In:* "babaloo" *In:* "babaloo"
@ -269,6 +461,22 @@ The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word br
*Example:* *Example:*
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-icu]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<!-- no customization -->
<tokenizer name="icu"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-icu]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
@ -276,7 +484,25 @@ The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word br
<tokenizer class="solr.ICUTokenizerFactory"/> <tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer> </analyzer>
---- ----
====
--
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-icu-rule]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="icu"
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-icu-rule]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
@ -284,6 +510,8 @@ The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word br
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/> rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer> </analyzer>
---- ----
====
--
[IMPORTANT] [IMPORTANT]
==== ====
@ -306,6 +534,23 @@ This tokenizer creates synonyms from file path hierarchies.
*Example:* *Example:*
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-pathhierarchy]
====
[.tab-label]*With name*
[source,xml]
----
<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer name="pathHierarchy" delimiter="\" replace="/"/>
</analyzer>
</fieldType>
----
====
[example.tab-pane#byclass-tokenizer-pathhierarchy]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100"> <fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
@ -314,6 +559,8 @@ This tokenizer creates synonyms from file path hierarchies.
</analyzer> </analyzer>
</fieldType> </fieldType>
---- ----
====
--
*In:* "c:\usr\local\apache" *In:* "c:\usr\local\apache"
@ -337,12 +584,29 @@ See {java-javadocs}java/util/regex/Pattern.html[the Javadocs for `java.util.rege
A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces. A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces.
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-pattern]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="pattern" pattern="\s*,\s*"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-pattern]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/> <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/>
</analyzer> </analyzer>
---- ----
====
--
*In:* "fee,fie, foe , fum, foo" *In:* "fee,fie, foe , fum, foo"
@ -352,12 +616,29 @@ A comma separated list. Tokens are separated by a sequence of zero or more space
Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token. Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token.
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-pattern-words]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="pattern" pattern="[A-Z][A-Za-z]*" group="0"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-pattern-words]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*" group="0"/> <tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*" group="0"/>
</analyzer> </analyzer>
---- ----
====
--
*In:* "Hello. My name is Inigo Montoya. You killed my father. Prepare to die." *In:* "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."
@ -367,12 +648,29 @@ Extract simple, capitalized words. A sequence of at least one capital letter fol
Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens. Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens.
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-pattern-sku]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="pattern" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-pattern-sku]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/> <tokenizer class="solr.PatternTokenizerFactory" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>
</analyzer> </analyzer>
---- ----
====
--
*In:* "SKU: 1234, Part Number 5678, Part: 126-987" *In:* "SKU: 1234, Part Number 5678, Part: 126-987"
@ -394,12 +692,29 @@ This tokenizer is similar to the `PatternTokenizerFactory` described above, but
To match tokens delimited by simple whitespace characters: To match tokens delimited by simple whitespace characters:
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-simplepattern]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="simplePattern" pattern="[^ \t\r\n]+"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-simplepattern]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/> <tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/>
</analyzer> </analyzer>
---- ----
====
--
== Simplified Regular Expression Pattern Splitting Tokenizer == Simplified Regular Expression Pattern Splitting Tokenizer
@ -417,12 +732,29 @@ This tokenizer is similar to the `SimplePatternTokenizerFactory` described above
To match tokens delimited by simple whitespace characters: To match tokens delimited by simple whitespace characters:
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-simplepatternsplit]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="simplePatternSplit" pattern="[ \t\r\n]+"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-simplepatternsplit]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/> <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
</analyzer> </analyzer>
---- ----
====
--
== UAX29 URL Email Tokenizer == UAX29 URL Email Tokenizer
@ -448,12 +780,29 @@ The UAX29 URL Email Tokenizer supports http://unicode.org/reports/tr29/#Word_Bou
*Example:* *Example:*
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-uax29urlemail]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="uax29URLEmail"/>
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-uax29urlemail]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/> <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
</analyzer> </analyzer>
---- ----
====
--
*In:* "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com" *In:* "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"
@ -475,12 +824,29 @@ Specifies how to define whitespace for the purpose of tokenization. Valid values
*Example:* *Example:*
[.dynamic-tabs]
--
[example.tab-pane#byname-tokenizer-whitespace]
====
[.tab-label]*With name*
[source,xml]
----
<analyzer>
<tokenizer name="whitespace" rule="java" />
</analyzer>
----
====
[example.tab-pane#byclass-tokenizer-whitespace]
====
[.tab-label]*With class name (legacy)*
[source,xml] [source,xml]
---- ----
<analyzer> <analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" /> <tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" />
</analyzer> </analyzer>
---- ----
====
--
*In:* "To be, or what?" *In:* "To be, or what?"