mirror of https://github.com/apache/lucene.git
SOLR-13691: Add example field type configurations using name attributes to Ref Guide
This commit is contained in:
parent
77c1ed7d16
commit
66d7dffc79
|
@ -64,7 +64,7 @@ Upgrade Notes
|
|||
* SOLR-11266: default Content-Type override for JSONResponseWriter from _default configSet is removed. Example has been
|
||||
provided in sample_techproducts_configs to override content-type. (Ishan Chattopadhyaya, Munendra S N, Gus Heck)
|
||||
|
||||
* SOLR-13593 SOLR-13690: Allow to look up analyzer components by their SPI names in field type configuration. (Tomoko Uchida)
|
||||
* SOLR-13593 SOLR-13690 SOLR-13691: Allow to look up analyzer components by their SPI names in field type configuration. (Tomoko Uchida)
|
||||
|
||||
Other Changes
|
||||
----------------------
|
||||
|
|
|
@ -22,6 +22,25 @@ A filter may also do more complex analysis by looking ahead to consider multiple
|
|||
|
||||
Because filters consume one `TokenStream` and produce a new `TokenStream`, they can be chained one after another indefinitely. Each filter in the chain in turn processes the tokens produced by its predecessor. The order in which you specify the filters is therefore significant. Typically, the most general filtering is done first, and later filtering stages are more specialized.
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-filterexample]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="text" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer name="standard"/>
|
||||
<filter name="lowercase"/>
|
||||
<filter name="englishPorter"/>
|
||||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-filterexample]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="text" class="solr.TextField">
|
||||
|
@ -32,6 +51,8 @@ Because filters consume one `TokenStream` and produce a new `TokenStream`, they
|
|||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
This example starts with Solr's standard tokenizer, which breaks the field's text into tokens. All the tokens are then set to lowercase, which will facilitate case-insensitive matching at query time.
|
||||
|
||||
|
|
|
@ -20,6 +20,23 @@ The job of a <<tokenizers.adoc#tokenizers,tokenizer>> is to break up a stream of
|
|||
|
||||
Characters in the input stream may be discarded, such as whitespace or other delimiters. They may also be added to or replaced, such as mapping aliases or abbreviations to normalized forms. A token contains various metadata in addition to its text value, such as the location at which the token occurs in the field. Because a tokenizer may produce tokens that diverge from the input text, you should not assume that the text of the token is the same text that occurs in the field, or that its length is the same as the original text. It's also possible for more than one token to have the same position or refer to the same offset in the original text. Keep this in mind if you use token metadata for things like highlighting search results in the field text.
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tok]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="text" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer name="standard"/>
|
||||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tok]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="text" class="solr.TextField">
|
||||
|
@ -28,6 +45,8 @@ Characters in the input stream may be discarded, such as whitespace or other del
|
|||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
The class named in the tokenizer element is not the actual tokenizer, but rather a class that implements the `TokenizerFactory` API. This factory class will be called upon to create new tokenizer instances as needed. Objects created by the factory must derive from `Tokenizer`, which indicates that they produce sequences of tokens. If the tokenizer produces tokens that are usable as is, it may be the only component of the analyzer. Otherwise, the tokenizer's output tokens will serve as input to the first filter stage in the pipeline.
|
||||
|
||||
|
|
|
@ -35,6 +35,27 @@ Even the most complex analysis requirements can usually be decomposed into a ser
|
|||
|
||||
For example:
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="nametext" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer name="standard"/>
|
||||
<filter name="lowercase"/>
|
||||
<filter name="stop"/>
|
||||
<filter name="englishPorter"/>
|
||||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
Tokenizer and filter factory classes are referred by their symbolic names (SPI names). Here, name="standard" refers `org.apache.lucene.analysis.standard.StandardTokenizerFactory`.
|
||||
====
|
||||
[example.tab-pane#byclass]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="nametext" class="solr.TextField">
|
||||
|
@ -46,8 +67,9 @@ For example:
|
|||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
|
||||
Note that classes in the `org.apache.solr.analysis` package may be referred to here with the shorthand `solr.` prefix.
|
||||
Note that classes in the `org.apache.lucene.analysis` package may be referred to here with the shorthand `solr.` prefix.
|
||||
====
|
||||
--
|
||||
|
||||
In this case, no Analyzer class was specified on the `<analyzer>` element. Rather, a sequence of more specialized classes are wired together and collectively act as the Analyzer for the field. The text of the field is passed to the first item in the list (`solr.StandardTokenizerFactory`), and the tokens that emerge from the last one (`solr.EnglishPorterFilterFactory`) are the terms that are used for indexing or querying any fields that use the "nametext" `fieldType`.
|
||||
|
||||
|
@ -65,6 +87,30 @@ In many cases, the same analysis should be applied to both phases. This is desir
|
|||
|
||||
If you provide a simple `<analyzer>` definition for a field type, as in the examples above, then it will be used for both indexing and queries. If you want distinct analyzers for each phase, you may include two `<analyzer>` definitions distinguished with a type attribute. For example:
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-phases]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="nametext" class="solr.TextField">
|
||||
<analyzer type="index">
|
||||
<tokenizer name="standard"/>
|
||||
<filter name="lowercase"/>
|
||||
<filter name="keepWord" words="keepwords.txt"/>
|
||||
<filter name="synonymFilter" synonyms="syns.txt"/>
|
||||
</analyzer>
|
||||
<analyzer type="query">
|
||||
<tokenizer name="standard"/>
|
||||
<filter name="lowercase"/>
|
||||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-phases]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="nametext" class="solr.TextField">
|
||||
|
@ -80,6 +126,8 @@ If you provide a simple `<analyzer>` definition for a field type, as in the exam
|
|||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
In this theoretical example, at index time the text is tokenized, the tokens are set to lowercase, any that are not listed in `keepwords.txt` are discarded and those that remain are mapped to alternate values as defined by the synonym rules in the file `syns.txt`. This essentially builds an index from a restricted set of possible values and then normalizes them to values that may not even occur in the original text.
|
||||
|
||||
|
@ -103,14 +151,14 @@ For most use cases, this provides the best possible behavior, but if you wish fo
|
|||
----
|
||||
<fieldType name="nametext" class="solr.TextField">
|
||||
<analyzer type="index">
|
||||
<tokenizer class="solr.StandardTokenizerFactory"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
|
||||
<filter class="solr.SynonymFilterFactory" synonyms="syns.txt"/>
|
||||
<tokenizer name="standard"/>
|
||||
<filter name="lowercase"/>
|
||||
<filter name="keepWord" words="keepwords.txt"/>
|
||||
<filter name="synonym" synonyms="syns.txt"/>
|
||||
</analyzer>
|
||||
<analyzer type="query">
|
||||
<tokenizer class="solr.StandardTokenizerFactory"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
<tokenizer name="standard"/>
|
||||
<filter name="lowercase"/>
|
||||
</analyzer>
|
||||
<!-- No analysis at all when doing queries that involved Multi-Term expansion -->
|
||||
<analyzer type="multiterm">
|
||||
|
|
|
@ -28,6 +28,23 @@ This filter requires specifying a `mapping` argument, which is the path and name
|
|||
|
||||
Example:
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-charfilter]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<charFilter name="mapping" mapping="mapping-FoldToASCII.txt"/>
|
||||
<tokenizer ...>
|
||||
[...]
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-charfilter]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
|
@ -36,6 +53,8 @@ Example:
|
|||
[...]
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
Mapping file syntax:
|
||||
|
||||
|
@ -101,6 +120,23 @@ The table below presents examples of HTML stripping.
|
|||
|
||||
Example:
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-charfilter-htmlstrip]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<charFilter name="htmlStrip"/>
|
||||
<tokenizer ...>
|
||||
[...]
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-charfilter-htmlstrip]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
|
@ -109,6 +145,8 @@ Example:
|
|||
[...]
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
== solr.ICUNormalizer2CharFilterFactory
|
||||
|
||||
|
@ -124,6 +162,23 @@ Arguments:
|
|||
|
||||
Example:
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-charfilter-icunormalizer2]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<charFilter name="icuNormalizer2"/>
|
||||
<tokenizer ...>
|
||||
[...]
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-charfilter-icunormalizer2]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
|
@ -132,6 +187,8 @@ Example:
|
|||
[...]
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
== solr.PatternReplaceCharFilterFactory
|
||||
|
||||
|
@ -145,6 +202,24 @@ Arguments:
|
|||
|
||||
You can configure this filter in `schema.xml` like this:
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-charfilter-patternreplace]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<charFilter name="patternReplace"
|
||||
pattern="([nN][oO]\.)\s*(\d+)" replacement="$1$2"/>
|
||||
<tokenizer ...>
|
||||
[...]
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-charfilter-patternreplace]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
|
@ -154,6 +229,8 @@ You can configure this filter in `schema.xml` like this:
|
|||
[...]
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
The table below presents examples of regex-based pattern replacement:
|
||||
|
||||
|
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
|
@ -333,13 +333,13 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{
|
|||
"positionIncrementGap":"100",
|
||||
"analyzer" : {
|
||||
"charFilters":[{
|
||||
"class":"solr.PatternReplaceCharFilterFactory",
|
||||
"name":"patternReplace",
|
||||
"replacement":"$1$1",
|
||||
"pattern":"([a-zA-Z])\\\\1+" }],
|
||||
"tokenizer":{
|
||||
"class":"solr.WhitespaceTokenizerFactory" },
|
||||
"name":"whitespace" },
|
||||
"filters":[{
|
||||
"class":"solr.WordDelimiterFilterFactory",
|
||||
"name":"wordDelimiter",
|
||||
"preserveOriginal":"0" }]}}
|
||||
}' http://localhost:8983/solr/gettingstarted/schema
|
||||
----
|
||||
|
@ -361,11 +361,11 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{
|
|||
"class":"solr.TextField",
|
||||
"indexAnalyzer":{
|
||||
"tokenizer":{
|
||||
"class":"solr.PathHierarchyTokenizerFactory",
|
||||
"name":"pathHierarchy",
|
||||
"delimiter":"/" }},
|
||||
"queryAnalyzer":{
|
||||
"tokenizer":{
|
||||
"class":"solr.KeywordTokenizerFactory" }}}
|
||||
"name":"keyword" }}}
|
||||
}' http://localhost:8983/solr/gettingstarted/schema
|
||||
----
|
||||
====
|
||||
|
@ -383,11 +383,11 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{
|
|||
"class":"solr.TextField",
|
||||
"indexAnalyzer":{
|
||||
"tokenizer":{
|
||||
"class":"solr.PathHierarchyTokenizerFactory",
|
||||
"name":"pathHierarchy",
|
||||
"delimiter":"/" }},
|
||||
"queryAnalyzer":{
|
||||
"tokenizer":{
|
||||
"class":"solr.KeywordTokenizerFactory" }}}
|
||||
"name":"keyword" }}}
|
||||
}' http://localhost:8983/api/cores/gettingstarted/schema
|
||||
----
|
||||
====
|
||||
|
@ -446,7 +446,7 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{
|
|||
"positionIncrementGap":"100",
|
||||
"analyzer":{
|
||||
"tokenizer":{
|
||||
"class":"solr.StandardTokenizerFactory" }}}
|
||||
"name":"standard" }}}
|
||||
}' http://localhost:8983/solr/gettingstarted/schema
|
||||
----
|
||||
====
|
||||
|
@ -463,7 +463,7 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{
|
|||
"positionIncrementGap":"100",
|
||||
"analyzer":{
|
||||
"tokenizer":{
|
||||
"class":"solr.StandardTokenizerFactory" }}}
|
||||
"name":"standard" }}}
|
||||
}' http://localhost:8983/api/cores/gettingstarted/schema
|
||||
----
|
||||
====
|
||||
|
@ -565,13 +565,13 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{
|
|||
"positionIncrementGap":"100",
|
||||
"analyzer":{
|
||||
"charFilters":[{
|
||||
"class":"solr.PatternReplaceCharFilterFactory",
|
||||
"name":"patternReplace",
|
||||
"replacement":"$1$1",
|
||||
"pattern":"([a-zA-Z])\\\\1+" }],
|
||||
"tokenizer":{
|
||||
"class":"solr.WhitespaceTokenizerFactory" },
|
||||
"name":"whitespace" },
|
||||
"filters":[{
|
||||
"class":"solr.WordDelimiterFilterFactory",
|
||||
"name":"wordDelimiter",
|
||||
"preserveOriginal":"0" }]}},
|
||||
"add-field" : {
|
||||
"name":"sell_by",
|
||||
|
|
|
@ -20,6 +20,24 @@ Tokenizers are responsible for breaking field data into lexical units, or _token
|
|||
|
||||
You configure the tokenizer for a text field type in `schema.xml` with a `<tokenizer>` element, as a child of `<analyzer>`:
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="text" class="solr.TextField">
|
||||
<analyzer type="index">
|
||||
<tokenizer name="standard"/>
|
||||
<filter name="lowercase"/>
|
||||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="text" class="solr.TextField">
|
||||
|
@ -29,11 +47,30 @@ You configure the tokenizer for a text field type in `schema.xml` with a `<token
|
|||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the `org.apache.solr.analysis.TokenizerFactory`. A TokenizerFactory's `create()` method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field.
|
||||
The name/class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the `org.apache.lucene.analysis.util.TokenizerFactory`. A TokenizerFactory's `create()` method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field.
|
||||
|
||||
Arguments may be passed to tokenizer factories by setting attributes on the `<tokenizer>` element.
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-args]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="semicolonDelimited" class="solr.TextField">
|
||||
<analyzer type="query">
|
||||
<tokenizer name="pattern" pattern="; "/>
|
||||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-args]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="semicolonDelimited" class="solr.TextField">
|
||||
|
@ -42,6 +79,8 @@ Arguments may be passed to tokenizer factories by setting attributes on the `<to
|
|||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
The following sections describe the tokenizer factory classes included in this release of Solr.
|
||||
|
||||
|
@ -66,12 +105,29 @@ The Standard Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries
|
|||
|
||||
*Example:*
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-standard]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="standard"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-standard]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.StandardTokenizerFactory"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
|
||||
|
||||
|
@ -95,12 +151,29 @@ The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of S
|
|||
|
||||
*Example:*
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-classic]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="classic"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-classic]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.ClassicTokenizerFactory"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
|
||||
|
||||
|
@ -116,12 +189,29 @@ This tokenizer treats the entire text field as a single token.
|
|||
|
||||
*Example:*
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-keyword]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="keyword"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-keyword]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.KeywordTokenizerFactory"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."
|
||||
|
||||
|
@ -137,12 +227,29 @@ This tokenizer creates tokens from strings of contiguous letters, discarding all
|
|||
|
||||
*Example:*
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-letter]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="letter"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-letter]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.LetterTokenizerFactory"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "I can't."
|
||||
|
||||
|
@ -158,12 +265,29 @@ Tokenizes the input stream by delimiting at non-letters and then converting all
|
|||
|
||||
*Example:*
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-lowercase]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="lowercase"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-lowercase]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "I just \*LOVE* my iPhone!"
|
||||
|
||||
|
@ -185,12 +309,29 @@ Reads the field text and generates n-gram tokens of sizes in the given range.
|
|||
|
||||
Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding.
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-ngram]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="nGram"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-ngram]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.NGramTokenizerFactory"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "hey man"
|
||||
|
||||
|
@ -200,12 +341,29 @@ Default behavior. Note that this tokenizer operates over the whole field. It doe
|
|||
|
||||
With an n-gram size range of 4 to 5:
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-ngram-args]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="nGram" minGramSize="4" maxGramSize="5"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-ngram-args]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "bicycle"
|
||||
|
||||
|
@ -227,12 +385,29 @@ Reads the field text and generates edge n-gram tokens of sizes in the given rang
|
|||
|
||||
Default behavior (min and max default to 1):
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-edgengram]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="edgeNGram"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-edgengram]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.EdgeNGramTokenizerFactory"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "babaloo"
|
||||
|
||||
|
@ -242,12 +417,29 @@ Default behavior (min and max default to 1):
|
|||
|
||||
Edge n-gram range of 2 to 5
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-edgengram-args]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="edgeNGram" minGramSize="2" maxGramSize="5"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-edgengram-args]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "babaloo"
|
||||
|
||||
|
@ -269,6 +461,22 @@ The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word br
|
|||
|
||||
*Example:*
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-icu]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<!-- no customization -->
|
||||
<tokenizer name="icu"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-icu]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
|
@ -276,7 +484,25 @@ The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word br
|
|||
<tokenizer class="solr.ICUTokenizerFactory"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-icu-rule]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="icu"
|
||||
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-icu-rule]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
|
@ -284,6 +510,8 @@ The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word br
|
|||
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
[IMPORTANT]
|
||||
====
|
||||
|
@ -306,6 +534,23 @@ This tokenizer creates synonyms from file path hierarchies.
|
|||
|
||||
*Example:*
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-pathhierarchy]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
|
||||
<analyzer>
|
||||
<tokenizer name="pathHierarchy" delimiter="\" replace="/"/>
|
||||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-pathhierarchy]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
|
||||
|
@ -314,6 +559,8 @@ This tokenizer creates synonyms from file path hierarchies.
|
|||
</analyzer>
|
||||
</fieldType>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "c:\usr\local\apache"
|
||||
|
||||
|
@ -337,12 +584,29 @@ See {java-javadocs}java/util/regex/Pattern.html[the Javadocs for `java.util.rege
|
|||
|
||||
A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces.
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-pattern]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="pattern" pattern="\s*,\s*"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-pattern]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "fee,fie, foe , fum, foo"
|
||||
|
||||
|
@ -352,12 +616,29 @@ A comma separated list. Tokens are separated by a sequence of zero or more space
|
|||
|
||||
Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token.
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-pattern-words]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="pattern" pattern="[A-Z][A-Za-z]*" group="0"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-pattern-words]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*" group="0"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."
|
||||
|
||||
|
@ -367,12 +648,29 @@ Extract simple, capitalized words. A sequence of at least one capital letter fol
|
|||
|
||||
Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens.
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-pattern-sku]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="pattern" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-pattern-sku]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.PatternTokenizerFactory" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "SKU: 1234, Part Number 5678, Part: 126-987"
|
||||
|
||||
|
@ -394,12 +692,29 @@ This tokenizer is similar to the `PatternTokenizerFactory` described above, but
|
|||
|
||||
To match tokens delimited by simple whitespace characters:
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-simplepattern]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="simplePattern" pattern="[^ \t\r\n]+"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-simplepattern]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
== Simplified Regular Expression Pattern Splitting Tokenizer
|
||||
|
||||
|
@ -417,12 +732,29 @@ This tokenizer is similar to the `SimplePatternTokenizerFactory` described above
|
|||
|
||||
To match tokens delimited by simple whitespace characters:
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-simplepatternsplit]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="simplePatternSplit" pattern="[ \t\r\n]+"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-simplepatternsplit]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
== UAX29 URL Email Tokenizer
|
||||
|
||||
|
@ -448,12 +780,29 @@ The UAX29 URL Email Tokenizer supports http://unicode.org/reports/tr29/#Word_Bou
|
|||
|
||||
*Example:*
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-uax29urlemail]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="uax29URLEmail"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-uax29urlemail]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"
|
||||
|
||||
|
@ -475,12 +824,29 @@ Specifies how to define whitespace for the purpose of tokenization. Valid values
|
|||
|
||||
*Example:*
|
||||
|
||||
[.dynamic-tabs]
|
||||
--
|
||||
[example.tab-pane#byname-tokenizer-whitespace]
|
||||
====
|
||||
[.tab-label]*With name*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer name="whitespace" rule="java" />
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
[example.tab-pane#byclass-tokenizer-whitespace]
|
||||
====
|
||||
[.tab-label]*With class name (legacy)*
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" />
|
||||
</analyzer>
|
||||
----
|
||||
====
|
||||
--
|
||||
|
||||
*In:* "To be, or what?"
|
||||
|
||||
|
|
Loading…
Reference in New Issue