diff --git a/solr/CHANGES.txt b/solr/CHANGES.txt index a74eb91c901..bacb6372c35 100644 --- a/solr/CHANGES.txt +++ b/solr/CHANGES.txt @@ -64,7 +64,7 @@ Upgrade Notes * SOLR-11266: default Content-Type override for JSONResponseWriter from _default configSet is removed. Example has been provided in sample_techproducts_configs to override content-type. (Ishan Chattopadhyaya, Munendra S N, Gus Heck) -* SOLR-13593 SOLR-13690: Allow to look up analyzer components by their SPI names in field type configuration. (Tomoko Uchida) +* SOLR-13593 SOLR-13690 SOLR-13691: Allow to look up analyzer components by their SPI names in field type configuration. (Tomoko Uchida) Other Changes ---------------------- diff --git a/solr/solr-ref-guide/src/about-filters.adoc b/solr/solr-ref-guide/src/about-filters.adoc index dbb10a6fe0a..a0577b94546 100644 --- a/solr/solr-ref-guide/src/about-filters.adoc +++ b/solr/solr-ref-guide/src/about-filters.adoc @@ -22,6 +22,25 @@ A filter may also do more complex analysis by looking ahead to consider multiple Because filters consume one `TokenStream` and produce a new `TokenStream`, they can be chained one after another indefinitely. Each filter in the chain in turn processes the tokens produced by its predecessor. The order in which you specify the filters is therefore significant. Typically, the most general filtering is done first, and later filtering stages are more specialized. +[.dynamic-tabs] +-- +[example.tab-pane#byname-filterexample] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + + +---- +==== +[example.tab-pane#byclass-filterexample] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -32,6 +51,8 @@ Because filters consume one `TokenStream` and produce a new `TokenStream`, they ---- +==== +-- This example starts with Solr's standard tokenizer, which breaks the field's text into tokens. All the tokens are then set to lowercase, which will facilitate case-insensitive matching at query time. diff --git a/solr/solr-ref-guide/src/about-tokenizers.adoc b/solr/solr-ref-guide/src/about-tokenizers.adoc index 77898d7ee27..f94ef5ee0f0 100644 --- a/solr/solr-ref-guide/src/about-tokenizers.adoc +++ b/solr/solr-ref-guide/src/about-tokenizers.adoc @@ -20,6 +20,23 @@ The job of a <> is to break up a stream of Characters in the input stream may be discarded, such as whitespace or other delimiters. They may also be added to or replaced, such as mapping aliases or abbreviations to normalized forms. A token contains various metadata in addition to its text value, such as the location at which the token occurs in the field. Because a tokenizer may produce tokens that diverge from the input text, you should not assume that the text of the token is the same text that occurs in the field, or that its length is the same as the original text. It's also possible for more than one token to have the same position or refer to the same offset in the original text. Keep this in mind if you use token metadata for things like highlighting search results in the field text. +[.dynamic-tabs] +-- +[example.tab-pane#byname-tok] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-tok] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -28,6 +45,8 @@ Characters in the input stream may be discarded, such as whitespace or other del ---- +==== +-- The class named in the tokenizer element is not the actual tokenizer, but rather a class that implements the `TokenizerFactory` API. This factory class will be called upon to create new tokenizer instances as needed. Objects created by the factory must derive from `Tokenizer`, which indicates that they produce sequences of tokens. If the tokenizer produces tokens that are usable as is, it may be the only component of the analyzer. Otherwise, the tokenizer's output tokens will serve as input to the first filter stage in the pipeline. diff --git a/solr/solr-ref-guide/src/analyzers.adoc b/solr/solr-ref-guide/src/analyzers.adoc index 998f50f8982..36ca4c57326 100644 --- a/solr/solr-ref-guide/src/analyzers.adoc +++ b/solr/solr-ref-guide/src/analyzers.adoc @@ -35,6 +35,27 @@ Even the most complex analysis requirements can usually be decomposed into a ser For example: +[.dynamic-tabs] +-- +[example.tab-pane#byname] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + + + +---- +Tokenizer and filter factory classes are referred by their symbolic names (SPI names). Here, name="standard" refers `org.apache.lucene.analysis.standard.StandardTokenizerFactory`. +==== +[example.tab-pane#byclass] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -46,8 +67,9 @@ For example: ---- - -Note that classes in the `org.apache.solr.analysis` package may be referred to here with the shorthand `solr.` prefix. +Note that classes in the `org.apache.lucene.analysis` package may be referred to here with the shorthand `solr.` prefix. +==== +-- In this case, no Analyzer class was specified on the `` element. Rather, a sequence of more specialized classes are wired together and collectively act as the Analyzer for the field. The text of the field is passed to the first item in the list (`solr.StandardTokenizerFactory`), and the tokens that emerge from the last one (`solr.EnglishPorterFilterFactory`) are the terms that are used for indexing or querying any fields that use the "nametext" `fieldType`. @@ -65,6 +87,30 @@ In many cases, the same analysis should be applied to both phases. This is desir If you provide a simple `` definition for a field type, as in the examples above, then it will be used for both indexing and queries. If you want distinct analyzers for each phase, you may include two `` definitions distinguished with a type attribute. For example: +[.dynamic-tabs] +-- +[example.tab-pane#byname-phases] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + + + + + + + +---- +==== +[example.tab-pane#byclass-phases] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -80,6 +126,8 @@ If you provide a simple `` definition for a field type, as in the exam ---- +==== +-- In this theoretical example, at index time the text is tokenized, the tokens are set to lowercase, any that are not listed in `keepwords.txt` are discarded and those that remain are mapped to alternate values as defined by the synonym rules in the file `syns.txt`. This essentially builds an index from a restricted set of possible values and then normalizes them to values that may not even occur in the original text. @@ -103,14 +151,14 @@ For most use cases, this provides the best possible behavior, but if you wish fo ---- - - - - + + + + - - + + diff --git a/solr/solr-ref-guide/src/charfilterfactories.adoc b/solr/solr-ref-guide/src/charfilterfactories.adoc index cd81b3e94e0..031706cdb18 100644 --- a/solr/solr-ref-guide/src/charfilterfactories.adoc +++ b/solr/solr-ref-guide/src/charfilterfactories.adoc @@ -28,6 +28,23 @@ This filter requires specifying a `mapping` argument, which is the path and name Example: +[.dynamic-tabs] +-- +[example.tab-pane#byname-charfilter] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + [...] + +---- +==== +[example.tab-pane#byclass-charfilter] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -36,6 +53,8 @@ Example: [...] ---- +==== +-- Mapping file syntax: @@ -101,6 +120,23 @@ The table below presents examples of HTML stripping. Example: +[.dynamic-tabs] +-- +[example.tab-pane#byname-charfilter-htmlstrip] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + [...] + +---- +==== +[example.tab-pane#byclass-charfilter-htmlstrip] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -109,6 +145,8 @@ Example: [...] ---- +==== +-- == solr.ICUNormalizer2CharFilterFactory @@ -124,6 +162,23 @@ Arguments: Example: +[.dynamic-tabs] +-- +[example.tab-pane#byname-charfilter-icunormalizer2] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + [...] + +---- +==== +[example.tab-pane#byclass-charfilter-icunormalizer2] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -132,6 +187,8 @@ Example: [...] ---- +==== +-- == solr.PatternReplaceCharFilterFactory @@ -145,6 +202,24 @@ Arguments: You can configure this filter in `schema.xml` like this: +[.dynamic-tabs] +-- +[example.tab-pane#byname-charfilter-patternreplace] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + [...] + +---- +==== +[example.tab-pane#byclass-charfilter-patternreplace] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -154,6 +229,8 @@ You can configure this filter in `schema.xml` like this: [...] ---- +==== +-- The table below presents examples of regex-based pattern replacement: diff --git a/solr/solr-ref-guide/src/filter-descriptions.adoc b/solr/solr-ref-guide/src/filter-descriptions.adoc index eedfbe9fdb1..f59a366a5b1 100644 --- a/solr/solr-ref-guide/src/filter-descriptions.adoc +++ b/solr/solr-ref-guide/src/filter-descriptions.adoc @@ -20,20 +20,58 @@ Filters examine a stream of tokens and keep them, transform them or discard them You configure each filter with a `` element in `schema.xml` as a child of ``, following the `` element. Filter definitions should follow a tokenizer or another filter definition because they take a `TokenStream` as input. For example: +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + +---- +==== +[example.tab-pane#byclass-filter] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- - ... + ---- +==== +-- -The class attribute names a factory class that will instantiate a filter object as needed. Filter factory classes must implement the `org.apache.solr.analysis.TokenFilterFactory` interface. Like tokenizers, filters are also instances of TokenStream and thus are producers of tokens. Unlike tokenizers, filters also consume tokens from a TokenStream. This allows you to mix and match filters, in any order you prefer, downstream of a tokenizer. +The name/class attribute names a factory class that will instantiate a filter object as needed. Filter factory classes must implement the `org.apache.lucene.analysis.util.TokenFilterFactory` interface. Like tokenizers, filters are also instances of TokenStream and thus are producers of tokens. Unlike tokenizers, filters also consume tokens from a TokenStream. This allows you to mix and match filters, in any order you prefer, downstream of a tokenizer. Arguments may be passed to tokenizer factories to modify their behavior by setting attributes on the `` element. For example: +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter2] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + +---- +==== +[example.tab-pane#byclass-filter-2] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -43,6 +81,8 @@ Arguments may be passed to tokenizer factories to modify their behavior by setti ---- +==== +-- The following sections describe the filter factories that are included in this release of Solr. @@ -77,6 +117,22 @@ This filter converts alphabetic, numeric, and symbolic Unicode characters which *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-asciifolding] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-asciifolding] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -84,6 +140,8 @@ This filter converts alphabetic, numeric, and symbolic Unicode characters which ---- +==== +-- *In:* "á" (Unicode character 00E1) @@ -112,6 +170,23 @@ BeiderMorseFilter changed its behavior in Solr 5.0 due to an update to version 3 *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-beidermorse] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-filter-beidermorse] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -120,6 +195,8 @@ BeiderMorseFilter changed its behavior in Solr 5.0 due to an update to version 3 ---- +==== +-- == Classic Filter @@ -131,6 +208,22 @@ This filter takes the output of the < + + + +---- +==== +[example.tab-pane#byclass-filter-classic] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -138,6 +231,8 @@ This filter takes the output of the < ---- +==== +-- *In:* "I.B.M. cat's can't" @@ -161,6 +256,22 @@ This filter creates word shingles by combining common tokens such as stop words *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-commongrams] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-commongrams] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -168,6 +279,8 @@ This filter creates word shingles by combining common tokens such as stop words ---- +==== +-- *In:* "the Cat" @@ -191,6 +304,22 @@ Implements the Daitch-Mokotoff Soundex algorithm, which allows identification of *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-daitchmokotoffsondex] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-daitchmokotoffsondex] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -198,6 +327,8 @@ Implements the Daitch-Mokotoff Soundex algorithm, which allows identification of ---- +==== +-- == Double Metaphone Filter @@ -215,6 +346,22 @@ This filter creates tokens using the http://commons.apache.org/proper/commons-co Default behavior for inject (true): keep the original token and add phonetic token(s) at the same position. +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-doublemetaphone] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-doublemetaphone] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -222,6 +369,8 @@ Default behavior for inject (true): keep the original token and add phonetic tok ---- +==== +-- *In:* "four score and Kuczewski" @@ -238,8 +387,8 @@ Discard original token (`inject="false"`). [source,xml] ---- - - + + ---- @@ -267,6 +416,22 @@ This filter generates edge n-gram tokens of sizes within the given range. Default behavior. +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-edgengram] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-edgengram] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -274,6 +439,8 @@ Default behavior. ---- +==== +-- *In:* "four score and twenty" @@ -288,8 +455,8 @@ A range of 1 to 4. [source,xml] ---- - - + + ---- @@ -306,8 +473,8 @@ A range of 4 to 6. [source,xml] ---- - - + + ---- @@ -327,6 +494,22 @@ This filter stems plural English words to their singular form. *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-englishminimalstem] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-englishminimalstem] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -334,6 +517,8 @@ This filter stems plural English words to their singular form. ---- +==== +-- *In:* "dogs cats" @@ -351,6 +536,22 @@ This filter removes singular possessives (trailing *'s*) from words. Note that p *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-englishpossessive] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-englishpossessive] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -358,6 +559,8 @@ This filter removes singular possessives (trailing *'s*) from words. Note that p ---- +==== +-- *In:* "Man's dog bites dogs' man" @@ -379,6 +582,22 @@ This filter outputs a single token which is a concatenation of the sorted and de *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-fingerprint] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-fingerprint] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -386,6 +605,8 @@ This filter outputs a single token which is a concatenation of the sorted and de ---- +==== +-- *In:* "the quick brown fox jumped over the lazy dog" @@ -423,6 +644,26 @@ Be aware that your results will vary widely based on the quality of the provided *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-hunspellstem] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-hunspellstem] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -434,6 +675,8 @@ Be aware that your results will vary widely based on the quality of the provided strictAffixParsing="true" /> ---- +==== +-- *In:* "jump jumping jumped" @@ -453,6 +696,22 @@ Note that for this filter to work properly, the upstream tokenizer must not remo *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-hyphenatedwords] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-hyphenatedwords] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -460,6 +719,8 @@ Note that for this filter to work properly, the upstream tokenizer must not remo ---- +==== +-- *In:* "A hyphen- ated word" @@ -481,6 +742,22 @@ To use this filter, you must add additional .jars to Solr's classpath (as descri *Example without a filter:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-icufolding] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-icufolding] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -488,14 +765,16 @@ To use this filter, you must add additional .jars to Solr's classpath (as descri ---- +==== +-- *Example with a filter to exclude Swedish/Finnish characters:* [source,xml] ---- - - + + ---- @@ -523,6 +802,22 @@ This filter factory normalizes text according to one of five Unicode Normalizati *Example with NFKC_Casefold:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-icunormalizer2] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-icunormalizer2] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -530,14 +825,16 @@ This filter factory normalizes text according to one of five Unicode Normalizati ---- +==== +-- *Example with a filter to exclude Swedish/Finnish characters:* [source,xml] ---- - - + + ---- @@ -557,6 +854,22 @@ This filter applies http://userguide.icu-project.org/transforms/general[ICU Tran *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-icutransform] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-icutransform] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -564,6 +877,8 @@ This filter applies http://userguide.icu-project.org/transforms/general[ICU Tran ---- +==== +-- For detailed information about ICU Transforms, see http://userguide.icu-project.org/transforms/general. @@ -589,6 +904,22 @@ Where `keepwords.txt` contains: `happy funny silly` +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-keepword] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-keepword] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -596,6 +927,8 @@ Where `keepwords.txt` contains: ---- +==== +-- *In:* "Happy, sad or funny" @@ -610,8 +943,8 @@ Same `keepwords.txt`, case insensitive: [source,xml] ---- - - + + ---- @@ -628,9 +961,9 @@ Using LowerCaseFilterFactory before filtering for keep words, no `ignoreCase` fl [source,xml] ---- - - - + + + ---- @@ -652,13 +985,31 @@ KStem is an alternative to the Porter Stem Filter for developers looking for a l *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-kstem] +==== +[.tab-label]*With name* [source,xml] ---- - + + + +---- +==== +[example.tab-pane#byclass-filter-kstem] +==== +[.tab-label]*With class name (legacy)* +[source,xml] +---- + + ---- +==== +-- *In:* "jump jumping jumped" @@ -682,6 +1033,22 @@ This filter passes tokens whose length falls within the min/max limit specified. *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-length] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-length] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -689,6 +1056,8 @@ This filter passes tokens whose length falls within the min/max limit specified. ---- +==== +-- *In:* "turn right at Albuquerque" @@ -712,6 +1081,23 @@ By default, this filter ignores any tokens in the wrapped `TokenStream` once the *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-limittokencount] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-limittokencount] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -720,6 +1106,8 @@ By default, this filter ignores any tokens in the wrapped `TokenStream` once the consumeAllTokens="false" /> ---- +==== +-- *In:* "1 2 3 4 5 6 7 8 9 10 11 12" @@ -743,6 +1131,23 @@ By default, this filter ignores any tokens in the wrapped `TokenStream` once the *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-limittokenoffset] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-limittokenoffset] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -751,6 +1156,8 @@ By default, this filter ignores any tokens in the wrapped `TokenStream` once the consumeAllTokens="false" /> ---- +==== +-- *In:* "0 2 4 6 8 A C E" @@ -774,6 +1181,23 @@ By default, this filter ignores any tokens in the wrapped `TokenStream` once the *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-limittokenposition] +==== +[.tab-label]*With name)* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-limittokenposition] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -782,6 +1206,8 @@ By default, this filter ignores any tokens in the wrapped `TokenStream` once the consumeAllTokens="false" /> ---- +==== +-- *In:* "1 2 3 4 5" @@ -799,6 +1225,22 @@ Converts any uppercase letters in a token to the equivalent lowercase token. All *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-lowercase] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-lowercase] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -806,6 +1248,8 @@ Converts any uppercase letters in a token to the equivalent lowercase token. All ---- +==== +-- *In:* "Down With CamelCase" @@ -825,6 +1269,22 @@ This is specialized version of the <> tha //TODO: make this show an actual API call. With this configuration the set of words is named "english" and can be managed via `/solr/collection_name/schema/analysis/stopwords/english` +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-managedstop] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-managedstop] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -832,6 +1292,8 @@ With this configuration the set of words is named "english" and can be managed v ---- +==== +-- See <> for example input/output. @@ -865,6 +1327,27 @@ NOTE: Although this filter produces correct token graphs, it cannot consume an i //TODO: make this show an actual API call With this configuration the set of mappings is named "english" and can be managed via `/solr/collection_name/schema/analysis/synonyms/english` +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-managedsynonymgraph] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + + + + +---- +==== +[example.tab-pane#byclass-filter-managedsynonymgraph] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -877,6 +1360,8 @@ With this configuration the set of mappings is named "english" and can be manage ---- +==== +-- See <> below for example input/output. @@ -896,6 +1381,22 @@ Generates n-gram tokens of sizes in the given range. Note that tokens are ordere Default behavior. +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-ngram] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-ngram] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -903,6 +1404,8 @@ Default behavior. ---- +==== +-- *In:* "four score" @@ -917,8 +1420,8 @@ A range of 1 to 4. [source,xml] ---- - - + + ---- @@ -935,8 +1438,8 @@ A range of 3 to 5. [source,xml] ---- - - + + ---- @@ -960,6 +1463,22 @@ This filter adds a numeric floating point payload value to tokens that match a g *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-numericpayload] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-numericpayload] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -967,6 +1486,8 @@ This filter adds a numeric floating point payload value to tokens that match a g ---- +==== +-- *In:* "bing bang boom" @@ -992,6 +1513,22 @@ This filter applies a regular expression to each token and, for those that match Simple string replace: +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-patternreplace] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-patternreplace] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -999,6 +1536,8 @@ Simple string replace: ---- +==== +-- *In:* "cat concatenate catycat" @@ -1013,8 +1552,8 @@ String replacement, first occurrence only: [source,xml] ---- - - + + ---- @@ -1031,8 +1570,8 @@ More complex pattern with capture group reference in the replacement. Tokens tha [source,xml] ---- - - + + ---- @@ -1060,6 +1599,22 @@ This filter creates tokens using one of the phonetic encoding algorithms in the Default behavior for DoubleMetaphone encoding. +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-phonetic] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-phonetic] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1067,6 +1622,8 @@ Default behavior for DoubleMetaphone encoding. ---- +==== +-- *In:* "four score and twenty" @@ -1083,8 +1640,8 @@ Discard original token. [source,xml] ---- - - + + ---- @@ -1101,8 +1658,8 @@ Default Soundex encoder. [source,xml] ---- - - + + ---- @@ -1122,6 +1679,22 @@ This filter applies the Porter Stemming Algorithm for English. The results are s *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-porterstem] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-porterstem] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1129,6 +1702,8 @@ This filter applies the Porter Stemming Algorithm for English. The results are s ---- +==== +-- *In:* "jump jumping jumped" @@ -1154,6 +1729,25 @@ This filter enables a form of conditional filtering: it only applies its wrapped All terms except those in `protectedTerms.txt` are truncated at 4 characters and lowercased: +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-protectedterm] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-protectedterm] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1164,6 +1758,8 @@ All terms except those in `protectedTerms.txt` are truncated at 4 characters and truncate.prefixLength="4"/> ---- +==== +-- *Example:* @@ -1174,8 +1770,8 @@ For all terms except those in `protectedTerms.txt`, synonyms are added, terms ar [source,xml] ---- - - + + + + + + +---- +==== +[example.tab-pane#byclass-filter-removeduplicates] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1215,6 +1829,8 @@ When used in the following configuration: ---- +==== +-- *In:* "Watch TV" @@ -1246,6 +1862,23 @@ This filter reverses tokens to provide faster leading wildcard and prefix querie *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-reversedwildcard] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-reversedwildcard] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1254,6 +1887,8 @@ This filter reverses tokens to provide faster leading wildcard and prefix querie maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/> ---- +==== +-- *In:* "*foo *bar" @@ -1283,6 +1918,22 @@ This filter constructs shingles, which are token n-grams, from the token stream. Default behavior. +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-shingle] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-shingle] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1290,6 +1941,8 @@ Default behavior. ---- +==== +-- *In:* "To be, or what?" @@ -1304,8 +1957,8 @@ A shingle size of four, do not include original token. [source,xml] ---- - - + + ---- @@ -1335,6 +1988,22 @@ Solr contains Snowball stemmers for Armenian, Basque, Catalan, Danish, Dutch, En Default behavior: +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-snowball] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-snowball] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1342,6 +2011,8 @@ Default behavior: ---- +==== +-- *In:* "flip flipped flipping" @@ -1356,8 +2027,8 @@ French stemmer, English words: [source,xml] ---- - - + + ---- @@ -1374,8 +2045,8 @@ Spanish stemmer, Spanish words: [source,xml] ---- - - + + ---- @@ -1405,6 +2076,22 @@ This filter discards, or _stops_ analysis of, tokens that are on the given stop Case-sensitive matching, capitalized words not stopped. Token positions skip stopped words. +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-stop] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-stop] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1412,6 +2099,8 @@ Case-sensitive matching, capitalized words not stopped. Token positions skip sto ---- +==== +-- *In:* "To be or what?" @@ -1424,8 +2113,8 @@ Case-sensitive matching, capitalized words not stopped. Token positions skip sto [source,xml] ---- - - + + ---- @@ -1459,6 +2148,24 @@ By contrast, a query like "`find the popsicle`" would remove '`the`' as a stopwo *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-suggeststop] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-filter-suggeststop] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1468,6 +2175,8 @@ By contrast, a query like "`find the popsicle`" would remove '`the`' as a stopwo words="stopwords.txt" format="wordset"/> ---- +==== +-- *In:* "The The" @@ -1535,6 +2244,27 @@ small => tiny,teeny,weeny *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-stop-synonymgraph] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + + + + +---- +==== +[example.tab-pane#byclass-filter-stop-synonymgraph] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1547,6 +2277,8 @@ small => tiny,teeny,weeny ---- +==== +-- *In:* "teh small couch" @@ -1572,6 +2304,22 @@ This filter adds the numeric character offsets of the token as a payload value f *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-stop-tokenoffsetpayload] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-stop-tokenoffsetpayload] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1579,6 +2327,8 @@ This filter adds the numeric character offsets of the token as a payload value f ---- +==== +-- *In:* "bing bang boom" @@ -1600,6 +2350,22 @@ This filter trims leading and/or trailing whitespace from tokens. Most tokenizer The PatternTokenizerFactory configuration used here splits the input on simple commas, it does not remove whitespace. +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-trim] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-trim] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1607,6 +2373,8 @@ The PatternTokenizerFactory configuration used here splits the input on simple c ---- +==== +-- *In:* "one, two , three ,four " @@ -1624,6 +2392,22 @@ This filter adds the token's type, as an encoded byte sequence, as its payload. *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-typeaspayload] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-typeaspayload] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1631,6 +2415,8 @@ This filter adds the token's type, as an encoded byte sequence, as its payload. ---- +==== +-- *In:* "Pay Bob's I.O.U." @@ -1652,6 +2438,22 @@ This filter adds the token's type, as a token at the same position as the token, With the example below, each token's type will be emitted verbatim at the same position: +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-typeassynonym] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-typeassynonym] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1659,9 +2461,27 @@ With the example below, each token's type will be emitted verbatim at the same p ---- +==== +-- With the example below, for a token "example.com" with type ``, the token emitted at the same position will be "\_type_": +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-typeassynonym-args] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-typeassynonym-args] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1669,6 +2489,8 @@ With the example below, for a token "example.com" with type ``, the token e ---- +==== +-- == Type Token Filter @@ -1686,12 +2508,29 @@ This filter blacklists or whitelists a specified list of token types, assuming t *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-typetoken] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-filter-typetoken] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- == Word Delimiter Filter @@ -1770,6 +2609,27 @@ $ => DIGIT Default behavior. The whitespace tokenizer is used here to preserve non-alphanumeric characters. +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-worddelimitergraph] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + + + + +---- +==== +[example.tab-pane#byclass-filter-worddelimitergraph] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1777,12 +2637,13 @@ Default behavior. The whitespace tokenizer is used here to preserve non-alphanum - ---- +==== +-- *In:* "hot-spot RoboBlaster/9000 100XL" @@ -1797,8 +2658,8 @@ Do not split on case changes, and do not generate number parts. Note that by not [source,xml] ---- - - + + ---- @@ -1815,8 +2676,8 @@ Concatenate word parts and number parts, but not word and number parts that occu [source,xml] ---- - - + + ---- @@ -1833,8 +2694,8 @@ Concatenate all. Word and/or number parts are joined together. [source,xml] ---- - - + + ---- @@ -1851,8 +2712,8 @@ Using a protected words list that contains "AstroBlaster" and "XL-5000" (among o [source,xml] ---- - - + + ---- diff --git a/solr/solr-ref-guide/src/language-analysis.adoc b/solr/solr-ref-guide/src/language-analysis.adoc index b32e6be0f7e..d71b8f5ba8e 100644 --- a/solr/solr-ref-guide/src/language-analysis.adoc +++ b/solr/solr-ref-guide/src/language-analysis.adoc @@ -31,6 +31,25 @@ Protects words from being modified by stemmers. A customized protected word list A sample Solr `protwords.txt` with comments can be found in the `sample_techproducts_configs` <> directory: +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-keywordmarker] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + + +---- +==== +[example.tab-pane#byclass-filter-keywordmarker] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -41,6 +60,8 @@ A sample Solr `protwords.txt` with comments can be found in the `sample_techprod ---- +==== +-- == KeywordRepeatFilterFactory @@ -52,6 +73,26 @@ To configure, add the `KeywordRepeatFilterFactory` early in the analysis chain. A sample fieldType configuration could look like this: +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-keywordrepeat] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + + + +---- +==== +[example.tab-pane#byclass-filter-keywordrepeat] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -63,6 +104,8 @@ A sample fieldType configuration could look like this: ---- +==== +-- IMPORTANT: When adding the same token twice, it will also score twice (double), so you may have to re-tune your ranking rules. @@ -72,7 +115,25 @@ Overrides stemming algorithms by applying a custom mapping, then protecting thes A customized mapping of words to stems, in a tab-separated file, can be specified to the `dictionary` attribute in the schema. Words in this mapping will be stemmed to the stems from the file, and will not be further changed by any stemmer. - +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-stemmeroverride] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + + +---- +==== +[example.tab-pane#byclass-filter-stemmeroverride] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -83,6 +144,8 @@ A customized mapping of words to stems, in a tab-separated file, can be specifie ---- +==== +-- A sample `stemdict.txt` file is shown below: @@ -117,6 +180,22 @@ Compound words are most commonly found in Germanic languages. Assume that `germanwords.txt` contains at least the following words: `dumm kopf donau dampf schiff` +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-dictionarycompoundword] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-dictionarycompoundword] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -124,6 +203,8 @@ Assume that `germanwords.txt` contains at least the following words: `dumm kopf ---- +==== +-- *In:* "Donaudampfschiff dummkopf" @@ -330,6 +411,22 @@ This can increase recall by causing more matches. On the other hand, it can redu *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-lang-asciifolding] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-lang-asciifolding] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -337,6 +434,8 @@ This can increase recall by causing more matches. On the other hand, it can redu ---- +==== +-- *In:* "Björn Ångström" @@ -356,6 +455,22 @@ This can increase recall by causing more matches. On the other hand, it can redu *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-decimaldigit] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-filter-decimaldigit] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -363,6 +478,8 @@ This can increase recall by causing more matches. On the other hand, it can redu ---- +==== +-- == OpenNLP Integration @@ -386,6 +503,23 @@ The OpenNLP Tokenizer takes two language-specific binary model files as paramete *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-opennlp] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-opennlp] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -394,6 +528,8 @@ The OpenNLP Tokenizer takes two language-specific binary model files as paramete tokenizerModel="en-tokenizer.bin"/> ---- +==== +-- === OpenNLP Part-Of-Speech Filter @@ -427,6 +563,26 @@ $ Index the POS for each token as a payload: +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-opennlppos] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + +---- +==== +[example.tab-pane#byclass-filter-opennlppos] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -438,18 +594,20 @@ Index the POS for each token as a payload: ---- +==== +-- Index the POS for each token as a synonym, after prefixing the POS with "@" (see the <>): [source,xml] ---- - - - - + + + ---- @@ -458,11 +616,11 @@ Only index nouns - the `keep.pos.txt` file contains lines `NN`, `NNS`, `NNP` and [source,xml] ---- - - - + + ---- @@ -484,6 +642,26 @@ NOTE: Lucene currently does not index token types, so if you want to keep this i Index the phrase chunk label for each token as a payload: +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-opennlpchunker] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + +---- +==== +[example.tab-pane#byclass-filter-opennlpchunker] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -495,18 +673,20 @@ Index the phrase chunk label for each token as a payload: ---- +==== +-- Index the phrase chunk label for each token as a synonym, after prefixing it with "#" (see the <>): [source,xml] ---- - - - - + + + ---- @@ -528,6 +708,28 @@ Either `dictionary` or `lemmatizerModel` must be provided, and both may be provi Perform dictionary-based lemmatization, and fall back to model-based lemmatization for out-of-vocabulary tokens (see the <> section above for information about using `TypeTokenFilter` to avoid indexing punctuation): +[.dynamic-tabs] +-- +[example.tab-pane#byname-filter-opennlplemmatizer] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + +---- +==== +[example.tab-pane#byclass-filter-opennlplemmatizer] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -541,18 +743,20 @@ Perform dictionary-based lemmatization, and fall back to model-based lemmatizati ---- +==== +-- Perform dictionary-based lemmatization only: [source,xml] ---- - - - - + + + ---- @@ -561,14 +765,14 @@ Perform model-based lemmatization only, preserving the original token and emitti [source,xml] ---- - - - - - - + + + + + ---- @@ -626,6 +830,23 @@ This algorithm defines both character normalization and stemming, so these are s *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-arabic] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-arabic] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -634,6 +855,8 @@ This algorithm defines both character normalization and stemming, so these are s ---- +==== +-- === Bengali @@ -645,6 +868,23 @@ There are two filters written specifically for dealing with Bengali language. Th *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-bengali] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-bengali] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -652,8 +892,9 @@ There are two filters written specifically for dealing with Bengali language. Th - ---- +==== +-- *Normalisation* - `মানুষ` \-> `মানুস` @@ -670,6 +911,22 @@ This is a Java filter written specifically for stemming the Brazilian dialect of *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-brazilian] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-lang-brazilian] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -677,6 +934,8 @@ This is a Java filter written specifically for stemming the Brazilian dialect of ---- +==== +-- *In:* "praia praias" @@ -694,6 +953,23 @@ Solr includes a light stemmer for Bulgarian, following http://members.unine.ch/j *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-bulgarian] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-bulgarian] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -702,6 +978,8 @@ Solr includes a light stemmer for Bulgarian, following http://members.unine.ch/j ---- +==== +-- === Catalan @@ -715,6 +993,25 @@ Solr can stem Catalan using the Snowball Porter Stemmer with an argument of `lan *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-catalan] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + +---- +==== +[example.tab-pane#byclass-lang-catalan] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -725,6 +1022,8 @@ Solr can stem Catalan using the Snowball Porter Stemmer with an argument of `lan ---- +==== +-- *In:* "llengües llengua" @@ -742,6 +1041,23 @@ The default configuration of the <> *Examples:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-trd-chinese] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-trad-chinese] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -750,14 +1066,16 @@ The default configuration of the <> ---- +==== +-- [source,xml] ---- - - - - + + + + ---- @@ -797,6 +1115,26 @@ Also useful for Chinese analysis: *Examples:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-chinese] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + + +---- +==== +[example.tab-pane#byclass-lang-chinese] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -808,15 +1146,17 @@ Also useful for Chinese analysis: ---- +==== +-- [source,xml] ---- - - - + + - + ---- @@ -846,6 +1186,23 @@ Solr includes a light stemmer for Czech, following https://dl.acm.org/citation.c *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-czech] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-czech] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -854,6 +1211,8 @@ Solr includes a light stemmer for Czech, following https://dl.acm.org/citation.c ---- +==== +-- *In:* "prezidenští, prezidenta, prezidentského" @@ -875,6 +1234,23 @@ Also relevant are the <>. *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-danish] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-danish] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -883,6 +1259,8 @@ Also relevant are the <>. ---- +==== +-- *In:* "undersøg undersøgelse" @@ -902,6 +1280,23 @@ Solr can stem Dutch using the Snowball Porter Stemmer with an argument of `langu *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-dutch] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-dutch] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -910,6 +1305,8 @@ Solr can stem Dutch using the Snowball Porter Stemmer with an argument of `langu ---- +==== +-- *In:* "kanaal kanalen" @@ -929,6 +1326,23 @@ Solr can stem Estonian using the Snowball Porter Stemmer with an argument of `la *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-estonian] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-estonian] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -937,6 +1351,8 @@ Solr can stem Estonian using the Snowball Porter Stemmer with an argument of `la ---- +==== +-- *In:* "Taevani tõustes" @@ -954,6 +1370,22 @@ Solr includes support for stemming Finnish, and Lucene includes an example stopw *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-finnish] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-lang-finnish] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -961,6 +1393,8 @@ Solr includes support for stemming Finnish, and Lucene includes an example stopw ---- +==== +-- *In:* "kala kalat" @@ -985,6 +1419,24 @@ Removes article elisions from a token stream. This filter can be useful for lang *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-french] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-lang-french] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -994,6 +1446,8 @@ Removes article elisions from a token stream. This filter can be useful for lang articles="lang/contractions_fr.txt"/> ---- +==== +-- *In:* "L'histoire d'art" @@ -1014,22 +1468,22 @@ Solr includes three stemmers for French: one in the `solr.SnowballPorterFilterFa [source,xml] ---- - - - + + - + ---- [source,xml] ---- - - - + + - + ---- @@ -1050,6 +1504,23 @@ Solr includes a stemmer for Galician following http://bvg.udc.es/recursos_lingua *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-galician] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-galician] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1058,6 +1529,8 @@ Solr includes a stemmer for Galician following http://bvg.udc.es/recursos_lingua ---- +==== +-- *In:* "felizmente Luzes" @@ -1075,27 +1548,45 @@ Solr includes four stemmers for German: one in the `solr.SnowballPorterFilterFac *Examples:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-german] +==== +[.tab-label]*With name* [source,xml] ---- - - + + ---- - +==== +[example.tab-pane#byclass-lang-german] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- - + + +---- +==== +-- + +[source,xml] +---- + + + ---- [source,xml] ---- - - + + ---- @@ -1120,6 +1611,22 @@ Use of custom charsets is no longer supported as of Solr 3.1. If you need to ind *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-greek] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-lang-greek] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1127,6 +1634,8 @@ Use of custom charsets is no longer supported as of Solr 3.1. If you need to ind ---- +==== +-- === Hindi @@ -1138,6 +1647,24 @@ Solr includes support for stemming Hindi following http://computing.open.ac.uk/S *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-hindi] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + +---- +==== +[example.tab-pane#byclass-lang-hindi] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1147,6 +1674,8 @@ Solr includes support for stemming Hindi following http://computing.open.ac.uk/S ---- +==== +-- === Indonesian @@ -1158,6 +1687,23 @@ Solr includes support for stemming Indonesian (Bahasa Indonesia) following http: *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-indonesian] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-indonesian] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1166,6 +1712,8 @@ Solr includes support for stemming Indonesian (Bahasa Indonesia) following http: ---- +==== +-- *In:* "sebagai sebagainya" @@ -1183,6 +1731,25 @@ Solr includes two stemmers for Italian: one in the `solr.SnowballPorterFilterFac *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-italian] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + +---- +==== +[example.tab-pane#byclass-lang-italian] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1193,6 +1760,8 @@ Solr includes two stemmers for Italian: one in the `solr.SnowballPorterFilterFac ---- +==== +-- *In:* "propaga propagare propagamento" @@ -1212,6 +1781,25 @@ Solr can stem Irish using the Snowball Porter Stemmer with an argument of `langu *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-irish] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + +---- +==== +[example.tab-pane#byclass-lang-irish] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1222,6 +1810,8 @@ Solr can stem Irish using the Snowball Porter Stemmer with an argument of `langu ---- +==== +-- *In:* "siopadóireacht síceapatacha b'fhearr m'athair" @@ -1321,6 +1911,31 @@ Folds fullwidth ASCII variants into the equivalent Basic Latin forms, and folds Example: +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-japanese] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + + + + + + + + +---- +==== +[example.tab-pane#byclass-lang-japanese] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1337,6 +1952,8 @@ Example: ---- +==== +-- [[hebrew-lao-myanmar-khmer]] === Hebrew, Lao, Myanmar, Khmer @@ -1355,6 +1972,25 @@ Solr includes support for stemming Latvian, and Lucene includes an example stopw *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-latvian] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + + +---- +==== +[example.tab-pane#byclass-lang-latvian] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1365,6 +2001,8 @@ Solr includes support for stemming Latvian, and Lucene includes an example stopw ---- +==== +-- *In:* "tirgiem tirgus" @@ -1411,6 +2049,26 @@ The second pass is to pick up -dom and -het endings. Consider this example: *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-norweigian] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + + + +---- +==== +[example.tab-pane#byclass-lang-norweigian] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1422,6 +2080,8 @@ The second pass is to pick up -dom and -het endings. Consider this example: ---- +==== +-- *In:* "Forelskelsen" @@ -1449,10 +2109,10 @@ The `NorwegianMinimalStemFilterFactory` stems plural forms of Norwegian nouns on ---- - - - - + + + + ---- @@ -1475,14 +2135,33 @@ Solr includes support for normalizing Persian, and Lucene includes an example st *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-persian] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-persian] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- - + ---- +==== +-- === Polish @@ -1494,6 +2173,23 @@ Solr provides support for Polish stemming with the `solr.StempelPolishStemFilter *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-polish] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-polish] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1502,13 +2198,15 @@ Solr provides support for Polish stemming with the `solr.StempelPolishStemFilter ---- +==== +-- [source,xml] ---- - - - + + + ---- @@ -1534,6 +2232,23 @@ Solr includes four stemmers for Portuguese: one in the `solr.SnowballPorterFilte *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-portuguese] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-portuguese] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1542,22 +2257,24 @@ Solr includes four stemmers for Portuguese: one in the `solr.SnowballPorterFilte ---- +==== +-- [source,xml] ---- - - - + + + ---- [source,xml] ---- - - - + + + ---- @@ -1579,6 +2296,23 @@ Solr can stem Romanian using the Snowball Porter Stemmer with an argument of `la *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-romanian] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-romanian] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1587,6 +2321,8 @@ Solr can stem Romanian using the Snowball Porter Stemmer with an argument of `la ---- +==== +-- === Russian @@ -1600,6 +2336,23 @@ Solr includes two stemmers for Russian: one in the `solr.SnowballPorterFilterFac *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-russian] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-russian] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1608,6 +2361,8 @@ Solr includes two stemmers for Russian: one in the `solr.SnowballPorterFilterFac ---- +==== +-- === Scandinavian @@ -1633,6 +2388,23 @@ It's a semantically less destructive solution than `ScandinavianFoldingFilter`, *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-scandinavian] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-scandinavian] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1641,6 +2413,8 @@ It's a semantically less destructive solution than `ScandinavianFoldingFilter`, ---- +==== +-- *In:* "blåbærsyltetøj blåbärsyltetöj blaabaarsyltetoej blabarsyltetoj" @@ -1663,9 +2437,9 @@ It's a semantically more destructive solution than `ScandinavianNormalizationFil [source,xml] ---- - - - + + + ---- @@ -1694,6 +2468,23 @@ See the Solr wiki for tips & advice on using this filter: https://wiki.apache.or *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-serbian] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-serbian] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1702,6 +2493,8 @@ See the Solr wiki for tips & advice on using this filter: https://wiki.apache.or ---- +==== +-- === Spanish @@ -1713,6 +2506,23 @@ Solr includes two stemmers for Spanish: one in the `solr.SnowballPorterFilterFac *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-spanish] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-spanish] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1721,6 +2531,8 @@ Solr includes two stemmers for Spanish: one in the `solr.SnowballPorterFilterFac ---- +==== +-- *In:* "torear toreara torearlo" @@ -1743,6 +2555,23 @@ Also relevant are the <>. *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-swedish] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-lang-swedish] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1751,6 +2580,8 @@ Also relevant are the <>. ---- +==== +-- *In:* "kloke klokhet klokheten" @@ -1768,6 +2599,22 @@ This filter converts sequences of Thai characters into individual Thai words. Un *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-thai] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-lang-thai] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1775,6 +2622,8 @@ This filter converts sequences of Thai characters into individual Thai words. Un ---- +==== +-- === Turkish @@ -1786,6 +2635,24 @@ Solr includes support for stemming Turkish with the `solr.SnowballPorterFilterFa *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-turkish] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + +---- +==== +[example.tab-pane#byclass-lang-turkish] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1795,19 +2662,21 @@ Solr includes support for stemming Turkish with the `solr.SnowballPorterFilterFa ---- +==== +-- *Another example, illustrating diacritics-insensitive search:* [source,xml] ---- - - - - - - - + + + + + + + ---- @@ -1825,6 +2694,24 @@ Lucene also includes an example Ukrainian stopword list, in the `lucene-analyzer *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-lang-ukranian] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + +---- +==== +[example.tab-pane#byclass-lang-ukranian] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -1834,5 +2721,7 @@ Lucene also includes an example Ukrainian stopword list, in the `lucene-analyzer ---- +==== +-- The Morfologik `dictionary` parameter value is a constant specifying which dictionary to choose. The dictionary resource must be named `path/to/_language_.dict` and have an associated `.info` metadata file. See http://morfologik.blogspot.com/[the Morfologik project] for details. If the dictionary attribute is not provided, the Polish dictionary is loaded and used by default. diff --git a/solr/solr-ref-guide/src/schema-api.adoc b/solr/solr-ref-guide/src/schema-api.adoc index 68f865aae8f..96de55e794a 100644 --- a/solr/solr-ref-guide/src/schema-api.adoc +++ b/solr/solr-ref-guide/src/schema-api.adoc @@ -333,13 +333,13 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{ "positionIncrementGap":"100", "analyzer" : { "charFilters":[{ - "class":"solr.PatternReplaceCharFilterFactory", + "name":"patternReplace", "replacement":"$1$1", "pattern":"([a-zA-Z])\\\\1+" }], "tokenizer":{ - "class":"solr.WhitespaceTokenizerFactory" }, + "name":"whitespace" }, "filters":[{ - "class":"solr.WordDelimiterFilterFactory", + "name":"wordDelimiter", "preserveOriginal":"0" }]}} }' http://localhost:8983/solr/gettingstarted/schema ---- @@ -361,11 +361,11 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{ "class":"solr.TextField", "indexAnalyzer":{ "tokenizer":{ - "class":"solr.PathHierarchyTokenizerFactory", + "name":"pathHierarchy", "delimiter":"/" }}, "queryAnalyzer":{ "tokenizer":{ - "class":"solr.KeywordTokenizerFactory" }}} + "name":"keyword" }}} }' http://localhost:8983/solr/gettingstarted/schema ---- ==== @@ -383,11 +383,11 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{ "class":"solr.TextField", "indexAnalyzer":{ "tokenizer":{ - "class":"solr.PathHierarchyTokenizerFactory", + "name":"pathHierarchy", "delimiter":"/" }}, "queryAnalyzer":{ "tokenizer":{ - "class":"solr.KeywordTokenizerFactory" }}} + "name":"keyword" }}} }' http://localhost:8983/api/cores/gettingstarted/schema ---- ==== @@ -446,7 +446,7 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{ "positionIncrementGap":"100", "analyzer":{ "tokenizer":{ - "class":"solr.StandardTokenizerFactory" }}} + "name":"standard" }}} }' http://localhost:8983/solr/gettingstarted/schema ---- ==== @@ -463,7 +463,7 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{ "positionIncrementGap":"100", "analyzer":{ "tokenizer":{ - "class":"solr.StandardTokenizerFactory" }}} + "name":"standard" }}} }' http://localhost:8983/api/cores/gettingstarted/schema ---- ==== @@ -565,13 +565,13 @@ curl -X POST -H 'Content-type:application/json' --data-binary '{ "positionIncrementGap":"100", "analyzer":{ "charFilters":[{ - "class":"solr.PatternReplaceCharFilterFactory", + "name":"patternReplace", "replacement":"$1$1", "pattern":"([a-zA-Z])\\\\1+" }], "tokenizer":{ - "class":"solr.WhitespaceTokenizerFactory" }, + "name":"whitespace" }, "filters":[{ - "class":"solr.WordDelimiterFilterFactory", + "name":"wordDelimiter", "preserveOriginal":"0" }]}}, "add-field" : { "name":"sell_by", diff --git a/solr/solr-ref-guide/src/tokenizers.adoc b/solr/solr-ref-guide/src/tokenizers.adoc index db32c784130..c883342debe 100644 --- a/solr/solr-ref-guide/src/tokenizers.adoc +++ b/solr/solr-ref-guide/src/tokenizers.adoc @@ -20,6 +20,24 @@ Tokenizers are responsible for breaking field data into lexical units, or _token You configure the tokenizer for a text field type in `schema.xml` with a `` element, as a child of ``: +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + + +---- +==== +[example.tab-pane#byclass-tokenizer] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -29,11 +47,30 @@ You configure the tokenizer for a text field type in `schema.xml` with a ` ---- +==== +-- -The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the `org.apache.solr.analysis.TokenizerFactory`. A TokenizerFactory's `create()` method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field. +The name/class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the `org.apache.lucene.analysis.util.TokenizerFactory`. A TokenizerFactory's `create()` method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field. Arguments may be passed to tokenizer factories by setting attributes on the `` element. +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-args] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-tokenizer-args] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -42,6 +79,8 @@ Arguments may be passed to tokenizer factories by setting attributes on the ` ---- +==== +-- The following sections describe the tokenizer factory classes included in this release of Solr. @@ -66,12 +105,29 @@ The Standard Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-standard] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-standard] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq." @@ -95,12 +151,29 @@ The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of S *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-classic] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-classic] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq." @@ -116,12 +189,29 @@ This tokenizer treats the entire text field as a single token. *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-keyword] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-keyword] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq." @@ -137,12 +227,29 @@ This tokenizer creates tokens from strings of contiguous letters, discarding all *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-letter] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-letter] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "I can't." @@ -158,12 +265,29 @@ Tokenizes the input stream by delimiting at non-letters and then converting all *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-lowercase] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-lowercase] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "I just \*LOVE* my iPhone!" @@ -185,12 +309,29 @@ Reads the field text and generates n-gram tokens of sizes in the given range. Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding. +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-ngram] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-ngram] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "hey man" @@ -200,12 +341,29 @@ Default behavior. Note that this tokenizer operates over the whole field. It doe With an n-gram size range of 4 to 5: +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-ngram-args] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-ngram-args] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "bicycle" @@ -227,12 +385,29 @@ Reads the field text and generates edge n-gram tokens of sizes in the given rang Default behavior (min and max default to 1): +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-edgengram] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-edgengram] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "babaloo" @@ -242,12 +417,29 @@ Default behavior (min and max default to 1): Edge n-gram range of 2 to 5 +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-edgengram-args] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-edgengram-args] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "babaloo" @@ -269,6 +461,22 @@ The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word br *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-icu] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + +---- +==== +[example.tab-pane#byclass-tokenizer-icu] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -276,7 +484,25 @@ The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word br ---- +==== +-- +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-icu-rule] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-icu-rule] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -284,6 +510,8 @@ The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word br rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/> ---- +==== +-- [IMPORTANT] ==== @@ -306,6 +534,23 @@ This tokenizer creates synonyms from file path hierarchies. *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-pathhierarchy] +==== +[.tab-label]*With name* +[source,xml] +---- + + + + + +---- +==== +[example.tab-pane#byclass-tokenizer-pathhierarchy] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- @@ -314,6 +559,8 @@ This tokenizer creates synonyms from file path hierarchies. ---- +==== +-- *In:* "c:\usr\local\apache" @@ -337,12 +584,29 @@ See {java-javadocs}java/util/regex/Pattern.html[the Javadocs for `java.util.rege A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces. +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-pattern] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-pattern] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "fee,fie, foe , fum, foo" @@ -352,12 +616,29 @@ A comma separated list. Tokens are separated by a sequence of zero or more space Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token. +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-pattern-words] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-pattern-words] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "Hello. My name is Inigo Montoya. You killed my father. Prepare to die." @@ -367,12 +648,29 @@ Extract simple, capitalized words. A sequence of at least one capital letter fol Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens. +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-pattern-sku] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-pattern-sku] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "SKU: 1234, Part Number 5678, Part: 126-987" @@ -394,12 +692,29 @@ This tokenizer is similar to the `PatternTokenizerFactory` described above, but To match tokens delimited by simple whitespace characters: +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-simplepattern] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-simplepattern] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- == Simplified Regular Expression Pattern Splitting Tokenizer @@ -417,12 +732,29 @@ This tokenizer is similar to the `SimplePatternTokenizerFactory` described above To match tokens delimited by simple whitespace characters: +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-simplepatternsplit] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-simplepatternsplit] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- == UAX29 URL Email Tokenizer @@ -448,12 +780,29 @@ The UAX29 URL Email Tokenizer supports http://unicode.org/reports/tr29/#Word_Bou *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-uax29urlemail] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-uax29urlemail] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com" @@ -475,12 +824,29 @@ Specifies how to define whitespace for the purpose of tokenization. Valid values *Example:* +[.dynamic-tabs] +-- +[example.tab-pane#byname-tokenizer-whitespace] +==== +[.tab-label]*With name* +[source,xml] +---- + + + +---- +==== +[example.tab-pane#byclass-tokenizer-whitespace] +==== +[.tab-label]*With class name (legacy)* [source,xml] ---- ---- +==== +-- *In:* "To be, or what?"