Makes the following changes to the `remove_duplicates` token filter
docs:
* Rewrites description and adds Lucene link
* Adds detailed analyze example
* Adds custom analyzer example
This change removes the Lucene's experimental flag from the documentations of the following
tokenizer/filters:
* Simple Pattern Split Tokenizer
* Simple Pattern tokenizer
* Flatten Graph Token Filter
* Word Delimiter Graph Token Filter
The flag is still present in Lucene codebase but we're fully supporting these tokenizers/filters
in ES for a long time now so the docs flag is misleading.
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Makes the following changes to the `word_delimiter` token filter docs:
* Adds a warning admonition recommending the `word_delimiter_graph`
filter instead. This warning includes a link to the deprecated Lucene
`WordDelimiterFilter`.
* Updates the description
* Adds detailed analyze snippet
* Adds custom analyzer and custom filter snippets
* Reorganizes and updates parameter documentation
In a tip admonition, we recommend using the `keyword` tokenizer with the
`word_delimiter_graph` token filter. However, we only use the
`whitespace` tokenizer in the example snippets. This updates those
snippets to use the `keyword` tokenizer instead.
Also corrects several spacing issues for arrays in these docs.
Makes the following changes to the `word_delimiter_graph` token filter
docs:
* Updates the Lucene experimental admonition.
* Updates description
* Adds analyze snippet
* Adds custom analyzer and custom filter snippets
* Reorganizes and updates parameter list
* Expands and updates section re: differences between `word_delimiter`
and `word_delimiter_graph`
Per the [Asciidoctor docs][0], Asciidoctor replaces the following
syntax with double arrows in the rendered HTML:
* => renders as ⇒
* <= renders as ⇐
This escapes several unintended replacements, such as in the Painless
docs.
Where appropriate, it also replaces some double arrow instances with
single arrows for consistency.
[0]: https://asciidoctor.org/docs/user-manual/#replacements
Makes the following changes to the `stop` token filter docs:
* Updates description
* Adds a link to the related Lucene filter
* Adds detailed analyze snippet
* Updates custom analyzer and custom filter snippets
* Adds a list of predefined stop words by language
Co-authored-by: ScottieL <36999642+ScottieL@users.noreply.github.com>
Makes the following changes to the `trim` token filter docs:
* Updates description
* Adds a link to the related Lucene filter
* Adds tip about removing whitespace using tokenizers
* Adds detailed analyze snippets
* Adds custom analyzer snippet
* [DOCS] Rewrite analysis intro. Move index/search analysis content.
* Rewrites 'Text analysis' page intro as high-level definition.
Adds guidance on when users should configure text analysis
* Rewrites and splits index/search analysis content:
* Conceptual content -> 'Index and search analysis' under 'Concepts'
* Task-based content -> 'Specify an analyzer' under 'Configure...'
* Adds detailed examples for when to use the same index/search analyzer
and when not.
* Adds new example snippets for specifying search analyzers
* clarifications
* Add toc. Decrement headings.
* Reword 'When to configure' section
* Remove sentence from tip
Adds a 'Configure text analysis' page to house tutorial content for the
analysis topic.
Also relocates the following pages as children as this new page:
* 'Test an analyzer'
* 'Configuring built-in analyzers'
* 'Create a custom analyzer'
I plan to add a tutorial for specifying index-time and search-time
analyzers to this section as part of a future PR.
This helps the topic better match the structure of
our machine learning docs, e.g.
https://www.elastic.co/guide/en/machine-learning/7.5/ml-concepts.html
This PR only includes the 'Anatomy of an analyzer' page as a 'Concepts'
child page, but I plan to add other concepts, such as 'Index time vs.
search time', with later PRs.
* Changes titles to sentence case.
* Appends pages with 'reference' to differentiate their content from
conceptual overviews.
* Moves the 'Normalizers' page to end of the Analysis topic pages.
Adds a 'text analysis overview' page to the analysis topic docs.
The goals of this page are:
* Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples
* Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis
* Highlight how analysis can be used to improve search results
* Adds a title abbreviation
* Updates the description and adds a Lucene link
* Reformats the parameters section
* Adds analyze, custom analyzer, and custom filter snippets
Relates to #44726.
* Adds a title abbreviation
* Relocates the older name deprecation warning
* Updates the description and adds a Lucene link
* Adds a note to explain payloads and how to store them
* Adds analyze and custom analyzer snippets
* Adds a 'Return stored payloads' example
Reformats the edge n-gram and n-gram token filter docs. Changes include:
* Adds title abbreviations
* Updates the descriptions and adds Lucene links
* Reformats parameter definitions
* Adds analyze and custom analyzer snippets
* Adds notes explaining differences between the edge n-gram and n-gram
filters
Additional changes:
* Switches titles to use "n-gram" throughout.
* Fixes a typo in the edge n-gram tokenizer docs
* Adds an explicit anchor for the `index.max_ngram_diff` setting
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers
only allows for a list of predefined character classes, which might not fit
every use case. For example, including underscore "_" in a token would currently
require the `punctuation` class which comes with a lot of other characters.
This change adds an additional "custom" option to the `token_chars` setting,
which requires an additional `custom_token_chars` setting to be present and
which will be interpreted as a set of characters to inlcude into a token.
Closes#25894
The `edge_ngram` tokenizer limits tokens to the `max_gram` character
length. Autocomplete searches for terms longer than this limit return
no results.
To prevent this, you can use the `truncate` token filter to truncate
tokens to the `max_gram` character length. However, this could return irrelevant results.
This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach.
Closes#48956.
The first example of splitting rules for the `word_delimiter` token filter was spread across two bullet points. This makes it look like they are two separate splitting rules.
The sample code is wrong. Field type is required for the sample field.
I guess the intention was to give the sample field the name ```fingerprint```, mapping it as ```text``` using the custom analyzer ```my_analyzer```
Currently changing resources (like dictionaries, synonym files etc...) of search
time analyzers is only possible by closing an index, changing the underlying
resource (e.g. synonym files) and then re-opening the index for the change to
take effect.
This PR adds a new API endpoint that allows triggering reloading of certain
analysis resources (currently token filters) that will then pick up changes in
underlying file resources. To achieve this we introduce a new type of custom
analyzer (ReloadableCustomAnalyzer) that uses a ReuseStrategy that allows
swapping out analysis components. Custom analyzers that contain filters that are
markes as "updateable" will automatically choose this implementation. This PR
also adds this capability to `synonym` token filters for use in search time
analyzers.
Relates to #29051
We should throw an exception at construction time if a list of
articles is not provided, otherwise we can get random NPEs during
indexing.
Relates to #43002
In #30209 we deprecated the camel case `nGram` filter name in favour of `ngram` and
did the same for `edgeNGram` and `edge_ngram` and we are removing those names in
8.0. This change disallows using the deprecated names for new indices created in 7.0 by
throwing an error if these filters are used.
Relates to #38911
Make substitution of \u200C with a space explicit
The problem with this symbol `\u200C` in a test string,
that **SHOULD** be substituted with space in the rebuilt Persian analyzer, but it is not.
Correcting this line `"mappings": [ "\\u200C=> "] <1>` to
`"mappings": [ "\\u200C=>\\u0020"] <1>` in solves the problem.
This change explicitly says to substitute ZWNJ with a space.
Closes#38188
The "include_type_name" parameter was temporarily introduced in #37285 to facilitate
moving the default parameter setting to "false" in many places in the documentation
code snippets. Most of the places can simply be reverted without causing errors.
In this change I looked for asciidoc files that contained the
"include_type_name=true" addition when creating new indices but didn't look
likey they made use of the "_doc" type for mappings. This is mostly the case
e.g. in the analysis docs where index creating often only contains settings. I
manually corrected the use of types in some places where the docs still used an
explicit type name and not the dummy "_doc" type.
* Default include_type_name to false for get and put mappings.
* Default include_type_name to false for get field mappings.
* Add a constant for the default include_type_name value.
* Default include_type_name to false for get and put index templates.
* Default include_type_name to false for create index.
* Update create index calls in REST documentation to use include_type_name=true.
* Some minor clean-ups around the get index API.
* In REST tests, use include_type_name=true by default for index creation.
* Make sure to use 'expression == false'.
* Clarify the different IndexTemplateMetaData toXContent methods.
* Fix FullClusterRestartIT#testSnapshotRestore.
* Fix the ml_anomalies_default_mappings test.
* Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests.
We make sure to specify include_type_name=true during xContent parsing,
so we continue to test the legacy typed responses. XContent generation
for the typeless responses is currently only covered by REST tests,
but we will be adding unit test coverage for these as we implement
each typeless API in the Java HLRC.
This commit also refactors GetMappingsResponse to follow the same appraoch
as the other mappings-related responses, where we read include_type_name
out of the xContent params, instead of creating a second toXContent method.
This gives better consistency in the response parsing code.
* Fix more REST tests.
* Improve some wording in the create index documentation.
* Add a note about types removal in the create index docs.
* Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL.
* Make sure to mention include_type_name in the REST docs for affected APIs.
* Make sure to use 'expression == false' in FullClusterRestartIT.
* Mention include_type_name in the REST templates docs.
The first example given is missing the two single-token cases for "is" and "a".
The later usage example is slightly wrong in that custom analyzers should
go under `settings.analysis.analyzer`.
This commit adds an adjust_offsets parameter to the word_delimiter_graph token filter, defaulting
to true. Most of the time you'd want sub-tokens emitted by this filter to have offsets that are
adjusted to their real position in the token stream; however, some token filters can change the
length or starting position of a token (eg trim) without changing their offset attributes, and this
can lead to word_delimiter_graph emitting illegal offsets. Setting adjust_offsets to false in these
cases will allow indexing again.
Fixes#34741, #33710
This commit changes the format of the `hits.total` in the search response to be an object with
a `value` and a `relation`. The `value` indicates the number of hits that match the query and the
`relation` indicates whether the number is accurate (in which case the relation is equals to `eq`)
or a lower bound of the total (in which case it is equals to `gte`).
This change also adds a parameter called `rest_total_hits_as_int` that can be used in the
search APIs to opt out from this change (retrieve the total hits as a number in the rest response).
Note that currently all search responses are accurate (`track_total_hits: true`) or they don't contain
`hits.total` (`track_total_hits: true`). We'll add a way to get a lower bound of the total hits in a
follow up (to allow numbers to be passed to `track_total_hits`).
Relates #33028
A number of tokenfilters can produce multiple tokens at the same position. This
is a problem when using token chains to parse synonym files, as the SynonymMap
requires that there are no stacked tokens in its input.
This commit ensures that when used to parse synonyms, these tokenfilters either produce
a single version of their input token, or that they throw an error when mappings are
generated. In indexes created in elasticsearch 6.x deprecation warnings are emitted in place
of the error.
* asciifolding and cjk_bigram produce only the folded or bigrammed token
* decompounders, synonyms and keyword_repeat are skipped
* n-grams, word-delimiter-filter, multiplexer, fingerprint and phonetic throw errors
Fixes#34298
The ICU plugin provides the building blocks of an analysis chain, but doesn't actually have a prebuilt analyzer. It would be a better for users if there was a simple analyzer that they could use out of the box, and also something we can point to from the CJK Analyzer docs as a superior alternative.
Relates to #34285
* Replace custom type names with _doc in REST examples.
* Avoid using two mapping types in the percolator docs.
* Rename doc -> _doc in the main repository README.
* Also replace some custom type names in the HLRC docs.
We currently special-case SynonymFilterFactory and SynonymGraphFilterFactory, which need to
know their predecessors in the analysis chain in order to correctly analyze their synonym lists. This
special-casing doesn't work with Referring filter factories, such as the Multiplexer or Conditional
filters. We also have a number of filters (eg the Multiplexer) that will break synonyms when they
appear before them in a chain, because they produce multiple tokens at the same position.
This commit adds two methods to the TokenFilterFactory interface.
* `getChainAwareTokenFilterFactory()` allows a filter factory to rewrite itself against its preceding
filter chain, or to resolve references to other filters. It replaces `ReferringFilterFactory` and
`CustomAnalyzerProvider.checkAndApplySynonymFilter`, and by default returns `this`.
* `getSynonymFilter()` defines whether or not a filter should be applied when building a synonym
list `Analyzer`. By default it returns `true`.
Fixes#33609
This allows users to filter out tokens from a TokenStream using painless scripts,
instead of having to write specialised Java code and packaging it up into a plugin.
The commit also refactors the AnalysisPredicateScript.Token class so that it wraps
and makes read-only an AttributeSource.
The main benefit of the upgrade for users is the search optimization for top scored documents when the total hit count is not needed. However this optimization is not activated in this change, there is another issue opened to discuss how it should be integrated smoothly.
Some comments about the change:
* Tests that can produce negative scores have been adapted but we need to forbid them completely: #33309Closes#32899