OpenSearch/docs/reference/migration/migrate_7_0/analysis.asciidoc

[discrete]
[[breaking_70_analysis_changes]]
=== Analysis changes

//NOTE: The notable-breaking-changes tagged regions are re-used in the
//Installation and Upgrade Guide

//tag::notable-breaking-changes[]

// end::notable-breaking-changes[]

[discrete]
[[limit-number-of-tokens-produced-by-analyze]]
==== Limiting the number of tokens produced by _analyze

To safeguard against out of memory errors, the number of tokens that can be produced
using the `_analyze` endpoint has been limited to 10000. This default limit can be changed
for a particular index with the index setting `index.analyze.max_token_count`.

[discrete]
==== Limiting the length of an analyzed text during highlighting

Highlighting a text that was indexed without offsets or term vectors,
requires analysis of this text in memory real time during the search request.
For large texts this analysis may take substantial amount of time and memory.
To protect against this, the maximum number of characters that will be analyzed has been
limited to 1000000. This default limit can be changed
for a particular index with the index setting `index.highlight.max_analyzed_offset`.

[discrete]
[[delimited-payload-filter-renaming]]
==== `delimited_payload_filter` renaming

The `delimited_payload_filter` was deprecated and renamed to `delimited_payload` in 6.2.
Using it in indices created before 7.0 will issue deprecation warnings. Using the old
name in new indices created in 7.0 will throw an error. Use the new name `delimited_payload`
instead.

[discrete]
[[standard-filter-removed]]
==== `standard` filter has been removed

The `standard` token filter has been removed because it doesn't change anything in the stream.

[discrete]
==== Deprecated standard_html_strip analyzer

The `standard_html_strip` analyzer has been deprecated, and should be replaced
with a combination of the `standard` tokenizer and `html_strip` char_filter.
Indexes created using this analyzer will still be readable in elasticsearch 7.0,
but it will not be possible to create new indexes using it.

[discrete]
[[deprecated-ngram-edgengram-token-filter-cannot-be-used]]
==== The deprecated `nGram` and `edgeNGram` token filter cannot be used on new indices

The `nGram` and `edgeNGram` token filter names have been deprecated in an earlier 6.x version.
Indexes created using these token filters will still be readable in elasticsearch 7.0 but indexing
documents using those filter names will issue a deprecation warning. Using the deprecated names on
new indices starting with version 7.0.0 will be prohibited and throw an error when indexing
or analyzing documents. Both names should be replaced by `ngram` or `edge_ngram` respectively.

[discrete]
==== Limit to the difference between max_size and min_size in NGramTokenFilter and NGramTokenizer

To safeguard against creating too many index terms, the difference between `max_ngram` and
`min_ngram` in `NGramTokenFilter` and `NGramTokenizer` has been limited to 1. This default
limit can be changed with the index setting `index.max_ngram_diff`. Note that if the limit is
exceeded a error is thrown only for new indices. For existing pre-7.0 indices, a deprecation
warning is logged.

[discrete]
==== Limit to the difference between max_shingle_size and min_shingle_size in ShingleTokenFilter

To safeguard against creating too many tokens, the difference between `max_shingle_size` and
`min_shingle_size` in `ShingleTokenFilter` has been limited to 3. This default
limit can be changed with the index setting `index.max_shingle_diff`. Note that if the limit is
exceeded a error is thrown only for new indices. For existing pre-7.0 indices, a deprecation
warning is logged.
[DOCS] Swap `[float]` for `[discrete]` (#60134) Changes instances of `[float]` in our docs for `[discrete]`. Asciidoctor prefers the `[discrete]` tag for floating headings: https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks 2020-07-23 12:42:33 -04:00			`[discrete]`
Replace `delimited_payload_filter` by `delimited_payload` (#26625) The `delimited_payload_filter` is renamed to `delimited_payload`, the old name is deprecated and should be replaced by `delimited_payload`. Closes #21978 2017-11-24 07:03:19 -05:00			`[[breaking_70_analysis_changes]]`
			`=== Analysis changes`

[DOCS] Add notable-breaking-changes tags (#40991) 2019-04-08 21:54:29 -04:00			`//NOTE: The notable-breaking-changes tagged regions are re-used in the`
			`//Installation and Upgrade Guide`

			`//tag::notable-breaking-changes[]`

			`// end::notable-breaking-changes[]`

[DOCS] Swap `[float]` for `[discrete]` (#60134) Changes instances of `[float]` in our docs for `[discrete]`. Asciidoctor prefers the `[discrete]` tag for floating headings: https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks 2020-07-23 12:42:33 -04:00			`[discrete]`
[DOCS] Set literal anchors for Asciidoctor (#42462) 2019-05-28 14:16:18 -04:00			`[[limit-number-of-tokens-produced-by-analyze]]`
Limit the number of tokens produced by _analyze (#27529) Add an index level setting `index.analyze.max_token_count` to control the number of generated tokens in the _analyze endpoint. Defaults to 10000. Throw an error if the number of generated tokens exceeds this limit. Closes #27038 2017-11-30 11:54:39 -05:00			`==== Limiting the number of tokens produced by _analyze`

			`To safeguard against out of memory errors, the number of tokens that can be produced`
			using the `_analyze` endpoint has been limited to 10000. This default limit can be changed
Limit the analyzed text for highlighting (#27934) * Limit the analyzed text for highlighting - Introduce index level settings to control the max number of character to be analyzed for highlighting - Throw an error if analysis is required on a larger text Closes #27517 2017-12-21 10:19:58 -05:00			for a particular index with the index setting `index.analyze.max_token_count`.

[DOCS] Swap `[float]` for `[discrete]` (#60134) Changes instances of `[float]` in our docs for `[discrete]`. Asciidoctor prefers the `[discrete]` tag for floating headings: https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks 2020-07-23 12:42:33 -04:00			`[discrete]`
Limit the analyzed text for highlighting (#27934) * Limit the analyzed text for highlighting - Introduce index level settings to control the max number of character to be analyzed for highlighting - Throw an error if analysis is required on a larger text Closes #27517 2017-12-21 10:19:58 -05:00			`==== Limiting the length of an analyzed text during highlighting`

			`Highlighting a text that was indexed without offsets or term vectors,`
			`requires analysis of this text in memory real time during the search request.`
			`For large texts this analysis may take substantial amount of time and memory.`
			`To protect against this, the maximum number of characters that will be analyzed has been`
Limit analyzed text for highlighting (improvements) (#28808) Increase the default limit of `index.highlight.max_analyzed_offset` to 1M instead of previous 10K. Enhance an error message when offset increased to include field name, index name and doc_id. Relates to https://github.com/elastic/kibana/issues/16764 2018-03-02 11:09:05 -05:00			`limited to 1000000. This default limit can be changed`
Remove `delimited_payload_filter` (#27705) From 7.0 on, using `delimited_payload_filter` should throw an error. It was deprecated in 6.2 in favour of `delimited_payload` (#26625). Relates to #27704 2018-04-05 12:41:04 -04:00			for a particular index with the index setting `index.highlight.max_analyzed_offset`.

[DOCS] Swap `[float]` for `[discrete]` (#60134) Changes instances of `[float]` in our docs for `[discrete]`. Asciidoctor prefers the `[discrete]` tag for floating headings: https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks 2020-07-23 12:42:33 -04:00			`[discrete]`
[DOCS] Set literal anchors for Asciidoctor (#42462) 2019-05-28 14:16:18 -04:00			`[[delimited-payload-filter-renaming]]`
Remove `delimited_payload_filter` (#27705) From 7.0 on, using `delimited_payload_filter` should throw an error. It was deprecated in 6.2 in favour of `delimited_payload` (#26625). Relates to #27704 2018-04-05 12:41:04 -04:00			==== `delimited_payload_filter` renaming

			The `delimited_payload_filter` was deprecated and renamed to `delimited_payload` in 6.2.
			`Using it in indices created before 7.0 will issue deprecation warnings. Using the old`
			name in new indices created in 7.0 will throw an error. Use the new name `delimited_payload`
			`instead.`
Upgrade to a Lucene 8 snapshot (#33310) The main benefit of the upgrade for users is the search optimization for top scored documents when the total hit count is not needed. However this optimization is not activated in this change, there is another issue opened to discuss how it should be integrated smoothly. Some comments about the change: * Tests that can produce negative scores have been adapted but we need to forbid them completely: #33309 Closes #32899 2018-09-06 08:42:06 -04:00
[DOCS] Swap `[float]` for `[discrete]` (#60134) Changes instances of `[float]` in our docs for `[discrete]`. Asciidoctor prefers the `[discrete]` tag for floating headings: https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks 2020-07-23 12:42:33 -04:00			`[discrete]`
[DOCS] Set literal anchors for Asciidoctor (#42462) 2019-05-28 14:16:18 -04:00			`[[standard-filter-removed]]`
Upgrade to a Lucene 8 snapshot (#33310) The main benefit of the upgrade for users is the search optimization for top scored documents when the total hit count is not needed. However this optimization is not activated in this change, there is another issue opened to discuss how it should be integrated smoothly. Some comments about the change: * Tests that can produce negative scores have been adapted but we need to forbid them completely: #33309 Closes #32899 2018-09-06 08:42:06 -04:00			==== `standard` filter has been removed

			The `standard` token filter has been removed because it doesn't change anything in the stream.
[Analysis] Deprecate Standard Html Strip Analyzer in master (#26719) * [Analysis] Deprecate Standard Html Strip Analyzer Deprecate only Standard Html Strip Analyzer If user create index with the analyzer since 7.0, es throws an exception. If an index was created before 7.0, es issue deprecation log We will remove it in 8.0 Related #4704 2019-01-08 22:42:00 -05:00
[DOCS] Swap `[float]` for `[discrete]` (#60134) Changes instances of `[float]` in our docs for `[discrete]`. Asciidoctor prefers the `[discrete]` tag for floating headings: https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks 2020-07-23 12:42:33 -04:00			`[discrete]`
[Analysis] Deprecate Standard Html Strip Analyzer in master (#26719) * [Analysis] Deprecate Standard Html Strip Analyzer Deprecate only Standard Html Strip Analyzer If user create index with the analyzer since 7.0, es throws an exception. If an index was created before 7.0, es issue deprecation log We will remove it in 8.0 Related #4704 2019-01-08 22:42:00 -05:00			`==== Deprecated standard_html_strip analyzer`

			The `standard_html_strip` analyzer has been deprecated, and should be replaced
			with a combination of the `standard` tokenizer and `html_strip` char_filter.
			`Indexes created using this analyzer will still be readable in elasticsearch 7.0,`
Remove `nGram` and `edgeNGram` token filter names (#39070) In #30209 we deprecated the camel case `nGram` filter name in favour of `ngram` and did the same for `edgeNGram` and `edge_ngram` and we are removing those names in 8.0. This change disallows using the deprecated names for new indices created in 7.0 by throwing an error if these filters are used. Relates to #38911 2019-02-21 10:47:02 -05:00			`but it will not be possible to create new indexes using it.`

[DOCS] Swap `[float]` for `[discrete]` (#60134) Changes instances of `[float]` in our docs for `[discrete]`. Asciidoctor prefers the `[discrete]` tag for floating headings: https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks 2020-07-23 12:42:33 -04:00			`[discrete]`
[DOCS] Set literal anchors for Asciidoctor (#42462) 2019-05-28 14:16:18 -04:00			`[[deprecated-ngram-edgengram-token-filter-cannot-be-used]]`
Remove `nGram` and `edgeNGram` token filter names (#39070) In #30209 we deprecated the camel case `nGram` filter name in favour of `ngram` and did the same for `edgeNGram` and `edge_ngram` and we are removing those names in 8.0. This change disallows using the deprecated names for new indices created in 7.0 by throwing an error if these filters are used. Relates to #38911 2019-02-21 10:47:02 -05:00			==== The deprecated `nGram` and `edgeNGram` token filter cannot be used on new indices

			The `nGram` and `edgeNGram` token filter names have been deprecated in an earlier 6.x version.
			`Indexes created using these token filters will still be readable in elasticsearch 7.0 but indexing`
			`documents using those filter names will issue a deprecation warning. Using the deprecated names on`
			`new indices starting with version 7.0.0 will be prohibited and throw an error when indexing`
Fix grammatical error in analysis.asciidoc (#40827) 2019-04-04 11:30:37 -04:00			or analyzing documents. Both names should be replaced by `ngram` or `edge_ngram` respectively.
Small fixes to breaking changes docs. * Move ngram and shingle changes to the analysis section. * Add missing heading for field caps change. 2020-09-08 19:54:20 -04:00
			`[discrete]`
			`==== Limit to the difference between max_size and min_size in NGramTokenFilter and NGramTokenizer`

			To safeguard against creating too many index terms, the difference between `max_ngram` and
			`min_ngram` in `NGramTokenFilter` and `NGramTokenizer` has been limited to 1. This default
			limit can be changed with the index setting `index.max_ngram_diff`. Note that if the limit is
			`exceeded a error is thrown only for new indices. For existing pre-7.0 indices, a deprecation`
			`warning is logged.`

			`[discrete]`
			`==== Limit to the difference between max_shingle_size and min_shingle_size in ShingleTokenFilter`

			To safeguard against creating too many tokens, the difference between `max_shingle_size` and
			`min_shingle_size` in `ShingleTokenFilter` has been limited to 3. This default
			limit can be changed with the index setting `index.max_shingle_diff`. Note that if the limit is
			`exceeded a error is thrown only for new indices. For existing pre-7.0 indices, a deprecation`
			`warning is logged.`