Commit Graph

320 Commits

Author SHA1 Message Date
James Rodewig b302b09b85
[DOCS] Reformat snippets to use two-space indents (#59973) (#59994) 2020-07-21 15:49:58 -04:00
malpani 0555fef799 Support ignore_keywords flag for word delimiter graph token filter (#59563)
This commit allows customizing the word delimiter token filters to skip processing 
tokens tagged as keyword through the `ignore_keywords` flag Lucene's 
WordDelimiterGraphFilter already exposes.

Fix for #59491
2020-07-21 16:11:55 +01:00
James Rodewig 82a8d9aa0c
[DOCS] Fix keyword marker docs (#59834) (#59863)
Co-authored-by: Rui Almeida <ruial@outlook.com>
2020-07-20 09:27:42 -04:00
James Rodewig da85a40e7e
[DOCS] Reformat `predicate_token_filter` tokenfilter (#57705) (#59714) 2020-07-16 13:35:09 -04:00
James Rodewig 6ed356ffc3
[DOCS] Replace `datatype` with `data type` (#58972) (#59184) 2020-07-07 14:59:35 -04:00
James Rodewig 6436792aac
[DOCS] Fix headings for simple analyzer docs (#58910) (#58918) 2020-07-02 09:52:05 -04:00
James Rodewig 28717d1e02
[DOCS] Fix analyzer page titles (#58362) (#58603)
Changes the titles for analyzer pages to sentence case.

Also changes the 'Pattern character filter' page title to sentence case.
2020-06-26 10:17:01 -04:00
James Rodewig ab29162ab3
[DOCS] Fix tokenizer page titles (#58361) (#58598)
Changes the titles for tokenizer pages to sentence case.

Also moves the 'Path hierarchy tokenizer examples' page within the
'Path hierarchy tokenizer' page and adds a related redirect.
2020-06-26 09:24:41 -04:00
James Rodewig c36df27730
[DOCS] Reformat `pattern_replace` token filter (#57699) (#57995)
Changes:

* Rewrites description and adds Lucene link
* Adds analyze example
* Adds parameter definitions
* Adds custom analyzer example
2020-06-11 12:19:38 -04:00
James Rodewig 24a50eb3af
[DOCS] Reformat `mapping` charfilter (#57818) (#57885)
Changes:

* Adds title abbreviation
* Adds Lucene link to description
* Adds standard headings
* Simplifies analyze example
* Simplifies analyzer example and adds contextual text
2020-06-09 12:43:01 -04:00
James Rodewig 63e962c99a [DOCS] Fix typo in `html_strip` char filter docs 2020-06-08 10:37:35 -04:00
James Rodewig 2121eb528c
[DOCS] Reformat `html_strip` charfilter (#57764) (#57810)
Changes:

* Converts title to sentence case
* Adds a title abbreviation
* Adds Lucene link to description
* Reformat sections
2020-06-08 08:44:43 -04:00
Tomasz Elendt a7c36c8af5 Support multiple tokens on LHS in stemmer_override rules (#56113) (#56484)
This commit adds support for rules with multiple tokens on LHS, also
known as "contraction rules", into stemmer override token
filter. Contraction rules are handy into translating multiple
inflected words into the same root form. One side effect of this change is
that it brings stemmer override rules format closer to synonym rules
format so that it makes it easier to translate one into another.

This change also makes stemmer override rules parser more strict so
that it should catch more errors which were previously accepted.

Closes #56113
2020-05-29 22:34:31 +02:00
James Rodewig a0ca0325fe
[DOCS] Reformat `min_hash` token filter docs (#57181) (#57246)
Changes:

* Rewrites description and adds a Lucene link
* Reformats the configurable parameters as a definition list
* Changes the `Theory` heading to `Using the min_hash token filter for
  similarity search`
* Adds some additional detail to the analyzer example
2020-05-27 15:08:23 -04:00
James Rodewig a2de43d468
[DOCS] Reformat `shingle` token filter (#57040)
Changes:

* Rewrites description and adds Lucene link
* Adds analyze example
* Rewrites parameter documentation
* Updates custom analyzer and filter examples
* Adds anchor to `index.max_shingle_diff` index-level setting
2020-05-21 13:56:54 -04:00
James Rodewig 5cb34d9a6e
[DOCS] Reformat `hunspell` token filter (#56955)
Changes:

* Rewrites description and adds Lucene link
* Adds analyze example
* Rewrites parameter documentation
* Updates custom analyzer example
* Rewrites related setting documentation
2020-05-20 14:47:53 -04:00
Andrei Balici 19a336e8d3 Add `max_token_length` setting to the CharGroupTokenizer (#56860)
Adds `max_token_length` option to the CharGroupTokenizer.
Updates documentation as well to reflect the changes.

Closes #56676
2020-05-20 14:28:40 +02:00
James Rodewig 342e713e2a
[DOCS] Fix fingerprint token filter's analyzer example (#56811) (#56943)
Co-authored-by: Abhilash Bolla <2282894+ivssh@users.noreply.github.com>
2020-05-19 09:30:00 -04:00
James Rodewig 4faf5a7916
[DOCS] Reformat `porter_stem` token filter (#56053)
Makes the following changes to the `porter_stem` token filter docs:

* Rewrites description and adds a Lucene link
* Adds detailed analyze example
* Adds an analyzer example
2020-05-04 10:39:17 -04:00
markharwood e197b6c45b
Analysis enhancement - add preserve_original setting in ngram-token-filter (#55432) (#56100)
Authored-by: Amit Khandelwal <amitmbm87@gmail.com>
2020-05-04 11:31:28 +01:00
James Rodewig bbf68de446 [DOCS] Correct Lucene link in `kstem` token filter docs 2020-04-29 09:30:37 -04:00
James Rodewig 767836c367
[DOCS] Reformat `kstem` token filter (#55823)
Makes the following changes to the `kstem` token filter docs:

* Rewrite description and adds a Lucene work
* Adds detailed analyze example
* Adds an analyzer example
2020-04-29 08:52:55 -04:00
Amit Khandelwal 126e4acca8 Expose `preserve_original` in `edge_ngram` token filter (#55766)
The Lucene `preserve_original` setting is currently not supported in the `edge_ngram`
token filter. This change adds it with a default value of `false`.

Closes #55767
2020-04-28 10:24:27 +02:00
James Rodewig 8df5cff9c1 [DOCS] Correct stemmer token filters anchor 2020-04-27 14:57:59 -04:00
James Rodewig 5b8a18c756 [DOCS] Correct stemmer token filter anchor 2020-04-27 14:51:51 -04:00
James Rodewig e0a8adb5b2
[DOCS] Reformat `stemmer` token filter (#55693)
Makes the following changes to the `stemmer` token filter docs:

* Adds detailed analyze example
* Rewrites parameter definitions
* Adds custom analyzer example
* Adds a `language` value for the `estonian` stemmer
* Reorders the `language` values to show recommended algorithms first,
  followed by other values alphabetically
2020-04-24 11:25:01 -04:00
James Rodewig 96285b90c1
[DOCS] Add stemming concept docs (#55156)
Adds conceptual documentation for stemming, including:

* An overview of why stemming is helpful in search
* Algorithmic vs. dictionary stemming
* Token filters used to control stemming, such as `stemmer_override`, `keyword_marker`, and `conditional`
2020-04-24 11:01:28 -04:00
James Rodewig f0b9be8b1b [DOCS] Reformat `flatten_graph` token filter (#54268)
* [DOCS] Reformat `flatten_graph` token filter

Makes the following changes to the `flatten_graph` token filter docs:

* Rewrites description and adds Lucene link
* Adds detailed analyze example
* Adds analyzer example
2020-04-16 08:35:08 -04:00
James Rodewig d5a609a2e5 [DOCS] Add token filter reference docs template (#52290)
Creates a reusable template for token filter reference documentation.

Contributors can make a copy of this template and customize it when
documenting new token filters.
2020-04-10 08:45:10 -04:00
markharwood 2da2305587
Backport of lowercase normalizer PR #53882
A pre-configured normalizer for lower-casing.
Closes #53872
2020-04-03 11:43:40 +01:00
James Rodewig 21abc311fd
[DOCS] Reformat `keyword_repeat` token filter (#54428) 2020-04-01 11:56:05 -04:00
James Rodewig ed1edb4964 [DOCS] Add missing word to keyword marker token filter docs 2020-03-30 10:52:14 -04:00
James Rodewig 21f362a2a8
[DOCS] Add a lowercase email example to keyword tokenizer docs (#53257) 2020-03-30 09:06:04 -04:00
James Rodewig 74051d68a8
[DOCS] Reformat `keyword_marker` token filter (#54076)
Makes the following changes to the `keyword_marker` token filter docs:

* Rewrites description and adds Lucene link
* Adds detailed analyze example
* Rewrites parameter definitions
* Adds custom analyzer and filter example
2020-03-25 09:26:06 -04:00
James Rodewig 43199a8c82 [DOCS] Remove double space in WDG docs 2020-03-23 17:18:04 -04:00
James Rodewig 553d8a9ca9 [DOCS] Fix "letter case" typo
Changes "lettercase" to "letter case" in the `uppercase` token filter
docs.
2020-03-23 17:11:59 -04:00
lgypro be3090138e [Docs] Fix typo in _analyze api docs (#53837) 2020-03-20 12:02:40 +01:00
James Rodewig 8f4a3eb07f [DOCS] Add token graph concept docs (#53339)
Adds conceptual docs for token graphs.
These docs cover:

* How a token graph is constructed from a token stream
* How synonyms and multi-position tokens impact token graphs
* How token graphs are used during search
* Why some token filters produce invalid token graphs

Also makes the following supporting changes:
* Adds anchors to the 'Anatomy of an Analyzer' docs for cross-linking
* Adds several SVGs for token graph diagrams
2020-03-19 07:43:18 -04:00
James Rodewig e4b7af11ab [DOCS] Remove `light_bengali` stemmer (#53697)
Only the `bengali` stemmer is available in Lucene and surfaced through
Elasticsearch. This removes the incorrect `light_bengali` link in our
docs.
2020-03-18 08:34:27 -04:00
James Rodewig e1eebea846
[DOCS] Reformat `remove_duplicates` token filter (#53608)
Makes the following changes to the `remove_duplicates` token filter
docs:

* Rewrites description and adds Lucene link
* Adds detailed analyze example
* Adds custom analyzer example
2020-03-16 11:37:06 -04:00
Jim Ferenczi 97621e7f65 Removes old Lucene's experimental flag from analyzer documentations (#53217)
This change removes the Lucene's experimental flag from the documentations of the following
tokenizer/filters:
  * Simple Pattern Split Tokenizer
  * Simple Pattern tokenizer
  * Flatten Graph Token Filter
  * Word Delimiter Graph Token Filter

The flag is still present in Lucene codebase but we're fully supporting these tokenizers/filters
in ES for a long time now so the docs flag is misleading.

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2020-03-12 21:18:19 +01:00
James Rodewig 933a9c6fca
[DOCS] Reformat `word_delimiter` token filter (#53387)
Makes the following changes to the `word_delimiter` token filter docs:

* Adds a warning admonition recommending the `word_delimiter_graph`
  filter instead. This warning includes a link to the deprecated Lucene
  `WordDelimiterFilter`.
* Updates the description
* Adds detailed analyze snippet
* Adds custom analyzer and custom filter snippets
* Reorganizes and updates parameter documentation
2020-03-11 09:03:57 -04:00
James Rodewig a9dd7773d2 [DOCS] Use keyword tokenizer in word delimiter graph examples (#53384)
In a tip admonition, we recommend using the `keyword` tokenizer with the
`word_delimiter_graph` token filter. However, we only use the
`whitespace` tokenizer in the example snippets. This updates those
snippets to use the `keyword` tokenizer instead.

Also corrects several spacing issues for arrays in these docs.
2020-03-11 04:46:33 -04:00
James Rodewig 166b5a92f6 [DOCS] Correct anchor in word delimiter graph token filter docs 2020-03-10 10:32:42 -04:00
James Rodewig 28cb4a167d
[DOCS] Reformat `word_delimiter_graph` token filter (#53170) (#53272)
Makes the following changes to the `word_delimiter_graph` token filter
docs:

* Updates the Lucene experimental admonition.
* Updates description
* Adds analyze snippet
* Adds custom analyzer and custom filter snippets
* Reorganizes and updates parameter list
* Expands and updates section re: differences between `word_delimiter`
  and `word_delimiter_graph`
2020-03-09 06:45:44 -04:00
James Rodewig 9bb9f63364 [DOCS] Note that `trim` filter doesn't change offsets (#53220)
The [word delimiter graph token filter docs][0] note that the `trim`
filter changes the length of tokens without changing their offsets.

This explicitly mentions that in the `trim` filter docs.

[0]: https://www.elastic.co/guide/en/elasticsearch/reference/master/analysis-word-delimiter-graph-tokenfilter.html
2020-03-06 07:32:35 -05:00
James Rodewig 4bc6d2dbec [DOCS] Correct link for Lucene StopFilter 2020-03-05 14:52:25 -05:00
James Rodewig 0c4bf64095 [DOCS] Fix several Asciidoctor double arrow replacements (#52827)
Per the [Asciidoctor docs][0], Asciidoctor replaces the following
syntax with double arrows in the rendered HTML:

* => renders as ⇒
* <= renders as ⇐

This escapes several unintended replacements, such as in the Painless
docs.

Where appropriate, it also replaces some double arrow instances with
single arrows for consistency.

[0]: https://asciidoctor.org/docs/user-manual/#replacements
2020-03-04 08:43:19 -05:00
James Rodewig cf87724ff6
[DOCS] Reformat `stop` token filter (#53059)
Makes the following changes to the `stop` token filter docs:

* Updates description
* Adds a link to the related Lucene filter
* Adds detailed analyze snippet
* Updates custom analyzer and custom filter snippets
* Adds a list of predefined stop words by language

Co-authored-by: ScottieL <36999642+ScottieL@users.noreply.github.com>
2020-03-03 13:22:52 -05:00
James Rodewig d336faa0b0 [DOCS] Reformat trim token filter docs (#51649)
Makes the following changes to the `trim` token filter docs:

* Updates description
* Adds a link to the related Lucene filter
* Adds tip about removing whitespace using tokenizers
* Adds detailed analyze snippets
* Adds custom analyzer snippet
2020-03-02 07:48:23 -05:00