Commit Graph

264 Commits

Author SHA1 Message Date
James Rodewig 7ef906fde8 [DOCS] Add tutorials section to analysis topic (#50809)
Adds a 'Configure text analysis' page to house tutorial content for the
analysis topic.

Also relocates the following pages as children as this new page:

* 'Test an analyzer'
* 'Configuring built-in analyzers'
* 'Create a custom analyzer'

I plan to add a tutorial for specifying index-time and search-time
analyzers to this section as part of a future PR.
2020-01-16 13:12:06 -05:00
James Rodewig ef26763ca9 [DOCS] Add concepts section to analysis topic (#50801)
This helps the topic better match the structure of
our machine learning docs, e.g.
https://www.elastic.co/guide/en/machine-learning/7.5/ml-concepts.html

This PR only includes the 'Anatomy of an analyzer' page as a 'Concepts'
child page, but I plan to add other concepts, such as 'Index time vs.
search time', with later PRs.
2020-01-16 13:00:39 -05:00
James Rodewig 1edaf2b101 [DOCS] Retitle analysis reference pages (#51071)
* Changes titles to sentence case.

* Appends pages with 'reference' to differentiate their content from
  conceptual overviews.

* Moves the 'Normalizers' page to end of the Analysis topic pages.
2020-01-16 12:30:51 -05:00
PND 1d391f7113 [Docs] Fix example output of edge n-gram token filter. (#51085) 2020-01-16 11:34:00 +01:00
James Rodewig 78c9eee5ea [DOCS] Add section ID to analysis overview page 2020-01-08 14:43:41 -06:00
James Rodewig 9d1567b13b [DOCS] Add overview page to analysis topic (#50515)
Adds a 'text analysis overview' page to the analysis topic docs.

The goals of this page are:

* Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples
* Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis
* Highlight how analysis can be used to improve search results
2020-01-08 12:54:00 -06:00
James Rodewig 20eba1e410 [DOCS] Reformat reverse token filter docs (#50672)
* Updates the description and adds a Lucene link
* Adds analyze and custom analyzer snippets
2020-01-07 11:01:55 -06:00
James Rodewig 8009b07ccb [DOCS] Reformat truncate token filter docs (#50687)
* Updates the description and adds a Lucene link
* Adds analyze, custom analyzer, and custom filter snippets
* Adds parameter documentation
2020-01-07 10:33:57 -06:00
James Rodewig e6a469cc74 [DOCS] Reformat uppercase token filter docs (#50555)
* Updates the description and adds a Lucene link
* Adds analyze and custom analyzer snippets
2020-01-03 08:39:08 -05:00
James Rodewig 7a14607a25 [DOCS] Abbreviate token filter titles (#50511) 2019-12-27 11:01:52 -05:00
Nik Everett 01293ebad5
Fix docs typos (#50365) (#50464)
Fixes a few typos in the docs.

Co-authored-by: Xiang Dai <764524258@qq.com>
2019-12-23 12:38:17 -05:00
James Rodewig cd04021961 [DOCS] Reformat token count limit filter docs (#49835) 2019-12-13 08:44:39 -05:00
James Rodewig 1186a5dc09 [DOCS] Reformat lowercase token filter docs (#49935) 2019-12-12 09:50:12 -05:00
James Rodewig 87a73b6bdf [DOCS] Reformat length token filter docs (#49805)
* Adds a title abbreviation
* Updates the description and adds a Lucene link
* Reformats the parameters section
* Adds analyze, custom analyzer, and custom filter snippets

Relates to #44726.
2019-12-04 09:59:08 -05:00
James Rodewig ade72b97b7 [DOCS] Reformat keep types and keep words token filter docs (#49604)
* Adds title abbreviations
* Updates the descriptions and adds Lucene links
* Reformats parameter definitions
* Adds analyze and custom analyzer snippets
* Adds explanations of token types to keep types token filter and tokenizer docs
2019-12-02 09:40:50 -05:00
James Rodewig 2fd58bb845 [DOCS] Add missing "_type" to delimited payload token filter docs 2019-11-25 16:16:05 -05:00
James Rodewig c40449ac22 [DOCS] Reformat delimited payload token filter docs (#49380)
* Adds a title abbreviation
* Relocates the older name deprecation warning
* Updates the description and adds a Lucene link
* Adds a note to explain payloads and how to store them
* Adds analyze and custom analyzer snippets
* Adds a 'Return stored payloads' example
2019-11-25 15:40:05 -05:00
James Rodewig d06c71eb82 [DOCS] Fix edge n-gram tokenizer nav
Adds a missing float tag to the edge n-gram tokenizer docs. This tag
ensures the edge n-gram tokenizer docs display on the same page.
2019-11-22 15:54:07 -05:00
James Rodewig 562607d3f5 [DOCS] Reformat n-gram token filter docs (#49438)
Reformats the edge n-gram and n-gram token filter docs. Changes include:

* Adds title abbreviations
* Updates the descriptions and adds Lucene links
* Reformats parameter definitions
* Adds analyze and custom analyzer snippets
* Adds notes explaining differences between the edge n-gram and n-gram
  filters

Additional changes:
* Switches titles to use "n-gram" throughout.
* Fixes a typo in the edge n-gram tokenizer docs
* Adds an explicit anchor for the `index.max_ngram_diff` setting
2019-11-22 10:38:50 -05:00
Christoph Büscher 4ffa050735 Allow custom characters in token_chars of ngram tokenizers (#49250)
Currently the `token_chars` setting in both `edgeNGram` and `ngram` tokenizers
only allows for a list of predefined character classes, which might not fit
every use case. For example, including underscore "_" in a token would currently
require the `punctuation` class which comes with a lot of other characters.
This change adds an additional "custom" option to the `token_chars` setting,
which requires an additional `custom_token_chars` setting to be present and
which will be interpreted as a set of characters to inlcude into a token.

Closes #25894
2019-11-20 10:37:12 +01:00
James Rodewig a26916cc23 [DOCS] Reformat elision token filter docs (#49262) 2019-11-19 10:55:22 -05:00
James Rodewig 8639ddab5e [DOCS] Reformat fingerprint token filter docs (#49311) 2019-11-19 10:55:21 -05:00
gpaimla 7d20b50f45 Implement Lucene EstonianAnalyzer, Stemmer (#49149)
This PR adds a new analyzer and stemmer for the Estonian language.

Closes #48895
2019-11-18 17:24:21 +01:00
James Rodewig 095c34359f [DOCS] Note limitations of `max_gram` parm in `edge_ngram` tokenizer for index analyzers (#49007)
The `edge_ngram` tokenizer limits tokens to the `max_gram` character
length. Autocomplete searches for terms longer than this limit return
no results.

To prevent this, you can use the `truncate` token filter to truncate
tokens to the `max_gram` character length. However, this could return irrelevant results.

This commit adds some advisory text to make users aware of this limitation and outline the tradeoffs for each approach.

Closes #48956.
2019-11-13 14:28:12 -05:00
James Rodewig 838af15d29 [DOCS] Reformat compound word token filters (#49006)
* Separates the compound token filters doc pages into separate token
  filter pages:
  * Dictionary decompounder token filter
  * Hyphenation decompounder token filter

* Adds analyze API examples for each compound token filter

* Adds a redirect for the removed compound token filters page

Co-Authored-By: debadair <debadair@elastic.co>
2019-11-13 09:36:52 -05:00
James Rodewig dd92830801 [DOCS] Reformat condition token filter (#48775) 2019-11-11 08:49:44 -05:00
Julian Simioni 5e4501eb3f [Docs] Consolidate single example into a single line (#48904)
The first example of splitting rules for the `word_delimiter` token filter was spread across two bullet points. This makes it look like they are two separate splitting rules.
2019-11-08 15:12:45 -05:00
James Rodewig 700a316bb3 [DOCS] Reformat decimal digit token filter docs (#48722) 2019-11-01 12:38:14 -04:00
Peter Johnson 3f7aafa421 [DOCS] Fix typo in synonym token filter docs (#48691) 2019-10-31 09:12:24 -04:00
James Rodewig 3d5b1725a9 [DOCS] Remove unneeded filter from common grams analyze ex (#48748) 2019-10-31 09:08:14 -04:00
James Rodewig 77acbc4fa9 [DOCS] Reformat common grams token filter (#48426) 2019-10-30 08:40:56 -04:00
James Rodewig 06dc1fbd96 [DOCS] Reformat ASCII folding token filter docs (#48143) 2019-10-23 15:06:55 -05:00
James Rodewig 9c75f14a9f [DOCS] Reformat classic token filter docs (#48314) 2019-10-23 10:14:25 -05:00
James Rodewig a66bb2c7ed [DOCS] Reformat CJK bigram and CJK width token filter docs (#48210) 2019-10-21 08:44:49 -05:00
James Rodewig 8677653c5b [DOCS] Reformat apostrophe token filter docs (#48076) 2019-10-16 08:51:14 -04:00
Wilder Pereira 8c73e215b2 [DOCS] Remove unneeded spaces from custom analyzer snippet (#47332) 2019-10-15 15:53:16 -04:00
James Rodewig 601a88bede [DOCS] Sort analyzers, tokenizers, and token filters alphabetically (#48068) 2019-10-15 15:47:25 -04:00
James Rodewig af7aba18d4 Fixed sample code for minhash (#46385)
The sample code is wrong. Field type is required for the sample field.
I guess the intention was to give the sample field the name ```fingerprint```, mapping it as ```text``` using the custom analyzer ```my_analyzer```
2019-09-12 13:29:44 -04:00
Abhilash Bolla 20e93bca6b Fixed grammar in pattern replace char filter docs. (#46546)
Minor grammar fix in the pattern replace char filter docs.
2019-09-10 11:04:07 -07:00
James Rodewig b59ecde041
[DOCS] [2 of 5] Change // CONSOLE comments to [source,console] (#46353) (#46502) 2019-09-09 13:38:14 -04:00
James Rodewig f04573f8e8
[DOCS] [5 of 5] Change // TESTRESPONSE comments to [source,console-results] (#46449) (#46459) 2019-09-06 16:09:09 -04:00
James Rodewig bb7bff5e30
[DOCS] Replace "// TESTRESPONSE" magic comments with "[source,console-result] (#46295) (#46418) 2019-09-06 09:22:08 -04:00
James Rodewig 3e62cf9d74 [DOCS] Correct custom analyzer callouts (#46030) 2019-08-29 10:08:18 -04:00
James Rodewig d46545f729 [DOCS] Update anchors and links for Elasticsearch API relocation (#44500) 2019-07-19 09:18:23 -04:00
Christoph Büscher 2cc7f5a744
Allow reloading of search time analyzers (#43313)
Currently changing resources (like dictionaries, synonym files etc...) of search
time analyzers is only possible by closing an index, changing the underlying
resource (e.g. synonym files) and then re-opening the index for the change to
take effect.

This PR adds a new API endpoint that allows triggering reloading of certain
analysis resources (currently token filters) that will then pick up changes in
underlying file resources. To achieve this we introduce a new type of custom
analyzer (ReloadableCustomAnalyzer) that uses a ReuseStrategy that allows
swapping out analysis components. Custom analyzers that contain filters that are
markes as "updateable" will automatically choose this implementation. This PR
also adds this capability to `synonym` token filters for use in search time
analyzers.

Relates to #29051
2019-06-28 09:55:40 +02:00
Alan Woodward 05a7333eca Require [articles] setting in elision filter (#43083)
We should throw an exception at construction time if a list of
articles is not provided, otherwise we can get random NPEs during
indexing.

Relates to #43002
2019-06-27 09:02:36 +01:00
Sachin Frayne 44aedcf97a Correct the description of generate_word_parts (#43026) 2019-06-10 11:36:31 +01:00
James Rodewig 5342616a23 [DOCS] Add explicit `articles_case` parameter to Elision Token Filter example (#42987) 2019-06-07 11:24:43 -04:00
Mayya Sharipova 5a76f46ac6 Fix error with mapping in docs
Related to #39630
2019-05-30 10:28:09 -04:00
Peter Dyson b84b5525e1 [DOCS] path_hierarchy tokenizer examples (#39630)
Closes #17138
2019-05-30 09:17:55 -04:00