From 9d1567b13b25ee11c94b608cadae1437729e17e2 Mon Sep 17 00:00:00 2001 From: James Rodewig Date: Wed, 8 Jan 2020 12:53:08 -0600 Subject: [PATCH] [DOCS] Add overview page to analysis topic (#50515) Adds a 'text analysis overview' page to the analysis topic docs. The goals of this page are: * Concisely summarize the analysis process while avoiding in-depth concepts, tutorials, or API examples * Explain why analysis is important, largely through highlighting problems with full-text searches missing analysis * Highlight how analysis can be used to improve search results --- docs/reference/analysis.asciidoc | 8 ++- docs/reference/analysis/overview.asciidoc | 78 +++++++++++++++++++++++ 2 files changed, 83 insertions(+), 3 deletions(-) create mode 100644 docs/reference/analysis/overview.asciidoc diff --git a/docs/reference/analysis.asciidoc b/docs/reference/analysis.asciidoc index 2ff88a38c65..aaa40c01aea 100644 --- a/docs/reference/analysis.asciidoc +++ b/docs/reference/analysis.asciidoc @@ -1,11 +1,11 @@ [[analysis]] -= Analysis += Text analysis [partintro] -- -_Analysis_ is the process of converting text, like the body of any email, into -_tokens_ or _terms_ which are added to the inverted index for searching. +_Text analysis_ is the process of converting text, like the body of any email, +into _tokens_ or _terms_ which are added to the inverted index for searching. Analysis is performed by an <> which can be either a built-in analyzer or a <> analyzer defined per index. @@ -142,6 +142,8 @@ looking for: -- +include::analysis/overview.asciidoc[] + include::analysis/anatomy.asciidoc[] include::analysis/testing.asciidoc[] diff --git a/docs/reference/analysis/overview.asciidoc b/docs/reference/analysis/overview.asciidoc new file mode 100644 index 00000000000..19f39a548ba --- /dev/null +++ b/docs/reference/analysis/overview.asciidoc @@ -0,0 +1,78 @@ + +== Text analysis overview +++++ +Overview +++++ + +Text analysis enables {es} to perform full-text search, where the search returns +all _relevant_ results rather than just exact matches. + +If you search for `Quick fox jumps`, you probably want the document that +contains `A quick brown fox jumps over the lazy dog`, and you might also want +documents that contain related words like `fast fox` or `foxes leap`. + +[discrete] +[[tokenization]] +=== Tokenization + +Analysis makes full-text search possible through _tokenization_: breaking a text +down into smaller chunks, called _tokens_. In most cases, these tokens are +individual words. + +If you index the phrase `the quick brown fox jumps` as a single string and the +user searches for `quick fox`, it isn't considered a match. However, if you +tokenize the phrase and index each word separately, the terms in the query +string can be looked up individually. This means they can be matched by searches +for `quick fox`, `fox brown`, or other variations. + +[discrete] +[[normalization]] +=== Normalization + +Tokenization enables matching on individual terms, but each token is still +matched literally. This means: + +* A search for `Quick` would not match `quick`, even though you likely want +either term to match the other + +* Although `fox` and `foxes` share the same root word, a search for `foxes` +would not match `fox` or vice versa. + +* A search for `jumps` would not match `leaps`. While they don't share a root +word, they are synonyms and have a similar meaning. + +To solve these problems, text analysis can _normalize_ these tokens into a +standard format. This allows you to match tokens that are not exactly the same +as the search terms, but similar enough to still be relevant. For example: + +* `Quick` can be lowercased: `quick`. + +* `foxes` can be _stemmed_, or reduced to its root word: `fox`. + +* `jump` and `leap` are synonyms and can be indexed as a single word: `jump`. + +To ensure search terms match these words as intended, you can apply the same +tokenization and normalization rules to the query string. For example, a search +for `Foxes leap` can be normalized to a search for `fox jump`. + +[discrete] +[[analysis-customization]] +=== Customize text analysis + +Text analysis is performed by an <>, a set of rules +that govern the entire process. + +{es} includes a default analyzer, called the +<>, which works well for most use +cases right out of the box. + +If you want to tailor your search experience, you can choose a different +<> or even +<>. A custom analyzer gives you +control over each step of the analysis process, including: + +* Changes to the text _before_ tokenization + +* How text is converted to tokens + +* Normalization changes made to tokens before indexing or search \ No newline at end of file