[DOCS] Reorganized the highlighting topic so it's less confusing.

2025-03-24 17:09:48 +00:00 · 2017-07-11 21:15:35 -07:00 · 2017-07-11 21:15:35 -07:00 · b5e81132cf
commit b5e81132cf
parent e165c405ac
1 changed files with 239 additions and 215 deletions
--- a/docs/reference/search/request/highlighting.asciidoc
+++ b/docs/reference/search/request/highlighting.asciidoc
@ -1,9 +1,22 @@
 [[search-request-highlighting]]
 === Highlighting

-Highlighters allow you to produce highlighted snippets from one or more fields
-in your search results.
-The following is an example of the search request body:
+Highlighters enable you to get highlighted snippets from one or more fields
+in your search results so you can show users where the query matches are.
+When you request highlights, the response contains an additional `highlight`
+element for each search hit that includes the highlighted fields and the
+highlighted fragments.
+
+Highlighting requires the actual content of a field. If the field is not
+stored (the mapping does not set `store` to `true`), the actual `_source` is
+loaded and the relevant field is extracted from `_source`.
+
+NOTE: The `_all` field cannot be extracted from `_source`, so it can only
+be used for highlighting if it is explicitly stored.
+
+For example, to get highlights for the `content` field in each search hit
+using the default highlighter, include a `highlight` object in
+the request body that specifies the `content` field:

 [source,js]
 --------------------------------------------------
@ -22,63 +35,207 @@ GET /_search
 // CONSOLE
 // TEST[setup:twitter]

-In the above case, the `comment` field will be highlighted for each
-search hit (there will be another element in each search hit, called
-`highlight`, which includes the highlighted fields and the highlighted
-fragments).
-
-[NOTE]
-==================================
-In order to perform highlighting, the actual content of the field is
-required. If the field in question is stored (has `store` set to `true`
-in the mapping) it will be used, otherwise, the actual `_source` will
-be loaded and the relevant field will be extracted from it.
-
-The `_all` field cannot be extracted from `_source`, so it can only
-be used for highlighting if it mapped to have `store` set to `true`.
-==================================
-
-The field name supports wildcard notation. For example, using `comment_*`
-will cause all <<text,text>> and <<keyword,keyword>> fields that match the expression to be highlighted.
-Note that all other fields will not be highlighted. If you use a custom mapper and want to
-highlight on a field anyway, you have to provide the field name explicitly.
+{es} supports three highlighters:

 [[unified-highlighter]]
-==== Unified Highlighter
+* The `unified` highlighter uses the Lucene Unified Highlighter. This
+highlighter breaks the text into sentences and uses the BM25 algorithm to score
+individual sentences as if they were documents in the corpus. It also supports
+accurate phrase and multi-term (fuzzy, prefix, regex) highlighting. This is the
+default highlighter.

-The unified highlighter (which is used by default if no highlighter type is specified)
-uses the Lucene Unified Highlighter.
-This highlighter breaks the text into sentences and scores individual sentences as
-if they were documents in this corpus, using the BM25 algorithm.
-It also supports accurate phrase and multi-term (fuzzy, prefix, regex) highlighting.
+[[plain-highlighter]]
+* The `plain` highlighter uses the standard Lucene highlighter. It attempts to
+reflect the query matching logic in terms of understanding word importance and
+any word positioning criteria in phrase queries.
+
+[WARNING]
+The `plain` highlighter works best for highlighting simple query matches in a
+single field. To accurately reflect query logic, it creates a tiny in-memory
+index and re-runs the original query criteria through Lucene's query execution
+planner to get access to low-level match information for the current document.
+This is repeated for every field and every document that needs to be highlighted.
+If you want to highlight a lot of fields in a lot of documents with complex
+queries, we recommend using one of the other highlighters.

-===== Offsets Strategy
+[[fast-vector-highlighter]]
+* The `fvh` highlighter uses the Lucene Fast Vector highlighter.
+This highlighter can be used on fields with `term_vector` set to
+`with_positions_offsets` in the mapping. The fast vector highlighter:

-In order to create meaningful search snippets from the terms being queried,
-a highlighter needs to know the start and end character offsets of each word
-in the original text.
-These offsets can be obtained from:
+** Is faster especially for large fields (> `1MB`)
+** Can be customized with  a <<boundary-scanners,`boundary_scanner`>>. 
+** Requires setting `term_vector` to `with_positions_offsets` which
+  increases the size of the index
+** Can combine matches from multiple fields into one result.  See
+  `matched_fields`
+** Can assign different weights to matches at different positions allowing
+  for things like phrase matches being sorted above term matches when
+  highlighting a Boosting Query that boosts phrase matches over term matches

-* The postings list (fields mapped as "index_options": "offsets").
-* Term vectors (fields mapped as "term_vectors": "with_positions_offsets").
-* The original field, by reanalysing the text on-the-fly.
+To create meaningful search snippets from the terms being queried,
+the highlighter needs to know the start and end character offsets of each word
+in the original text. These offsets can be obtained from:

-====== Plain highlighting
+* The postings list. If `index_options` is set to `offsets` in the mapping,
+the `unified` highlighter uses this information to highlight documents without
+re-analyzing the text. It re-runs the original query directly on the postings
+and extracts the matching offsets from the index, limiting the collection to
+the highlighted documents. This is important if you have large fields because
+it doesn't require reanalyzing the text to be highlighted. It also requires less
+disk space than using `term_vectors`.

-This mode is picked when there is no other alternative.
+* Term vectors. If `term_vector` information is provided by setting 
+`term_vector` to `with_positions_offsets` in the mapping, the `unified`
+highlighter automatically uses the `term_vector` to highlight the field.
+Term vector highlighting is faster for highlighting multi-term queries like
+`prefix` or `wildcard` because it can access the dictionary of terms for
+each document, but it can be slower than using the postings list. The `fvh`
+highlighter always uses term vectors.
+
+* Plain highlighting. This mode is used when there is no other alternative.
 It creates a tiny in-memory index and re-runs the original query criteria through
-Lucene's query execution planner to get access to low-level match information on the current document.
-This is repeated for every field and every document that needs highlighting.
+Lucene's query execution planner to get access to low-level match information on
+the current document. This is repeated for every field and every document that
+needs highlighting. The `plain` highlighter always uses plain highlighting.

-====== Postings
+You can specify the highlighter `type` you want to use
+for each field.

-If `index_options` is set to `offsets` in the mapping the `unified` highlighter
-will use this information to highlight documents without re-analyzing the text.
-It re-runs the original query directly on the postings and extracts the matching offsets
-directly from the index limiting the collection to the highlighted documents.
-This mode is faster on large fields since it doesn't require to reanalyze the text to be highlighted
-and requires less disk space than term_vectors, needed for the fast vector
-highlighting.
+[[highlighting-settings]]
+==== Highlighting Settings
+
+Highlighting settings can be set on a global level and overridden at
+the field level.
+
+boundary_chars:: A string that contains each boundary character.
+Defaults to `.,!? \t\n`.
+
+boundary_max_scan:: How far to scan for boundary characters. Defaults to `20`.
+
+[[boundary-scanners]]
+boundary_scanner:: Specifies how to break the highlighted fragments: `chars`,
+`sentence`, or `word`. Only valid for the `unified` and `fvh` highlighters.
+Defaults to `sentence` for the `unified` highlighter. Defaults to `chars` for
+the `fvh` highlighter.
+
+* `chars` Use the characters specified by `boundary_chars` as highlighting
+boundaries.  The `boundary_max_scan` setting controls how far to scan for
+boundary characters. Only valid for the `fvh` highlighter.
+* `sentence` Break highlighted fragments at the next sentence boundary, as
+determined by Java's 
+https://docs.oracle.com/javase/8/docs/api/java/text/BreakIterator.html[BreakIterator].
+You can specify the locale to use with `boundary_scanner_locale`.
+
+NOTE: When used with the `unified` highlighter, the `sentence` scanner splits
+sentences bigger than `fragment_size` at the first word boundary next to
+`fragment_size`. You can set `fragment_size` to 0 to never split any sentence.
+
+* `word` Break highlighted fragments at the next word boundary, as determined
+by Java's https://docs.oracle.com/javase/8/docs/api/java/text/BreakIterator.html[BreakIterator].
+You can specify the locale to use with `boundary_scanner_locale`.
+
+boundary_scanner_locale:: Controls which locale is used to search for sentence
+and word boundaries.
+
+encoder:: Indicates if the highlighted text should be HTML encoded:
+`default` (no encoding) or `html` (escapes HTML highlighting tags).
+
+fields:: Specifies the fields to retrieve highlights for. You can use wildcards
+to specify fields. For example, you could specify `comment_*` to
+get highlights for all <<text,text>> and <<keyword,keyword>> fields
+that start with `comment_`.
+
+NOTE: Only text and keyword fields are highlighted when you use wildcards.
+If you use a custom mapper and want to highlight on a field anyway, you
+must explicitly specify that field name.
+
+force_source:: Highlight based on the source even if the field is
+stored separately. Defaults to `false`.
+
+fragmenter:: Specifies how text should be broken up in highlight
+snippets: `simple` or `span`. Only valid for the `plain` highlighter.
+Defaults to `span`.
+
+* `simple` Breaks up text into same-sized fragments.
+* `span` Breaks up text into same-sized fragments, but tried to avoid
+breaking up text between highlighted terms. This is helpful when you're
+querying for phrases. Default.
+
+fragment_offset:: Controls the margin from which you want to start
+highlighting. Only valid when using the `fvh` highlighter.
+
+fragment_size:: The size of the highlighted fragment in characters. Defaults
+to 100.
+
+highlight_query:: Highlight matches for a query other than the search
+query. This is especially useful if you use a rescore query because
+those are not taken into account by highlighting by default.
+
+IMPORTANT: {es} does not validate that `highlight_query` contains
+the search query in any way so it is possible to define it so
+legitimate query results are not highlighted. Generally, you should
+include the search query as part of the `highlight_query`.
+
+matched_fields:: Combine matches on multiple fields to highlight a single field.
+This is most intuitive for multifields that analyze the same string in different
+ways.  All `matched_fields` must have `term_vector` set to
+`with_positions_offsets`, but only the field to which
+the matches are combined is loaded so only that field benefits from having
+`store` set to `yes`. Only valid for the `fvh` highlighter.
+
+no_match_size:: The amount of text you want to return from the beginning
+of the field if there are no matching fragments to highlight. Defaults
+to 0 (nothing is returned).
+
+number_of_fragments:: The maximum number of fragments to return. If the
+number of fragments is set to 0, no fragments are returned. Instead,
+the entire field contents are highlighted and returned. This can be
+handy when you need to highlight short texts such as a title or
+address, but fragmentation is not required. If `number_of_fragments`
+is 0, `fragment_size` is ignored. Defaults to 5.
+
+order:: Sorts highlighted fragments by score when set to `score`. Only valid for
+the `unified` highlighter.
+
+phrase_limit:: Controls the number of matching phrases in a document that are
+considered. Prevents the `fvh` highlighter from analyzing too many phrases
+and consuming too much memory. When using `matched_fields, `phrase_limit`
+phrases per matched field are considered. Raising the limit increases query
+time and consumes more memory. Only supported by the `fvh` highlighter.
+Defaults to 256.
+
+pre_tags:: Use in conjunction with `post_tags` to define the HTML tags
+to use for the highlighted text. By default, highlighted text is wrapped
+in `<em>` and </em>` tags. Specify as an array of strings.
+
+post_tags:: Use in conjunction with `pre_tags` to define the HTML tags
+to use for the highlighted text. By default, highlighted text is wrapped
+in `<em>` and `</em>` tags. Specify as an array of strings.
+
+require_field_match:: By default, only fields that contains a query match are
+highlighted. Set `require_field_match` to `false` to highlight all fields.
+Defaults to `true`.
+
+tags_schema:: Set to `styled` to use the built-in tag schema. The `styled`
+schema defines the following `pre_tags` and defines `post_tags` as
+`</em>`.
+
+[source,html]
+--------------------------------------------------
+<em class="hlt1">, <em class="hlt2">, <em class="hlt3">,
+<em class="hlt4">, <em class="hlt5">, <em class="hlt6">,
+<em class="hlt7">, <em class="hlt8">, <em class="hlt9">,
+<em class="hlt10">
+--------------------------------------------------
+
+
+[[highlighter-type]]
+type:: The highlighter to use: `unified`, `plain`, or `fvh`. Defaults to
+`unified`.
+
+[[highlighting-examples]]
+==== Highlighting Examples

 Here is an example of setting the `comment` field in the index mapping to allow for
 highlighting using the postings:
@ -101,15 +258,6 @@ PUT /example
 --------------------------------------------------
 // CONSOLE

-====== Term Vectors
-
-If `term_vector` information is provided by setting `term_vector` to
-`with_positions_offsets` in the mapping then the `unified` highlighter
-will automatically use the `term_vector` to highlight the field.
-The `term_vector` highlighting is faster to highlight multi-term queries like
-`prefix` or `wildcard` because it can access the dictionary of term for each document
-but it is also usually more costly than using the `postings` directly.
-
 Here is an example of setting the `comment` field to allow for
 highlighting using the `term_vectors` (this will cause the index to be bigger):

@ -131,59 +279,8 @@ PUT /example
 --------------------------------------------------
 // CONSOLE

-[[plain-highlighter]]
-==== Plain highlighter

-This highlighter of type `plain` uses the standard Lucene highlighter.
-It tries hard to reflect the query matching logic in terms of understanding word importance and any word positioning criteria in phrase queries.
-
-[WARNING]
-If you want to highlight a lot of fields in a lot of documents with complex queries this highlighter will not be fast.
-In its efforts to accurately reflect query logic it creates a tiny in-memory index and re-runs the original query criteria through
-Lucene's query execution planner to get access to low-level match information on the current document.
-This is repeated for every field and every document that needs highlighting. If this presents a performance issue in your system consider using an alternative highlighter.
-
-[[fast-vector-highlighter]]
-==== Fast vector highlighter
-
-This highlighter of type `fvh` uses the Lucene Fast Vector highlighter.
-This highlighter can be used on fields with `term_vector` set to
-`with_positions_offsets` in the mapping.
-The fast vector highlighter:
-
-* Is faster especially for large fields (> `1MB`)
-* Can be customized with `boundary_scanner` (see <<boundary-scanners,below>>)
-* Requires setting `term_vector` to `with_positions_offsets` which
-  increases the size of the index
-* Can combine matches from multiple fields into one result.  See
-  `matched_fields`
-* Can assign different weights to matches at different positions allowing
-  for things like phrase matches being sorted above term matches when
-  highlighting a Boosting Query that boosts phrase matches over term matches
-
-Here is an example of setting the `comment` field to allow for
-highlighting using the fast vector highlighter on it (this will cause
-the index to be bigger):
-
-[source,js]
--------------------------------------------------
-PUT /example
-{
-  "mappings": {
-    "doc" : {
-      "properties": {
-        "comment" : {
-          "type": "text",
-          "term_vector" : "with_positions_offsets"
-        }
-      }
-    }
-  }
-}
--------------------------------------------------
-// CONSOLE
-
-==== Force highlighter type
+===== Force highlighter type

 The `type` field allows to force a specific highlighter type.
 The allowed values are: `unified`, `plain` and `fvh`.
@ -206,10 +303,10 @@ GET /_search
 // CONSOLE
 // TEST[setup:twitter]

-==== Force highlighting on source
+===== Force highlighting on source

-Forces the highlighting to highlight fields based on the source even if fields are
-stored separately. Defaults to `false`.
+Forces the highlighting to highlight fields based on the source even if fields
+are stored separately. Defaults to `false`.

 [source,js]
 --------------------------------------------------
@ -229,7 +326,7 @@ GET /_search
 // TEST[setup:twitter]

 [[tags]]
-==== Highlighting Tags
+===== Configure highlighting tags

 By default, the highlighting will wrap highlighted text in `<em>` and
 `</em>`. This can be controlled by setting `pre_tags` and `post_tags`,
@ -254,8 +351,8 @@ GET /_search
 // CONSOLE
 // TEST[setup:twitter]

-Using the fast vector highlighter there can be more tags, and the "importance"
-is ordered.
+When using the fast vector highlighter, you can specify additional tags and the
+"importance" is ordered.

 [source,js]
 --------------------------------------------------
@ -276,20 +373,7 @@ GET /_search
 // CONSOLE
 // TEST[setup:twitter]

-There are also built in "tag" schemas, with currently a single schema
-called `styled` with the following `pre_tags`:
-
-[source,html]
--------------------------------------------------
-<em class="hlt1">, <em class="hlt2">, <em class="hlt3">,
-<em class="hlt4">, <em class="hlt5">, <em class="hlt6">,
-<em class="hlt7">, <em class="hlt8">, <em class="hlt9">,
-<em class="hlt10">
--------------------------------------------------
-
-and `</em>` as `post_tags`. If you think of more nice to have built in tag
-schemas, just send an email to the mailing list or open an issue. Here
-is an example of switching tag schemas:
+You can also use the built-in `styled` tag schema:

 [source,js]
 --------------------------------------------------
@ -309,13 +393,8 @@ GET /_search
 // CONSOLE
 // TEST[setup:twitter]

-==== Encoder

-An `encoder` parameter can be used to define how highlighted text will
-be encoded. It can be either `default` (no encoding) or `html` (will
-escape html, if you use html highlighting tags).
-
-==== Highlighted Fragments
+===== Controlling highlighted fragments

 Each field highlighted can control the size of the highlighted fragment
 in characters (defaults to `100`), and the maximum number of fragments
@ -414,17 +493,10 @@ GET /_search
 // CONSOLE
 // TEST[setup:twitter]

-==== Fragmenter
+===== Specifying a fragmenter for the plain highlighter

-WARNING: This option is not supported by the `unified` highlighter
-
-Fragmenter can control how text should be broken up in highlight snippets.
-However, this option is applicable only for the Plain Highlighter.
-There are two options:
-
-[horizontal]
-`simple`:: Breaks up text into same sized fragments.
-`span`:: Same as the simple fragmenter, but tries not to break up text between highlighted terms (this is applicable when using phrase like queries). This is the default.
+When using the `plain` highlighter, you can choose between the `simple` and
+`span` fragmenters:

 [source,js]
 --------------------------------------------------
@ -539,19 +611,13 @@ Response:

 If the `number_of_fragments` option is set to `0`,
 `NullFragmenter` is used which does not fragment the text at all.
-This is useful for highlighting the entire content of a document or field.
+This is useful for highlighting the entire contents of a document or field.

-==== Highlight query
+===== Specifying a highlight query

-It is also possible to highlight against a query other than the search
-query by setting `highlight_query`.  This is especially useful if you
-use a rescore query because those are not taken into account by
-highlighting by default.  Elasticsearch does not validate that
-`highlight_query` contains the search query in any way so it is possible
-to define it so legitimate query results aren't highlighted at all.
-Generally it is better to include the search query in the
-`highlight_query`.  Here is an example of including both the search
+Here is an example of including both the search
 query and the rescore query in `highlight_query`.
+
 [source,js]
 --------------------------------------------------
 GET /_search
@ -613,11 +679,8 @@ GET /_search
 // CONSOLE
 // TEST[setup:twitter]

-[[highlighting-settings]]
-==== Global Settings
-
-Highlighting settings can be set on a global level and then overridden
-at the field level.
+[[overriding-global-settings]]
+===== Overriding global settings

 [source,js]
 --------------------------------------------------
@ -642,12 +705,10 @@ GET /_search
 // TEST[setup:twitter]

 [[field-match]]
-==== Require Field Match
+===== Highlighting in all fields

-`require_field_match` can be set to `false` which will cause any field to
-be highlighted regardless of whether the query matched specifically on them.
-The default behaviour is `true`, meaning that only fields that hold a query
-match will be highlighted.
+By default, only fields that contains a query match are highlighted. Set
+`require_field_match` to `false` to highlight all fields.

 [source,js]
 --------------------------------------------------
@ -667,43 +728,19 @@ GET /_search
 // CONSOLE
 // TEST[setup:twitter]

-[[boundary-scanners]]
-==== Boundary Scanners
-
-When highlighting a field using the unified highlighter or the fast vector highlighter,
-you can specify how to break the highlighted fragments using `boundary_scanner`, which accepts
-the following values:
-
-* `chars` (default mode for the FVH): allows to configure which characters (`boundary_chars`)
-constitute a boundary for highlighting. It's a single string with each boundary
-character defined in it (defaults to `.,!? \t\n`). It also allows configuring
-the `boundary_max_scan` to control how far to look for boundary characters
-(defaults to `20`). Works only with the Fast Vector Highlighter.
-
-* `sentence` and `word`: use Java's https://docs.oracle.com/javase/8/docs/api/java/text/BreakIterator.html[BreakIterator]
-to break the highlighted fragments at the next  _sentence_ or _word_ boundary.
-You can further specify `boundary_scanner_locale` to control which Locale is used
-to search the text for these boundaries.
-
-[NOTE]
-When used with the `unified` highlighter, the `sentence` scanner splits sentence
-bigger than `fragment_size` at the first word boundary next to `fragment_size`.
-You can set `fragment_size` to 0 to never split any sentence.
-
 [[matched-fields]]
-==== Matched Fields
+===== Combining matches on multiple fields

 WARNING: This is only supported by the `fvh` highlighter

 The Fast Vector Highlighter can combine matches on multiple fields to
-highlight a single field using `matched_fields`.  This is most
-intuitive for multifields that analyze the same string in different
-ways.  All `matched_fields` must have `term_vector` set to
-`with_positions_offsets` but only the field to which the matches are
-combined is loaded so only that field would benefit from having
+highlight a single field.  This is most intuitive for multifields that
+analyze the same string in different ways.  All `matched_fields` must have
+`term_vector` set to `with_positions_offsets` but only the field to which
+the matches are combined is loaded so only that field would benefit from having
 `store` set to `yes`.

-In the following examples `comment` is analyzed by the `english`
+In the following examples, `comment` is analyzed by the `english`
 analyzer and `comment.plain` is analyzed by the `standard` analyzer.

 [source,js]
@ -826,26 +863,13 @@ to
 // NOTCONSOLE
 ===================================================================

-[[phrase-limit]]
-==== Phrase Limit

-WARNING: this is only supported by the `fvh` highlighter
-
-The fast vector highlighter has a `phrase_limit` parameter that prevents
-it from analyzing too many phrases and eating tons of memory.  It defaults
-to 256 so only the first 256 matching phrases in the document scored
-considered.  You can raise the limit with the `phrase_limit` parameter but
-keep in mind that scoring more phrases consumes more time and memory.
-
-If using `matched_fields` keep in mind that `phrase_limit` phrases per
-matched field are considered.
-
-[float]
 [[explicit-field-order]]
-=== Field Highlight Order
-Elasticsearch highlights the fields in the order that they are sent.  Per the
-json spec objects are unordered but if you need to be explicit about the order
-that fields are highlighted then you can use an array for `fields` like this:
+===== Explicitly ordering highlighted fields
+Elasticsearch highlights the fields in the order that they are sent, but per the
+JSON spec, objects are unordered.  If you need to be explicit about the order
+in which fields are highlighted specify the `fields` as an array:
+
 [source,js]
 --------------------------------------------------
 GET /_search
@ -862,4 +886,4 @@ GET /_search
 // TEST[setup:twitter]

 None of the highlighters built into Elasticsearch care about the order that the
-fields are highlighted but a plugin may.
+fields are highlighted but a plugin might.