[DOCS] Incorporated feedback on the highlighting changes.

This commit is contained in:
Deb Adair 2017-07-12 16:36:07 -07:00
parent 70b2897bdf
commit ded9f55263
1 changed files with 406 additions and 366 deletions

View File

@ -35,20 +35,24 @@ GET /_search
// CONSOLE
// TEST[setup:twitter]
{es} supports three highlighters:
{es} supports three highlighters: `unified`, `plain`, and `fvh` (fast vector
highlighter). You can specify the highlighter `type` you want to use
for each field.
[[unified-highlighter]]
* The `unified` highlighter uses the Lucene Unified Highlighter. This
==== Unified highlighter
The `unified` highlighter uses the Lucene Unified Highlighter. This
highlighter breaks the text into sentences and uses the BM25 algorithm to score
individual sentences as if they were documents in the corpus. It also supports
accurate phrase and multi-term (fuzzy, prefix, regex) highlighting. This is the
default highlighter.
[[plain-highlighter]]
* The `plain` highlighter uses the standard Lucene highlighter. It attempts to
==== Plain highlighter
The `plain` highlighter uses the standard Lucene highlighter. It attempts to
reflect the query matching logic in terms of understanding word importance and
any word positioning criteria in phrase queries.
+
[WARNING]
The `plain` highlighter works best for highlighting simple query matches in a
single field. To accurately reflect query logic, it creates a tiny in-memory
@ -59,20 +63,23 @@ If you want to highlight a lot of fields in a lot of documents with complex
queries, we recommend using one of the other highlighters.
[[fast-vector-highlighter]]
* The `fvh` highlighter uses the Lucene Fast Vector highlighter.
==== Fast vector highlighter
The `fvh` highlighter uses the Lucene Fast Vector highlighter.
This highlighter can be used on fields with `term_vector` set to
`with_positions_offsets` in the mapping. The fast vector highlighter:
** Is faster especially for large fields (> `1MB`)
** Can be customized with a <<boundary-scanners,`boundary_scanner`>>.
** Requires setting `term_vector` to `with_positions_offsets` which
* Is faster especially for large fields (> `1MB`)
* Can be customized with a <<boundary-scanners,`boundary_scanner`>>.
* Requires setting `term_vector` to `with_positions_offsets` which
increases the size of the index
** Can combine matches from multiple fields into one result. See
* Can combine matches from multiple fields into one result. See
`matched_fields`
** Can assign different weights to matches at different positions allowing
* Can assign different weights to matches at different positions allowing
for things like phrase matches being sorted above term matches when
highlighting a Boosting Query that boosts phrase matches over term matches
[[offsets-strategy]]
==== Offsets Strategy
To create meaningful search snippets from the terms being queried,
the highlighter needs to know the start and end character offsets of each word
in the original text. These offsets can be obtained from:
@ -99,9 +106,6 @@ Lucene's query execution planner to get access to low-level match information on
the current document. This is repeated for every field and every document that
needs highlighting. The `plain` highlighter always uses plain highlighting.
You can specify the highlighter `type` you want to use
for each field.
[[highlighting-settings]]
==== Highlighting Settings
@ -118,11 +122,10 @@ boundary_scanner:: Specifies how to break the highlighted fragments: `chars`,
`sentence`, or `word`. Only valid for the `unified` and `fvh` highlighters.
Defaults to `sentence` for the `unified` highlighter. Defaults to `chars` for
the `fvh` highlighter.
+
* `chars` Use the characters specified by `boundary_chars` as highlighting
`chars`::: Use the characters specified by `boundary_chars` as highlighting
boundaries. The `boundary_max_scan` setting controls how far to scan for
boundary characters. Only valid for the `fvh` highlighter.
* `sentence` Break highlighted fragments at the next sentence boundary, as
`sentence`::: Break highlighted fragments at the next sentence boundary, as
determined by Java's
https://docs.oracle.com/javase/8/docs/api/java/text/BreakIterator.html[BreakIterator].
You can specify the locale to use with `boundary_scanner_locale`.
@ -131,7 +134,7 @@ NOTE: When used with the `unified` highlighter, the `sentence` scanner splits
sentences bigger than `fragment_size` at the first word boundary next to
`fragment_size`. You can set `fragment_size` to 0 to never split any sentence.
* `word` Break highlighted fragments at the next word boundary, as determined
`word`::: Break highlighted fragments at the next word boundary, as determined
by Java's https://docs.oracle.com/javase/8/docs/api/java/text/BreakIterator.html[BreakIterator].
You can specify the locale to use with `boundary_scanner_locale`.
@ -156,9 +159,9 @@ stored separately. Defaults to `false`.
fragmenter:: Specifies how text should be broken up in highlight
snippets: `simple` or `span`. Only valid for the `plain` highlighter.
Defaults to `span`.
+
* `simple` Breaks up text into same-sized fragments.
* `span` Breaks up text into same-sized fragments, but tried to avoid
`simple`::: Breaks up text into same-sized fragments.
`span`::: Breaks up text into same-sized fragments, but tried to avoid
breaking up text between highlighted terms. This is helpful when you're
querying for phrases. Default.
@ -207,7 +210,7 @@ Defaults to 256.
pre_tags:: Use in conjunction with `post_tags` to define the HTML tags
to use for the highlighted text. By default, highlighted text is wrapped
in `<em>` and </em>` tags. Specify as an array of strings.
in `<em>` and `</em>` tags. Specify as an array of strings.
post_tags:: Use in conjunction with `pre_tags` to define the HTML tags
to use for the highlighted text. By default, highlighted text is wrapped
@ -229,7 +232,6 @@ schema defines the following `pre_tags` and defines `post_tags` as
<em class="hlt10">
--------------------------------------------------
[[highlighter-type]]
type:: The highlighter to use: `unified`, `plain`, or `fvh`. Defaults to
`unified`.
@ -237,19 +239,108 @@ type:: The highlighter to use: `unified`, `plain`, or `fvh`. Defaults to
[[highlighting-examples]]
==== Highlighting Examples
Here is an example of setting the `comment` field in the index mapping to allow for
highlighting using the postings:
* <<override-global-settings, Override global settings>>
* <<specify-highlight-query, Specify a highlight query>>
* <<set-highlighter-type, Set highlighter type>>
* <<configure-tags, Configure highlighting tags>>
* <<highlight-source, Highlight source>>
* <<highlight-all, Highlight all fields>>
* <<matched-fields, Combine matches on multiple fields>>
* <<explicit-field-order, Explicitly order highlighted fields>>
* <<control-highlighted-frags, Control highlighted fragments>>
* <<highlight-postings-list, Highlight using the postings list>>
* <<specify-fragmenter, Specify a fragmenter for the plain highlighter>>
[[override-global-settings]]
[float]
=== Override global settings
You can specify highlighter settings globally and selectively override them for
individual fields.
[source,js]
--------------------------------------------------
PUT /example
GET /_search
{
"mappings": {
"doc" : {
"properties": {
"query" : {
"match": { "user": "kimchy" }
},
"highlight" : {
"number_of_fragments" : 3,
"fragment_size" : 150,
"fields" : {
"_all" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] },
"blog.title" : { "number_of_fragments" : 0 },
"blog.author" : { "number_of_fragments" : 0 },
"blog.comment" : { "number_of_fragments" : 5, "order" : "score" }
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
[float]
[[specify-highlight-query]]
=== Specify a highlight query
You can specify a `highlight_query` to take additional information into account
when highlighting. For example, the following query includes both the search
query and rescore query in the `highlight_query`. Without the `highlight_query`,
highlighting would only take the search query into account.
[source,js]
--------------------------------------------------
GET /_search
{
"stored_fields": [ "_id" ],
"query" : {
"match": {
"comment": {
"type": "text",
"index_options" : "offsets"
"query": "foo bar"
}
}
},
"rescore": {
"window_size": 50,
"query": {
"rescore_query" : {
"match_phrase": {
"comment": {
"query": "foo bar",
"slop": 1
}
}
},
"rescore_query_weight" : 10
}
},
"highlight" : {
"order" : "score",
"fields" : {
"comment" : {
"fragment_size" : 150,
"number_of_fragments" : 3,
"highlight_query": {
"bool": {
"must": {
"match": {
"comment": {
"query": "foo bar"
}
}
},
"should": {
"match_phrase": {
"comment": {
"query": "foo bar",
"slop": 1,
"boost": 10.0
}
}
},
"minimum_should_match": 0
}
}
}
}
@ -257,30 +348,11 @@ PUT /example
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
Here is an example of setting the `comment` field to allow for
highlighting using the `term_vectors` (this will cause the index to be bigger):
[source,js]
--------------------------------------------------
PUT /example
{
"mappings": {
"doc" : {
"properties": {
"comment" : {
"type": "text",
"term_vector" : "with_positions_offsets"
}
}
}
}
}
--------------------------------------------------
// CONSOLE
===== Force highlighter type
[float]
[[set-highlighter-type]]
=== Set highlighter type
The `type` field allows to force a specific highlighter type.
The allowed values are: `unified`, `plain` and `fvh`.
@ -303,30 +375,9 @@ GET /_search
// CONSOLE
// TEST[setup:twitter]
===== Force highlighting on source
Forces the highlighting to highlight fields based on the source even if fields
are stored separately. Defaults to `false`.
[source,js]
--------------------------------------------------
GET /_search
{
"query" : {
"match": { "user": "kimchy" }
},
"highlight" : {
"fields" : {
"comment" : {"force_source" : true}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
[[tags]]
===== Configure highlighting tags
[[configure-tags]]
[float]
=== Configure highlighting tags
By default, the highlighting will wrap highlighted text in `<em>` and
`</em>`. This can be controlled by setting `pre_tags` and `post_tags`,
@ -393,13 +444,12 @@ GET /_search
// CONSOLE
// TEST[setup:twitter]
[float]
[[highlight-source]]
=== Highlight on source
===== Controlling highlighted fragments
Each field highlighted can control the size of the highlighted fragment
in characters (defaults to `100`), and the maximum number of fragments
to return (defaults to `5`).
For example:
Forces the highlighting to highlight fields based on the source even if fields
are stored separately. Defaults to `false`.
[source,js]
--------------------------------------------------
@ -410,7 +460,7 @@ GET /_search
},
"highlight" : {
"fields" : {
"comment" : {"fragment_size" : 150, "number_of_fragments" : 3}
"comment" : {"force_source" : true}
}
}
}
@ -418,294 +468,10 @@ GET /_search
// CONSOLE
// TEST[setup:twitter]
On top of this it is possible to specify that highlighted fragments need
to be sorted by score:
[source,js]
--------------------------------------------------
GET /_search
{
"query" : {
"match": { "user": "kimchy" }
},
"highlight" : {
"order" : "score",
"fields" : {
"comment" : {"fragment_size" : 150, "number_of_fragments" : 3}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
If the `number_of_fragments` value is set to `0` then no fragments are
produced, instead the whole content of the field is returned, and of
course it is highlighted. This can be very handy if short texts (like
document title or address) need to be highlighted but no fragmentation
is required. Note that `fragment_size` is ignored in this case.
[source,js]
--------------------------------------------------
GET /_search
{
"query" : {
"match": { "user": "kimchy" }
},
"highlight" : {
"fields" : {
"_all" : {},
"blog.title" : {"number_of_fragments" : 0}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
When using `fvh` one can use `fragment_offset`
parameter to control the margin to start highlighting from.
In the case where there is no matching fragment to highlight, the default is
to not return anything. Instead, we can return a snippet of text from the
beginning of the field by setting `no_match_size` (default `0`) to the length
of the text that you want returned. The actual length may be shorter or longer than
specified as it tries to break on a word boundary.
[source,js]
--------------------------------------------------
GET /_search
{
"query" : {
"match": { "user": "kimchy" }
},
"highlight" : {
"fields" : {
"comment" : {
"fragment_size" : 150,
"number_of_fragments" : 3,
"no_match_size": 150
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
===== Specifying a fragmenter for the plain highlighter
When using the `plain` highlighter, you can choose between the `simple` and
`span` fragmenters:
[source,js]
--------------------------------------------------
GET twitter/tweet/_search
{
"query" : {
"match_phrase": { "message": "number 1" }
},
"highlight" : {
"fields" : {
"message" : {
"type": "plain",
"fragment_size" : 15,
"number_of_fragments" : 3,
"fragmenter": "simple"
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
Response:
[source,js]
--------------------------------------------------
{
...
"hits": {
"total": 1,
"max_score": 1.601195,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 1.601195,
"_source": {
"user": "test",
"message": "some message with the number 1",
"date": "2009-11-15T14:12:12",
"likes": 1
},
"highlight": {
"message": [
" with the <em>number</em>",
" <em>1</em>"
]
}
}
]
}
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,/]
[source,js]
--------------------------------------------------
GET twitter/tweet/_search
{
"query" : {
"match_phrase": { "message": "number 1" }
},
"highlight" : {
"fields" : {
"message" : {
"type": "plain",
"fragment_size" : 15,
"number_of_fragments" : 3,
"fragmenter": "span"
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
Response:
[source,js]
--------------------------------------------------
{
...
"hits": {
"total": 1,
"max_score": 1.601195,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 1.601195,
"_source": {
"user": "test",
"message": "some message with the number 1",
"date": "2009-11-15T14:12:12",
"likes": 1
},
"highlight": {
"message": [
"some message with the <em>number</em> <em>1</em>"
]
}
}
]
}
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,/]
If the `number_of_fragments` option is set to `0`,
`NullFragmenter` is used which does not fragment the text at all.
This is useful for highlighting the entire contents of a document or field.
===== Specifying a highlight query
Here is an example of including both the search
query and the rescore query in `highlight_query`.
[source,js]
--------------------------------------------------
GET /_search
{
"stored_fields": [ "_id" ],
"query" : {
"match": {
"comment": {
"query": "foo bar"
}
}
},
"rescore": {
"window_size": 50,
"query": {
"rescore_query" : {
"match_phrase": {
"comment": {
"query": "foo bar",
"slop": 1
}
}
},
"rescore_query_weight" : 10
}
},
"highlight" : {
"order" : "score",
"fields" : {
"comment" : {
"fragment_size" : 150,
"number_of_fragments" : 3,
"highlight_query": {
"bool": {
"must": {
"match": {
"comment": {
"query": "foo bar"
}
}
},
"should": {
"match_phrase": {
"comment": {
"query": "foo bar",
"slop": 1,
"boost": 10.0
}
}
},
"minimum_should_match": 0
}
}
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
[[overriding-global-settings]]
===== Overriding global settings
[source,js]
--------------------------------------------------
GET /_search
{
"query" : {
"match": { "user": "kimchy" }
},
"highlight" : {
"number_of_fragments" : 3,
"fragment_size" : 150,
"fields" : {
"_all" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] },
"blog.title" : { "number_of_fragments" : 0 },
"blog.author" : { "number_of_fragments" : 0 },
"blog.comment" : { "number_of_fragments" : 5, "order" : "score" }
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
[[field-match]]
===== Highlighting in all fields
[[highlight-all]]
[float]
=== Highlight in all fields
By default, only fields that contains a query match are highlighted. Set
`require_field_match` to `false` to highlight all fields.
@ -729,7 +495,8 @@ GET /_search
// TEST[setup:twitter]
[[matched-fields]]
===== Combining matches on multiple fields
[float]
=== Combine matches on multiple fields
WARNING: This is only supported by the `fvh` highlighter
@ -865,7 +632,8 @@ to
[[explicit-field-order]]
===== Explicitly ordering highlighted fields
[float]
=== Explicitly order highlighted fields
Elasticsearch highlights the fields in the order that they are sent, but per the
JSON spec, objects are unordered. If you need to be explicit about the order
in which fields are highlighted specify the `fields` as an array:
@ -887,3 +655,275 @@ GET /_search
None of the highlighters built into Elasticsearch care about the order that the
fields are highlighted but a plugin might.
[float]
[[control-highlighted-frags]]
=== Control highlighted fragments
Each field highlighted can control the size of the highlighted fragment
in characters (defaults to `100`), and the maximum number of fragments
to return (defaults to `5`).
For example:
[source,js]
--------------------------------------------------
GET /_search
{
"query" : {
"match": { "user": "kimchy" }
},
"highlight" : {
"fields" : {
"comment" : {"fragment_size" : 150, "number_of_fragments" : 3}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
On top of this it is possible to specify that highlighted fragments need
to be sorted by score:
[source,js]
--------------------------------------------------
GET /_search
{
"query" : {
"match": { "user": "kimchy" }
},
"highlight" : {
"order" : "score",
"fields" : {
"comment" : {"fragment_size" : 150, "number_of_fragments" : 3}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
If the `number_of_fragments` value is set to `0` then no fragments are
produced, instead the whole content of the field is returned, and of
course it is highlighted. This can be very handy if short texts (like
document title or address) need to be highlighted but no fragmentation
is required. Note that `fragment_size` is ignored in this case.
[source,js]
--------------------------------------------------
GET /_search
{
"query" : {
"match": { "user": "kimchy" }
},
"highlight" : {
"fields" : {
"_all" : {},
"blog.title" : {"number_of_fragments" : 0}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
When using `fvh` one can use `fragment_offset`
parameter to control the margin to start highlighting from.
In the case where there is no matching fragment to highlight, the default is
to not return anything. Instead, we can return a snippet of text from the
beginning of the field by setting `no_match_size` (default `0`) to the length
of the text that you want returned. The actual length may be shorter or longer than
specified as it tries to break on a word boundary.
[source,js]
--------------------------------------------------
GET /_search
{
"query" : {
"match": { "user": "kimchy" }
},
"highlight" : {
"fields" : {
"comment" : {
"fragment_size" : 150,
"number_of_fragments" : 3,
"no_match_size": 150
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
[float]
[[highlight-postings-list]]
=== Highlight using the postings list
Here is an example of setting the `comment` field in the index mapping to
allow for highlighting using the postings:
[source,js]
--------------------------------------------------
PUT /example
{
"mappings": {
"doc" : {
"properties": {
"comment" : {
"type": "text",
"index_options" : "offsets"
}
}
}
}
}
--------------------------------------------------
// CONSOLE
Here is an example of setting the `comment` field to allow for
highlighting using the `term_vectors` (this will cause the index to be bigger):
[source,js]
--------------------------------------------------
PUT /example
{
"mappings": {
"doc" : {
"properties": {
"comment" : {
"type": "text",
"term_vector" : "with_positions_offsets"
}
}
}
}
}
--------------------------------------------------
// CONSOLE
[float]
[[specify-fragmenter]]
=== Specify a fragmenter for the plain highlighter
When using the `plain` highlighter, you can choose between the `simple` and
`span` fragmenters:
[source,js]
--------------------------------------------------
GET twitter/tweet/_search
{
"query" : {
"match_phrase": { "message": "number 1" }
},
"highlight" : {
"fields" : {
"message" : {
"type": "plain",
"fragment_size" : 15,
"number_of_fragments" : 3,
"fragmenter": "simple"
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
Response:
[source,js]
--------------------------------------------------
{
...
"hits": {
"total": 1,
"max_score": 1.601195,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 1.601195,
"_source": {
"user": "test",
"message": "some message with the number 1",
"date": "2009-11-15T14:12:12",
"likes": 1
},
"highlight": {
"message": [
" with the <em>number</em>",
" <em>1</em>"
]
}
}
]
}
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,/]
[source,js]
--------------------------------------------------
GET twitter/tweet/_search
{
"query" : {
"match_phrase": { "message": "number 1" }
},
"highlight" : {
"fields" : {
"message" : {
"type": "plain",
"fragment_size" : 15,
"number_of_fragments" : 3,
"fragmenter": "span"
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[setup:twitter]
Response:
[source,js]
--------------------------------------------------
{
...
"hits": {
"total": 1,
"max_score": 1.601195,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 1.601195,
"_source": {
"user": "test",
"message": "some message with the number 1",
"date": "2009-11-15T14:12:12",
"likes": 1
},
"highlight": {
"message": [
"some message with the <em>number</em> <em>1</em>"
]
}
}
]
}
}
--------------------------------------------------
// TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,/]
If the `number_of_fragments` option is set to `0`,
`NullFragmenter` is used which does not fragment the text at all.
This is useful for highlighting the entire contents of a document or field.