OpenSearch/docs/reference/search/request/highlighting.asciidoc

209 lines
6.5 KiB
Plaintext

[[search-request-highlighting]]
=== Highlighting
Allows to highlight search results on one or more fields. The
implementation uses either the lucene `fast-vector-highlighter` or
`highlighter`. The search request body:
[source,js]
--------------------------------------------------
{
"query" : {...},
"highlight" : {
"fields" : {
"content" : {}
}
}
}
--------------------------------------------------
In the above case, the `content` field will be highlighted for each
search hit (there will be another element in each search hit, called
`highlight`, which includes the highlighted fields and the highlighted
fragments).
In order to perform highlighting, the actual content of the field is
required. If the field in question is stored (has `store` set to `yes`
in the mapping), it will be used, otherwise, the actual `_source` will
be loaded and the relevant field will be extracted from it.
If `term_vector` information is provided by setting `term_vector` to
`with_positions_offsets` in the mapping then the fast vector
highlighter will be used instead of the plain highlighter. The fast vector highlighter:
* Is faster especially for large fields (> `1MB`)
* Can be customized with `boundary_chars`, `boundary_max_scan`, and
`fragment_offset` (see below)
* Requires setting `term_vector` to `with_positions_offsets` which
increases the size of the index
Here is an example of setting the `content` field to allow for
highlighting using the fast vector highlighter on it (this will cause
the index to be bigger):
[source,js]
--------------------------------------------------
{
"type_name" : {
"content" : {"term_vector" : "with_positions_offsets"}
}
}
--------------------------------------------------
The field name supports wildcard notation, for example,
using `comment_*` which will cause all fields that match the expression
to be highlighted.
==== Highlighting Tags
By default, the highlighting will wrap highlighted text in `<em>` and
`</em>`. This can be controlled by setting `pre_tags` and `post_tags`,
for example:
[source,js]
--------------------------------------------------
{
"query" : {...},
"highlight" : {
"pre_tags" : ["<tag1>", "<tag2>"],
"post_tags" : ["</tag1>", "</tag2>"],
"fields" : {
"_all" : {}
}
}
}
--------------------------------------------------
There can be a single tag or more, and the "importance" is ordered.
There are also built in "tag" schemas, with currently a single schema
called `styled` with `pre_tags` of:
[source,js]
--------------------------------------------------
<em class="hlt1">, <em class="hlt2">, <em class="hlt3">,
<em class="hlt4">, <em class="hlt5">, <em class="hlt6">,
<em class="hlt7">, <em class="hlt8">, <em class="hlt9">,
<em class="hlt10">
--------------------------------------------------
And post tag of `</em>`. If you think of more nice to have built in tag
schemas, just send an email to the mailing list or open an issue. Here
is an example of switching tag schemas:
[source,js]
--------------------------------------------------
{
"query" : {...},
"highlight" : {
"tags_schema" : "styled",
"fields" : {
"content" : {}
}
}
}
--------------------------------------------------
An `encoder` parameter can be used to define how highlighted text will
be encoded. It can be either `default` (no encoding) or `html` (will
escape html, if you use html highlighting tags).
==== Highlighted Fragments
Each field highlighted can control the size of the highlighted fragment
in characters (defaults to `100`), and the maximum number of fragments
to return (defaults to `5`). For example:
[source,js]
--------------------------------------------------
{
"query" : {...},
"highlight" : {
"fields" : {
"content" : {"fragment_size" : 150, "number_of_fragments" : 3}
}
}
}
--------------------------------------------------
On top of this it is possible to specify that highlighted fragments are
order by score:
[source,js]
--------------------------------------------------
{
"query" : {...},
"highlight" : {
"order" : "score",
"fields" : {
"content" : {"fragment_size" : 150, "number_of_fragments" : 3}
}
}
}
--------------------------------------------------
Note the score of text fragment in this case is calculated by Lucene
highlighting framework. For implementation details you can check
`ScoreOrderFragmentsBuilder.java` class.
If the `number_of_fragments` value is set to 0 then no fragments are
produced, instead the whole content of the field is returned, and of
course it is highlighted. This can be very handy if short texts (like
document title or address) need to be highlighted but no fragmentation
is required. Note that `fragment_size` is ignored in this case.
[source,js]
--------------------------------------------------
{
"query" : {...},
"highlight" : {
"fields" : {
"_all" : {},
"bio.title" : {"number_of_fragments" : 0}
}
}
}
--------------------------------------------------
When using `fast-vector-highlighter` one can use `fragment_offset`
parameter to control the margin to start highlighting from.
==== Global Settings
Highlighting settings can be set on a global level and then overridden
at the field level.
[source,js]
--------------------------------------------------
{
"query" : {...},
"highlight" : {
"number_of_fragments" : 3,
"fragment_size" : 150,
"tag_schema" : "styled",
"fields" : {
"_all" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] },
"bio.title" : { "number_of_fragments" : 0 },
"bio.author" : { "number_of_fragments" : 0 },
"bio.content" : { "number_of_fragments" : 5, "order" : "score" }
}
}
}
--------------------------------------------------
==== Require Field Match
`require_field_match` can be set to `true` which will cause a field to
be highlighted only if a query matched that field. `false` means that
terms are highlighted on all requested fields regardless if the query
matches specifically on them.
==== Boundary Characters
When highlighting a field that is mapped with term vectors,
`boundary_chars` can be configured to define what constitutes a boundary
for highlighting. It's a single string with each boundary character
defined in it. It defaults to `.,!? \t\n`.
The `boundary_max_scan` allows to control how far to look for boundary
characters, and defaults to `20`.