Commit Graph

55 Commits

Author SHA1 Message Date
Mayya Sharipova cbd271e497
Limit the analyzed text for highlighting (#27934)
* Limit the analyzed text for highlighting

- Introduce index level settings to control the max number of character
to be analyzed for highlighting
- Throw an error if analysis is required on a larger text

Closes #27517
2017-12-21 10:19:58 -05:00
Adrien Grand 1b660821a2
Allow `_doc` as a type. (#27816)
Allowing `_doc` as a type will enable users to make the transition to 7.0
smoother since the index APIs will be `PUT index/_doc/id` and `POST index/_doc`.
This also moves most of the documentation to `_doc` as a type name.

Closes #27750
Closes #27751
2017-12-14 17:47:53 +01:00
Jim Ferenczi 401f4ba2ce Fix percolator highlight sub fetch phase to not highlight query twice (#26622)
* Fix percolator highlight sub fetch phase to not highlight query twice

The PercolatorHighlightSubFetchPhase does not override hitExecute and since it extends HighlightPhase the search hits
are highlighted twice (by the highlight phase and then by the percolator). This does not alter the results, the second highlighting
just overrides the first one but this slow down the request because it duplicates the work.
2017-09-14 09:31:14 +02:00
Jim Ferenczi 86d97971a4 Remove the _all metadata field (#26356)
* Remove the _all metadata field

This change removes the `_all` metadata field. This field is deprecated in 6
and cannot be activated for indices created in 6 so it can be safely removed in
the next major version (e.g. 7).
2017-08-28 17:43:59 +02:00
Jim Ferenczi fe383b7c27 More clarifications on the unified highlighter being the new default (#25668)
* More clarifications on the unified highlighter being the new default
2017-07-13 15:38:58 +02:00
Deb Adair ded9f55263 [DOCS] Incorporated feedback on the highlighting changes. 2017-07-12 16:36:33 -07:00
Deb Adair b5e81132cf [DOCS] Reorganized the highlighting topic so it's less confusing. 2017-07-11 21:16:14 -07:00
Clinton Gormley 0170e0e8d3 Remove usage of multi-types from the docs and added a page explaining type removal (#25543)
Closes #25401
2017-07-05 12:30:19 +02:00
Adrien Grand 0c117145f6 Upgrade to lucene-7.0.0-snapshot-92b1783. (#25222)
This snapshot has faster range queries on range fields (LUCENE-7828), more
accurate norms (LUCENE-7730) and the ability to use fake term frequencies
(LUCENE-7854).
2017-06-15 09:52:07 +02:00
Jim Ferenczi 5e8b569255 fix highlighting docs 2017-06-09 14:42:08 +02:00
Jim Ferenczi 8250aa4267 Remove the postings highlighter and make unified the default highlighter choice (#25028)
This change removes the `postings` highlighter. This highlighter has been removed from Lucene master (7.x) because it behaves
exactly like the `unified` highlighter when index_options is set to `offsets`:
https://issues.apache.org/jira/browse/LUCENE-7815

It also makes the `unified` highlighter the default choice for highlighting a field (if `type` is not provided).
The strategy used internally by this highlighter remain the same as before, it checks `term_vectors` first, then `postings` and ultimately it re-analyzes the text.
Ultimately it rewrites the docs so that the options that the `unified` highlighter cannot handle are clearly marked as such.
There are few features that the `unified` highlighter is not able to handle which is why the other highlighters (`plain` and `fvh`) are still available.
I'll open separate issues for these features and we'll deprecate the `fvh` and `plain` highlighters when full support for these features have been added to the `unified`.
2017-06-09 14:09:57 +02:00
Suhas Karanth f97d8bc78d Update reference docs for Highlighter fragmenter (#23754)
Explain the fragmenter and add examples.
2017-04-17 14:00:24 -04:00
Nik Everett 048191ceb6 CONSOLEify highlighting a function_score docs
Converts many of the partial examples into full search requests.

Relates #18160
2017-04-06 08:13:56 -04:00
Jim Ferenczi b8c352fc3f Add support for fragment_length in the unified highlighter (#23431)
* Add support for fragment_length in the unified highlighter

This commit introduce a new break iterator (a BoundedBreakIterator) designed for the unified highlighter
 that is able to limit the size of fragments produced by generic break iterator like `sentence`.
The `unified` highlighter now supports `boundary_scanner` which can `words` or `sentence`.
The `sentence` mode will use the bounded break iterator in order to limit the size of the sentence to `fragment_length`.
When sentences bigger than `fragment_length` are produced, this mode will break the sentence at the next word boundary **after**
 `fragment_length` is reached.
2017-03-17 18:10:13 +01:00
Shai Erera eeac6d27f2 Add BreakIteratorBoundaryScanner support for FVH (#23248)
This commit adds a boundary_scanner property to the search highlight
request so the user can specify different boundary scanners:

* `chars` (default,  current behavior)
* `word` Use a WordBreakIterator
* `sentence` Use a SentenceBreakIterator

This commit also adds "boundary_scanner_locale" to define which locale
should be used when scanning the text.
2017-02-23 23:32:22 +01:00
Jim Ferenczi f6d38d480a Integrate UnifiedHighlighter (#21621)
* Integrate UnifiedHighlighter

This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES.
This highlighter can extract offsets from either postings, term vectors, or via re-analyzing text.
The best strategy is picked automatically at query time and depends on the field and the query to highlight.
2017-01-31 19:06:03 +01:00
Clinton Gormley 401438819e Docs: Fix the first highlighting example to work
Closes #22642
2017-01-17 12:20:03 +01:00
Joshua Rich e06a40ccbd [DOCS] Use a better name for fields in examples to avoid ambiguity
Previously, this doc was using a field called "content". This is
confusing, especially when the doc starts talking about the content of
the content field.  This change makes the field name "comment" which
is less ambiguous and also changes some related field names in the doc
to make a consistent example theme of editing docs around blog posts.
2016-10-07 14:46:55 +11:00
Nik Everett 1e587406d8 Fail yaml tests and docs snippets that get unexpected warnings
Adds `warnings` syntax to the yaml test that allows you to expect
a `Warning` header that looks like:
```
    - do:
        warnings:
            - '[index] is deprecated'
            - quotes are not required because yaml
            - but this argument is always a list, never a single string
            - no matter how many warnings you expect
        get:
            index:    test
            type:    test
            id:        1
```

These are accessible from the docs with:
```
// TEST[warning:some warning]
```

This should help to force you to update the docs if you deprecate
something. You *must* add the warnings marker to the docs or the build
will fail. While you are there you *should* update the docs to add
deprecation warnings visible in the rendered results.
2016-08-04 15:23:05 -04:00
Jim Ferenczi afe99fcdcd Restore reverted change now that alpha4 is out:
Rename `fields` to `stored_fields` and add `docvalue_fields`

`stored_fields` parameter will no longer try to retrieve fields from the _source but will only return stored fields.
`fields` will throw an exception if the user uses it.
Add `docvalue_fields` as an adjunct to `fielddata_fields` which is deprecated. `docvalue_fields` will try to load the value from the docvalue and fallback to fielddata cache if docvalues are not enabled on that field.

Closes #18943
2016-07-04 10:39:49 +02:00
Jim Ferenczi eb1e231a63 Revert "Rename `fields` to `stored_fields` and add `docvalue_fields`"
This reverts commit 2f46f53dc8.
2016-06-27 17:20:32 +02:00
Jim Ferenczi 2f46f53dc8 Rename `fields` to `stored_fields` and add `docvalue_fields`
`stored_fields` parameter will no longer try to retrieve fields from the _source but will only return stored fields.
`fields` will throw an exception if the user uses it.
Add `docvalue_fields` as an adjunct to `fielddata_fields` which is deprecated. `docvalue_fields` will try to load the value from the docvalue and fallback to fielddata cache if docvalues are not enabled on that field.

Closes #18943
2016-06-22 17:38:30 +02:00
Isabel Drost-Fromm 394a60f3fd Switch to more match query for better illustration 2016-05-18 15:52:08 +02:00
Isabel Drost-Fromm 5753bcca83 Add Console to highlighting docs
... in order to execute the snippets through rest tests.

Relates to #18160
2016-05-17 21:00:15 +02:00
Britta Weber ddebbb9536 add string to documentation 2016-05-06 16:50:34 +02:00
Britta Weber d3c5f865be Exclude all but string fields from highlighting if wildcards are used in fieldname
We should prevent highlighting if a field is anything but a text or keyword field.
However, someone might implement a custom field type that has text and still want to
highlight on that. We cannot know in advance if the highlighter will be able to
highlight such a field and so we do the following:
If the field is only highlighted because the field matches a wildcard we assume
it was a mistake and do not process it.
If the field was explicitly given we assume that whoever issued the query knew
what they were doing and try to highlight anyway.

closes #17537
2016-05-06 13:41:16 +02:00
Jeff ba34faa1ef Call out where we are making a setting change.
IMHO the original text here was incomplete. Adding the simple words 'in the index mapping' makes this sentence more clear. Perhaps a be more clear to make this a link.
2016-04-05 13:51:31 -06:00
Clinton Gormley 380ecd7604 Merge pull request #16777 from radar/patch-1
Add example for require_field_match to highlighting docs
2016-02-28 21:33:47 +01:00
javanna 49f5757ae2 Remove support for multiple highlighter names
The only way to refer to the plain highlighter is now `plain`, the only way to refer to the fast vector highlighter is `fvh` and the only way to refer to the postings highlighter is `postings`. The name variants like `highlighter`, `postings-highlighter` and `fast-vector-highlighter` have been removed.
2015-10-28 10:50:29 +01:00
markharwood 52fb3c3a09 Docs fix- added performance note about plain highlighter
Closes #11442
2015-07-15 14:28:28 +01:00
Clinton Gormley f19a748d3c Docs: Move field highlight order to the highlight page 2015-06-26 17:36:48 +02:00
javanna a843008b17 Highlighting: require_field_match set to true by default
The default `false` for `require_field_match` is a bit odd and confusing for users, given that field names get ignored by default and every field gets highlighted if it contains terms extracted out of the query, regardless of which fields were queries. Changed the default to `true`, it can always be changed per request.

Closes #10627
Closes #11067
2015-05-15 21:38:45 +02:00
javanna 46c521f7ec Highlighting: nuke XPostingsHighlighter
Our own fork of the lucene PostingsHighlighter is not easy to maintain and doesn't give us any added value at this point. In particular, it was introduced to support the require_field_match option and discrete per value highlighting, used in case one wants to highlight the whole content of a field, but get back one snippet per value. These two features won't
 make it into lucene as they slow things down and shouldn't have been supported from day one on our end probably.

One other customization we had was support for a wider range of queries via custom rewrite etc. (yet another way to slow
 things down), which got added to lucene and works much much better than what we used to do (instead of or rewrite, term
s are pulled out of the automata for multi term queries).

Removing our fork means the following in terms of features:
- dropped support for require_field_match: the postings highlighter will only highlight fields that were queried
- some custom es queries won't be supported anymore, meaning they won't be highlighted. The only one I found up until now is the phrase_prefix. Postings highlighter rewrites against an empty reader to avoid slow operations (like the ones that we were performing with the fork that we are removing here), thus the prefix will not be expanded to any term. What the postings highlighter does instead is pulling the automata out of multi term queries, but this is not supported at the moment with our MultiPhrasePrefixQuery.

Closes #10625
Closes #11077
2015-05-15 20:41:33 +02:00
Clinton Gormley 6a180d1803 Docs: Update highlighting.asciidoc
Added note about how to highlight on the `_all` field

Closes #7991
2014-10-15 13:45:56 +02:00
Clinton Gormley cb00d4a542 Docs: Removed all the added/deprecated tags from 1.x 2014-09-26 21:04:42 +02:00
Clinton Gormley 64a4acc49b Docs: Added IDs to the highlighters for linking 2014-06-22 16:46:42 +02:00
Nik Everett 3573822b7e Highlight fields in request order
Because json objects are unordered this also adds an explicit order syntax
that looks like
    "highlight": {
        "fields": [
            {"title":{ /*params*/ }},
            {"text":{ /*params*/ }}
        ]
    }

This is not useful for any of the builtin highlighters but will be useful
in plugins.

Closes #4649
2014-05-22 16:44:14 +02:00
Hannes Korte c11293ad78 Fix some typos in documentation. 2014-03-31 13:48:17 +02:00
Luca Cavanna 4e6610a798 Fixed multi term queries support in postings highlighter for non top-level queries
In #4052 we added support for highlighting multi term queries using the postings highlighter. That worked only for top-level queries though, and not for multi term queries that are nested for instance within a bool query, or filtered query, or a constant score query.

The way we make this work is by walking the query structure and temporarily overriding the query rewrite method with a method that allows for multi terms extraction.

Closes #5102
2014-02-21 21:43:40 +01:00
Clinton Gormley 93930d6dc7 Removed 0.90.* deprecation and addition notifications
Closes #5052
2014-02-07 20:52:49 +01:00
Clinton Gormley 2e4b70d40f [DOCS] Fixed duplicate ID in highlighting 2014-01-09 00:37:18 +01:00
Nik Everett 8bd9e34e39 Stop FVH from throwing away some query boosts
The FVH was throwing away some boosts on queries stopping a number of
ways to boost phrase matches to the top of the list of fragments from
working.

The plain highlighter also doesn't work for this but that is because it
doesn't support the concept of the same term having a different score at
different positions.

Also update documentation claiming that FHV is nicer for weighing terms
found by query combinations.

Closes #4351
2014-01-08 11:51:48 +01:00
Nik Everett 522d620eb6 Use FHV's phraseLimit
This prevents poisoning the FVH with documents that contain TONS of matches
which take tons of memory and time to highlight.

Closes #4645
2014-01-08 11:27:58 +01:00
Martijn van Groningen 10e2528cce Added the `force_source` option to highlighting that enforces to use of the _source even if there are stored fields.
The percolator uses this option to deal with the fact that the MemoryIndex doesn't support stored fields,
this is possible b/c the _source of the document being percolated is always present.

Closes #4348
2013-12-13 13:39:53 +01:00
Nik Everett 8e34057bc0 Add support for combining fields to the FVH
The Fast Vector Highlighter can combine matches on multiple fields to
highlight a single field using `matched_fields`.  This is most
intuitive for multifields that analyze the same string in different
ways.  Example:
{
    "query": {
        "query_string": {
            "query": "content.plain:running scissors",
            "fields": ["content"]
        }
    },
    "highlight": {
        "order": "score",
        "fields": {
            "content": {
                "matched_fields": ["content", "content.plain"],
                "type" : "fvh"
            }
        }
    }
}

Closes #3750
2013-12-03 11:10:01 +01:00
Clinton Gormley 3465e69e83 [DOCS] Changed all store:yes/no to store:true/false
which is how this setting is stored internally
2013-11-07 16:57:18 +01:00
Luca Cavanna 48ac9747a8 Added third highlighter type based on lucene postings highlighter
Requires field index_options set to "offsets" in order to store positions and offsets in the postings list.
Considerably faster than the plain highlighter since it doesn't require to reanalyze the text to be highlighted: the larger the documents the better the performance gain should be.
Requires less disk space than term_vectors, needed for the fast_vector_highlighter.
Breaks the text into sentences and highlights them. Uses a BreakIterator to find sentences in the text. Plays really well with natural text, not quite the same if the text contains html markup for instance.
Treats the document as the whole corpus, and scores individual sentences as if they were documents in this corpus, using the BM25 algorithm.

Uses forked version of lucene postings highlighter to support:
- per value discrete highlighting for fields that have multiple values, needed when number_of_fragments=0 since we want to return a snippet per value
- manually passing in query terms to avoid calling extract terms multiple times, since we use a different highlighter instance per doc/field, but the query is always the same

The lucene postings highlighter api is  quite different compared to the existing highlighters api, the main difference being that it allows to highlight multiple fields in multiple docs with a single call, ensuring sequential IO.
The way it is introduced in elasticsearch in this first round is a compromise trying not to change the current highlight api, which works per document, per field. The main disadvantage is that we lose the sequential IO, but we can always refactor the highlight api to work with multiple documents.

Supports pre_tag, post_tag, number_of_fragments (0 highlights the whole field), require_field_match, no_match_size, order by score and html encoding.

Closes #3704
2013-10-24 23:38:00 +02:00
Luca Cavanna e981e411d7 [DOCS] rephrased docs for highlight no_match_size parameter
(removed 0.90.6 coming tag as it's needed only in 0.90 branch)
2013-10-24 14:38:32 +02:00
Nik Everett 14a709f563 Highlighting can return excerpt with no highlights
You can configure the highlighting api to return an excerpt of a field
even if there wasn't a match on the field.

The FVH makes excerpts from the beginning of the string to the first
boundary character after the requested length or the boundary_max_scan,
whichever comes first.  The Plain highlighter makes excerpts from the
beginning of the string to the end of the last token before the requested
length.

Closes #1171
2013-10-24 14:38:32 +02:00
Clinton Gormley b2d82d7e75 [DOCS] Reorganised the highlight_query docs and added a version flag 2013-10-18 18:03:31 +02:00