This aggregation computes unique term counts using the hyperloglog++ algorithm
which uses linear counting to estimate low cardinalities and hyperloglog on
higher cardinalities.
Since this algorithm works on hashes, it is useful for high-cardinality fields
to store the hash of values directly in the index, which is the purpose of
the new `murmur3` field type. This is less necessary on low-cardinality
string fields because the aggregator is smart enough to only compute the hash
once per unique value per segment thanks to ordinals, or on numeric fields
since hashing them is very fast.
Close#5426
================
This commit extends the `CompletionSuggester` by context
informations. In example such a context informations can
be a simple string representing a category reducing the
suggestions in order to this category.
Three base implementations of these context informations
have been setup in this commit.
- a Category Context
- a Geo Context
All the mapping for these context informations are
specified within a context field in the completion
field that should use this kind of information.
Supports sorting on sub-aggs down the current hierarchy. This is supported as long as the aggregation in the specified order path are of a single-bucket type, where the last aggregation in the path points to either a single-bucket aggregation or a metrics one. If it's a single-bucket aggregation, the sort will be applied on the document count in the bucket (i.e. doc_count), and if it is a metrics type, the sort will be applied on the pointed out metric (in case of a single-metric aggregations, such as avg, the sort will be applied on the single metric value)
NOTE: this commit adds a constraint on what should be considered a valid aggregation name. Aggregations names must be alpha-numeric and may contain '-' and '_'.
Closes#5253
In #4052 we added support for highlighting multi term queries using the postings highlighter. That worked only for top-level queries though, and not for multi term queries that are nested for instance within a bool query, or filtered query, or a constant score query.
The way we make this work is by walking the query structure and temporarily overriding the query rewrite method with a method that allows for multi terms extraction.
Closes#5102
* Mostly minor things like typos and grammar stuff
* Some clarifications
* The note on the deprecation was ambiguous. I've removed the problematic part so that it now definitely says it's deprecated
Detects if rescores arrive as an array instead of a plain object. If so
then parse each element of the array as a separate rescore to be executed
one after another. It looks like this:
"rescore" : [ {
"window_size" : 100,
"query" : {
"rescore_query" : {
"match" : {
"field1" : {
"query" : "the quick brown",
"type" : "phrase",
"slop" : 2
}
}
},
"query_weight" : 0.7,
"rescore_query_weight" : 1.2
}
}, {
"window_size" : 10,
"query" : {
"score_mode": "multiply",
"rescore_query" : {
"function_score" : {
"script_score": {
"script": "log10(doc['numeric'].value + 2)"
}
}
}
}
} ]
Rescores as a single object are still supported.
Closes#4748
Terms aggregations return up to `size` terms, so up to now, the way to get all
matching terms back was to set `size` to an arbitrary high number that would be
larger than the number of unique terms.
Terms aggregators already made sure to not allocate memory based on the `size`
parameter so this commit mostly consists in making `0` an alias for the
maximum integer value in the TermsParser.
Close#4837
Adds a new FetchSubPhase, FieldDataFieldsFetchSubPhase, which loads the
field data cache for a field and returns an array of values for the
field.
Also removes `doc['<field>']` and `_source.<field>` workaround no longer
needed in field name resolving.
Closes#4492
* Make it clearer that `aggs` is an allowed synomym
for the `aggregations` key
* Fix broken example in for datehistogram, `1.5M` is
not an allowed interval
* Make use of colon before examples consistent
* Fix typos
- Removed "ok": true from response examples
- Added "created" flag to index response examples
- Replaced exists flag with found in delete response examples
Java Builder apis drop old “len” methods in favour of new “length”
Rest APIs support both old “len: and new “length” forms using new ParseField class to a) provide compiler-checked consistency between Builder and Parser classes and
b) a common means of handling deprecated syntax in the DSL.
Documentation and rest specs only document the new “*length” forms
Closes#4083
`min_doc_count` is the minimum number of hits that a term or histogram key
should match in order to appear in the response.
`min_doc_count=0` replaces `compute_empty_buckets` for histograms and will
behave exactly like facets' `all_terms=true` for terms aggregations.
Close#4662
When upgrading to ES 1.0 the existing mappings with a multi-field type automatically get replaced to a core field with the new `fields` option.
If a `multi_field` type-ed field doesn't have a main / default field, a default field will be chosen for the multi fields syntax. The new main field type
will be equal to the first `multi_field` fields' field or type string if no fields have been configured for the `multi_field` field and in both cases
the default index will not be indexed (`index=no` is set on the default field).
If a `multi_field` typed field has a default field, that field will replace the `multi_field` typed field.
Closes to #4521
============
The default unit for measuring distances is *MILES* in most cases. This commit moves ES
over to the *International System of Units* and make it work on a default which relates
to *METERS* . Also the current structures of the `GeoBoundingBox Filter` changed in
order to define the *Bounding* by setting abitrary corners.
Distances
---------
Since the default unit for measuring distances has changed to a default unit
`DistanceUnit.DEFAULT` relating to *meters*, the **REST API** has changed at the
following places:
* `ScriptDocValues.factorDistance()` returns *meters* instead of *miles*
* `ScriptDocValues.factorDistanceWithDefault()` returns *meters* instead of *miles*
* `ScriptDocValues.arcDistance()` returns *meters* instead of *miles*
one might use `ScriptDocValues.arcDistanceInMiles()`
* `ScriptDocValues.arcDistanceWithDefault()` returns *meters* instead of *miles*
* `ScriptDocValues.distance()` returns *meters* instead of *miles*
one might use `ScriptDocValues.distanceInMiles()`
* `ScriptDocValues.distanceWithDefault()` returns *meters* instead of *miles*
one might use `ScriptDocValues.distanceInMilesWithDefault()`
* `GeoDistanceFilter` default unit changes from *kilometers* to *meters*
* `GeoDistanceRangeFilter` default unit changes from *miles* to *meters*
* `GeoDistanceFacet` default unit changes from *miles* to *meters*
Geo Bounding Box Filter
-----------------------
The naming of the GeoBoundingBoxFilter properties allows to set arbitrary corners
(see #4084) namely `top_right`, `top_left`, `bottom_right` and `bottom_left`. This
change also includes the fields `topRight` and `bottomLeft` Also it is be possible to
set the single values by using just `top`, `bottom`, `left` and `right` parameters.
Closes#4515, #4084
A lot of different API's currently use different names for the
same logical parameter. Since lucene moved away from the notion
of a `similarity` and now uses an `fuzziness` we should generalize
this and encapsulate the generation, parsing and creation of these
settings across all queries.
This commit adds a new `Fuzziness` class that handles the renaming
and generalization in a backwards compatible manner.
This commit also added a ParseField class to better support deprecated
Query DSL parameters
The ParseField class allows specifying parameger that have been deprecated.
Those parameters can be more easily tracked and removed in future version.
This also allows to run queries in `strict` mode per index to throw
exceptions if a query is executed with deprected keys.
Closes#4082
The FVH was throwing away some boosts on queries stopping a number of
ways to boost phrase matches to the top of the list of fragments from
working.
The plain highlighter also doesn't work for this but that is because it
doesn't support the concept of the same term having a different score at
different positions.
Also update documentation claiming that FHV is nicer for weighing terms
found by query combinations.
Closes#4351
Added a long-based representation of GeoHashes to GeoHashUtils for fast evaluation in aggregations.
The new BucketUtils provides a common heuristic for determining the number of results to obtain from each shard in "top N" type requests.
* Clean up s/ElasticSearch/Elasticsearch on docs/*
* Clean up s/ElasticSearch/Elasticsearch on src/* bin/* & pom.xml
* Clean up s/ElasticSearch/Elasticsearch on NOTICE.txt and README.textile
Closes#4634
* `ignore_unavailable` - Controls whether to ignore if any specified indices are unavailable, this includes indices that don't exist or closed indices. Either `true` or `false` can be specified.
* `allow_no_indices` - Controls whether to fail if a wildcard indices expressions results into no concrete indices. Either `true` or `false` can be specified. For example if the wildcard expression `foo*` is specified and no indices are available that start with `foo` then depending on this setting the request will fail. This setting is also applicable when `_all`, `*` or no index has been specified.
* `expand_wildcards` - Controls to what kind of concrete indices wildcard indices expression expand to. If `open` is specified then the wildcard expression if expanded to only open indices and if `closed` is specified then the wildcard expression if expanded only to closed indices. Also both values (`open,closed`) can be specified to expand to all indices.
Closes to #4436
When the ValuesSource has ordinals, terms ordinals are used as a cache key to
bucket ordinals. This can make terms aggregations on String terms significantly
faster.
Close#4350
The percolator uses this option to deal with the fact that the MemoryIndex doesn't support stored fields,
this is possible b/c the _source of the document being percolated is always present.
Closes#4348
This contribution is based on the feedback given in issue #4254 and
issue #4255, and should clear things up, when suggestions are being
removed and not displayed anymore after deletion of data.
The Fast Vector Highlighter can combine matches on multiple fields to
highlight a single field using `matched_fields`. This is most
intuitive for multifields that analyze the same string in different
ways. Example:
{
"query": {
"query_string": {
"query": "content.plain:running scissors",
"fields": ["content"]
}
},
"highlight": {
"order": "score",
"fields": {
"content": {
"matched_fields": ["content", "content.plain"],
"type" : "fvh"
}
}
}
}
Closes#3750
* Minor alignments (like setter to ctor)
* FuzzySuggester has a unicode aware flag, which is not exposed in the fuzzy completion request parameters
* Made XAnalyzingSuggester flags (PAYLOAD_SEP, END_BYTE, SEP_LABEL) to be written into the postings format, so we can retain backwards compatibility
* The above change also implies, that these flags can be set per instantiated XAnalyzingSuggester
* CompletionPostingsFormatTest now uses a randomProvider for writing data to check for bwc
Use .percolator as the internal (hidden) type name for percolators within the index. Seems nicer name to represent "hidden" types within an index.
closes#4090
Requires field index_options set to "offsets" in order to store positions and offsets in the postings list.
Considerably faster than the plain highlighter since it doesn't require to reanalyze the text to be highlighted: the larger the documents the better the performance gain should be.
Requires less disk space than term_vectors, needed for the fast_vector_highlighter.
Breaks the text into sentences and highlights them. Uses a BreakIterator to find sentences in the text. Plays really well with natural text, not quite the same if the text contains html markup for instance.
Treats the document as the whole corpus, and scores individual sentences as if they were documents in this corpus, using the BM25 algorithm.
Uses forked version of lucene postings highlighter to support:
- per value discrete highlighting for fields that have multiple values, needed when number_of_fragments=0 since we want to return a snippet per value
- manually passing in query terms to avoid calling extract terms multiple times, since we use a different highlighter instance per doc/field, but the query is always the same
The lucene postings highlighter api is quite different compared to the existing highlighters api, the main difference being that it allows to highlight multiple fields in multiple docs with a single call, ensuring sequential IO.
The way it is introduced in elasticsearch in this first round is a compromise trying not to change the current highlight api, which works per document, per field. The main disadvantage is that we lose the sequential IO, but we can always refactor the highlight api to work with multiple documents.
Supports pre_tag, post_tag, number_of_fragments (0 highlights the whole field), require_field_match, no_match_size, order by score and html encoding.
Closes#3704
You can configure the highlighting api to return an excerpt of a field
even if there wasn't a match on the field.
The FVH makes excerpts from the beginning of the string to the first
boundary character after the requested length or the boundary_max_scan,
whichever comes first. The Plain highlighter makes excerpts from the
beginning of the string to the end of the last token before the requested
length.
Closes#1171
The clear scroll api allows clear all resources associated with a `scroll_id` by deleting the `scroll_id` and its associated SearchContext.
Closes#3657
Restrict the size of the input length to a reasonable size otherwise very
long strings can cause StackOverflowExceptions deep down in lucene land.
Yet, this is simply a saftly limit set to `50` UTF-16 codepoints by default.
This limit is only present at index time and not at query time. If prefix
completions > 50 UTF-16 codepoints are expected / desired this limit should be raised.
Critical string sizes are beyone the 1k UTF-16 Codepoints limit.
Closes#3596