This aggregation computes unique term counts using the hyperloglog++ algorithm
which uses linear counting to estimate low cardinalities and hyperloglog on
higher cardinalities.
Since this algorithm works on hashes, it is useful for high-cardinality fields
to store the hash of values directly in the index, which is the purpose of
the new `murmur3` field type. This is less necessary on low-cardinality
string fields because the aggregator is smart enough to only compute the hash
once per unique value per segment thanks to ordinals, or on numeric fields
since hashing them is very fast.
Close#5426
`omit_term_freq_and_positions` was deprecated in `0.20` and
is not documented anymore. We should reject indices that are
created with this option in the future.
Closes#4722
Put mapping now supports either of these formats:
POST foo/doc/_mapping
{
"doc": {
"_routing": {"required": true},
"properties": {
"body": {"type": "string"}
}
}
}
or
POST foo/doc/_mapping
{
"_routing": {"required": true},
"properties": {
"body": {"type": "string"}
}
}
Closes#4483
When upgrading to ES 1.0 the existing mappings with a multi-field type automatically get replaced to a core field with the new `fields` option.
If a `multi_field` type-ed field doesn't have a main / default field, a default field will be chosen for the multi fields syntax. The new main field type
will be equal to the first `multi_field` fields' field or type string if no fields have been configured for the `multi_field` field and in both cases
the default index will not be indexed (`index=no` is set on the default field).
If a `multi_field` typed field has a default field, that field will replace the `multi_field` typed field.
Closes to #4521
Also, fix all the problems it brought up in tests.
Removed OverrideTypeMappingTests as it is no longer relevant.
Better naming for the default percolator mapping and change it's content use _default_ as root node.
Closes#4038
This commit allows for using Lucene doc values as a backend for field data,
moving the cost of building field data from the refresh operation to indexing.
In addition, Lucene doc values can be stored on disk (partially, or even
entirely), so that memory management is done at the operating system level
(file-system cache) instead of the JVM, avoiding long pauses during major
collections due to large heaps.
So far doc values are supported on numeric types and non-analyzed strings
(index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values
which is the only type to support multi-valued fields. Since the field data API
set is a bit wider than the doc values API set, some operations are not
supported:
- field data filtering: this will fail if doc values are enabled,
- field data cache clearing, even for memory-based doc values formats,
- getting the memory usage for a specific field,
- knowing whether a field is actually multi-valued.
This commit also allows for configuring doc-values formats on a per-field basis
similarly to postings formats. In particular the doc values format of the
_version field can be configured through its own field mapper (it used to be
handled in UidFieldMapper previously).
Closes#3806
This commit introduces near realtime suggestions. For more information about
its usage refer to github issue #3376
From the implementation point of view, a custom AnalyzingSuggester is used
in combination with a custom postingsformat (which is not exposed to the user
anywhere for him to use).
Closes#3376
This feature adds the option to configure a `PostingsFormat` and assign it to a field in the mapping. This feature is very expert and in almost all cases Elasticsearch's defaults will suite your needs.
## Configuring a postingsformat per field
There're several default postings formats configured by default which can be used in your mapping:
a* `direct` - A codec that wraps the default postings format during write time, but loads the terms and postinglists into memory directly in memory during read time as raw arrays. This postings format is exceptional memory intensive, but can give a substantial increase in search performance.
* `memory` - A codec that loads and stores terms and postinglists in memory using a FST. Acts like a cached postingslist.
* `bloom_default` - Maintains a bloom filter for the indexed terms, which is stored to disk and builds on top of the `default` postings format. This postings format is useful for low document frequency terms and offers a fail fast for seeks to terms that don't exist.
* `bloom_pulsing` - Similar to the `bloom_default` postings format, but builds on top of the `pulsing` postings format.
* `default` - The default postings format. The default if none is specified.
On all fields it possible to configure a `postings_format` attribute. Example mapping:
```
{
"person" : {
"properties" : {
"second_person_id" : {"type" : "string", "postings_format" : "pulsing"}
}
}
}
```
## Configuring a custom postingsformat
It is possible the instantiate custom postingsformats. This can be specified via the index settings.
```
{
"codec" : {
"postings_format" : {
"my_format" : {
"type" : "pulsing40"
"freq_cut_off" : "5"
}
}
}
}
```
In the above example the `freq_cut_off` is set the 5 (defaults to 1). This tells the pulsing postings format to inline the postinglist of terms with a document frequency lower or equal to 5 in the term dictionary.
Closes#2411