2013-08-28 19:24:34 -04:00
|
|
|
[[index-modules-fielddata]]
|
|
|
|
== Field data
|
|
|
|
|
|
|
|
The field data cache is used mainly when sorting on or faceting on a
|
|
|
|
field. It loads all the field values to memory in order to provide fast
|
|
|
|
document based access to those values. The field data cache can be
|
|
|
|
expensive to build for a field, so its recommended to have enough memory
|
|
|
|
to allocate it, and to keep it loaded.
|
|
|
|
|
2013-09-03 15:27:49 -04:00
|
|
|
The amount of memory used for the field
|
2013-08-28 19:24:34 -04:00
|
|
|
data cache can be controlled using `indices.fielddata.cache.size`. Note:
|
|
|
|
reloading the field data which does not fit into your cache will be expensive
|
|
|
|
and perform poorly.
|
|
|
|
|
|
|
|
[cols="<,<",options="header",]
|
|
|
|
|=======================================================================
|
|
|
|
|Setting |Description
|
|
|
|
|`indices.fielddata.cache.size` |The max size of the field data cache,
|
|
|
|
eg `30%` of node heap space, or an absolute value, eg `12GB`. Defaults
|
|
|
|
to unbounded.
|
|
|
|
|
|
|
|
|`indices.fielddata.cache.expire` |A time based setting that expires
|
|
|
|
field data after a certain time of inactivity. Defaults to `-1`. For
|
|
|
|
example, can be set to `5m` for a 5 minute expiry.
|
|
|
|
|=======================================================================
|
|
|
|
|
2014-01-15 08:00:08 -05:00
|
|
|
[float]
|
|
|
|
[[fielddata-circuit-breaker]]
|
|
|
|
=== Field data circuit breaker
|
|
|
|
The field data circuit breaker allows Elasticsearch to estimate the amount of
|
|
|
|
memory a field will required to be loaded into memory. It can then prevent the
|
2014-01-15 09:04:24 -05:00
|
|
|
field data loading by raising and exception. By default the limit is configured
|
2014-05-06 04:13:05 -04:00
|
|
|
to 60% of the maximum JVM heap. It can be configured with the following
|
2014-01-15 09:04:24 -05:00
|
|
|
parameters:
|
2014-01-15 08:00:08 -05:00
|
|
|
|
|
|
|
[cols="<,<",options="header",]
|
|
|
|
|=======================================================================
|
|
|
|
|Setting |Description
|
|
|
|
|`indices.fielddata.breaker.limit` |Maximum size of estimated field data
|
|
|
|
to allow loading. Defaults to 80% of the maximum JVM heap.
|
|
|
|
|`indices.fielddata.breaker.overhead` |A constant that all field data
|
|
|
|
estimations are multiplied with to determine a final estimation. Defaults to
|
|
|
|
1.03
|
|
|
|
|=======================================================================
|
|
|
|
|
|
|
|
Both the `indices.fielddata.breaker.limit` and
|
|
|
|
`indices.fielddata.breaker.overhead` can be changed dynamically using the
|
|
|
|
cluster update settings API.
|
|
|
|
|
|
|
|
[float]
|
|
|
|
[[fielddata-monitoring]]
|
|
|
|
=== Monitoring field data
|
|
|
|
|
|
|
|
You can monitor memory usage for field data as well as the field data circuit
|
|
|
|
breaker using
|
|
|
|
<<cluster-nodes-stats,Nodes Stats API>>
|
|
|
|
|
|
|
|
[[fielddata-formats]]
|
|
|
|
== Field data formats
|
Doc values integration.
This commit allows for using Lucene doc values as a backend for field data,
moving the cost of building field data from the refresh operation to indexing.
In addition, Lucene doc values can be stored on disk (partially, or even
entirely), so that memory management is done at the operating system level
(file-system cache) instead of the JVM, avoiding long pauses during major
collections due to large heaps.
So far doc values are supported on numeric types and non-analyzed strings
(index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values
which is the only type to support multi-valued fields. Since the field data API
set is a bit wider than the doc values API set, some operations are not
supported:
- field data filtering: this will fail if doc values are enabled,
- field data cache clearing, even for memory-based doc values formats,
- getting the memory usage for a specific field,
- knowing whether a field is actually multi-valued.
This commit also allows for configuring doc-values formats on a per-field basis
similarly to postings formats. In particular the doc values format of the
_version field can be configured through its own field mapper (it used to be
handled in UidFieldMapper previously).
Closes #3806
2013-06-12 06:51:51 -04:00
|
|
|
|
2013-12-12 09:05:47 -05:00
|
|
|
The field data format controls how field data should be stored.
|
|
|
|
|
Doc values integration.
This commit allows for using Lucene doc values as a backend for field data,
moving the cost of building field data from the refresh operation to indexing.
In addition, Lucene doc values can be stored on disk (partially, or even
entirely), so that memory management is done at the operating system level
(file-system cache) instead of the JVM, avoiding long pauses during major
collections due to large heaps.
So far doc values are supported on numeric types and non-analyzed strings
(index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values
which is the only type to support multi-valued fields. Since the field data API
set is a bit wider than the doc values API set, some operations are not
supported:
- field data filtering: this will fail if doc values are enabled,
- field data cache clearing, even for memory-based doc values formats,
- getting the memory usage for a specific field,
- knowing whether a field is actually multi-valued.
This commit also allows for configuring doc-values formats on a per-field basis
similarly to postings formats. In particular the doc values format of the
_version field can be configured through its own field mapper (it used to be
handled in UidFieldMapper previously).
Closes #3806
2013-06-12 06:51:51 -04:00
|
|
|
Depending on the field type, there might be several field data types
|
|
|
|
available. In particular, string and numeric types support the `doc_values`
|
|
|
|
format which allows for computing the field data data-structures at indexing
|
|
|
|
time and storing them on disk. Although it will make the index larger and may
|
|
|
|
be slightly slower, this implementation will be more near-realtime-friendly
|
|
|
|
and will require much less memory from the JVM than other implementations.
|
|
|
|
|
2013-12-12 09:05:47 -05:00
|
|
|
Here is an example of how to configure the `tag` field to use the `fst` field
|
|
|
|
data format.
|
|
|
|
|
Doc values integration.
This commit allows for using Lucene doc values as a backend for field data,
moving the cost of building field data from the refresh operation to indexing.
In addition, Lucene doc values can be stored on disk (partially, or even
entirely), so that memory management is done at the operating system level
(file-system cache) instead of the JVM, avoiding long pauses during major
collections due to large heaps.
So far doc values are supported on numeric types and non-analyzed strings
(index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values
which is the only type to support multi-valued fields. Since the field data API
set is a bit wider than the doc values API set, some operations are not
supported:
- field data filtering: this will fail if doc values are enabled,
- field data cache clearing, even for memory-based doc values formats,
- getting the memory usage for a specific field,
- knowing whether a field is actually multi-valued.
This commit also allows for configuring doc-values formats on a per-field basis
similarly to postings formats. In particular the doc values format of the
_version field can be configured through its own field mapper (it used to be
handled in UidFieldMapper previously).
Closes #3806
2013-06-12 06:51:51 -04:00
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
tag: {
|
|
|
|
type: "string",
|
|
|
|
fielddata: {
|
|
|
|
format: "fst"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
2013-12-12 09:05:47 -05:00
|
|
|
It is possible to change the field data format (and the field data settings
|
|
|
|
in general) on a live index by using the update mapping API. When doing so,
|
|
|
|
field data which had already been loaded for existing segments will remain
|
|
|
|
alive while new segments will use the new field data configuration. Thanks to
|
|
|
|
the background merging process, all segments will eventually use the new
|
|
|
|
field data format.
|
|
|
|
|
Doc values integration.
This commit allows for using Lucene doc values as a backend for field data,
moving the cost of building field data from the refresh operation to indexing.
In addition, Lucene doc values can be stored on disk (partially, or even
entirely), so that memory management is done at the operating system level
(file-system cache) instead of the JVM, avoiding long pauses during major
collections due to large heaps.
So far doc values are supported on numeric types and non-analyzed strings
(index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values
which is the only type to support multi-valued fields. Since the field data API
set is a bit wider than the doc values API set, some operations are not
supported:
- field data filtering: this will fail if doc values are enabled,
- field data cache clearing, even for memory-based doc values formats,
- getting the memory usage for a specific field,
- knowing whether a field is actually multi-valued.
This commit also allows for configuring doc-values formats on a per-field basis
similarly to postings formats. In particular the doc values format of the
_version field can be configured through its own field mapper (it used to be
handled in UidFieldMapper previously).
Closes #3806
2013-06-12 06:51:51 -04:00
|
|
|
[float]
|
|
|
|
==== String field data types
|
|
|
|
|
|
|
|
`paged_bytes` (default)::
|
|
|
|
Stores unique terms sequentially in a large buffer and maps documents to
|
|
|
|
the indices of the terms they contain in this large buffer.
|
|
|
|
|
|
|
|
`fst`::
|
|
|
|
Stores terms in a FST. Slower to build than `paged_bytes` but can help lower
|
|
|
|
memory usage if many terms share common prefixes and/or suffixes.
|
|
|
|
|
|
|
|
`doc_values`::
|
|
|
|
Computes and stores field data data-structures on disk at indexing time.
|
|
|
|
Lowers memory usage but only works on non-analyzed strings (`index`: `no` or
|
|
|
|
`not_analyzed`) and doesn't support filtering.
|
|
|
|
|
|
|
|
[float]
|
|
|
|
==== Numeric field data types
|
|
|
|
|
|
|
|
`array` (default)::
|
2014-01-03 13:11:47 -05:00
|
|
|
Stores field values in memory using arrays.
|
Doc values integration.
This commit allows for using Lucene doc values as a backend for field data,
moving the cost of building field data from the refresh operation to indexing.
In addition, Lucene doc values can be stored on disk (partially, or even
entirely), so that memory management is done at the operating system level
(file-system cache) instead of the JVM, avoiding long pauses during major
collections due to large heaps.
So far doc values are supported on numeric types and non-analyzed strings
(index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values
which is the only type to support multi-valued fields. Since the field data API
set is a bit wider than the doc values API set, some operations are not
supported:
- field data filtering: this will fail if doc values are enabled,
- field data cache clearing, even for memory-based doc values formats,
- getting the memory usage for a specific field,
- knowing whether a field is actually multi-valued.
This commit also allows for configuring doc-values formats on a per-field basis
similarly to postings formats. In particular the doc values format of the
_version field can be configured through its own field mapper (it used to be
handled in UidFieldMapper previously).
Closes #3806
2013-06-12 06:51:51 -04:00
|
|
|
|
|
|
|
`doc_values`::
|
|
|
|
Computes and stores field data data-structures on disk at indexing time.
|
|
|
|
Doesn't support filtering.
|
|
|
|
|
|
|
|
[float]
|
|
|
|
==== Geo point field data types
|
|
|
|
|
|
|
|
`array` (default)::
|
|
|
|
Stores latitudes and longitudes in arrays.
|
|
|
|
|
2013-12-26 08:52:45 -05:00
|
|
|
`doc_values`::
|
|
|
|
Computes and stores field data data-structures on disk at indexing time.
|
|
|
|
|
2014-02-16 20:23:38 -05:00
|
|
|
[float]
|
|
|
|
==== Global ordinals
|
|
|
|
|
|
|
|
coming[1.2.0]
|
|
|
|
|
|
|
|
Global ordinals is a data-structure on top of field data, that maintains an
|
|
|
|
incremental numbering for all the terms in field data in a lexicographic order.
|
|
|
|
Each term has a unique number and the number of term 'A' is lower than the number
|
|
|
|
of term 'B'. Global ordinals are only supported on string fields.
|
|
|
|
|
|
|
|
Field data on string also has ordinals, which is a unique numbering for all terms
|
|
|
|
in a particular segment and field. Global ordinals just build on top of this,
|
|
|
|
by providing a mapping between the segment ordinals and the global ordinals.
|
|
|
|
The latter being unique across the entire shard.
|
|
|
|
|
|
|
|
Global ordinals can be beneficial in search features that use segment ordinals already
|
|
|
|
such as the terms aggregator to improve the execution time. Often these search features
|
|
|
|
need to merge the segment ordinal results to a cross segment terms result. With
|
|
|
|
global ordinals this mapping happens during field data load time instead of during each
|
|
|
|
query execution. With global ordinals search features only need to resolve the actual
|
|
|
|
term when building the (shard) response, but during the execution there is no need
|
|
|
|
at all to use the actual terms and the unique numbering global ordinals provided is
|
|
|
|
sufficient and improves the execution time.
|
|
|
|
|
|
|
|
Global ordinals for a specified field are tied to all the segments of a shard (Lucene index),
|
|
|
|
which is different than for field data for a specific field which is tied to a single segment.
|
|
|
|
For this reason global ordinals need to be rebuilt in its entirety once new segments
|
|
|
|
become visible. This one time cost would happen anyway without global ordinals, but
|
|
|
|
then it would happen for each search execution instead!
|
|
|
|
|
|
|
|
The loading time of global ordinals depends on the number of terms in a field, but in general
|
|
|
|
it is low, since it source field data has already been loaded. The memory overhead of global
|
|
|
|
ordinals is a small because it is very efficiently compressed. Eager loading of global ordinals
|
|
|
|
can move the loading time from the first search request, to the refresh itself.
|
|
|
|
|
2013-10-04 15:40:37 -04:00
|
|
|
[float]
|
|
|
|
=== Fielddata loading
|
|
|
|
|
2013-12-18 10:05:35 -05:00
|
|
|
By default, field data is loaded lazily, ie. the first time that a query that
|
|
|
|
requires them is executed. However, this can make the first requests that
|
2013-10-04 15:40:37 -04:00
|
|
|
follow a merge operation quite slow since fielddata loading is a heavy
|
|
|
|
operation.
|
|
|
|
|
|
|
|
It is possible to force field data to be loaded and cached eagerly through the
|
|
|
|
`loading` setting of fielddata:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
category: {
|
|
|
|
type: "string",
|
|
|
|
fielddata: {
|
|
|
|
loading: "eager"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
2014-02-16 20:23:38 -05:00
|
|
|
Global ordinals can also be eagerly loaded:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
category: {
|
|
|
|
type: "string",
|
|
|
|
fielddata: {
|
|
|
|
loading: "eager_global_ordinals"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
With the above setting both field data and global ordinals for a specific field
|
|
|
|
are eagerly loaded.
|
|
|
|
|
2013-12-18 09:44:57 -05:00
|
|
|
[float]
|
|
|
|
==== Disabling field data loading
|
|
|
|
|
|
|
|
Field data can take a lot of RAM so it makes sense to disable field data
|
|
|
|
loading on the fields that don't need field data, for example those that are
|
|
|
|
used for full-text search only. In order to disable field data loading, just
|
2013-12-18 10:05:35 -05:00
|
|
|
change the field data format to `disabled`. When disabled, all requests that
|
|
|
|
will try to load field data, e.g. when they include aggregations and/or sorting,
|
|
|
|
will return an error.
|
2013-12-18 09:44:57 -05:00
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
text: {
|
|
|
|
type: "string",
|
|
|
|
fielddata: {
|
|
|
|
format: "disabled"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
The `disabled` format is supported by all field types.
|
|
|
|
|
2013-08-28 19:24:34 -04:00
|
|
|
[float]
|
2013-09-30 17:32:00 -04:00
|
|
|
[[field-data-filtering]]
|
2013-08-28 19:24:34 -04:00
|
|
|
=== Filtering fielddata
|
|
|
|
|
|
|
|
It is possible to control which field values are loaded into memory,
|
|
|
|
which is particularly useful for string fields. When specifying the
|
|
|
|
<<mapping-core-types,mapping>> for a field, you
|
|
|
|
can also specify a fielddata filter.
|
|
|
|
|
|
|
|
Fielddata filters can be changed using the
|
|
|
|
<<indices-put-mapping,PUT mapping>>
|
|
|
|
API. After changing the filters, use the
|
|
|
|
<<indices-clearcache,Clear Cache>> API
|
|
|
|
to reload the fielddata using the new filters.
|
|
|
|
|
|
|
|
[float]
|
|
|
|
==== Filtering by frequency:
|
|
|
|
|
|
|
|
The frequency filter allows you to only load terms whose frequency falls
|
|
|
|
between a `min` and `max` value, which can be expressed an absolute
|
|
|
|
number or as a percentage (eg `0.01` is `1%`). Frequency is calculated
|
|
|
|
*per segment*. Percentages are based on the number of docs which have a
|
|
|
|
value for the field, as opposed to all docs in the segment.
|
|
|
|
|
|
|
|
Small segments can be excluded completely by specifying the minimum
|
|
|
|
number of docs that the segment should contain with `min_segment_size`:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
tag: {
|
|
|
|
type: "string",
|
|
|
|
fielddata: {
|
|
|
|
filter: {
|
|
|
|
frequency: {
|
|
|
|
min: 0.001,
|
|
|
|
max: 0.1,
|
|
|
|
min_segment_size: 500
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
[float]
|
|
|
|
==== Filtering by regex
|
|
|
|
|
|
|
|
Terms can also be filtered by regular expression - only values which
|
|
|
|
match the regular expression are loaded. Note: the regular expression is
|
|
|
|
applied to each term in the field, not to the whole field value. For
|
|
|
|
instance, to only load hashtags from a tweet, we can use a regular
|
|
|
|
expression which matches terms beginning with `#`:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
tweet: {
|
|
|
|
type: "string",
|
|
|
|
analyzer: "whitespace"
|
|
|
|
fielddata: {
|
|
|
|
filter: {
|
2013-09-04 17:17:46 -04:00
|
|
|
regex: {
|
|
|
|
pattern: "^#.*"
|
|
|
|
}
|
2013-08-28 19:24:34 -04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
[float]
|
|
|
|
==== Combining filters
|
|
|
|
|
|
|
|
The `frequency` and `regex` filters can be combined:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
tweet: {
|
|
|
|
type: "string",
|
|
|
|
analyzer: "whitespace"
|
|
|
|
fielddata: {
|
|
|
|
filter: {
|
2013-09-04 17:17:46 -04:00
|
|
|
regex: {
|
|
|
|
pattern: "^#.*",
|
|
|
|
},
|
2013-08-28 19:24:34 -04:00
|
|
|
frequency: {
|
|
|
|
min: 0.001,
|
|
|
|
max: 0.1,
|
|
|
|
min_segment_size: 500
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|