346 lines
12 KiB
Plaintext
346 lines
12 KiB
Plaintext
[[index-modules-fielddata]]
|
|
== Field data
|
|
|
|
The field data cache is used mainly when sorting on or computing aggregations
|
|
on a field. It loads all the field values to memory in order to provide fast
|
|
document based access to those values. The field data cache can be
|
|
expensive to build for a field, so its recommended to have enough memory
|
|
to allocate it, and to keep it loaded.
|
|
|
|
The amount of memory used for the field
|
|
data cache can be controlled using `indices.fielddata.cache.size`. Note:
|
|
reloading the field data which does not fit into your cache will be expensive
|
|
and perform poorly.
|
|
|
|
[cols="<,<",options="header",]
|
|
|=======================================================================
|
|
|Setting |Description
|
|
|`indices.fielddata.cache.size` |The max size of the field data cache,
|
|
eg `30%` of node heap space, or an absolute value, eg `12GB`. Defaults
|
|
to unbounded.
|
|
|
|
|`indices.fielddata.cache.expire` |A time based setting that expires
|
|
field data after a certain time of inactivity. Defaults to `-1`. For
|
|
example, can be set to `5m` for a 5 minute expiry.
|
|
|=======================================================================
|
|
|
|
[float]
|
|
[[circuit-breaker]]
|
|
=== Circuit Breaker
|
|
|
|
Elasticsearch contains multiple circuit breakers used to prevent operations from
|
|
causing an OutOfMemoryError. Each breaker specifies a limit for how much memory
|
|
it can use. Additionally, there is a parent-level breaker that specifies the
|
|
total amount of memory that can be used across all breakers.
|
|
|
|
The parent-level breaker can be configured with the following setting:
|
|
|
|
`indices.breaker.total.limit`::
|
|
Starting limit for overall parent breaker, defaults to 70% of JVM heap
|
|
|
|
All circuit breaker settings can be changed dynamically using the cluster update
|
|
settings API.
|
|
|
|
[float]
|
|
[[fielddata-circuit-breaker]]
|
|
==== Field data circuit breaker
|
|
The field data circuit breaker allows Elasticsearch to estimate the amount of
|
|
memory a field will required to be loaded into memory. It can then prevent the
|
|
field data loading by raising an exception. By default the limit is configured
|
|
to 60% of the maximum JVM heap. It can be configured with the following
|
|
parameters:
|
|
|
|
`indices.breaker.fielddata.limit`::
|
|
Limit for fielddata breaker, defaults to 60% of JVM heap
|
|
|
|
`indices.breaker.fielddata.overhead`::
|
|
A constant that all field data estimations are multiplied with to determine a
|
|
final estimation. Defaults to 1.03
|
|
|
|
[float]
|
|
[[request-circuit-breaker]]
|
|
==== Request circuit breaker
|
|
|
|
The request circuit breaker allows Elasticsearch to prevent per-request data
|
|
structures (for example, memory used for calculating aggregations during a
|
|
request) from exceeding a certain amount of memory.
|
|
|
|
`indices.breaker.request.limit`::
|
|
Limit for request breaker, defaults to 40% of JVM heap
|
|
|
|
`indices.breaker.request.overhead`::
|
|
A constant that all request estimations are multiplied with to determine a
|
|
final estimation. Defaults to 1
|
|
|
|
[float]
|
|
[[fielddata-monitoring]]
|
|
=== Monitoring field data
|
|
|
|
You can monitor memory usage for field data as well as the field data circuit
|
|
breaker using
|
|
<<cluster-nodes-stats,Nodes Stats API>>
|
|
|
|
[[fielddata-formats]]
|
|
== Field data formats
|
|
|
|
The field data format controls how field data should be stored.
|
|
|
|
Depending on the field type, there might be several field data types
|
|
available. In particular, string and numeric types support the `doc_values`
|
|
format which allows for computing the field data data-structures at indexing
|
|
time and storing them on disk. Although it will make the index larger and may
|
|
be slightly slower, this implementation will be more near-realtime-friendly
|
|
and will require much less memory from the JVM than other implementations.
|
|
|
|
Here is an example of how to configure the `tag` field to use the `fst` field
|
|
data format.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"tag": {
|
|
"type": "string",
|
|
"fielddata": {
|
|
"format": "fst"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
It is possible to change the field data format (and the field data settings
|
|
in general) on a live index by using the update mapping API. When doing so,
|
|
field data which had already been loaded for existing segments will remain
|
|
alive while new segments will use the new field data configuration. Thanks to
|
|
the background merging process, all segments will eventually use the new
|
|
field data format.
|
|
|
|
[float]
|
|
==== String field data types
|
|
|
|
`paged_bytes` (default)::
|
|
Stores unique terms sequentially in a large buffer and maps documents to
|
|
the indices of the terms they contain in this large buffer.
|
|
|
|
`fst`::
|
|
Stores terms in a FST. Slower to build than `paged_bytes` but can help lower
|
|
memory usage if many terms share common prefixes and/or suffixes.
|
|
|
|
`doc_values`::
|
|
Computes and stores field data data-structures on disk at indexing time.
|
|
Lowers memory usage but only works on non-analyzed strings (`index`: `no` or
|
|
`not_analyzed`) and doesn't support <<field-data-filtering,filtering>>.
|
|
|
|
[float]
|
|
==== Numeric field data types
|
|
|
|
`array` (default)::
|
|
Stores field values in memory using arrays.
|
|
|
|
`doc_values`::
|
|
Computes and stores field data data-structures on disk at indexing time.
|
|
Doesn't support <<field-data-filtering,filtering>>.
|
|
|
|
[float]
|
|
==== Geo point field data types
|
|
|
|
`array` (default)::
|
|
Stores latitudes and longitudes in arrays.
|
|
|
|
`doc_values`::
|
|
Computes and stores field data data-structures on disk at indexing time.
|
|
|
|
[float]
|
|
==== Global ordinals
|
|
|
|
Global ordinals is a data-structure on top of field data, that maintains an
|
|
incremental numbering for all the terms in field data in a lexicographic order.
|
|
Each term has a unique number and the number of term 'A' is lower than the number
|
|
of term 'B'. Global ordinals are only supported on string fields.
|
|
|
|
Field data on string also has ordinals, which is a unique numbering for all terms
|
|
in a particular segment and field. Global ordinals just build on top of this,
|
|
by providing a mapping between the segment ordinals and the global ordinals.
|
|
The latter being unique across the entire shard.
|
|
|
|
Global ordinals can be beneficial in search features that use segment ordinals already
|
|
such as the terms aggregator to improve the execution time. Often these search features
|
|
need to merge the segment ordinal results to a cross segment terms result. With
|
|
global ordinals this mapping happens during field data load time instead of during each
|
|
query execution. With global ordinals search features only need to resolve the actual
|
|
term when building the (shard) response, but during the execution there is no need
|
|
at all to use the actual terms and the unique numbering global ordinals provided is
|
|
sufficient and improves the execution time.
|
|
|
|
Global ordinals for a specified field are tied to all the segments of a shard (Lucene index),
|
|
which is different than for field data for a specific field which is tied to a single segment.
|
|
For this reason global ordinals need to be rebuilt in its entirety once new segments
|
|
become visible. This one time cost would happen anyway without global ordinals, but
|
|
then it would happen for each search execution instead!
|
|
|
|
The loading time of global ordinals depends on the number of terms in a field, but in general
|
|
it is low, since it source field data has already been loaded. The memory overhead of global
|
|
ordinals is a small because it is very efficiently compressed. Eager loading of global ordinals
|
|
can move the loading time from the first search request, to the refresh itself.
|
|
|
|
[float]
|
|
=== Fielddata loading
|
|
|
|
By default, field data is loaded lazily, ie. the first time that a query that
|
|
requires them is executed. However, this can make the first requests that
|
|
follow a merge operation quite slow since fielddata loading is a heavy
|
|
operation.
|
|
|
|
It is possible to force field data to be loaded and cached eagerly through the
|
|
`loading` setting of fielddata:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"category": {
|
|
"type": "string",
|
|
"fielddata": {
|
|
"loading": "eager"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
Global ordinals can also be eagerly loaded:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"category": {
|
|
"type": "string",
|
|
"fielddata": {
|
|
"loading": "eager_global_ordinals"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
With the above setting both field data and global ordinals for a specific field
|
|
are eagerly loaded.
|
|
|
|
[float]
|
|
==== Disabling field data loading
|
|
|
|
Field data can take a lot of RAM so it makes sense to disable field data
|
|
loading on the fields that don't need field data, for example those that are
|
|
used for full-text search only. In order to disable field data loading, just
|
|
change the field data format to `disabled`. When disabled, all requests that
|
|
will try to load field data, e.g. when they include aggregations and/or sorting,
|
|
will return an error.
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"text": {
|
|
"type": "string",
|
|
"fielddata": {
|
|
"format": "disabled"
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
The `disabled` format is supported by all field types.
|
|
|
|
[float]
|
|
[[field-data-filtering]]
|
|
=== Filtering fielddata
|
|
|
|
It is possible to control which field values are loaded into memory,
|
|
which is particularly useful for string fields. When specifying the
|
|
<<mapping-core-types,mapping>> for a field, you
|
|
can also specify a fielddata filter.
|
|
|
|
Fielddata filters can be changed using the
|
|
<<indices-put-mapping,PUT mapping>>
|
|
API. After changing the filters, use the
|
|
<<indices-clearcache,Clear Cache>> API
|
|
to reload the fielddata using the new filters.
|
|
|
|
[float]
|
|
==== Filtering by frequency:
|
|
|
|
The frequency filter allows you to only load terms whose frequency falls
|
|
between a `min` and `max` value, which can be expressed an absolute
|
|
number or as a percentage (eg `0.01` is `1%`). Frequency is calculated
|
|
*per segment*. Percentages are based on the number of docs which have a
|
|
value for the field, as opposed to all docs in the segment.
|
|
|
|
Small segments can be excluded completely by specifying the minimum
|
|
number of docs that the segment should contain with `min_segment_size`:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"tag": {
|
|
"type": "string",
|
|
"fielddata": {
|
|
"filter": {
|
|
"frequency": {
|
|
"min": 0.001,
|
|
"max": 0.1,
|
|
"min_segment_size": 500
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
==== Filtering by regex
|
|
|
|
Terms can also be filtered by regular expression - only values which
|
|
match the regular expression are loaded. Note: the regular expression is
|
|
applied to each term in the field, not to the whole field value. For
|
|
instance, to only load hashtags from a tweet, we can use a regular
|
|
expression which matches terms beginning with `#`:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"tweet": {
|
|
"type": "string",
|
|
"analyzer": "whitespace"
|
|
"fielddata": {
|
|
"filter": {
|
|
"regex": {
|
|
"pattern": "^#.*"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
==== Combining filters
|
|
|
|
The `frequency` and `regex` filters can be combined:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"tweet": {
|
|
"type": "string",
|
|
"analyzer": "whitespace"
|
|
"fielddata": {
|
|
"filter": {
|
|
"regex": {
|
|
"pattern": "^#.*",
|
|
},
|
|
"frequency": {
|
|
"min": 0.001,
|
|
"max": 0.1,
|
|
"min_segment_size": 500
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|