OpenSearch/docs/reference/index-modules/fielddata.asciidoc

[[index-modules-fielddata]]
== Field data

The field data cache is used mainly when sorting on or faceting on a
field. It loads all the field values to memory in order to provide fast
document based access to those values. The field data cache can be
expensive to build for a field, so its recommended to have enough memory
to allocate it, and to keep it loaded.

The amount of memory used for the field
data cache can be controlled using `indices.fielddata.cache.size`. Note:
reloading  the field data which does not fit into your cache will be expensive
and  perform poorly.

[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`indices.fielddata.cache.size` |The max size of the field data cache,
eg `30%` of node heap space, or an absolute value, eg `12GB`. Defaults
to unbounded.

|`indices.fielddata.cache.expire` |A time based setting that expires
field data after a certain time of inactivity. Defaults to `-1`. For
example, can be set to `5m` for a 5 minute expiry.
|=======================================================================

=== Field data formats

The field data format controls how field data should be stored.

Depending on the field type, there might be several field data types
available. In particular, string and numeric types support the `doc_values`
format which allows for computing the field data data-structures at indexing
time and storing them on disk. Although it will make the index larger and may
be slightly slower, this implementation will be more near-realtime-friendly
and will require much less memory from the JVM than other implementations.

Here is an example of how to configure the `tag` field to use the `fst` field
data format.

[source,js]
--------------------------------------------------
{
    tag: {
        type:      "string",
        fielddata: {
            format: "fst"
        }
    }
}
--------------------------------------------------

It is possible to change the field data format (and the field data settings
in general) on a live index by using the update mapping API. When doing so,
field data which had already been loaded for existing segments will remain
alive while new segments will use the new field data configuration. Thanks to
the background merging process, all segments will eventually use the new
field data format.

[float]
==== String field data types

`paged_bytes` (default)::
    Stores unique terms sequentially in a large buffer and maps documents to
    the indices of the terms they contain in this large buffer.

`fst`::
    Stores terms in a FST. Slower to build than `paged_bytes` but can help lower
    memory usage if many terms share common prefixes and/or suffixes.

`doc_values`::
    Computes and stores field data data-structures on disk at indexing time.
    Lowers memory usage but only works on non-analyzed strings (`index`: `no` or
    `not_analyzed`) and doesn't support filtering.

[float]
==== Numeric field data types

`array` (default)::
    Stores field values in memory using arrays. 

`doc_values`::
    Computes and stores field data data-structures on disk at indexing time.
    Doesn't support filtering.

[float]
==== Geo point field data types

`array` (default)::
    Stores latitudes and longitudes in arrays.

`doc_values`::
    Computes and stores field data data-structures on disk at indexing time.

[float]
=== Fielddata loading

By default, field data is loaded lazily, ie. the first time that a query that
requires them is executed. However, this can make the first requests that
follow a merge operation quite slow since fielddata loading is a heavy
operation.

It is possible to force field data to be loaded and cached eagerly through the
`loading` setting of fielddata:

[source,js]
--------------------------------------------------
{
    category: {
        type:      "string",
        fielddata: {
            loading: "eager"
        }
    }
}
--------------------------------------------------

[float]
==== Disabling field data loading

Field data can take a lot of RAM so it makes sense to disable field data
loading on the fields that don't need field data, for example those that are
used for full-text search only. In order to disable field data loading, just
change the field data format to `disabled`. When disabled, all requests that
will try to load field data, e.g. when they include aggregations and/or sorting,
will return an error.

[source,js]
--------------------------------------------------
{
    text: {
        type:      "string",
        fielddata: {
            format: "disabled"
        }
    }
}
--------------------------------------------------

The `disabled` format is supported by all field types.

[float]
[[field-data-filtering]]
=== Filtering fielddata

It is possible to control which field values are loaded into memory,
which is particularly useful for string fields. When specifying the
<<mapping-core-types,mapping>> for a field, you
can also specify a fielddata filter.

Fielddata filters can be changed using the
<<indices-put-mapping,PUT mapping>>
API. After changing the filters, use the
<<indices-clearcache,Clear Cache>> API
to reload the fielddata using the new filters.

[float]
==== Filtering by frequency:

The frequency filter allows you to only load terms whose frequency falls
between a `min` and `max` value, which can be expressed an absolute
number or as a percentage (eg `0.01` is `1%`). Frequency is calculated
*per segment*. Percentages are based on the number of docs which have a
value for the field, as opposed to all docs in the segment.

Small segments can be excluded completely by specifying the minimum
number of docs that the segment should contain with `min_segment_size`:

[source,js]
--------------------------------------------------
{
    tag: {
        type:      "string",
        fielddata: {
            filter: {
                frequency: {
                    min:              0.001,
                    max:              0.1,
                    min_segment_size: 500
                }
            }
        }
    }
}
--------------------------------------------------

[float]
==== Filtering by regex

Terms can also be filtered by regular expression - only values which
match the regular expression are loaded. Note: the regular expression is
applied to each term in the field, not to the whole field value. For
instance, to only load hashtags from a tweet, we can use a regular
expression which matches terms beginning with `#`:

[source,js]
--------------------------------------------------
{
    tweet: {
        type:      "string",
        analyzer:  "whitespace"
        fielddata: {
            filter: {
                regex: {
                    pattern: "^#.*"
                }
            }
        }
    }
}
--------------------------------------------------

[float]
==== Combining filters

The `frequency` and `regex` filters can be combined:

[source,js]
--------------------------------------------------
{
    tweet: {
        type:      "string",
        analyzer:  "whitespace"
        fielddata: {
            filter: {
                regex: {
                    pattern:          "^#.*",
                },
                frequency: {
                    min:              0.001,
                    max:              0.1,
                    min_segment_size: 500
                }
            }
        }
    }
}
--------------------------------------------------

[float]
[[field-data-monitoring]]
=== Monitoring field data

You can monitor memory usage for field data using
<<cluster-nodes-stats,Nodes Stats API>>
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[index-modules-fielddata]]`
			`== Field data`

			`The field data cache is used mainly when sorting on or faceting on a`
			`field. It loads all the field values to memory in order to provide fast`
			`document based access to those values. The field data cache can be`
			`expensive to build for a field, so its recommended to have enough memory`
			`to allocate it, and to keep it loaded.`

[DOCS] Removed outdated new/deprecated version notices 2013-09-03 15:27:49 -04:00			`The amount of memory used for the field`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			data cache can be controlled using `indices.fielddata.cache.size`. Note:
			`reloading the field data which does not fit into your cache will be expensive`
			`and perform poorly.`

			`[cols="<,<",options="header",]`
			`\|=======================================================================`
			`\|Setting \|Description`
			\|`indices.fielddata.cache.size` \|The max size of the field data cache,
			eg `30%` of node heap space, or an absolute value, eg `12GB`. Defaults
			`to unbounded.`

			\|`indices.fielddata.cache.expire` \|A time based setting that expires
			field data after a certain time of inactivity. Defaults to `-1`. For
			example, can be set to `5m` for a 5 minute expiry.
			`\|=======================================================================`

Doc values integration. This commit allows for using Lucene doc values as a backend for field data, moving the cost of building field data from the refresh operation to indexing. In addition, Lucene doc values can be stored on disk (partially, or even entirely), so that memory management is done at the operating system level (file-system cache) instead of the JVM, avoiding long pauses during major collections due to large heaps. So far doc values are supported on numeric types and non-analyzed strings (index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values which is the only type to support multi-valued fields. Since the field data API set is a bit wider than the doc values API set, some operations are not supported: - field data filtering: this will fail if doc values are enabled, - field data cache clearing, even for memory-based doc values formats, - getting the memory usage for a specific field, - knowing whether a field is actually multi-valued. This commit also allows for configuring doc-values formats on a per-field basis similarly to postings formats. In particular the doc values format of the _version field can be configured through its own field mapper (it used to be handled in UidFieldMapper previously). Closes #3806 2013-06-12 06:51:51 -04:00			`=== Field data formats`

Make field data changes immediately taken into account and add the ability to disallow field data loading. This commit changes field data configuration updates so that they are immediately taken into account for loading new segments. The way it works is that field data configuration is now cached separately from the field data cache, meaning that it is now possible to clear the field data configuration from IndexFieldDataService while the cache will stay around. On the next time that Elasticsearch will reload field data configuration, it will check if there is already a cache entry, and reuse it if it exists. To disable field data loading, all that is required is to change the field data format to "none" (supported by all field data types) using the update mapping API. Elasticsearch will then refuse to load field data on any new segment, but field data which has been loaded on the previous segments will remain available. So you need to clear the field data cache in order to reclaim memory (otherwise memory will be reclaimed slower, as segments get merged). Close #4430 Close #4431 2013-12-12 09:05:47 -05:00			`The field data format controls how field data should be stored.`

Doc values integration. This commit allows for using Lucene doc values as a backend for field data, moving the cost of building field data from the refresh operation to indexing. In addition, Lucene doc values can be stored on disk (partially, or even entirely), so that memory management is done at the operating system level (file-system cache) instead of the JVM, avoiding long pauses during major collections due to large heaps. So far doc values are supported on numeric types and non-analyzed strings (index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values which is the only type to support multi-valued fields. Since the field data API set is a bit wider than the doc values API set, some operations are not supported: - field data filtering: this will fail if doc values are enabled, - field data cache clearing, even for memory-based doc values formats, - getting the memory usage for a specific field, - knowing whether a field is actually multi-valued. This commit also allows for configuring doc-values formats on a per-field basis similarly to postings formats. In particular the doc values format of the _version field can be configured through its own field mapper (it used to be handled in UidFieldMapper previously). Closes #3806 2013-06-12 06:51:51 -04:00			`Depending on the field type, there might be several field data types`
			available. In particular, string and numeric types support the `doc_values`
			`format which allows for computing the field data data-structures at indexing`
			`time and storing them on disk. Although it will make the index larger and may`
			`be slightly slower, this implementation will be more near-realtime-friendly`
			`and will require much less memory from the JVM than other implementations.`

Make field data changes immediately taken into account and add the ability to disallow field data loading. This commit changes field data configuration updates so that they are immediately taken into account for loading new segments. The way it works is that field data configuration is now cached separately from the field data cache, meaning that it is now possible to clear the field data configuration from IndexFieldDataService while the cache will stay around. On the next time that Elasticsearch will reload field data configuration, it will check if there is already a cache entry, and reuse it if it exists. To disable field data loading, all that is required is to change the field data format to "none" (supported by all field data types) using the update mapping API. Elasticsearch will then refuse to load field data on any new segment, but field data which has been loaded on the previous segments will remain available. So you need to clear the field data cache in order to reclaim memory (otherwise memory will be reclaimed slower, as segments get merged). Close #4430 Close #4431 2013-12-12 09:05:47 -05:00			Here is an example of how to configure the `tag` field to use the `fst` field
			`data format.`

Doc values integration. This commit allows for using Lucene doc values as a backend for field data, moving the cost of building field data from the refresh operation to indexing. In addition, Lucene doc values can be stored on disk (partially, or even entirely), so that memory management is done at the operating system level (file-system cache) instead of the JVM, avoiding long pauses during major collections due to large heaps. So far doc values are supported on numeric types and non-analyzed strings (index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values which is the only type to support multi-valued fields. Since the field data API set is a bit wider than the doc values API set, some operations are not supported: - field data filtering: this will fail if doc values are enabled, - field data cache clearing, even for memory-based doc values formats, - getting the memory usage for a specific field, - knowing whether a field is actually multi-valued. This commit also allows for configuring doc-values formats on a per-field basis similarly to postings formats. In particular the doc values format of the _version field can be configured through its own field mapper (it used to be handled in UidFieldMapper previously). Closes #3806 2013-06-12 06:51:51 -04:00			`[source,js]`
			`--------------------------------------------------`
			`{`
			`tag: {`
			`type: "string",`
			`fielddata: {`
			`format: "fst"`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

Make field data changes immediately taken into account and add the ability to disallow field data loading. This commit changes field data configuration updates so that they are immediately taken into account for loading new segments. The way it works is that field data configuration is now cached separately from the field data cache, meaning that it is now possible to clear the field data configuration from IndexFieldDataService while the cache will stay around. On the next time that Elasticsearch will reload field data configuration, it will check if there is already a cache entry, and reuse it if it exists. To disable field data loading, all that is required is to change the field data format to "none" (supported by all field data types) using the update mapping API. Elasticsearch will then refuse to load field data on any new segment, but field data which has been loaded on the previous segments will remain available. So you need to clear the field data cache in order to reclaim memory (otherwise memory will be reclaimed slower, as segments get merged). Close #4430 Close #4431 2013-12-12 09:05:47 -05:00			`It is possible to change the field data format (and the field data settings`
			`in general) on a live index by using the update mapping API. When doing so,`
			`field data which had already been loaded for existing segments will remain`
			`alive while new segments will use the new field data configuration. Thanks to`
			`the background merging process, all segments will eventually use the new`
			`field data format.`

Doc values integration. This commit allows for using Lucene doc values as a backend for field data, moving the cost of building field data from the refresh operation to indexing. In addition, Lucene doc values can be stored on disk (partially, or even entirely), so that memory management is done at the operating system level (file-system cache) instead of the JVM, avoiding long pauses during major collections due to large heaps. So far doc values are supported on numeric types and non-analyzed strings (index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values which is the only type to support multi-valued fields. Since the field data API set is a bit wider than the doc values API set, some operations are not supported: - field data filtering: this will fail if doc values are enabled, - field data cache clearing, even for memory-based doc values formats, - getting the memory usage for a specific field, - knowing whether a field is actually multi-valued. This commit also allows for configuring doc-values formats on a per-field basis similarly to postings formats. In particular the doc values format of the _version field can be configured through its own field mapper (it used to be handled in UidFieldMapper previously). Closes #3806 2013-06-12 06:51:51 -04:00			`[float]`
			`==== String field data types`

			`paged_bytes` (default)::
			`Stores unique terms sequentially in a large buffer and maps documents to`
			`the indices of the terms they contain in this large buffer.`

			`fst`::
			Stores terms in a FST. Slower to build than `paged_bytes` but can help lower
			`memory usage if many terms share common prefixes and/or suffixes.`

			`doc_values`::
			`Computes and stores field data data-structures on disk at indexing time.`
			Lowers memory usage but only works on non-analyzed strings (`index`: `no` or
			`not_analyzed`) and doesn't support filtering.

			`[float]`
			`==== Numeric field data types`

			`array` (default)::
			`Stores field values in memory using arrays.`

			`doc_values`::
			`Computes and stores field data data-structures on disk at indexing time.`
			`Doesn't support filtering.`

			`[float]`
			`==== Geo point field data types`

			`array` (default)::
			`Stores latitudes and longitudes in arrays.`

Doc values for geo points. This commits add doc values support to geo point using the exact same approach as for numeric data: geo points for a given document are stored uncompressed and sequentially in a single binary doc values field. Close #4207 2013-12-26 08:52:45 -05:00			`doc_values`::
			`Computes and stores field data data-structures on disk at indexing time.`

Improved warm-up of new segments. * Merged segments are now warmed-up at the end of the merge operation instead of _refresh, so that _refresh doesn't pay the price for the warm-up of merged segments, which is often higher than flushed segments because of their size. * Even when no _warmer is registered, some basic warm-up of the segments is performed: norms, doc values (_version). This should help a bit people who forget to register warmers. * Eager loading support for the parent id cache and field data: when one can't predict what terms will be present in the index, it is tempting to use a match_all query in a warmer, but in that case, query execution might not be much faster than field data loading so having a warmer that only loads field data without running a query can be useful. Closes #3819 2013-10-04 15:40:37 -04:00			`[float]`
			`=== Fielddata loading`

More documentation improvements for fielddata loading. 2013-12-18 10:05:35 -05:00			`By default, field data is loaded lazily, ie. the first time that a query that`
			`requires them is executed. However, this can make the first requests that`
Improved warm-up of new segments. * Merged segments are now warmed-up at the end of the merge operation instead of _refresh, so that _refresh doesn't pay the price for the warm-up of merged segments, which is often higher than flushed segments because of their size. * Even when no _warmer is registered, some basic warm-up of the segments is performed: norms, doc values (_version). This should help a bit people who forget to register warmers. * Eager loading support for the parent id cache and field data: when one can't predict what terms will be present in the index, it is tempting to use a match_all query in a warmer, but in that case, query execution might not be much faster than field data loading so having a warmer that only loads field data without running a query can be useful. Closes #3819 2013-10-04 15:40:37 -04:00			`follow a merge operation quite slow since fielddata loading is a heavy`
			`operation.`

			`It is possible to force field data to be loaded and cached eagerly through the`
			`loading` setting of fielddata:

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`category: {`
			`type: "string",`
			`fielddata: {`
			`loading: "eager"`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

Improve documentation of the new `disabled` field data format. 2013-12-18 09:44:57 -05:00			`[float]`
			`==== Disabling field data loading`

			`Field data can take a lot of RAM so it makes sense to disable field data`
			`loading on the fields that don't need field data, for example those that are`
			`used for full-text search only. In order to disable field data loading, just`
More documentation improvements for fielddata loading. 2013-12-18 10:05:35 -05:00			change the field data format to `disabled`. When disabled, all requests that
			`will try to load field data, e.g. when they include aggregations and/or sorting,`
			`will return an error.`
Improve documentation of the new `disabled` field data format. 2013-12-18 09:44:57 -05:00
			`[source,js]`
			`--------------------------------------------------`
			`{`
			`text: {`
			`type: "string",`
			`fielddata: {`
			`format: "disabled"`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			The `disabled` format is supported by all field types.

Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[float]`
Uniquify anchor links to fix asciidoc/docbook generation 2013-09-30 17:32:00 -04:00			`[[field-data-filtering]]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`=== Filtering fielddata`

			`It is possible to control which field values are loaded into memory,`
			`which is particularly useful for string fields. When specifying the`
			`<<mapping-core-types,mapping>> for a field, you`
			`can also specify a fielddata filter.`

			`Fielddata filters can be changed using the`
			`<<indices-put-mapping,PUT mapping>>`
			`API. After changing the filters, use the`
			`<<indices-clearcache,Clear Cache>> API`
			`to reload the fielddata using the new filters.`

			`[float]`
			`==== Filtering by frequency:`

			`The frequency filter allows you to only load terms whose frequency falls`
			between a `min` and `max` value, which can be expressed an absolute
			number or as a percentage (eg `0.01` is `1%`). Frequency is calculated
			`per segment. Percentages are based on the number of docs which have a`
			`value for the field, as opposed to all docs in the segment.`

			`Small segments can be excluded completely by specifying the minimum`
			number of docs that the segment should contain with `min_segment_size`:

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`tag: {`
			`type: "string",`
			`fielddata: {`
			`filter: {`
			`frequency: {`
			`min: 0.001,`
			`max: 0.1,`
			`min_segment_size: 500`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`[float]`
			`==== Filtering by regex`

			`Terms can also be filtered by regular expression - only values which`
			`match the regular expression are loaded. Note: the regular expression is`
			`applied to each term in the field, not to the whole field value. For`
			`instance, to only load hashtags from a tweet, we can use a regular`
			expression which matches terms beginning with `#`:

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`tweet: {`
			`type: "string",`
			`analyzer: "whitespace"`
			`fielddata: {`
			`filter: {`
[DOCS] Fixed fielddata regex syntax 2013-09-04 17:17:46 -04:00			`regex: {`
			`pattern: "^#.*"`
			`}`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`[float]`
			`==== Combining filters`

			The `frequency` and `regex` filters can be combined:

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`tweet: {`
			`type: "string",`
			`analyzer: "whitespace"`
			`fielddata: {`
			`filter: {`
[DOCS] Fixed fielddata regex syntax 2013-09-04 17:17:46 -04:00			`regex: {`
			`pattern: "^#.*",`
			`},`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`frequency: {`
			`min: 0.001,`
			`max: 0.1,`
			`min_segment_size: 500`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`[float]`
Uniquify anchor links to fix asciidoc/docbook generation 2013-09-30 17:32:00 -04:00			`[[field-data-monitoring]]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`=== Monitoring field data`

			`You can monitor memory usage for field data using`
			`<<cluster-nodes-stats,Nodes Stats API>>`