OpenSearch/docs/reference/index-modules/fielddata.asciidoc

[[index-modules-fielddata]]
== Field data

The field data cache is used mainly when sorting on or faceting on a
field. It loads all the field values to memory in order to provide fast
document based access to those values. The field data cache can be
expensive to build for a field, so its recommended to have enough memory
to allocate it, and to keep it loaded.

The amount of memory used for the field
data cache can be controlled using `indices.fielddata.cache.size`. Note:
reloading  the field data which does not fit into your cache will be expensive
and  perform poorly.

[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`indices.fielddata.cache.size` |The max size of the field data cache,
eg `30%` of node heap space, or an absolute value, eg `12GB`. Defaults
to unbounded.

|`indices.fielddata.cache.expire` |A time based setting that expires
field data after a certain time of inactivity. Defaults to `-1`. For
example, can be set to `5m` for a 5 minute expiry.
|=======================================================================

[float]
=== Filtering fielddata

It is possible to control which field values are loaded into memory,
which is particularly useful for string fields. When specifying the
<<mapping-core-types,mapping>> for a field, you
can also specify a fielddata filter.

Fielddata filters can be changed using the
<<indices-put-mapping,PUT mapping>>
API. After changing the filters, use the
<<indices-clearcache,Clear Cache>> API
to reload the fielddata using the new filters.

[float]
==== Filtering by frequency:

The frequency filter allows you to only load terms whose frequency falls
between a `min` and `max` value, which can be expressed an absolute
number or as a percentage (eg `0.01` is `1%`). Frequency is calculated
*per segment*. Percentages are based on the number of docs which have a
value for the field, as opposed to all docs in the segment.

Small segments can be excluded completely by specifying the minimum
number of docs that the segment should contain with `min_segment_size`:

[source,js]
--------------------------------------------------
{
    tag: {
        type:      "string",
        fielddata: {
            filter: {
                frequency: {
                    min:              0.001,
                    max:              0.1,
                    min_segment_size: 500
                }
            }
        }
    }
}
--------------------------------------------------

[float]
==== Filtering by regex

Terms can also be filtered by regular expression - only values which
match the regular expression are loaded. Note: the regular expression is
applied to each term in the field, not to the whole field value. For
instance, to only load hashtags from a tweet, we can use a regular
expression which matches terms beginning with `#`:

[source,js]
--------------------------------------------------
{
    tweet: {
        type:      "string",
        analyzer:  "whitespace"
        fielddata: {
            filter: {
                regex: {
                    pattern: "^#.*"
                }
            }
        }
    }
}
--------------------------------------------------

[float]
==== Combining filters

The `frequency` and `regex` filters can be combined:

[source,js]
--------------------------------------------------
{
    tweet: {
        type:      "string",
        analyzer:  "whitespace"
        fielddata: {
            filter: {
                regex: {
                    pattern:          "^#.*",
                },
                frequency: {
                    min:              0.001,
                    max:              0.1,
                    min_segment_size: 500
                }
            }
        }
    }
}
--------------------------------------------------

[float]
=== Monitoring field data

You can monitor memory usage for field data using
<<cluster-nodes-stats,Nodes Stats API>>
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[index-modules-fielddata]]`
			`== Field data`

			`The field data cache is used mainly when sorting on or faceting on a`
			`field. It loads all the field values to memory in order to provide fast`
			`document based access to those values. The field data cache can be`
			`expensive to build for a field, so its recommended to have enough memory`
			`to allocate it, and to keep it loaded.`

[DOCS] Removed outdated new/deprecated version notices 2013-09-03 15:27:49 -04:00			`The amount of memory used for the field`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			data cache can be controlled using `indices.fielddata.cache.size`. Note:
			`reloading the field data which does not fit into your cache will be expensive`
			`and perform poorly.`

			`[cols="<,<",options="header",]`
			`\|=======================================================================`
			`\|Setting \|Description`
			\|`indices.fielddata.cache.size` \|The max size of the field data cache,
			eg `30%` of node heap space, or an absolute value, eg `12GB`. Defaults
			`to unbounded.`

			\|`indices.fielddata.cache.expire` \|A time based setting that expires
			field data after a certain time of inactivity. Defaults to `-1`. For
			example, can be set to `5m` for a 5 minute expiry.
			`\|=======================================================================`

			`[float]`
			`=== Filtering fielddata`

			`It is possible to control which field values are loaded into memory,`
			`which is particularly useful for string fields. When specifying the`
			`<<mapping-core-types,mapping>> for a field, you`
			`can also specify a fielddata filter.`

			`Fielddata filters can be changed using the`
			`<<indices-put-mapping,PUT mapping>>`
			`API. After changing the filters, use the`
			`<<indices-clearcache,Clear Cache>> API`
			`to reload the fielddata using the new filters.`

			`[float]`
			`==== Filtering by frequency:`

			`The frequency filter allows you to only load terms whose frequency falls`
			between a `min` and `max` value, which can be expressed an absolute
			number or as a percentage (eg `0.01` is `1%`). Frequency is calculated
			`per segment. Percentages are based on the number of docs which have a`
			`value for the field, as opposed to all docs in the segment.`

			`Small segments can be excluded completely by specifying the minimum`
			number of docs that the segment should contain with `min_segment_size`:

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`tag: {`
			`type: "string",`
			`fielddata: {`
			`filter: {`
			`frequency: {`
			`min: 0.001,`
			`max: 0.1,`
			`min_segment_size: 500`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`[float]`
			`==== Filtering by regex`

			`Terms can also be filtered by regular expression - only values which`
			`match the regular expression are loaded. Note: the regular expression is`
			`applied to each term in the field, not to the whole field value. For`
			`instance, to only load hashtags from a tweet, we can use a regular`
			expression which matches terms beginning with `#`:

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`tweet: {`
			`type: "string",`
			`analyzer: "whitespace"`
			`fielddata: {`
			`filter: {`
[DOCS] Fixed fielddata regex syntax 2013-09-04 17:17:46 -04:00			`regex: {`
			`pattern: "^#.*"`
			`}`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`[float]`
			`==== Combining filters`

			The `frequency` and `regex` filters can be combined:

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`tweet: {`
			`type: "string",`
			`analyzer: "whitespace"`
			`fielddata: {`
			`filter: {`
[DOCS] Fixed fielddata regex syntax 2013-09-04 17:17:46 -04:00			`regex: {`
			`pattern: "^#.*",`
			`},`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`frequency: {`
			`min: 0.001,`
			`max: 0.1,`
			`min_segment_size: 500`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`[float]`
			`=== Monitoring field data`

			`You can monitor memory usage for field data using`
			`<<cluster-nodes-stats,Nodes Stats API>>`