OpenSearch/docs/reference/mapping/params/fielddata.asciidoc

[[fielddata]]
=== `fielddata`

Most fields are <<mapping-index,indexed>> by default, which makes them
searchable. The inverted index allows queries to look up the search term in
unique sorted list of terms, and from that immediately have access to the list
of documents that contain the term.

Sorting, aggregations, and access to field values in scripts requires a
different data access pattern.  Instead of lookup up the term and finding
documents, we need to be able to look up the document and find the terms that
it has in a field.

Most fields can use index-time, on-disk <<doc-values,`doc_values`>> to support
this type of data access pattern, but `text` fields do not support `doc_values`.

Instead, `text` strings use a query-time data structure called
`fielddata`.  This data structure is built on demand the first time that a
field is used for aggregations, sorting, or is accessed in a script.  It is built
by reading the entire inverted index for each segment from disk, inverting the
term ↔︎ document relationship, and storing the result in memory, in the
JVM heap.

Loading fielddata is an expensive process so it is disabled by default. Also,
when enabled, once it has been loaded, it remains in memory for the lifetime of
the segment.

[WARNING]
.Fielddata can fill up your heap space
==============================================================================
Fielddata can consume a lot of heap space, especially when loading high
cardinality `text` fields.  Most of the time, it doesn't make sense
to sort or aggregate on `text` fields (with the notable exception
of the
<<search-aggregations-bucket-significantterms-aggregation,`significant_terms`>>
aggregation).  Always think about whether a <<keyword,`keyword`>> field (which can
use `doc_values`) would be  a better fit for your use case.
==============================================================================

TIP: The `fielddata.*` settings must have the same settings for fields of the
same name in the same index.  Its value can be updated on existing fields
using the <<indices-put-mapping,PUT mapping API>>.


[[global-ordinals]]
.Global ordinals
*****************************************

Global ordinals is a data-structure on top of fielddata and doc values, that
maintains an incremental numbering for each unique term in a lexicographic
order. Each term has a unique number and the number of term 'A' is lower than
the number of term 'B'. Global ordinals are only supported on string fields.

Fielddata and doc values also have ordinals, which is a unique numbering for all terms
in a particular segment and field. Global ordinals just build on top of this,
by providing a mapping between the segment ordinals and the global ordinals,
the latter being unique across the entire shard.

Global ordinals are used for features that use segment ordinals, such as
sorting and the terms aggregation, to improve the execution time. A terms
aggregation relies purely on global ordinals to perform the aggregation at the
shard level, then converts global ordinals to the real term only for the final
reduce phase, which combines results from different shards.

Global ordinals for a specified field are tied to _all the segments of a
shard_, while fielddata and doc values ordinals are tied to a single segment.
which is different than for field data for a specific field which is tied to a
single segment. For this reason global ordinals need to be entirely rebuilt
whenever a once new segment becomes visible.

The loading time of global ordinals depends on the number of terms in a field, but in general
it is low, since it source field data has already been loaded. The memory overhead of global
ordinals is a small because it is very efficiently compressed. Eager loading of global ordinals
can move the loading time from the first search request, to the refresh itself.

*****************************************

[[field-data-filtering]]
==== `fielddata_frequency_filter`

Fielddata filtering can be used to reduce the number of terms loaded into
memory, and thus reduce memory usage. Terms can be filtered by _frequency_:

The frequency filter allows you to only load terms whose term frequency falls
between a `min` and `max` value, which can be expressed an absolute
number (when the number is bigger than 1.0) or as a percentage
(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated
*per segment*. Percentages are based on the number of docs which have a
value for the field, as opposed to all docs in the segment.

Small segments can be excluded completely by specifying the minimum
number of docs that the segment should contain with `min_segment_size`:

[source,js]
--------------------------------------------------
PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "tag": {
          "type": "text",
          "fielddata": true,
          "fielddata_frequency_filter": {
            "min": 0.001,
            "max": 0.1,
            "min_segment_size": 500
          }
        }
      }
    }
  }
}
--------------------------------------------------
// AUTOSENSE
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`[[fielddata]]`
			=== `fielddata`

			`Most fields are <<mapping-index,indexed>> by default, which makes them`
			`searchable. The inverted index allows queries to look up the search term in`
			`unique sorted list of terms, and from that immediately have access to the list`
			`of documents that contain the term.`

			`Sorting, aggregations, and access to field values in scripts requires a`
			`different data access pattern. Instead of lookup up the term and finding`
			`documents, we need to be able to look up the document and find the terms that`
Merge pull request #16842 from anhlqn/patch-1 Fix minor spelling 2016-02-28 19:32:12 -05:00			`it has in a field.`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00
			Most fields can use index-time, on-disk <<doc-values,`doc_values`>> to support
Document 5.0 mapping changes. 2016-03-18 12:01:27 -04:00			this type of data access pattern, but `text` fields do not support `doc_values`.
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00
Document 5.0 mapping changes. 2016-03-18 12:01:27 -04:00			Instead, `text` strings use a query-time data structure called
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`fielddata`. This data structure is built on demand the first time that a
			`field is used for aggregations, sorting, or is accessed in a script. It is built`
			`by reading the entire inverted index for each segment from disk, inverting the`
			`term ↔︎ document relationship, and storing the result in memory, in the`
			`JVM heap.`

Document 5.0 mapping changes. 2016-03-18 12:01:27 -04:00			`Loading fielddata is an expensive process so it is disabled by default. Also,`
			`when enabled, once it has been loaded, it remains in memory for the lifetime of`
			`the segment.`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00
			`[WARNING]`
			`.Fielddata can fill up your heap space`
			`==============================================================================`
			`Fielddata can consume a lot of heap space, especially when loading high`
Document 5.0 mapping changes. 2016-03-18 12:01:27 -04:00			cardinality `text` fields. Most of the time, it doesn't make sense
			to sort or aggregate on `text` fields (with the notable exception
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`of the`
			<<search-aggregations-bucket-significantterms-aggregation,`significant_terms`>>
Document 5.0 mapping changes. 2016-03-18 12:01:27 -04:00			aggregation). Always think about whether a <<keyword,`keyword`>> field (which can
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			use `doc_values`) would be a better fit for your use case.
			`==============================================================================`

Documented the update_all_types setting on PUT mapping Added docs to each mapping param to specify which ones can be updated when 2015-08-12 15:21:37 -04:00			TIP: The `fielddata.*` settings must have the same settings for fields of the
			`same name in the same index. Its value can be updated on existing fields`
			`using the <<indices-put-mapping,PUT mapping API>>.`


Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`[[global-ordinals]]`
			`.Global ordinals`
			`*****************************************`

			`Global ordinals is a data-structure on top of fielddata and doc values, that`
			`maintains an incremental numbering for each unique term in a lexicographic`
			`order. Each term has a unique number and the number of term 'A' is lower than`
			`the number of term 'B'. Global ordinals are only supported on string fields.`

			`Fielddata and doc values also have ordinals, which is a unique numbering for all terms`
			`in a particular segment and field. Global ordinals just build on top of this,`
			`by providing a mapping between the segment ordinals and the global ordinals,`
			`the latter being unique across the entire shard.`

			`Global ordinals are used for features that use segment ordinals, such as`
			`sorting and the terms aggregation, to improve the execution time. A terms`
			`aggregation relies purely on global ordinals to perform the aggregation at the`
			`shard level, then converts global ordinals to the real term only for the final`
			`reduce phase, which combines results from different shards.`

			`Global ordinals for a specified field are tied to _all the segments of a`
			`shard_, while fielddata and doc values ordinals are tied to a single segment.`
			`which is different than for field data for a specific field which is tied to a`
			`single segment. For this reason global ordinals need to be entirely rebuilt`
			`whenever a once new segment becomes visible.`

			`The loading time of global ordinals depends on the number of terms in a field, but in general`
			`it is low, since it source field data has already been loaded. The memory overhead of global`
			`ordinals is a small because it is very efficiently compressed. Eager loading of global ordinals`
			`can move the loading time from the first search request, to the refresh itself.`

			`*****************************************`

			`[[field-data-filtering]]`
Document 5.0 mapping changes. 2016-03-18 12:01:27 -04:00			==== `fielddata_frequency_filter`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00
			`Fielddata filtering can be used to reduce the number of terms loaded into`
Document 5.0 mapping changes. 2016-03-18 12:01:27 -04:00			`memory, and thus reduce memory usage. Terms can be filtered by _frequency_:`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00
			`The frequency filter allows you to only load terms whose term frequency falls`
			between a `min` and `max` value, which can be expressed an absolute
			`number (when the number is bigger than 1.0) or as a percentage`
			(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated
			`per segment. Percentages are based on the number of docs which have a`
			`value for the field, as opposed to all docs in the segment.`

			`Small segments can be excluded completely by specifying the minimum`
			number of docs that the segment should contain with `min_segment_size`:

			`[source,js]`
			`--------------------------------------------------`
			`PUT my_index`
			`{`
			`"mappings": {`
			`"my_type": {`
			`"properties": {`
			`"tag": {`
Document 5.0 mapping changes. 2016-03-18 12:01:27 -04:00			`"type": "text",`
Generate and run tests from the docs Adds infrastructure so `gradle :docs:check` will extract tests from snippets in the documentation and execute the tests. This is included in `gradle check` so it should happen on CI and during a normal build. By default each `// AUTOSENSE` snippet creates a unique REST test. These tests are executed in a random order and the cluster is wiped between each one. If multiple snippets chain together into a test you can annotate all snippets after the first with `// TEST[continued]` to have the generated tests for both snippets joined. Snippets marked as `// TESTRESPONSE` are checked against the response of the last action. See docs/README.asciidoc for lots more. Closes #12583. That issue is about catching bugs in the docs during build. This catches some bugs in the docs during build which is a good start. 2016-04-29 10:42:03 -04:00			`"fielddata": true,`
			`"fielddata_frequency_filter": {`
			`"min": 0.001,`
			`"max": 0.1,`
			`"min_segment_size": 500`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
			`// AUTOSENSE`