OpenSearch/docs/reference/mapping/params/fielddata.asciidoc

[[fielddata]]
=== `fielddata`

Most fields are <<mapping-index,indexed>> by default, which makes them
searchable. Sorting, aggregations, and accessing field values in scripts,
however, requires a different access pattern from search.

Search needs to answer the question _"Which documents contain this term?"_,
while sorting and aggregations need to answer a different question: _"What is
the value of this field for **this** document?"_.

Most fields can use index-time, on-disk <<doc-values,`doc_values`>> for this
data access pattern, but <<text,`text`>> fields do not support `doc_values`.

Instead, `text` fields use a query-time *in-memory* data structure called
`fielddata`.  This data structure is built on demand the first time that a
field is used for aggregations, sorting, or in a script.  It is built by
reading the entire inverted index for each segment from disk, inverting the
term ↔︎ document relationship, and storing the result in memory, in the JVM
heap.

==== Fielddata is disabled on `text` fields by default

Fielddata can consume a *lot* of heap space, especially when loading high
cardinality `text` fields.  Once fielddata has been loaded into the heap, it
remains there for the lifetime of the segment. Also, loading fielddata is an
expensive process which can cause users to experience latency hits.  This is
why fielddata is disabled by default.

If you try to sort, aggregate, or access values from a script on a `text`
field, you will see this exception:

[quote]
--
Fielddata is disabled on text fields by default.  Set `fielddata=true` on
[`your_field_name`] in order to load  fielddata in memory by uninverting the
inverted index. Note that this can however use significant memory.
--

[[before-enabling-fielddata]]
==== Before enabling fielddata

Before you enable fielddata, consider why you are using a `text` field for
aggregations, sorting, or in a script.  It usually doesn't make sense to do
so.

A text field is analyzed before indexing so that a value like
`New York` can be found by searching for `new` or for `york`.  A `terms`
aggregation on this field will return a `new` bucket and a `york` bucket, when
you probably want a single bucket called `New York`.

Instead, you should have a `text` field for full text searches, and an
unanalyzed <<keyword,`keyword`>> field with <<doc-values,`doc_values`>>
enabled for aggregations, as follows:

[source,js]
---------------------------------
PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "my_field": { <1>
          "type": "text",
          "fields": {
            "keyword": { <2>
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}
---------------------------------
// CONSOLE
<1> Use the `my_field` field for searches.
<2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts.

==== Enabling fielddata on `text` fields

You can enable fielddata on an existing `text` field using the
<<indices-put-mapping,PUT mapping API>> as follows:

[source,js]
-----------------------------------
PUT my_index/_mapping/my_type
{
  "properties": {
    "my_field": { <1>
      "type":     "text",
      "fielddata": true
    }
  }
}
-----------------------------------
// CONSOLE
// TEST[continued]

<1> The mapping that you specify for `my_field` should consist of the existing
    mapping for that field, plus the `fielddata` parameter.

TIP: The `fielddata.*` parameter must have the same settings for fields of the
same name in the same index.  Its value can be updated on existing fields
using the <<indices-put-mapping,PUT mapping API>>.


[[field-data-filtering]]
==== `fielddata_frequency_filter`

Fielddata filtering can be used to reduce the number of terms loaded into
memory, and thus reduce memory usage. Terms can be filtered by _frequency_:

The frequency filter allows you to only load terms whose document frequency falls
between a `min` and `max` value, which can be expressed an absolute
number (when the number is bigger than 1.0) or as a percentage
(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated
*per segment*. Percentages are based on the number of docs which have a
value for the field, as opposed to all docs in the segment.

Small segments can be excluded completely by specifying the minimum
number of docs that the segment should contain with `min_segment_size`:

[source,js]
--------------------------------------------------
PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "tag": {
          "type": "text",
          "fielddata": true,
          "fielddata_frequency_filter": {
            "min": 0.001,
            "max": 0.1,
            "min_segment_size": 500
          }
        }
      }
    }
  }
}
--------------------------------------------------
// CONSOLE
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`[[fielddata]]`
			=== `fielddata`

			`Most fields are <<mapping-index,indexed>> by default, which makes them`
Updated fielddata docs to make it easier for users with old mappings 2016-07-14 13:57:50 -04:00			`searchable. Sorting, aggregations, and accessing field values in scripts,`
			`however, requires a different access pattern from search.`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00
Updated fielddata docs to make it easier for users with old mappings 2016-07-14 13:57:50 -04:00			`Search needs to answer the question _"Which documents contain this term?"_,`
			`while sorting and aggregations need to answer a different question: _"What is`
			`the value of this field for this document?"_.`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00
Updated fielddata docs to make it easier for users with old mappings 2016-07-14 13:57:50 -04:00			Most fields can use index-time, on-disk <<doc-values,`doc_values`>> for this
			data access pattern, but <<text,`text`>> fields do not support `doc_values`.
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00
Updated fielddata docs to make it easier for users with old mappings 2016-07-14 13:57:50 -04:00			Instead, `text` fields use a query-time in-memory data structure called
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`fielddata`. This data structure is built on demand the first time that a
Updated fielddata docs to make it easier for users with old mappings 2016-07-14 13:57:50 -04:00			`field is used for aggregations, sorting, or in a script. It is built by`
			`reading the entire inverted index for each segment from disk, inverting the`
			`term ↔︎ document relationship, and storing the result in memory, in the JVM`
			`heap.`

			==== Fielddata is disabled on `text` fields by default

			`Fielddata can consume a lot of heap space, especially when loading high`
			cardinality `text` fields. Once fielddata has been loaded into the heap, it
			`remains there for the lifetime of the segment. Also, loading fielddata is an`
			`expensive process which can cause users to experience latency hits. This is`
			`why fielddata is disabled by default.`

			If you try to sort, aggregate, or access values from a script on a `text`
			`field, you will see this exception:`

			`[quote]`
			`--`
			Fielddata is disabled on text fields by default. Set `fielddata=true` on
			[`your_field_name`] in order to load fielddata in memory by uninverting the
			`inverted index. Note that this can however use significant memory.`
			`--`

			`[[before-enabling-fielddata]]`
			`==== Before enabling fielddata`

			Before you enable fielddata, consider why you are using a `text` field for
			`aggregations, sorting, or in a script. It usually doesn't make sense to do`
			`so.`

			`A text field is analyzed before indexing so that a value like`
			`New York` can be found by searching for `new` or for `york`. A `terms`
			aggregation on this field will return a `new` bucket and a `york` bucket, when
			you probably want a single bucket called `New York`.

			Instead, you should have a `text` field for full text searches, and an
			unanalyzed <<keyword,`keyword`>> field with <<doc-values,`doc_values`>>
			`enabled for aggregations, as follows:`

			`[source,js]`
			`---------------------------------`
			`PUT my_index`
			`{`
			`"mappings": {`
			`"my_type": {`
			`"properties": {`
			`"my_field": { <1>`
			`"type": "text",`
			`"fields": {`
			`"keyword": { <2>`
			`"type": "keyword"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`---------------------------------`
			`// CONSOLE`
			<1> Use the `my_field` field for searches.
			<2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts.

			==== Enabling fielddata on `text` fields

			You can enable fielddata on an existing `text` field using the
			`<<indices-put-mapping,PUT mapping API>> as follows:`

			`[source,js]`
			`-----------------------------------`
			`PUT my_index/_mapping/my_type`
			`{`
			`"properties": {`
			`"my_field": { <1>`
			`"type": "text",`
			`"fielddata": true`
			`}`
			`}`
			`}`
			`-----------------------------------`
			`// CONSOLE`
			`// TEST[continued]`

			<1> The mapping that you specify for `my_field` should consist of the existing
			mapping for that field, plus the `fielddata` parameter.

			TIP: The `fielddata.*` parameter must have the same settings for fields of the
Documented the update_all_types setting on PUT mapping Added docs to each mapping param to specify which ones can be updated when 2015-08-12 15:21:37 -04:00			`same name in the same index. Its value can be updated on existing fields`
			`using the <<indices-put-mapping,PUT mapping API>>.`


Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`[[field-data-filtering]]`
Document 5.0 mapping changes. 2016-03-18 12:01:27 -04:00			==== `fielddata_frequency_filter`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00
			`Fielddata filtering can be used to reduce the number of terms loaded into`
Document 5.0 mapping changes. 2016-03-18 12:01:27 -04:00			`memory, and thus reduce memory usage. Terms can be filtered by _frequency_:`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00
Update fielddata.asciidoc 2016-07-05 09:53:25 -04:00			`The frequency filter allows you to only load terms whose document frequency falls`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			between a `min` and `max` value, which can be expressed an absolute
			`number (when the number is bigger than 1.0) or as a percentage`
			(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated
			`per segment. Percentages are based on the number of docs which have a`
			`value for the field, as opposed to all docs in the segment.`

			`Small segments can be excluded completely by specifying the minimum`
			number of docs that the segment should contain with `min_segment_size`:

			`[source,js]`
			`--------------------------------------------------`
			`PUT my_index`
			`{`
			`"mappings": {`
			`"my_type": {`
			`"properties": {`
			`"tag": {`
Document 5.0 mapping changes. 2016-03-18 12:01:27 -04:00			`"type": "text",`
Generate and run tests from the docs Adds infrastructure so `gradle :docs:check` will extract tests from snippets in the documentation and execute the tests. This is included in `gradle check` so it should happen on CI and during a normal build. By default each `// AUTOSENSE` snippet creates a unique REST test. These tests are executed in a random order and the cluster is wiped between each one. If multiple snippets chain together into a test you can annotate all snippets after the first with `// TEST[continued]` to have the generated tests for both snippets joined. Snippets marked as `// TESTRESPONSE` are checked against the response of the last action. See docs/README.asciidoc for lots more. Closes #12583. That issue is about catching bugs in the docs during build. This catches some bugs in the docs during build which is a good start. 2016-04-29 10:42:03 -04:00			`"fielddata": true,`
			`"fielddata_frequency_filter": {`
			`"min": 0.001,`
			`"max": 0.1,`
			`"min_segment_size": 500`
Docs: Mapping docs completely rewritten for 2.0 2015-08-06 11:24:29 -04:00			`}`
			`}`
			`}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`
Renamed all AUTOSENSE snippets to CONSOLE (#18210) 2016-05-09 09:42:23 -04:00			`// CONSOLE`