diff --git a/docs/reference/mapping/params/fielddata.asciidoc b/docs/reference/mapping/params/fielddata.asciidoc index 2e6b63698c7..6795b0f5b9b 100644 --- a/docs/reference/mapping/params/fielddata.asciidoc +++ b/docs/reference/mapping/params/fielddata.asciidoc @@ -2,42 +2,105 @@ === `fielddata` Most fields are <> by default, which makes them -searchable. The inverted index allows queries to look up the search term in -unique sorted list of terms, and from that immediately have access to the list -of documents that contain the term. +searchable. Sorting, aggregations, and accessing field values in scripts, +however, requires a different access pattern from search. -Sorting, aggregations, and access to field values in scripts requires a -different data access pattern. Instead of lookup up the term and finding -documents, we need to be able to look up the document and find the terms that -it has in a field. +Search needs to answer the question _"Which documents contain this term?"_, +while sorting and aggregations need to answer a different question: _"What is +the value of this field for **this** document?"_. -Most fields can use index-time, on-disk <> to support -this type of data access pattern, but `text` fields do not support `doc_values`. +Most fields can use index-time, on-disk <> for this +data access pattern, but <> fields do not support `doc_values`. -Instead, `text` strings use a query-time data structure called +Instead, `text` fields use a query-time *in-memory* data structure called `fielddata`. This data structure is built on demand the first time that a -field is used for aggregations, sorting, or is accessed in a script. It is built -by reading the entire inverted index for each segment from disk, inverting the -term ↔︎ document relationship, and storing the result in memory, in the -JVM heap. +field is used for aggregations, sorting, or in a script. It is built by +reading the entire inverted index for each segment from disk, inverting the +term ↔︎ document relationship, and storing the result in memory, in the JVM +heap. -Loading fielddata is an expensive process so it is disabled by default. Also, -when enabled, once it has been loaded, it remains in memory for the lifetime of -the segment. +==== Fielddata is disabled on `text` fields by default -[WARNING] -.Fielddata can fill up your heap space -============================================================================== -Fielddata can consume a lot of heap space, especially when loading high -cardinality `text` fields. Most of the time, it doesn't make sense -to sort or aggregate on `text` fields (with the notable exception -of the -<> -aggregation). Always think about whether a <> field (which can -use `doc_values`) would be a better fit for your use case. -============================================================================== +Fielddata can consume a *lot* of heap space, especially when loading high +cardinality `text` fields. Once fielddata has been loaded into the heap, it +remains there for the lifetime of the segment. Also, loading fielddata is an +expensive process which can cause users to experience latency hits. This is +why fielddata is disabled by default. -TIP: The `fielddata.*` settings must have the same settings for fields of the +If you try to sort, aggregate, or access values from a script on a `text` +field, you will see this exception: + +[quote] +-- +Fielddata is disabled on text fields by default. Set `fielddata=true` on +[`your_field_name`] in order to load fielddata in memory by uninverting the +inverted index. Note that this can however use significant memory. +-- + +[[before-enabling-fielddata]] +==== Before enabling fielddata + +Before you enable fielddata, consider why you are using a `text` field for +aggregations, sorting, or in a script. It usually doesn't make sense to do +so. + +A text field is analyzed before indexing so that a value like +`New York` can be found by searching for `new` or for `york`. A `terms` +aggregation on this field will return a `new` bucket and a `york` bucket, when +you probably want a single bucket called `New York`. + +Instead, you should have a `text` field for full text searches, and an +unanalyzed <> field with <> +enabled for aggregations, as follows: + +[source,js] +--------------------------------- +PUT my_index +{ + "mappings": { + "my_type": { + "properties": { + "my_field": { <1> + "type": "text", + "fields": { + "keyword": { <2> + "type": "keyword" + } + } + } + } + } + } +} +--------------------------------- +// CONSOLE +<1> Use the `my_field` field for searches. +<2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts. + +==== Enabling fielddata on `text` fields + +You can enable fielddata on an existing `text` field using the +<> as follows: + +[source,js] +----------------------------------- +PUT my_index/_mapping/my_type +{ + "properties": { + "my_field": { <1> + "type": "text", + "fielddata": true + } + } +} +----------------------------------- +// CONSOLE +// TEST[continued] + +<1> The mapping that you specify for `my_field` should consist of the existing + mapping for that field, plus the `fielddata` parameter. + +TIP: The `fielddata.*` parameter must have the same settings for fields of the same name in the same index. Its value can be updated on existing fields using the <>. @@ -49,12 +112,13 @@ using the <>. Global ordinals is a data-structure on top of fielddata and doc values, that maintains an incremental numbering for each unique term in a lexicographic order. Each term has a unique number and the number of term 'A' is lower than -the number of term 'B'. Global ordinals are only supported on string fields. +the number of term 'B'. Global ordinals are only supported on <> +and <> fields. -Fielddata and doc values also have ordinals, which is a unique numbering for all terms -in a particular segment and field. Global ordinals just build on top of this, -by providing a mapping between the segment ordinals and the global ordinals, -the latter being unique across the entire shard. +Fielddata and doc values also have ordinals, which is a unique numbering for +all terms in a particular segment and field. Global ordinals just build on top +of this, by providing a mapping between the segment ordinals and the global +ordinals, the latter being unique across the entire shard. Global ordinals are used for features that use segment ordinals, such as sorting and the terms aggregation, to improve the execution time. A terms @@ -68,10 +132,11 @@ which is different than for field data for a specific field which is tied to a single segment. For this reason global ordinals need to be entirely rebuilt whenever a once new segment becomes visible. -The loading time of global ordinals depends on the number of terms in a field, but in general -it is low, since it source field data has already been loaded. The memory overhead of global -ordinals is a small because it is very efficiently compressed. Eager loading of global ordinals -can move the loading time from the first search request, to the refresh itself. +The loading time of global ordinals depends on the number of terms in a field, +but in general it is low, since it source field data has already been loaded. +The memory overhead of global ordinals is a small because it is very +efficiently compressed. Eager loading of global ordinals can move the loading +time from the first search request, to the refresh itself. *****************************************