OpenSearch/docs/reference/mapping/params/fielddata.asciidoc

137 lines
4.4 KiB
Plaintext
Raw Normal View History

[[fielddata]]
=== `fielddata`
Most fields are <<mapping-index,indexed>> by default, which makes them
searchable. Sorting, aggregations, and accessing field values in scripts,
however, requires a different access pattern from search.
Search needs to answer the question _"Which documents contain this term?"_,
while sorting and aggregations need to answer a different question: _"What is
the value of this field for **this** document?"_.
Most fields can use index-time, on-disk <<doc-values,`doc_values`>> for this
data access pattern, but <<text,`text`>> fields do not support `doc_values`.
Instead, `text` fields use a query-time *in-memory* data structure called
`fielddata`. This data structure is built on demand the first time that a
field is used for aggregations, sorting, or in a script. It is built by
reading the entire inverted index for each segment from disk, inverting the
term ↔︎ document relationship, and storing the result in memory, in the JVM
heap.
==== Fielddata is disabled on `text` fields by default
Fielddata can consume a *lot* of heap space, especially when loading high
cardinality `text` fields. Once fielddata has been loaded into the heap, it
remains there for the lifetime of the segment. Also, loading fielddata is an
expensive process which can cause users to experience latency hits. This is
why fielddata is disabled by default.
If you try to sort, aggregate, or access values from a script on a `text`
field, you will see this exception:
[quote]
--
Fielddata is disabled on text fields by default. Set `fielddata=true` on
[`your_field_name`] in order to load fielddata in memory by uninverting the
inverted index. Note that this can however use significant memory.
--
[[before-enabling-fielddata]]
==== Before enabling fielddata
Before you enable fielddata, consider why you are using a `text` field for
aggregations, sorting, or in a script. It usually doesn't make sense to do
so.
A text field is analyzed before indexing so that a value like
`New York` can be found by searching for `new` or for `york`. A `terms`
aggregation on this field will return a `new` bucket and a `york` bucket, when
you probably want a single bucket called `New York`.
Instead, you should have a `text` field for full text searches, and an
unanalyzed <<keyword,`keyword`>> field with <<doc-values,`doc_values`>>
enabled for aggregations, as follows:
[source,js]
---------------------------------
PUT my_index
{
"mappings": {
"properties": {
"my_field": { <1>
"type": "text",
"fields": {
"keyword": { <2>
"type": "keyword"
}
}
}
}
}
}
---------------------------------
// CONSOLE
<1> Use the `my_field` field for searches.
<2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts.
==== Enabling fielddata on `text` fields
You can enable fielddata on an existing `text` field using the
<<indices-put-mapping,PUT mapping API>> as follows:
[source,js]
-----------------------------------
Update the default for include_type_name to false. (#37285) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs.
2019-01-14 16:08:01 -05:00
PUT my_index/_mapping
{
"properties": {
"my_field": { <1>
"type": "text",
"fielddata": true
}
}
}
-----------------------------------
// CONSOLE
// TEST[continued]
<1> The mapping that you specify for `my_field` should consist of the existing
mapping for that field, plus the `fielddata` parameter.
[[field-data-filtering]]
2016-03-18 12:01:27 -04:00
==== `fielddata_frequency_filter`
Fielddata filtering can be used to reduce the number of terms loaded into
2016-03-18 12:01:27 -04:00
memory, and thus reduce memory usage. Terms can be filtered by _frequency_:
2016-07-05 09:53:25 -04:00
The frequency filter allows you to only load terms whose document frequency falls
between a `min` and `max` value, which can be expressed an absolute
number (when the number is bigger than 1.0) or as a percentage
(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated
*per segment*. Percentages are based on the number of docs which have a
value for the field, as opposed to all docs in the segment.
Small segments can be excluded completely by specifying the minimum
number of docs that the segment should contain with `min_segment_size`:
[source,js]
--------------------------------------------------
PUT my_index
{
"mappings": {
"properties": {
"tag": {
"type": "text",
"fielddata": true,
"fielddata_frequency_filter": {
"min": 0.001,
"max": 0.1,
"min_segment_size": 500
}
}
}
}
}
--------------------------------------------------
// CONSOLE