135 lines
4.5 KiB
Plaintext
135 lines
4.5 KiB
Plaintext
[[fielddata]]
|
|
=== `fielddata`
|
|
|
|
Most fields are <<mapping-index,indexed>> by default, which makes them
|
|
searchable. Sorting, aggregations, and accessing field values in scripts,
|
|
however, requires a different access pattern from search.
|
|
|
|
Search needs to answer the question _"Which documents contain this term?"_,
|
|
while sorting and aggregations need to answer a different question: _"What is
|
|
the value of this field for **this** document?"_.
|
|
|
|
Most fields can use index-time, on-disk <<doc-values,`doc_values`>> for this
|
|
data access pattern, but <<text,`text`>> fields do not support `doc_values`.
|
|
|
|
Instead, `text` fields use a query-time *in-memory* data structure called
|
|
`fielddata`. This data structure is built on demand the first time that a
|
|
field is used for aggregations, sorting, or in a script. It is built by
|
|
reading the entire inverted index for each segment from disk, inverting the
|
|
term ↔︎ document relationship, and storing the result in memory, in the JVM
|
|
heap.
|
|
|
|
[[fielddata-disabled-text-fields]]
|
|
==== Fielddata is disabled on `text` fields by default
|
|
|
|
Fielddata can consume a *lot* of heap space, especially when loading high
|
|
cardinality `text` fields. Once fielddata has been loaded into the heap, it
|
|
remains there for the lifetime of the segment. Also, loading fielddata is an
|
|
expensive process which can cause users to experience latency hits. This is
|
|
why fielddata is disabled by default.
|
|
|
|
If you try to sort, aggregate, or access values from a script on a `text`
|
|
field, you will see this exception:
|
|
|
|
[literal]
|
|
Fielddata is disabled on text fields by default. Set `fielddata=true` on
|
|
[`your_field_name`] in order to load fielddata in memory by uninverting the
|
|
inverted index. Note that this can however use significant memory.
|
|
|
|
[[before-enabling-fielddata]]
|
|
==== Before enabling fielddata
|
|
|
|
Before you enable fielddata, consider why you are using a `text` field for
|
|
aggregations, sorting, or in a script. It usually doesn't make sense to do
|
|
so.
|
|
|
|
A text field is analyzed before indexing so that a value like
|
|
`New York` can be found by searching for `new` or for `york`. A `terms`
|
|
aggregation on this field will return a `new` bucket and a `york` bucket, when
|
|
you probably want a single bucket called `New York`.
|
|
|
|
Instead, you should have a `text` field for full text searches, and an
|
|
unanalyzed <<keyword,`keyword`>> field with <<doc-values,`doc_values`>>
|
|
enabled for aggregations, as follows:
|
|
|
|
[source,console]
|
|
---------------------------------
|
|
PUT my-index-000001
|
|
{
|
|
"mappings": {
|
|
"properties": {
|
|
"my_field": { <1>
|
|
"type": "text",
|
|
"fields": {
|
|
"keyword": { <2>
|
|
"type": "keyword"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
---------------------------------
|
|
|
|
<1> Use the `my_field` field for searches.
|
|
<2> Use the `my_field.keyword` field for aggregations, sorting, or in scripts.
|
|
|
|
[[enable-fielddata-text-fields]]
|
|
==== Enabling fielddata on `text` fields
|
|
|
|
You can enable fielddata on an existing `text` field using the
|
|
<<indices-put-mapping,PUT mapping API>> as follows:
|
|
|
|
[source,console]
|
|
-----------------------------------
|
|
PUT my-index-000001/_mapping
|
|
{
|
|
"properties": {
|
|
"my_field": { <1>
|
|
"type": "text",
|
|
"fielddata": true
|
|
}
|
|
}
|
|
}
|
|
-----------------------------------
|
|
// TEST[continued]
|
|
|
|
<1> The mapping that you specify for `my_field` should consist of the existing
|
|
mapping for that field, plus the `fielddata` parameter.
|
|
|
|
[[field-data-filtering]]
|
|
==== `fielddata_frequency_filter`
|
|
|
|
Fielddata filtering can be used to reduce the number of terms loaded into
|
|
memory, and thus reduce memory usage. Terms can be filtered by _frequency_:
|
|
|
|
The frequency filter allows you to only load terms whose document frequency falls
|
|
between a `min` and `max` value, which can be expressed an absolute
|
|
number (when the number is bigger than 1.0) or as a percentage
|
|
(eg `0.01` is `1%` and `1.0` is `100%`). Frequency is calculated
|
|
*per segment*. Percentages are based on the number of docs which have a
|
|
value for the field, as opposed to all docs in the segment.
|
|
|
|
Small segments can be excluded completely by specifying the minimum
|
|
number of docs that the segment should contain with `min_segment_size`:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT my-index-000001
|
|
{
|
|
"mappings": {
|
|
"properties": {
|
|
"tag": {
|
|
"type": "text",
|
|
"fielddata": true,
|
|
"fielddata_frequency_filter": {
|
|
"min": 0.001,
|
|
"max": 0.1,
|
|
"min_segment_size": 500
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|