OpenSearch/docs/reference/mapping/params/eager-global-ordinals.asciidoc

118 lines
5.1 KiB
Plaintext
Raw Normal View History

[[eager-global-ordinals]]
=== `eager_global_ordinals`
==== What are global ordinals?
To support aggregations and other operations that require looking up field
values on a per-document basis, Elasticsearch uses a data structure called
<<doc-values, doc values>>. Term-based field types such as `keyword` store
their doc values using an ordinal mapping for a more compact representation.
This mapping works by assigning each term an incremental integer or 'ordinal'
based on its lexicographic order. The field's doc values store only the
ordinals for each document instead of the original terms, with a separate
lookup structure to convert between ordinals and terms.
When used during aggregations, ordinals can greatly improve performance. As an
example, the `terms` aggregation relies only on ordinals to collect documents
into buckets at the shard-level, then converts the ordinals back to their
original term values when combining results across shards.
Each index segment defines its own ordinal mapping, but aggregations collect
data across an entire shard. So to be able to use ordinals for shard-level
operations like aggregations, Elasticsearch creates a unified mapping called
'global ordinals'. The global ordinal mapping is built on top of segment
ordinals, and works by maintaining a map from global ordinal to the local
ordinal for each segment.
Global ordinals are used if a search contains any of the following components:
* Certain bucket aggregations on `keyword`, `ip`, and `flattened` fields. This
includes `terms` aggregations as mentioned above, as well as `composite`,
`diversified_sampler`, and `significant_terms`.
* Bucket aggregations on `text` fields that require <<fielddata, `fielddata`>>
to be enabled.
* Operations on parent and child documents from a `join` field, including
`has_child` queries and `parent` aggregations.
NOTE: The global ordinal mapping is an on-heap data structure. When measuring
memory usage, Elasticsearch counts the memory from global ordinals as
'fielddata'. Global ordinals memory is included in the
<<fielddata-circuit-breaker, fielddata circuit breaker>>, and is returned
under `fielddata` in the <<cluster-nodes-stats, node stats>> response.
==== Loading global ordinals
The global ordinal mapping must be built before ordinals can be used during a
search. By default, the mapping is loaded during search on the first time that
global ordinals are needed. This is is the right approach if you are optimizing
for indexing speed, but if search performance is a priority, it's recommended
to eagerly load global ordinals eagerly on fields that will be used in
aggregations:
[source,console]
------------
Update the default for include_type_name to false. (#37285) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs.
2019-01-14 16:08:01 -05:00
PUT my_index/_mapping
{
"properties": {
"tags": {
"type": "keyword",
"eager_global_ordinals": true
}
}
}
------------
// TEST[s/^/PUT my_index\n/]
When `eager_global_ordinals` is enabled, global ordinals are built when a shard
is <<indices-refresh, refreshed>> -- Elasticsearch always loads them before
exposing changes to the content of the index. This shifts the cost of building
global ordinals from search to index-time. Elasticsearch will also eagerly
build global ordinals when creating a new copy of a shard, as can occur when
increasing the number of replicas or relocating a shard onto a new node.
Eager loading can be disabled at any time by updating the `eager_global_ordinals` setting:
[source,console]
------------
Update the default for include_type_name to false. (#37285) * Default include_type_name to false for get and put mappings. * Default include_type_name to false for get field mappings. * Add a constant for the default include_type_name value. * Default include_type_name to false for get and put index templates. * Default include_type_name to false for create index. * Update create index calls in REST documentation to use include_type_name=true. * Some minor clean-ups around the get index API. * In REST tests, use include_type_name=true by default for index creation. * Make sure to use 'expression == false'. * Clarify the different IndexTemplateMetaData toXContent methods. * Fix FullClusterRestartIT#testSnapshotRestore. * Fix the ml_anomalies_default_mappings test. * Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests. We make sure to specify include_type_name=true during xContent parsing, so we continue to test the legacy typed responses. XContent generation for the typeless responses is currently only covered by REST tests, but we will be adding unit test coverage for these as we implement each typeless API in the Java HLRC. This commit also refactors GetMappingsResponse to follow the same appraoch as the other mappings-related responses, where we read include_type_name out of the xContent params, instead of creating a second toXContent method. This gives better consistency in the response parsing code. * Fix more REST tests. * Improve some wording in the create index documentation. * Add a note about types removal in the create index docs. * Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL. * Make sure to mention include_type_name in the REST docs for affected APIs. * Make sure to use 'expression == false' in FullClusterRestartIT. * Mention include_type_name in the REST templates docs.
2019-01-14 16:08:01 -05:00
PUT my_index/_mapping
{
"properties": {
"tags": {
"type": "keyword",
"eager_global_ordinals": false
}
}
}
------------
// TEST[continued]
IMPORTANT: On a <<frozen-indices,frozen index>>, global ordinals are discarded
after each search and rebuilt again when they're requested. This means that
`eager_global_ordinals` should not be used on frozen indices: it would
cause global ordinals to be reloaded on every search. Instead, the index should
be force-merged to a single segment before being frozen. This avoids building
global ordinals altogether (more details can be found in the next section).
==== Avoiding global ordinal loading
Usually, global ordinals do not present a large overhead in terms of their
loading time and memory usage. However, loading global ordinals can be
expensive on indices with large shards, or if the fields contain a large
number of unique term values. Because global ordinals provide a unified mapping
for all segments on the shard, they also need to be rebuilt entirely when a new
segment becomes visible.
In some cases it is possible to avoid global ordinal loading altogether:
* The `terms`, `sampler`, and `significant_terms` aggregations support a
parameter
<<search-aggregations-bucket-terms-aggregation-execution-hint, `execution_hint`>>
that helps control how buckets are collected. It defaults to `global_ordinals`,
but can be set to `map` to instead use the term values directly.
* If a shard has been <<indices-forcemerge,force-merged>> down to a single
segment, then its segment ordinals are already 'global' to the shard. In this
case, Elasticsearch does not need to build a global ordinal mapping and there
is no additional overhead from using global ordinals. Note that for performance
reasons you should only force-merge an index to which you will never write to
again.