[DOCS] Remove the section about codecs.
This documentation was dangerous because it felt like it was possible to gain substantial performance by just switching the codec of the index. However, non-default codecs are dangerous to use since they are not supported in terms of backward compatibility, and most improvements that they bring have been folded into the default codec anyway (for example, the default codec "pulses" postings lists that contain a single document).
This commit is contained in:
parent
e6632ec63e
commit
a242a63817
|
@ -76,8 +76,6 @@ include::index-modules/query-cache.asciidoc[]
|
|||
|
||||
include::index-modules/fielddata.asciidoc[]
|
||||
|
||||
include::index-modules/codec.asciidoc[]
|
||||
|
||||
include::index-modules/similarity.asciidoc[]
|
||||
|
||||
|
||||
|
|
|
@ -1,277 +0,0 @@
|
|||
[[index-modules-codec]]
|
||||
== Codec module
|
||||
|
||||
Codecs define how documents are written to disk and read from disk. The
|
||||
postings format is the part of the codec that is responsible for reading
|
||||
and writing the term dictionary, postings lists and positions, as well as the payloads
|
||||
and offsets stored in the postings list. The doc values format is
|
||||
responsible for reading column-stride storage for a field and is typically
|
||||
used for sorting or faceting. When a field doesn't have doc values enabled,
|
||||
it is still possible to sort or facet by loading field values from the
|
||||
inverted index into main memory.
|
||||
|
||||
Configuring custom postings or doc values formats is an expert feature and
|
||||
most likely using the builtin formats will suit your needs as is described
|
||||
in the <<mapping-core-types,mapping section>>.
|
||||
|
||||
[WARNING]
|
||||
Only the default codec, postings format and doc values format are supported:
|
||||
other formats may break backward compatibility between minor versions of
|
||||
Elasticsearch, requiring data to be reindexed.
|
||||
|
||||
|
||||
[float]
|
||||
[[custom-postings]]
|
||||
=== Configuring a custom postings format
|
||||
|
||||
A custom postings format can be defined in the index settings in the
|
||||
`codec` part. The `codec` part can be configured when creating an index
|
||||
or updating index settings. An example on how to define your custom
|
||||
postings format:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XPUT 'http://localhost:9200/twitter/' -d '{
|
||||
"settings" : {
|
||||
"index" : {
|
||||
"codec" : {
|
||||
"postings_format" : {
|
||||
"my_format" : {
|
||||
"type" : "pulsing",
|
||||
"freq_cut_off" : "5"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}'
|
||||
--------------------------------------------------
|
||||
|
||||
Then when defining your mapping you can use the `my_format` name in the
|
||||
`postings_format` option as the example below illustrates:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"person" : {
|
||||
"properties" : {
|
||||
"second_person_id" : {"type" : "string", "postings_format" : "my_format"}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
[float]
|
||||
=== Available postings formats
|
||||
|
||||
[float]
|
||||
[[direct-postings]]
|
||||
==== Direct postings format
|
||||
|
||||
Wraps the default postings format for on-disk storage, but then at read
|
||||
time loads and stores all terms & postings directly in RAM. This
|
||||
postings format makes no effort to compress the terms and posting list
|
||||
and therefore is memory intensive, but because of this it gives a
|
||||
substantial increase in search performance. Because this holds all term
|
||||
bytes as a single byte[], you cannot have more than 2.1GB worth of terms
|
||||
in a single segment.
|
||||
|
||||
This postings format offers the following parameters:
|
||||
|
||||
`min_skip_count`::
|
||||
The minimum number terms with a shared prefix to
|
||||
allow a skip pointer to be written. The default is *8*.
|
||||
|
||||
`low_freq_cutoff`::
|
||||
Terms with a lower document frequency use a
|
||||
single array object representation for postings and positions. The
|
||||
default is *32*.
|
||||
|
||||
Type name: `direct`
|
||||
|
||||
[float]
|
||||
[[memory-postings]]
|
||||
==== Memory postings format
|
||||
|
||||
A postings format that stores terms & postings (docs, positions,
|
||||
payloads) in RAM, using an FST. This postings format does write to disk,
|
||||
but loads everything into memory. The memory postings format has the
|
||||
following options:
|
||||
|
||||
`pack_fst`::
|
||||
A boolean option that defines if the in memory structure
|
||||
should be packed once its build. Packed will reduce the size for the
|
||||
data-structure in memory but requires more memory during building.
|
||||
Default is *false*.
|
||||
|
||||
`acceptable_overhead_ratio`::
|
||||
The compression ratio specified as a
|
||||
float, that is used to compress internal structures. Example ratios `0`
|
||||
(Compact, no memory overhead at all, but the returned implementation may
|
||||
be slow), `0.5` (Fast, at most 50% memory overhead, always select a
|
||||
reasonably fast implementation), `7` (Fastest, at most 700% memory
|
||||
overhead, no compression). Default is `0.2`.
|
||||
|
||||
Type name: `memory`
|
||||
|
||||
[float]
|
||||
[[bloom-postings]]
|
||||
==== Bloom filter posting format
|
||||
|
||||
The bloom filter postings format wraps a delegate postings format and on
|
||||
top of this creates a bloom filter that is written to disk. During
|
||||
opening this bloom filter is loaded into memory and used to offer
|
||||
"fast-fail" reads. This postings format is useful for low doc-frequency
|
||||
fields such as primary keys. The bloom filter postings format has the
|
||||
following options:
|
||||
|
||||
`delegate`::
|
||||
The name of the configured postings format that the
|
||||
bloom filter postings format will wrap.
|
||||
|
||||
`fpp`::
|
||||
The desired false positive probability specified as a
|
||||
floating point number between 0 and 1.0. The `fpp` can be configured for
|
||||
multiple expected insertions. Example expression: *10k=0.01,1m=0.03*. If
|
||||
number docs per index segment is larger than *1m* then use *0.03* as fpp
|
||||
and if number of docs per segment is larger than *10k* use *0.01* as
|
||||
fpp. The last fallback value is always *0.03*. This example expression
|
||||
is also the default.
|
||||
|
||||
Type name: `bloom`
|
||||
|
||||
[[codec-bloom-load]]
|
||||
[TIP]
|
||||
==================================================
|
||||
|
||||
As of 1.4, the bloom filters are no longer loaded at search time by
|
||||
default: they consume RAM in proportion to the number of unique terms,
|
||||
which can quickly add up for certain use cases, and separate
|
||||
performance improvements have made the performance gains with bloom
|
||||
filters very small.
|
||||
|
||||
You can enable loading of the bloom filter at search time on a
|
||||
per-index basis by updating the index settings:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT /old_index/_settings?index.codec.bloom.load=true
|
||||
--------------------------------------------------
|
||||
|
||||
This setting, which defaults to `false`, can be updated on a live index. Note,
|
||||
however, that changing the value will cause the index to be reopened, which
|
||||
will invalidate any existing caches.
|
||||
|
||||
==================================================
|
||||
|
||||
[float]
|
||||
[[pulsing-postings]]
|
||||
==== Pulsing postings format
|
||||
|
||||
The pulsing implementation in-lines the posting lists for very low
|
||||
frequent terms in the term dictionary. This is useful to improve lookup
|
||||
performance for low-frequent terms. This postings format offers the
|
||||
following parameters:
|
||||
|
||||
`min_block_size`::
|
||||
The minimum block size the default Lucene term
|
||||
dictionary uses to encode on-disk blocks. Defaults to *25*.
|
||||
|
||||
`max_block_size`::
|
||||
The maximum block size the default Lucene term
|
||||
dictionary uses to encode on-disk blocks. Defaults to *48*.
|
||||
|
||||
`freq_cut_off`::
|
||||
The document frequency cut off where pulsing
|
||||
in-lines posting lists into the term dictionary. Terms with a document
|
||||
frequency less or equal to the cutoff will be in-lined. The default is
|
||||
*1*.
|
||||
|
||||
Type name: `pulsing`
|
||||
|
||||
[float]
|
||||
[[default-postings]]
|
||||
==== Default postings format
|
||||
|
||||
The default postings format has the following options:
|
||||
|
||||
`min_block_size`::
|
||||
The minimum block size the default Lucene term
|
||||
dictionary uses to encode on-disk blocks. Defaults to *25*.
|
||||
|
||||
`max_block_size`::
|
||||
The maximum block size the default Lucene term
|
||||
dictionary uses to encode on-disk blocks. Defaults to *48*.
|
||||
|
||||
Type name: `default`
|
||||
|
||||
[float]
|
||||
=== Configuring a custom doc values format
|
||||
|
||||
Custom doc values format can be defined in the index settings in the
|
||||
`codec` part. The `codec` part can be configured when creating an index
|
||||
or updating index settings. An example on how to define your custom
|
||||
doc values format:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
curl -XPUT 'http://localhost:9200/twitter/' -d '{
|
||||
"settings" : {
|
||||
"index" : {
|
||||
"codec" : {
|
||||
"doc_values_format" : {
|
||||
"my_format" : {
|
||||
"type" : "disk"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}'
|
||||
--------------------------------------------------
|
||||
|
||||
Then we defining your mapping your can use the `my_format` name in the
|
||||
`doc_values_format` option as the example below illustrates:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"product" : {
|
||||
"properties" : {
|
||||
"price" : {"type" : "integer", "doc_values_format" : "my_format"}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
[float]
|
||||
=== Available doc values formats
|
||||
|
||||
[float]
|
||||
==== Memory doc values format
|
||||
|
||||
A doc values format that stores all values in a FST in RAM. This format does
|
||||
write to disk but the whole data-structure is loaded into memory when reading
|
||||
the index. The memory postings format has no options.
|
||||
|
||||
Type name: `memory`
|
||||
|
||||
[float]
|
||||
==== Disk doc values format
|
||||
|
||||
A doc values format that stores and reads everything from disk. This is
|
||||
generally not a good idea to use it as it saves very little memory compared
|
||||
to the default doc values format although it can be significantly slower.
|
||||
The disk doc values format has no options.
|
||||
|
||||
Type name: `disk`
|
||||
|
||||
[float]
|
||||
==== Default doc values format
|
||||
|
||||
The default doc values format tries to make a good compromise between speed and
|
||||
memory usage by only loading into memory data-structures that matter for
|
||||
performance. This makes this doc values format a good fit for most use-cases.
|
||||
The default doc values format has no options.
|
||||
|
||||
Type name: `default`
|
|
@ -57,12 +57,9 @@ settings API:
|
|||
`index.index_concurrency`::
|
||||
Defaults to `8`.
|
||||
|
||||
`index.codec`::
|
||||
Codec. Default to `default`.
|
||||
|
||||
`index.codec.bloom.load`::
|
||||
Whether to load the bloom filter. Defaults to `true`.
|
||||
See <<bloom-postings>>.
|
||||
Whether to load the bloom filter. Defaults to `false`.
|
||||
See <<codec-bloom-load>>.
|
||||
|
||||
`index.fail_on_merge_failure`::
|
||||
Default to `true`.
|
||||
|
@ -225,3 +222,35 @@ curl -XPUT 'localhost:9200/myindex/_settings' -d '{
|
|||
|
||||
curl -XPOST 'localhost:9200/myindex/_open'
|
||||
--------------------------------------------------
|
||||
|
||||
[float]
|
||||
[[codec-bloom-load]]
|
||||
=== Bloom filters
|
||||
|
||||
Up to version 1.3, Elasticsearch used to generate bloom filters for the `_uid`
|
||||
field at indexing time and to load them at search time in order to speed-up
|
||||
primary-key lookups by savings disk seeks.
|
||||
|
||||
As of 1.4, bloom filters are still generated at indexing time, but they are
|
||||
no longer loaded at search time by default: they consume RAM in proportion to
|
||||
the number of unique terms, which can quickly add up for certain use cases,
|
||||
and separate performance improvements have made the performance gains with
|
||||
bloom filters very small.
|
||||
|
||||
[TIP]
|
||||
==================================================
|
||||
|
||||
You can enable loading of the bloom filter at search time on a
|
||||
per-index basis by updating the index settings:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
PUT /old_index/_settings?index.codec.bloom.load=true
|
||||
--------------------------------------------------
|
||||
|
||||
This setting, which defaults to `false`, can be updated on a live index. Note,
|
||||
however, that changing the value will cause the index to be reopened, which
|
||||
will invalidate any existing caches.
|
||||
|
||||
==================================================
|
||||
|
||||
|
|
|
@ -511,116 +511,6 @@ effect the next time the fielddata for a segment is loaded. Use the
|
|||
<<indices-clearcache,Clear Cache>> API
|
||||
to reload the fielddata using the new filters.
|
||||
|
||||
[float]
|
||||
[[postings]]
|
||||
==== Postings format
|
||||
|
||||
Posting formats define how fields are written into the index and how
|
||||
fields are represented into memory. Posting formats can be defined per
|
||||
field via the `postings_format` option. Postings format are configurable.
|
||||
Elasticsearch has several builtin formats:
|
||||
|
||||
`direct`::
|
||||
A postings format that uses disk-based storage but loads
|
||||
its terms and postings directly into memory. Note this postings format
|
||||
is very memory intensive and has certain limitation that don't allow
|
||||
segments to grow beyond 2.1GB see \{@link DirectPostingsFormat} for
|
||||
details.
|
||||
|
||||
`memory`::
|
||||
A postings format that stores its entire terms, postings,
|
||||
positions and payloads in a finite state transducer. This format should
|
||||
only be used for primary keys or with fields where each term is
|
||||
contained in a very low number of documents.
|
||||
|
||||
`pulsing`::
|
||||
A postings format that in-lines the posting lists for very low
|
||||
frequent terms in the term dictionary. This is useful to improve lookup
|
||||
performance for low-frequent terms.
|
||||
|
||||
`bloom_default`::
|
||||
A postings format that uses a bloom filter to
|
||||
improve term lookup performance. This is useful for primary keys or
|
||||
fields that are used as a delete key.
|
||||
|
||||
`bloom_pulsing`::
|
||||
A postings format that combines the advantages of
|
||||
*bloom* and *pulsing* to further improve lookup performance.
|
||||
|
||||
`default`::
|
||||
The default Elasticsearch postings format offering best
|
||||
general purpose performance. This format is used if no postings format
|
||||
is specified in the field mapping.
|
||||
|
||||
[float]
|
||||
===== Postings format example
|
||||
|
||||
On all field types it possible to configure a `postings_format`
|
||||
attribute:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"person" : {
|
||||
"properties" : {
|
||||
"second_person_id" : {"type" : "string", "postings_format" : "pulsing"}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
On top of using the built-in posting formats it is possible define
|
||||
custom postings format. See
|
||||
<<index-modules-codec,codec module>> for more
|
||||
information.
|
||||
|
||||
[float]
|
||||
==== Doc values format
|
||||
|
||||
Doc values formats define how fields are written into column-stride storage in
|
||||
the index for the purpose of sorting or faceting. Fields that have doc values
|
||||
enabled will have special field data instances, which will not be uninverted
|
||||
from the inverted index, but directly read from disk. This makes _refresh faster
|
||||
and ultimately allows for having field data stored on disk depending on the
|
||||
configured doc values format.
|
||||
|
||||
Doc values formats are configurable. Elasticsearch has several builtin formats:
|
||||
|
||||
`memory`::
|
||||
A doc values format which stores data in memory. Compared to the default
|
||||
field data implementations, using doc values with this format will have
|
||||
similar performance but will be faster to load, making '_refresh' less
|
||||
time-consuming.
|
||||
|
||||
`disk`::
|
||||
A doc values format which stores all data on disk, requiring almost no
|
||||
memory from the JVM at the cost of a slight performance degradation.
|
||||
|
||||
`default`::
|
||||
The default Elasticsearch doc values format, offering good performance
|
||||
with low memory usage. This format is used if no format is specified in
|
||||
the field mapping.
|
||||
|
||||
[float]
|
||||
===== Doc values format example
|
||||
|
||||
On all field types, it is possible to configure a `doc_values_format` attribute:
|
||||
|
||||
[source,js]
|
||||
--------------------------------------------------
|
||||
{
|
||||
"product" : {
|
||||
"properties" : {
|
||||
"price" : {"type" : "integer", "doc_values_format" : "memory"}
|
||||
}
|
||||
}
|
||||
}
|
||||
--------------------------------------------------
|
||||
|
||||
On top of using the built-in doc values formats it is possible to define
|
||||
custom doc values formats. See
|
||||
<<index-modules-codec,codec module>> for more information.
|
||||
|
||||
[float]
|
||||
==== Similarity
|
||||
|
||||
|
|
Loading…
Reference in New Issue