279 lines
9.0 KiB
Plaintext
279 lines
9.0 KiB
Plaintext
[[index-modules-codec]]
|
|
== Codec module
|
|
|
|
Codecs define how documents are written to disk and read from disk. The
|
|
postings format is the part of the codec that responsible for reading
|
|
and writing the term dictionary, postings lists and positions, payloads
|
|
and offsets stored in the postings list. The doc values format is
|
|
responsible for reading column-stride storage for a field and is typically
|
|
used for sorting or faceting. When a field doesn't have doc values enabled,
|
|
it is still possible to sort or facet by loading field values from the
|
|
inverted index into main memory.
|
|
|
|
Configuring custom postings or doc values formats is an expert feature and
|
|
most likely using the builtin formats will suit your needs as is described
|
|
in the <<mapping-core-types,mapping section>>.
|
|
|
|
**********************************
|
|
Only the default codec, postings format and doc values format are supported:
|
|
other formats may break backward compatibility between minor versions of
|
|
Elasticsearch, requiring data to be reindexed.
|
|
**********************************
|
|
|
|
|
|
[float]
|
|
[[custom-postings]]
|
|
=== Configuring a custom postings format
|
|
|
|
Custom postings format can be defined in the index settings in the
|
|
`codec` part. The `codec` part can be configured when creating an index
|
|
or updating index settings. An example on how to define your custom
|
|
postings format:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XPUT 'http://localhost:9200/twitter/' -d '{
|
|
"settings" : {
|
|
"index" : {
|
|
"codec" : {
|
|
"postings_format" : {
|
|
"my_format" : {
|
|
"type" : "pulsing",
|
|
"freq_cut_off" : "5"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}'
|
|
--------------------------------------------------
|
|
|
|
Then we defining your mapping your can use the `my_format` name in the
|
|
`postings_format` option as the example below illustrates:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"person" : {
|
|
"properties" : {
|
|
"second_person_id" : {"type" : "string", "postings_format" : "my_format"}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
=== Available postings formats
|
|
|
|
[float]
|
|
[[direct-postings]]
|
|
==== Direct postings format
|
|
|
|
Wraps the default postings format for on-disk storage, but then at read
|
|
time loads and stores all terms & postings directly in RAM. This
|
|
postings format makes no effort to compress the terms and posting list
|
|
and therefore is memory intensive, but because of this it gives a
|
|
substantial increase in search performance. Because this holds all term
|
|
bytes as a single byte[], you cannot have more than 2.1GB worth of terms
|
|
in a single segment.
|
|
|
|
This postings format offers the following parameters:
|
|
|
|
`min_skip_count`::
|
|
The minimum number terms with a shared prefix to
|
|
allow a skip pointer to be written. The default is *8*.
|
|
|
|
`low_freq_cutoff`::
|
|
Terms with a lower document frequency use a
|
|
single array object representation for postings and positions. The
|
|
default is *32*.
|
|
|
|
Type name: `direct`
|
|
|
|
[float]
|
|
[[memory-postings]]
|
|
==== Memory postings format
|
|
|
|
A postings format that stores terms & postings (docs, positions,
|
|
payloads) in RAM, using an FST. This postings format does write to disk,
|
|
but loads everything into memory. The memory postings format has the
|
|
following options:
|
|
|
|
`pack_fst`::
|
|
A boolean option that defines if the in memory structure
|
|
should be packed once its build. Packed will reduce the size for the
|
|
data-structure in memory but requires more memory during building.
|
|
Default is *false*.
|
|
|
|
`acceptable_overhead_ratio`::
|
|
The compression ratio specified as a
|
|
float, that is used to compress internal structures. Example ratios `0`
|
|
(Compact, no memory overhead at all, but the returned implementation may
|
|
be slow), `0.5` (Fast, at most 50% memory overhead, always select a
|
|
reasonably fast implementation), `7` (Fastest, at most 700% memory
|
|
overhead, no compression). Default is `0.2`.
|
|
|
|
Type name: `memory`
|
|
|
|
[float]
|
|
[[bloom-postings]]
|
|
==== Bloom filter posting format
|
|
|
|
The bloom filter postings format wraps a delegate postings format and on
|
|
top of this creates a bloom filter that is written to disk. During
|
|
opening this bloom filter is loaded into memory and used to offer
|
|
"fast-fail" reads. This postings format is useful for low doc-frequency
|
|
fields such as primary keys. The bloom filter postings format has the
|
|
following options:
|
|
|
|
`delegate`::
|
|
The name of the configured postings format that the
|
|
bloom filter postings format will wrap.
|
|
|
|
`fpp`::
|
|
The desired false positive probability specified as a
|
|
floating point number between 0 and 1.0. The `fpp` can be configured for
|
|
multiple expected insertions. Example expression: *10k=0.01,1m=0.03*. If
|
|
number docs per index segment is larger than *1m* then use *0.03* as fpp
|
|
and if number of docs per segment is larger than *10k* use *0.01* as
|
|
fpp. The last fallback value is always *0.03*. This example expression
|
|
is also the default.
|
|
|
|
Type name: `bloom`
|
|
|
|
[[codec-bloom-load]]
|
|
[TIP]
|
|
==================================================
|
|
|
|
It can sometime make sense to disable bloom filters. For instance, if you are
|
|
logging into an index per day, and you have thousands of indices, the bloom
|
|
filters can take up a sizable amount of memory. For most queries you are only
|
|
interested in recent indices, so you don't mind queries on older indices
|
|
taking slightly longer.
|
|
|
|
In these cases you can disable loading of the bloom filter on a per-index
|
|
basis by updating the index settings:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
PUT /old_index/_settings?index.codec.bloom.load=false
|
|
--------------------------------------------------
|
|
|
|
This setting, which defaults to `true`, can be updated on a live index. Note,
|
|
however, that changing the value will cause the index to be reopened, which
|
|
will invalidate any existing caches.
|
|
|
|
==================================================
|
|
|
|
[float]
|
|
[[pulsing-postings]]
|
|
==== Pulsing postings format
|
|
|
|
The pulsing implementation in-lines the posting lists for very low
|
|
frequent terms in the term dictionary. This is useful to improve lookup
|
|
performance for low-frequent terms. This postings format offers the
|
|
following parameters:
|
|
|
|
`min_block_size`::
|
|
The minimum block size the default Lucene term
|
|
dictionary uses to encode on-disk blocks. Defaults to *25*.
|
|
|
|
`max_block_size`::
|
|
The maximum block size the default Lucene term
|
|
dictionary uses to encode on-disk blocks. Defaults to *48*.
|
|
|
|
`freq_cut_off`::
|
|
The document frequency cut off where pulsing
|
|
in-lines posting lists into the term dictionary. Terms with a document
|
|
frequency less or equal to the cutoff will be in-lined. The default is
|
|
*1*.
|
|
|
|
Type name: `pulsing`
|
|
|
|
[float]
|
|
[[default-postings]]
|
|
==== Default postings format
|
|
|
|
The default postings format has the following options:
|
|
|
|
`min_block_size`::
|
|
The minimum block size the default Lucene term
|
|
dictionary uses to encode on-disk blocks. Defaults to *25*.
|
|
|
|
`max_block_size`::
|
|
The maximum block size the default Lucene term
|
|
dictionary uses to encode on-disk blocks. Defaults to *48*.
|
|
|
|
Type name: `default`
|
|
|
|
[float]
|
|
=== Configuring a custom doc values format
|
|
|
|
Custom doc values format can be defined in the index settings in the
|
|
`codec` part. The `codec` part can be configured when creating an index
|
|
or updating index settings. An example on how to define your custom
|
|
doc values format:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
curl -XPUT 'http://localhost:9200/twitter/' -d '{
|
|
"settings" : {
|
|
"index" : {
|
|
"codec" : {
|
|
"doc_values_format" : {
|
|
"my_format" : {
|
|
"type" : "disk"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}'
|
|
--------------------------------------------------
|
|
|
|
Then we defining your mapping your can use the `my_format` name in the
|
|
`doc_values_format` option as the example below illustrates:
|
|
|
|
[source,js]
|
|
--------------------------------------------------
|
|
{
|
|
"product" : {
|
|
"properties" : {
|
|
"price" : {"type" : "integer", "doc_values_format" : "my_format"}
|
|
}
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
|
|
[float]
|
|
=== Available doc values formats
|
|
|
|
[float]
|
|
==== Memory doc values format
|
|
|
|
A doc values format that stores all values in a FST in RAM. This format does
|
|
write to disk but the whole data-structure is loaded into memory when reading
|
|
the index. The memory postings format has no options.
|
|
|
|
Type name: `memory`
|
|
|
|
[float]
|
|
==== Disk doc values format
|
|
|
|
A doc values format that stores and reads everything from disk. Although it may
|
|
be slightly slower than the default doc values format, this doc values format
|
|
will require almost no memory from the JVM. The disk doc values format has no
|
|
options.
|
|
|
|
Type name: `disk`
|
|
|
|
[float]
|
|
==== Default doc values format
|
|
|
|
The default doc values format tries to make a good compromise between speed and
|
|
memory usage by only loading into memory data-structures that matter for
|
|
performance. This makes this doc values format a good fit for most use-cases.
|
|
The default doc values format has no options.
|
|
|
|
Type name: `default`
|