OpenSearch/docs/reference/index-modules/codec.asciidoc

[[index-modules-codec]]
== Codec module

Codecs define how documents are written to disk and read from disk. The
postings format is the part of the codec that responsible for reading
and writing the term dictionary, postings lists and positions, payloads
and offsets stored in the postings list. The doc values format is
responsible for reading column-stride storage for a field and is typically
used for sorting or faceting. When a field doesn't have doc values enabled,
it is still possible to sort or facet by loading field values from the
inverted index into main memory.

Configuring custom postings or doc values formats is an expert feature and
most likely using the builtin formats will suit your needs as is described
in the <<mapping-core-types,mapping section>>.

**********************************
Only the default codec, postings format and doc values format are supported:
other formats may break backward compatibility between minor versions of
Elasticsearch, requiring data to be reindexed.
**********************************


[float]
[[custom-postings]]
=== Configuring a custom postings format

Custom postings format can be defined in the index settings in the
`codec` part. The `codec` part can be configured when creating an index
or updating index settings. An example on how to define your custom
postings format:

[source,js]
--------------------------------------------------
curl -XPUT 'http://localhost:9200/twitter/' -d '{
    "settings" : {
        "index" : {
            "codec" : {
          "postings_format" : {
             "my_format" : {
                "type" : "pulsing",
                "freq_cut_off" : "5"
             } 
          }
       }
        }
    }
}'
--------------------------------------------------

Then we defining your mapping your can use the `my_format` name in the
`postings_format` option as the example below illustrates:

[source,js]
--------------------------------------------------
{
  "person" : {
     "properties" : {
         "second_person_id" : {"type" : "string", "postings_format" : "my_format"}
     }
  }
}
--------------------------------------------------

[float]
=== Available postings formats

[float]
[[direct-postings]]
==== Direct postings format

Wraps the default postings format for on-disk storage, but then at read
time loads and stores all terms & postings directly in RAM. This
postings format makes no effort to compress the terms and posting list
and therefore is memory intensive, but because of this it gives a
substantial increase in search performance. Because this holds all term
bytes as a single byte[], you cannot have more than 2.1GB worth of terms
in a single segment.

This postings format offers the following parameters: 

`min_skip_count`:: 
    The minimum number terms with a shared prefix to
    allow a skip pointer to be written. The default is *8*. 

`low_freq_cutoff`:: 
    Terms with a lower document frequency use a
    single array object representation for postings and positions. The
    default is *32*.

Type name: `direct`

[float]
[[memory-postings]]
==== Memory postings format

A postings format that stores terms & postings (docs, positions,
payloads) in RAM, using an FST. This postings format does write to disk,
but loads everything into memory. The memory postings format has the
following options: 

`pack_fst`:: 
    A boolean option that defines if the in memory structure
    should be packed once its build. Packed will reduce the size for the
    data-structure in memory but requires more memory during building.
    Default is *false*.

`acceptable_overhead_ratio`:: 
    The compression ratio specified as a
    float, that is used to compress internal structures. Example ratios `0`
    (Compact, no memory overhead at all, but the returned implementation may
    be slow), `0.5` (Fast, at most 50% memory overhead, always select a
    reasonably fast implementation), `7` (Fastest, at most 700% memory
    overhead, no compression). Default is `0.2`.

Type name: `memory`

[float]
[[bloom-postings]]
==== Bloom filter posting format

The bloom filter postings format wraps a delegate postings format and on
top of this creates a bloom filter that is written to disk. During
opening this bloom filter is loaded into memory and used to offer
"fast-fail" reads. This postings format is useful for low doc-frequency
fields such as primary keys. The bloom filter postings format has the
following options: 

`delegate`:: 
    The name of the configured postings format that the
    bloom filter postings format will wrap. 

`fpp`:: 
    The desired false positive probability specified as a
    floating point number between 0 and 1.0. The `fpp` can be configured for
    multiple expected insertions. Example expression: *10k=0.01,1m=0.03*. If
    number docs per index segment is larger than *1m* then use *0.03* as fpp
    and if number of docs per segment is larger than *10k* use *0.01* as
    fpp. The last fallback value is always *0.03*. This example expression
    is also the default.

Type name: `bloom`

[float]
[[pulsing-postings]]
==== Pulsing postings format

The pulsing implementation in-lines the posting lists for very low
frequent terms in the term dictionary. This is useful to improve lookup
performance for low-frequent terms. This postings format offers the
following parameters: 

`min_block_size`:: 
    The minimum block size the default Lucene term
    dictionary uses to encode on-disk blocks. Defaults to *25*. 

`max_block_size`:: 
    The maximum block size the default Lucene term
    dictionary uses to encode on-disk blocks. Defaults to *48*. 

`freq_cut_off`:: 
    The document frequency cut off where pulsing
    in-lines posting lists into the term dictionary. Terms with a document
    frequency less or equal to the cutoff will be in-lined. The default is
    *1*.

Type name: `pulsing`

[float]
[[default-postings]]
==== Default postings format

The default postings format has the following options: 

`min_block_size`:: 
    The minimum block size the default Lucene term
    dictionary uses to encode on-disk blocks. Defaults to *25*. 

`max_block_size`::
    The maximum block size the default Lucene term
    dictionary uses to encode on-disk blocks. Defaults to *48*.

Type name: `default`

[float]
=== Configuring a custom doc values format

Custom doc values format can be defined in the index settings in the
`codec` part. The `codec` part can be configured when creating an index
or updating index settings. An example on how to define your custom
doc values format:

[source,js]
--------------------------------------------------
curl -XPUT 'http://localhost:9200/twitter/' -d '{
    "settings" : {
        "index" : {
            "codec" : {
                "doc_values_format" : {
                    "my_format" : {
                        "type" : "disk"
                    }
                }
            }
        }
    }
}'
--------------------------------------------------

Then we defining your mapping your can use the `my_format` name in the
`doc_values_format` option as the example below illustrates:

[source,js]
--------------------------------------------------
{
  "product" : {
     "properties" : {
         "price" : {"type" : "integer", "doc_values_format" : "my_format"}
     }
  }
}
--------------------------------------------------

[float]
=== Available doc values formats

[float]
==== Memory doc values format

A doc values format that stores all values in a FST in RAM. This format does
write to disk but the whole data-structure is loaded into memory when reading
the index. The memory postings format has no options.

Type name: `memory`

[float]
==== Disk doc values format

A doc values format that stores and reads everything from disk. Although it may
be slightly slower than the default doc values format, this doc values format
will require almost no memory from the JVM. The disk doc values format has no
options.

Type name: `disk`

[float]
==== Default doc values format

The default doc values format tries to make a good compromise between speed and
memory usage by only loading into memory data-structures that matter for
performance. This makes this doc values format a good fit for most use-cases.
The default doc values format has no options.

Type name: `default`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`[[index-modules-codec]]`
			`== Codec module`

			`Codecs define how documents are written to disk and read from disk. The`
			`postings format is the part of the codec that responsible for reading`
			`and writing the term dictionary, postings lists and positions, payloads`
Doc values integration. This commit allows for using Lucene doc values as a backend for field data, moving the cost of building field data from the refresh operation to indexing. In addition, Lucene doc values can be stored on disk (partially, or even entirely), so that memory management is done at the operating system level (file-system cache) instead of the JVM, avoiding long pauses during major collections due to large heaps. So far doc values are supported on numeric types and non-analyzed strings (index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values which is the only type to support multi-valued fields. Since the field data API set is a bit wider than the doc values API set, some operations are not supported: - field data filtering: this will fail if doc values are enabled, - field data cache clearing, even for memory-based doc values formats, - getting the memory usage for a specific field, - knowing whether a field is actually multi-valued. This commit also allows for configuring doc-values formats on a per-field basis similarly to postings formats. In particular the doc values format of the _version field can be configured through its own field mapper (it used to be handled in UidFieldMapper previously). Closes #3806 2013-06-12 06:51:51 -04:00			`and offsets stored in the postings list. The doc values format is`
			`responsible for reading column-stride storage for a field and is typically`
			`used for sorting or faceting. When a field doesn't have doc values enabled,`
			`it is still possible to sort or facet by loading field values from the`
			`inverted index into main memory.`

			`Configuring custom postings or doc values formats is an expert feature and`
			`most likely using the builtin formats will suit your needs as is described`
Add clear warnings that only the default codec, postings format and doc values format have backward compatibility warranties. 2013-10-10 07:26:31 -04:00			`in the <<mapping-core-types,mapping section>>.`

			`**********************************`
			`Only the default codec, postings format and doc values format are supported:`
			`other formats may break backward compatibility between minor versions of`
			`Elasticsearch, requiring data to be reindexed.`
			`**********************************`

Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00
			`[float]`
Uniquify anchor links to fix asciidoc/docbook generation 2013-09-30 17:32:00 -04:00			`[[custom-postings]]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`=== Configuring a custom postings format`

			`Custom postings format can be defined in the index settings in the`
			`codec` part. The `codec` part can be configured when creating an index
			`or updating index settings. An example on how to define your custom`
			`postings format:`

			`[source,js]`
			`--------------------------------------------------`
			`curl -XPUT 'http://localhost:9200/twitter/' -d '{`
			`"settings" : {`
			`"index" : {`
			`"codec" : {`
			`"postings_format" : {`
			`"my_format" : {`
			`"type" : "pulsing",`
			`"freq_cut_off" : "5"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}'`
			`--------------------------------------------------`

			Then we defining your mapping your can use the `my_format` name in the
			`postings_format` option as the example below illustrates:

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"person" : {`
			`"properties" : {`
			`"second_person_id" : {"type" : "string", "postings_format" : "my_format"}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`[float]`
			`=== Available postings formats`

			`[float]`
Add more anchor links to documentation Related to #3679 2013-09-25 12:17:40 -04:00			`[[direct-postings]]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`==== Direct postings format`

			`Wraps the default postings format for on-disk storage, but then at read`
			`time loads and stores all terms & postings directly in RAM. This`
			`postings format makes no effort to compress the terms and posting list`
			`and therefore is memory intensive, but because of this it gives a`
			`substantial increase in search performance. Because this holds all term`
			`bytes as a single byte[], you cannot have more than 2.1GB worth of terms`
			`in a single segment.`

			`This postings format offers the following parameters:`

			`min_skip_count`::
			`The minimum number terms with a shared prefix to`
			`allow a skip pointer to be written. The default is 8.`

			`low_freq_cutoff`::
			`Terms with a lower document frequency use a`
			`single array object representation for postings and positions. The`
			`default is 32.`

			Type name: `direct`

			`[float]`
Add more anchor links to documentation Related to #3679 2013-09-25 12:17:40 -04:00			`[[memory-postings]]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`==== Memory postings format`

			`A postings format that stores terms & postings (docs, positions,`
			`payloads) in RAM, using an FST. This postings format does write to disk,`
			`but loads everything into memory. The memory postings format has the`
			`following options:`

			`pack_fst`::
			`A boolean option that defines if the in memory structure`
			`should be packed once its build. Packed will reduce the size for the`
			`data-structure in memory but requires more memory during building.`
			`Default is false.`

			`acceptable_overhead_ratio`::
			`The compression ratio specified as a`
			float, that is used to compress internal structures. Example ratios `0`
			`(Compact, no memory overhead at all, but the returned implementation may`
			be slow), `0.5` (Fast, at most 50% memory overhead, always select a
			reasonably fast implementation), `7` (Fastest, at most 700% memory
			overhead, no compression). Default is `0.2`.

			Type name: `memory`

			`[float]`
Add more anchor links to documentation Related to #3679 2013-09-25 12:17:40 -04:00			`[[bloom-postings]]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`==== Bloom filter posting format`

			`The bloom filter postings format wraps a delegate postings format and on`
			`top of this creates a bloom filter that is written to disk. During`
			`opening this bloom filter is loaded into memory and used to offer`
			`"fast-fail" reads. This postings format is useful for low doc-frequency`
			`fields such as primary keys. The bloom filter postings format has the`
			`following options:`

			`delegate`::
			`The name of the configured postings format that the`
			`bloom filter postings format will wrap.`

			`fpp`::
			`The desired false positive probability specified as a`
			floating point number between 0 and 1.0. The `fpp` can be configured for
			`multiple expected insertions. Example expression: 10k=0.01,1m=0.03. If`
			`number docs per index segment is larger than 1m then use 0.03 as fpp`
			`and if number of docs per segment is larger than 10k use 0.01 as`
			`fpp. The last fallback value is always 0.03. This example expression`
			`is also the default.`

			Type name: `bloom`

			`[float]`
Add more anchor links to documentation Related to #3679 2013-09-25 12:17:40 -04:00			`[[pulsing-postings]]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`==== Pulsing postings format`

			`The pulsing implementation in-lines the posting lists for very low`
			`frequent terms in the term dictionary. This is useful to improve lookup`
			`performance for low-frequent terms. This postings format offers the`
			`following parameters:`

			`min_block_size`::
			`The minimum block size the default Lucene term`
			`dictionary uses to encode on-disk blocks. Defaults to 25.`

			`max_block_size`::
			`The maximum block size the default Lucene term`
			`dictionary uses to encode on-disk blocks. Defaults to 48.`

			`freq_cut_off`::
			`The document frequency cut off where pulsing`
			`in-lines posting lists into the term dictionary. Terms with a document`
			`frequency less or equal to the cutoff will be in-lined. The default is`
			`1.`

			Type name: `pulsing`

			`[float]`
Add more anchor links to documentation Related to #3679 2013-09-25 12:17:40 -04:00			`[[default-postings]]`
Migrated documentation into the main repo 2013-08-28 19:24:34 -04:00			`==== Default postings format`

			`The default postings format has the following options:`

			`min_block_size`::
			`The minimum block size the default Lucene term`
			`dictionary uses to encode on-disk blocks. Defaults to 25.`

			`max_block_size`::
			`The maximum block size the default Lucene term`
			`dictionary uses to encode on-disk blocks. Defaults to 48.`

			Type name: `default`
Doc values integration. This commit allows for using Lucene doc values as a backend for field data, moving the cost of building field data from the refresh operation to indexing. In addition, Lucene doc values can be stored on disk (partially, or even entirely), so that memory management is done at the operating system level (file-system cache) instead of the JVM, avoiding long pauses during major collections due to large heaps. So far doc values are supported on numeric types and non-analyzed strings (index:no or index:not_analyzed). Under the hood, it uses SORTED_SET doc values which is the only type to support multi-valued fields. Since the field data API set is a bit wider than the doc values API set, some operations are not supported: - field data filtering: this will fail if doc values are enabled, - field data cache clearing, even for memory-based doc values formats, - getting the memory usage for a specific field, - knowing whether a field is actually multi-valued. This commit also allows for configuring doc-values formats on a per-field basis similarly to postings formats. In particular the doc values format of the _version field can be configured through its own field mapper (it used to be handled in UidFieldMapper previously). Closes #3806 2013-06-12 06:51:51 -04:00
			`[float]`
			`=== Configuring a custom doc values format`

			`Custom doc values format can be defined in the index settings in the`
			`codec` part. The `codec` part can be configured when creating an index
			`or updating index settings. An example on how to define your custom`
			`doc values format:`

			`[source,js]`
			`--------------------------------------------------`
			`curl -XPUT 'http://localhost:9200/twitter/' -d '{`
			`"settings" : {`
			`"index" : {`
			`"codec" : {`
			`"doc_values_format" : {`
			`"my_format" : {`
			`"type" : "disk"`
			`}`
			`}`
			`}`
			`}`
			`}`
			`}'`
			`--------------------------------------------------`

			Then we defining your mapping your can use the `my_format` name in the
			`doc_values_format` option as the example below illustrates:

			`[source,js]`
			`--------------------------------------------------`
			`{`
			`"product" : {`
			`"properties" : {`
			`"price" : {"type" : "integer", "doc_values_format" : "my_format"}`
			`}`
			`}`
			`}`
			`--------------------------------------------------`

			`[float]`
			`=== Available doc values formats`

			`[float]`
			`==== Memory doc values format`

			`A doc values format that stores all values in a FST in RAM. This format does`
			`write to disk but the whole data-structure is loaded into memory when reading`
			`the index. The memory postings format has no options.`

			Type name: `memory`

			`[float]`
			`==== Disk doc values format`

			`A doc values format that stores and reads everything from disk. Although it may`
			`be slightly slower than the default doc values format, this doc values format`
			`will require almost no memory from the JVM. The disk doc values format has no`
			`options.`

			Type name: `disk`

			`[float]`
			`==== Default doc values format`

			`The default doc values format tries to make a good compromise between speed and`
			`memory usage by only loading into memory data-structures that matter for`
			`performance. This makes this doc values format a good fit for most use-cases.`
			`The default doc values format has no options.`

			Type name: `default`