2013-08-28 19:24:34 -04:00
|
|
|
[[index-modules-codec]]
|
|
|
|
== Codec module
|
|
|
|
|
|
|
|
Codecs define how documents are written to disk and read from disk. The
|
|
|
|
postings format is the part of the codec that responsible for reading
|
|
|
|
and writing the term dictionary, postings lists and positions, payloads
|
|
|
|
and offsets stored in the postings list.
|
|
|
|
|
|
|
|
Configuring custom postings formats is an expert feature and most likely
|
|
|
|
using the builtin postings formats will suite your needs as is described
|
|
|
|
in the <<mapping-core-types,mapping section>>
|
|
|
|
|
|
|
|
[float]
|
2013-09-25 12:17:40 -04:00
|
|
|
[[postings]]
|
2013-08-28 19:24:34 -04:00
|
|
|
=== Configuring a custom postings format
|
|
|
|
|
|
|
|
Custom postings format can be defined in the index settings in the
|
|
|
|
`codec` part. The `codec` part can be configured when creating an index
|
|
|
|
or updating index settings. An example on how to define your custom
|
|
|
|
postings format:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
curl -XPUT 'http://localhost:9200/twitter/' -d '{
|
|
|
|
"settings" : {
|
|
|
|
"index" : {
|
|
|
|
"codec" : {
|
|
|
|
"postings_format" : {
|
|
|
|
"my_format" : {
|
|
|
|
"type" : "pulsing",
|
|
|
|
"freq_cut_off" : "5"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}'
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
Then we defining your mapping your can use the `my_format` name in the
|
|
|
|
`postings_format` option as the example below illustrates:
|
|
|
|
|
|
|
|
[source,js]
|
|
|
|
--------------------------------------------------
|
|
|
|
{
|
|
|
|
"person" : {
|
|
|
|
"properties" : {
|
|
|
|
"second_person_id" : {"type" : "string", "postings_format" : "my_format"}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
--------------------------------------------------
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Available postings formats
|
|
|
|
|
|
|
|
[float]
|
2013-09-25 12:17:40 -04:00
|
|
|
[[direct-postings]]
|
2013-08-28 19:24:34 -04:00
|
|
|
==== Direct postings format
|
|
|
|
|
|
|
|
Wraps the default postings format for on-disk storage, but then at read
|
|
|
|
time loads and stores all terms & postings directly in RAM. This
|
|
|
|
postings format makes no effort to compress the terms and posting list
|
|
|
|
and therefore is memory intensive, but because of this it gives a
|
|
|
|
substantial increase in search performance. Because this holds all term
|
|
|
|
bytes as a single byte[], you cannot have more than 2.1GB worth of terms
|
|
|
|
in a single segment.
|
|
|
|
|
|
|
|
This postings format offers the following parameters:
|
|
|
|
|
|
|
|
`min_skip_count`::
|
|
|
|
The minimum number terms with a shared prefix to
|
|
|
|
allow a skip pointer to be written. The default is *8*.
|
|
|
|
|
|
|
|
`low_freq_cutoff`::
|
|
|
|
Terms with a lower document frequency use a
|
|
|
|
single array object representation for postings and positions. The
|
|
|
|
default is *32*.
|
|
|
|
|
|
|
|
Type name: `direct`
|
|
|
|
|
|
|
|
[float]
|
2013-09-25 12:17:40 -04:00
|
|
|
[[memory-postings]]
|
2013-08-28 19:24:34 -04:00
|
|
|
==== Memory postings format
|
|
|
|
|
|
|
|
A postings format that stores terms & postings (docs, positions,
|
|
|
|
payloads) in RAM, using an FST. This postings format does write to disk,
|
|
|
|
but loads everything into memory. The memory postings format has the
|
|
|
|
following options:
|
|
|
|
|
|
|
|
`pack_fst`::
|
|
|
|
A boolean option that defines if the in memory structure
|
|
|
|
should be packed once its build. Packed will reduce the size for the
|
|
|
|
data-structure in memory but requires more memory during building.
|
|
|
|
Default is *false*.
|
|
|
|
|
|
|
|
`acceptable_overhead_ratio`::
|
|
|
|
The compression ratio specified as a
|
|
|
|
float, that is used to compress internal structures. Example ratios `0`
|
|
|
|
(Compact, no memory overhead at all, but the returned implementation may
|
|
|
|
be slow), `0.5` (Fast, at most 50% memory overhead, always select a
|
|
|
|
reasonably fast implementation), `7` (Fastest, at most 700% memory
|
|
|
|
overhead, no compression). Default is `0.2`.
|
|
|
|
|
|
|
|
Type name: `memory`
|
|
|
|
|
|
|
|
[float]
|
2013-09-25 12:17:40 -04:00
|
|
|
[[bloom-postings]]
|
2013-08-28 19:24:34 -04:00
|
|
|
==== Bloom filter posting format
|
|
|
|
|
|
|
|
The bloom filter postings format wraps a delegate postings format and on
|
|
|
|
top of this creates a bloom filter that is written to disk. During
|
|
|
|
opening this bloom filter is loaded into memory and used to offer
|
|
|
|
"fast-fail" reads. This postings format is useful for low doc-frequency
|
|
|
|
fields such as primary keys. The bloom filter postings format has the
|
|
|
|
following options:
|
|
|
|
|
|
|
|
`delegate`::
|
|
|
|
The name of the configured postings format that the
|
|
|
|
bloom filter postings format will wrap.
|
|
|
|
|
|
|
|
`fpp`::
|
|
|
|
The desired false positive probability specified as a
|
|
|
|
floating point number between 0 and 1.0. The `fpp` can be configured for
|
|
|
|
multiple expected insertions. Example expression: *10k=0.01,1m=0.03*. If
|
|
|
|
number docs per index segment is larger than *1m* then use *0.03* as fpp
|
|
|
|
and if number of docs per segment is larger than *10k* use *0.01* as
|
|
|
|
fpp. The last fallback value is always *0.03*. This example expression
|
|
|
|
is also the default.
|
|
|
|
|
|
|
|
Type name: `bloom`
|
|
|
|
|
|
|
|
[float]
|
2013-09-25 12:17:40 -04:00
|
|
|
[[pulsing-postings]]
|
2013-08-28 19:24:34 -04:00
|
|
|
==== Pulsing postings format
|
|
|
|
|
|
|
|
The pulsing implementation in-lines the posting lists for very low
|
|
|
|
frequent terms in the term dictionary. This is useful to improve lookup
|
|
|
|
performance for low-frequent terms. This postings format offers the
|
|
|
|
following parameters:
|
|
|
|
|
|
|
|
`min_block_size`::
|
|
|
|
The minimum block size the default Lucene term
|
|
|
|
dictionary uses to encode on-disk blocks. Defaults to *25*.
|
|
|
|
|
|
|
|
`max_block_size`::
|
|
|
|
The maximum block size the default Lucene term
|
|
|
|
dictionary uses to encode on-disk blocks. Defaults to *48*.
|
|
|
|
|
|
|
|
`freq_cut_off`::
|
|
|
|
The document frequency cut off where pulsing
|
|
|
|
in-lines posting lists into the term dictionary. Terms with a document
|
|
|
|
frequency less or equal to the cutoff will be in-lined. The default is
|
|
|
|
*1*.
|
|
|
|
|
|
|
|
Type name: `pulsing`
|
|
|
|
|
|
|
|
[float]
|
2013-09-25 12:17:40 -04:00
|
|
|
[[default-postings]]
|
2013-08-28 19:24:34 -04:00
|
|
|
==== Default postings format
|
|
|
|
|
|
|
|
The default postings format has the following options:
|
|
|
|
|
|
|
|
`min_block_size`::
|
|
|
|
The minimum block size the default Lucene term
|
|
|
|
dictionary uses to encode on-disk blocks. Defaults to *25*.
|
|
|
|
|
|
|
|
`max_block_size`::
|
|
|
|
The maximum block size the default Lucene term
|
|
|
|
dictionary uses to encode on-disk blocks. Defaults to *48*.
|
|
|
|
|
|
|
|
Type name: `default`
|