From f295a218a05f75ab5cd219afbd4c402f9984f477 Mon Sep 17 00:00:00 2001 From: Adrien Grand Date: Thu, 7 Jul 2016 09:12:09 +0200 Subject: [PATCH] Add notes about sparsity. --- docs/reference/how-to.asciidoc | 2 + docs/reference/how-to/general.asciidoc | 104 +++++++++++++++++++++++++ 2 files changed, 106 insertions(+) create mode 100644 docs/reference/how-to/general.asciidoc diff --git a/docs/reference/how-to.asciidoc b/docs/reference/how-to.asciidoc index ee954553617..f41c3a3bb9c 100644 --- a/docs/reference/how-to.asciidoc +++ b/docs/reference/how-to.asciidoc @@ -15,6 +15,8 @@ This section provides guidance about which changes should and shouldn't be made. -- +include::how-to/general.asciidoc[] + include::how-to/indexing-speed.asciidoc[] include::how-to/search-speed.asciidoc[] diff --git a/docs/reference/how-to/general.asciidoc b/docs/reference/how-to/general.asciidoc new file mode 100644 index 00000000000..de37a714de6 --- /dev/null +++ b/docs/reference/how-to/general.asciidoc @@ -0,0 +1,104 @@ +[[general-recommendations]] +== General recommendations + +[float] +[[large-size]] +=== Don't return large result sets + +Elasticsearch is designed as a search engine, which makes it very good at +getting back the top documents that match a query. However, it is not as good +for workloads that fall into the database domain, such as retrieving all +documents that match a particular query. If you need to do this, make sure to +use the <> API. + +[float] +[[sparsity]] +=== Avoid sparsity + +The data-structures behind Lucene, which elasticsearch relies on in order to +index and store data, work best with dense data, ie. when all documents have the +same fields. This is especially true for fields that have norms enabled (which +is the case for `text` fields by default) or doc values enabled (which is the +case for numerics, `date`, `ip` and `keyword` by default). + +The reason is that Lucene internally identifies documents with so-called doc +ids, which are integers between 0 and the total number of documents in the +index. These doc ids are used for communication between the internal APIs of +Lucene: for instance searching on a term with a `match` query produces an +iterator of doc ids, and these doc ids are then used to retrieve the value of +the `norm` in order to compute a score for these documents. The way this `norm` +lookup is implemented currently is by reserving one byte for each document. +The `norm` value for a given doc id can then be retrieved by reading the +byte at index `doc_id`. While this is very efficient and helps Lucene quickly +have access to the `norm` values of every document, this has the drawback that +documents that do not have a value will also require one byte of storage. + +In practice, this means that if an index has `M` documents, norms will require +`M` bytes of storage *per field*, even for fields that only appear in a small +fraction of the documents of the index. Although slightly more complex with doc +values due to the fact that doc values have multiple ways that they can be +encoded depending on the type of field and on the actual data that the field +stores, the problem is very similar. In case you wonder: `fielddata`, which was +used in elasticsearch pre-2.0 before being replaced with doc values, also +suffered from this issue, except that the impact was only on the memory +footprint since `fielddata` was not explicitly materialized on disk. + +Note that even though the most notable impact of sparsity is on storage +requirements, it also has an impact on indexing speed and search speed since +these bytes for documents that do not have a field still need to be written +at index time and skipped over at search time. + +It is totally fine to have a minority of sparse fields in an index. But beware +that if sparsity becomes the rule rather than the exception, then the index +will not be as efficient as it could be. + +This section mostly focused on `norms` and `doc values` because those are the +two features that are most affected by sparsity. Sparsity also affect the +efficiency of the inverted index (used to index `text`/`keyword` fields) and +dimensional points (used to index `geo_point` and numerics) but to a lesser +extent. + +Here are some recommendations that can help avoid sparsity: + +[float] +==== Avoid putting unrelated data in the same index + +You should avoid putting documents that have totally different structures into +the same index in order to avoid sparsity. It is often better to put these +documents into different indices, you could also consider giving fewer shards +to these smaller indices since they will contain fewer documents overall. + +Note that this advice does not apply in the case that you need to use +parent/child relations between your documents since this feature is only +supported on documents that live in the same index. + +[float] +==== Normalize document structures + +Even if you really need to put different kinds of documents in the same index, +maybe there are opportunities to reduce sparsity. For instance if all documents +in the index have a timestamp field but some call it `timestamp` and others +call it `creation_date`, it would help to rename it so that all documents have +the same field name for the same data. + +[float] +==== Avoid types + +Types might sound like a good way to store multiple tenants in a single index. +They are not: given that types store everything in a single index, having +multiple types that have different fields in a single index will also cause +problems due to sparsity as described above. If your types to not have very +similar mappings, you might want to consider moving them to a dedicated index. + +[float] +==== Disable `norms` and `doc_values` on sparse fields + +If none of the above recommendations apply in your case, you might want to +check whether you actually need `norms` and `doc_values` on your sparse fields. +`norms` can be disabled if producing scores is not necessary on a field, this is +typically true for fields that are only used for filtering. `doc_values` can be +disabled on fields that are neither used for sorting nor for aggregations. +Beware that this decision should not be made lightly since these parameters +cannot be changed on a live index, so you would have to reindex if you realize +that you need `norms` or `doc_values`. +