From 52408fc389abd1abc2f58506271617713c2e2a6e Mon Sep 17 00:00:00 2001 From: Adrien Grand Date: Mon, 21 Nov 2016 15:01:36 +0100 Subject: [PATCH] Add a recommendation against large documents to the docs. (#21652) --- docs/reference/how-to/general.asciidoc | 29 ++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/docs/reference/how-to/general.asciidoc b/docs/reference/how-to/general.asciidoc index 60f0181b2bb..0900c49ce06 100644 --- a/docs/reference/how-to/general.asciidoc +++ b/docs/reference/how-to/general.asciidoc @@ -11,6 +11,35 @@ for workloads that fall into the database domain, such as retrieving all documents that match a particular query. If you need to do this, make sure to use the <> API. +[float] +[[maximum-document-size]] +=== Avoid large documents + +Given that the default <> is set to +100MB, Elasticsearch will refuse to index any document that is larger than +that. You might decide to increase that particular setting, but Lucene still +has a limit of about 2GB. + +Even without considering hard limits, large documents are usually not +practical. Large documents put more stress on network, memory usage and disk, +even for search requests that do not request the `_source` since Elasticsearch +needs to fetch the `_id` of the document in all cases, and the cost of getting +this field is bigger for large documents due to how the filesystem cache works. +Indexing this document can use an amount of memory that is a multiplier of the +original size of the document. Proximity search (phrase queries for instance) +and <> also become more expensive +since their cost directly depends on the size of the original document. + +It is sometimes useful to reconsider what the unit of information should be. +For instance, the fact you want to make books searchable doesn't necesarily +mean that a document should consist of a whole book. It might be a better idea +to use chapters or even paragraphs as documents, and then have a property in +these documents that identifies which book they belong to. This does not only +avoid the issues with large documents, it also makes the search experience +better. For instance if a user searches for two words `foo` and `bar`, a match +across different chapters is probably very poor, while a match within the same +paragraph is likely good. + [float] [[sparsity]] === Avoid sparsity