From 52408fc389abd1abc2f58506271617713c2e2a6e Mon Sep 17 00:00:00 2001
From: Adrien Grand <jpountz@gmail.com>
Date: Mon, 21 Nov 2016 15:01:36 +0100
Subject: [PATCH] Add a recommendation against large documents to the docs.
 (#21652)

---
 docs/reference/how-to/general.asciidoc | 29 ++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/docs/reference/how-to/general.asciidoc b/docs/reference/how-to/general.asciidoc
index 60f0181b2bb..0900c49ce06 100644
--- a/docs/reference/how-to/general.asciidoc
+++ b/docs/reference/how-to/general.asciidoc
@@ -11,6 +11,35 @@ for workloads that fall into the database domain, such as retrieving all
 documents that match a particular query. If you need to do this, make sure to
 use the <<search-request-scroll,Scroll>> API.
 
+[float]
+[[maximum-document-size]]
+=== Avoid large documents
+
+Given that the default <<modules-http,`http.max_context_length`>> is set to
+100MB, Elasticsearch will refuse to index any document that is larger than
+that. You might decide to increase that particular setting, but Lucene still
+has a limit of about 2GB.
+
+Even without considering hard limits, large documents are usually not
+practical. Large documents put more stress on network, memory usage and disk,
+even for search requests that do not request the `_source` since Elasticsearch
+needs to fetch the `_id` of the document in all cases, and the cost of getting
+this field is bigger for large documents due to how the filesystem cache works.
+Indexing this document can use an amount of memory that is a multiplier of the
+original size of the document. Proximity search (phrase queries for instance)
+and <<search-request-highlighting,highlighting>> also become more expensive
+since their cost directly depends on the size of the original document.
+
+It is sometimes useful to reconsider what the unit of information should be.
+For instance, the fact you want to make books searchable doesn't necesarily
+mean that a document should consist of a whole book. It might be a better idea
+to use chapters or even paragraphs as documents, and then have a property in
+these documents that identifies which book they belong to. This does not only
+avoid the issues with large documents, it also makes the search experience
+better. For instance if a user searches for two words `foo` and `bar`, a match
+across different chapters is probably very poor, while a match within the same
+paragraph is likely good.
+
 [float]
 [[sparsity]]
 === Avoid sparsity