From 22a8e4cbfa70a9cc08c9dbe6184d985731f7f608 Mon Sep 17 00:00:00 2001 From: Adrien Grand Date: Tue, 11 Jan 2022 15:16:05 +0100 Subject: [PATCH] Add documentation on file formats. (#598) --- dev-docs/file-formats.md | 127 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 127 insertions(+) create mode 100644 dev-docs/file-formats.md diff --git a/dev-docs/file-formats.md b/dev-docs/file-formats.md new file mode 100644 index 00000000000..6f23adcc39f --- /dev/null +++ b/dev-docs/file-formats.md @@ -0,0 +1,127 @@ + + +# Designing file formats + +## Use little JVM heap + +Lucene generally prefers to avoid loading gigabytes of data into the JVM heap. +Could this data be stored in a file and accessed using a +`org.apache.lucene.store.RandomAccessInput` instead? + +## Avoid options + +One of the hardest problems with file formats is maintaining backward +compatibility. Avoid giving options to the user, and instead let the file +format make decisions based on the information it has. If an expert user wants +to optimize for a specific case, they can write a custom codec and maintain it +on their own. + +## How to split the data into files? + +Most file formats split the data into 3 files: + - metadata, + - index data, + - raw data. + +The metadata file contains all the data that is read once at open time. This +helps on several fronts: + - One can validate the checksums of this data at open time without significant + overhead since all data needs to be read anyway, this helps detect + corruptions early. + - No need to perform expensive seeks into the index/raw data files at open + time, one can create slices into these files from offsets that have been + written into the metadata file. + +The index file contains data-structures that help search the raw data. For KD +trees, this would be the inner nodes, for doc values this would be jump tables, +for KNN vectors this would be the HNSW graph structure, for terms this would be +the FST that stores term prefixes, etc. Having it in a separate file from the +data file enables users to do things like `MMapDirectory#setPreload(boolean)` +on these files which are generally rather small and accessed randomly. It is +also convenient at times so that index and raw data can be written on the fly +without buffering all index data into memory. + +The raw file contains the data that needs to be retrieved. + +Some file formats are simpler, e.g. the compound file format's index is so +small that it can be loaded fully into memory at open time. So it becomes +read-once and can be stored in the same file as metadata. + +Some file formats are more complex, e.g. postings have multiple types of data +(docs, freqs, positions, offsets, payloads) that are optionally retrieved, so +they use multiple data files in order not to have to read lots of useless data. + +## Don't use too many files + +The maximum number of file descriptors is usually not infinite. It's ok to use +multiple files per segment as described above, but this number should always be +small. For instance, it would be a bad practice to use a different file per +field. + +## Add codec headers and footers to all files + +Use `CodecUtil` to add headers and footers to all files of the index. This +helps make sure that we are opening the right file and differenciate Lucene +bugs from file corruptions. + +## Validate checksums of the metadata file when opening a segment + +If data has been organized in such a way that the metadata file only contains +read-once data then verifying checksums is very cheap to do and can help detect +corruptions early and in a way that we can give users a meaningful error +message that tells users that their index is corrupt, rather than a confusing +exception that tells them that Lucene tried to read data beyond the end of the +file or anything like that. + +## Validate structures of other files when opening a segment + +One of the most frequent case of index corruption that we have observed over +the years is file truncation. Verifying that index files have the expected +codec header and a correct structure for the codec footer when opening a +segment helps detect a significant spectrum of cases of corruption. + +## Do as many consistency checks as reasonable + +It is common for some data to be redundant, e.g. data from the metadata file +might be redundant with information from `FieldInfos`, or all files from the +same file format should have the same version in their codec header. Checking +that these redundant pieces of information are consistent is always a good +idea, as it would make cases of corruption much easier to debug. + +## Make sure to not leak files + +Be paranoid regarding where exceptions might be thrown and make sure that files +would be closed on all paths. E.g. imagine that opening the data file fails +while the index file is already open, make sure that the index file would also +get closed in that case. Lucene has tests that randomly throw exceptions when +interacting with the `Directory` in order to detect some bugs, but it might +take many runs before randomization triggers the exact case that triggers a +bug. + +## Verify checksums upon merges + +Merges need to read most if not all input data anyway, so make sure to verify +checksums before starting a merge by calling `checkIntegrity()` on the file +format reader in order to make sure that file corruptions don't get propagated +by merges. All default implementations do this. + +## How to make backward-compatible changes to file formats? + +See [here](../lucene/backward-codecs/README.md).