mirror of https://github.com/apache/lucene.git
Add documentation on file formats. (#598)
This commit is contained in:
parent
f81c760cc8
commit
308ddd7502
|
@ -0,0 +1,127 @@
|
|||
<!--
|
||||
Licensed to the Apache Software Foundation (ASF) under one
|
||||
or more contributor license agreements. See the NOTICE file
|
||||
distributed with this work for additional information
|
||||
regarding copyright ownership. The ASF licenses this file
|
||||
to you under the Apache License, Version 2.0 (the
|
||||
"License"); you may not use this file except in compliance
|
||||
with the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing,
|
||||
software distributed under the License is distributed on an
|
||||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||||
KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations
|
||||
under the License.
|
||||
-->
|
||||
|
||||
# Designing file formats
|
||||
|
||||
## Use little JVM heap
|
||||
|
||||
Lucene generally prefers to avoid loading gigabytes of data into the JVM heap.
|
||||
Could this data be stored in a file and accessed using a
|
||||
`org.apache.lucene.store.RandomAccessInput` instead?
|
||||
|
||||
## Avoid options
|
||||
|
||||
One of the hardest problems with file formats is maintaining backward
|
||||
compatibility. Avoid giving options to the user, and instead let the file
|
||||
format make decisions based on the information it has. If an expert user wants
|
||||
to optimize for a specific case, they can write a custom codec and maintain it
|
||||
on their own.
|
||||
|
||||
## How to split the data into files?
|
||||
|
||||
Most file formats split the data into 3 files:
|
||||
- metadata,
|
||||
- index data,
|
||||
- raw data.
|
||||
|
||||
The metadata file contains all the data that is read once at open time. This
|
||||
helps on several fronts:
|
||||
- One can validate the checksums of this data at open time without significant
|
||||
overhead since all data needs to be read anyway, this helps detect
|
||||
corruptions early.
|
||||
- No need to perform expensive seeks into the index/raw data files at open
|
||||
time, one can create slices into these files from offsets that have been
|
||||
written into the metadata file.
|
||||
|
||||
The index file contains data-structures that help search the raw data. For KD
|
||||
trees, this would be the inner nodes, for doc values this would be jump tables,
|
||||
for KNN vectors this would be the HNSW graph structure, for terms this would be
|
||||
the FST that stores term prefixes, etc. Having it in a separate file from the
|
||||
data file enables users to do things like `MMapDirectory#setPreload(boolean)`
|
||||
on these files which are generally rather small and accessed randomly. It is
|
||||
also convenient at times so that index and raw data can be written on the fly
|
||||
without buffering all index data into memory.
|
||||
|
||||
The raw file contains the data that needs to be retrieved.
|
||||
|
||||
Some file formats are simpler, e.g. the compound file format's index is so
|
||||
small that it can be loaded fully into memory at open time. So it becomes
|
||||
read-once and can be stored in the same file as metadata.
|
||||
|
||||
Some file formats are more complex, e.g. postings have multiple types of data
|
||||
(docs, freqs, positions, offsets, payloads) that are optionally retrieved, so
|
||||
they use multiple data files in order not to have to read lots of useless data.
|
||||
|
||||
## Don't use too many files
|
||||
|
||||
The maximum number of file descriptors is usually not infinite. It's ok to use
|
||||
multiple files per segment as described above, but this number should always be
|
||||
small. For instance, it would be a bad practice to use a different file per
|
||||
field.
|
||||
|
||||
## Add codec headers and footers to all files
|
||||
|
||||
Use `CodecUtil` to add headers and footers to all files of the index. This
|
||||
helps make sure that we are opening the right file and differenciate Lucene
|
||||
bugs from file corruptions.
|
||||
|
||||
## Validate checksums of the metadata file when opening a segment
|
||||
|
||||
If data has been organized in such a way that the metadata file only contains
|
||||
read-once data then verifying checksums is very cheap to do and can help detect
|
||||
corruptions early and in a way that we can give users a meaningful error
|
||||
message that tells users that their index is corrupt, rather than a confusing
|
||||
exception that tells them that Lucene tried to read data beyond the end of the
|
||||
file or anything like that.
|
||||
|
||||
## Validate structures of other files when opening a segment
|
||||
|
||||
One of the most frequent case of index corruption that we have observed over
|
||||
the years is file truncation. Verifying that index files have the expected
|
||||
codec header and a correct structure for the codec footer when opening a
|
||||
segment helps detect a significant spectrum of cases of corruption.
|
||||
|
||||
## Do as many consistency checks as reasonable
|
||||
|
||||
It is common for some data to be redundant, e.g. data from the metadata file
|
||||
might be redundant with information from `FieldInfos`, or all files from the
|
||||
same file format should have the same version in their codec header. Checking
|
||||
that these redundant pieces of information are consistent is always a good
|
||||
idea, as it would make cases of corruption much easier to debug.
|
||||
|
||||
## Make sure to not leak files
|
||||
|
||||
Be paranoid regarding where exceptions might be thrown and make sure that files
|
||||
would be closed on all paths. E.g. imagine that opening the data file fails
|
||||
while the index file is already open, make sure that the index file would also
|
||||
get closed in that case. Lucene has tests that randomly throw exceptions when
|
||||
interacting with the `Directory` in order to detect some bugs, but it might
|
||||
take many runs before randomization triggers the exact case that triggers a
|
||||
bug.
|
||||
|
||||
## Verify checksums upon merges
|
||||
|
||||
Merges need to read most if not all input data anyway, so make sure to verify
|
||||
checksums before starting a merge by calling `checkIntegrity()` on the file
|
||||
format reader in order to make sure that file corruptions don't get propagated
|
||||
by merges. All default implementations do this.
|
||||
|
||||
## How to make backward-compatible changes to file formats?
|
||||
|
||||
See [here](../lucene/backward-codecs/README.md).
|
Loading…
Reference in New Issue