mirror of https://github.com/apache/lucene.git
128 lines
5.6 KiB
Markdown
128 lines
5.6 KiB
Markdown
|
<!--
|
||
|
Licensed to the Apache Software Foundation (ASF) under one
|
||
|
or more contributor license agreements. See the NOTICE file
|
||
|
distributed with this work for additional information
|
||
|
regarding copyright ownership. The ASF licenses this file
|
||
|
to you under the Apache License, Version 2.0 (the
|
||
|
"License"); you may not use this file except in compliance
|
||
|
with the License. You may obtain a copy of the License at
|
||
|
|
||
|
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
|
||
|
Unless required by applicable law or agreed to in writing,
|
||
|
software distributed under the License is distributed on an
|
||
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||
|
KIND, either express or implied. See the License for the
|
||
|
specific language governing permissions and limitations
|
||
|
under the License.
|
||
|
-->
|
||
|
|
||
|
# Designing file formats
|
||
|
|
||
|
## Use little JVM heap
|
||
|
|
||
|
Lucene generally prefers to avoid loading gigabytes of data into the JVM heap.
|
||
|
Could this data be stored in a file and accessed using a
|
||
|
`org.apache.lucene.store.RandomAccessInput` instead?
|
||
|
|
||
|
## Avoid options
|
||
|
|
||
|
One of the hardest problems with file formats is maintaining backward
|
||
|
compatibility. Avoid giving options to the user, and instead let the file
|
||
|
format make decisions based on the information it has. If an expert user wants
|
||
|
to optimize for a specific case, they can write a custom codec and maintain it
|
||
|
on their own.
|
||
|
|
||
|
## How to split the data into files?
|
||
|
|
||
|
Most file formats split the data into 3 files:
|
||
|
- metadata,
|
||
|
- index data,
|
||
|
- raw data.
|
||
|
|
||
|
The metadata file contains all the data that is read once at open time. This
|
||
|
helps on several fronts:
|
||
|
- One can validate the checksums of this data at open time without significant
|
||
|
overhead since all data needs to be read anyway, this helps detect
|
||
|
corruptions early.
|
||
|
- No need to perform expensive seeks into the index/raw data files at open
|
||
|
time, one can create slices into these files from offsets that have been
|
||
|
written into the metadata file.
|
||
|
|
||
|
The index file contains data-structures that help search the raw data. For KD
|
||
|
trees, this would be the inner nodes, for doc values this would be jump tables,
|
||
|
for KNN vectors this would be the HNSW graph structure, for terms this would be
|
||
|
the FST that stores term prefixes, etc. Having it in a separate file from the
|
||
|
data file enables users to do things like `MMapDirectory#setPreload(boolean)`
|
||
|
on these files which are generally rather small and accessed randomly. It is
|
||
|
also convenient at times so that index and raw data can be written on the fly
|
||
|
without buffering all index data into memory.
|
||
|
|
||
|
The raw file contains the data that needs to be retrieved.
|
||
|
|
||
|
Some file formats are simpler, e.g. the compound file format's index is so
|
||
|
small that it can be loaded fully into memory at open time. So it becomes
|
||
|
read-once and can be stored in the same file as metadata.
|
||
|
|
||
|
Some file formats are more complex, e.g. postings have multiple types of data
|
||
|
(docs, freqs, positions, offsets, payloads) that are optionally retrieved, so
|
||
|
they use multiple data files in order not to have to read lots of useless data.
|
||
|
|
||
|
## Don't use too many files
|
||
|
|
||
|
The maximum number of file descriptors is usually not infinite. It's ok to use
|
||
|
multiple files per segment as described above, but this number should always be
|
||
|
small. For instance, it would be a bad practice to use a different file per
|
||
|
field.
|
||
|
|
||
|
## Add codec headers and footers to all files
|
||
|
|
||
|
Use `CodecUtil` to add headers and footers to all files of the index. This
|
||
|
helps make sure that we are opening the right file and differenciate Lucene
|
||
|
bugs from file corruptions.
|
||
|
|
||
|
## Validate checksums of the metadata file when opening a segment
|
||
|
|
||
|
If data has been organized in such a way that the metadata file only contains
|
||
|
read-once data then verifying checksums is very cheap to do and can help detect
|
||
|
corruptions early and in a way that we can give users a meaningful error
|
||
|
message that tells users that their index is corrupt, rather than a confusing
|
||
|
exception that tells them that Lucene tried to read data beyond the end of the
|
||
|
file or anything like that.
|
||
|
|
||
|
## Validate structures of other files when opening a segment
|
||
|
|
||
|
One of the most frequent case of index corruption that we have observed over
|
||
|
the years is file truncation. Verifying that index files have the expected
|
||
|
codec header and a correct structure for the codec footer when opening a
|
||
|
segment helps detect a significant spectrum of cases of corruption.
|
||
|
|
||
|
## Do as many consistency checks as reasonable
|
||
|
|
||
|
It is common for some data to be redundant, e.g. data from the metadata file
|
||
|
might be redundant with information from `FieldInfos`, or all files from the
|
||
|
same file format should have the same version in their codec header. Checking
|
||
|
that these redundant pieces of information are consistent is always a good
|
||
|
idea, as it would make cases of corruption much easier to debug.
|
||
|
|
||
|
## Make sure to not leak files
|
||
|
|
||
|
Be paranoid regarding where exceptions might be thrown and make sure that files
|
||
|
would be closed on all paths. E.g. imagine that opening the data file fails
|
||
|
while the index file is already open, make sure that the index file would also
|
||
|
get closed in that case. Lucene has tests that randomly throw exceptions when
|
||
|
interacting with the `Directory` in order to detect some bugs, but it might
|
||
|
take many runs before randomization triggers the exact case that triggers a
|
||
|
bug.
|
||
|
|
||
|
## Verify checksums upon merges
|
||
|
|
||
|
Merges need to read most if not all input data anyway, so make sure to verify
|
||
|
checksums before starting a merge by calling `checkIntegrity()` on the file
|
||
|
format reader in order to make sure that file corruptions don't get propagated
|
||
|
by merges. All default implementations do this.
|
||
|
|
||
|
## How to make backward-compatible changes to file formats?
|
||
|
|
||
|
See [here](../lucene/backward-codecs/README.md).
|