LUCENE-9616: Add developer docs on how to update a format. (#2395)

This commit adds simple guidelines on how to make a change to a file format:
* Document how the 'copy-on-write' approach works with backwards-codecs
* Clarify that we prefer to copy the format instead of using internal versions
This commit is contained in:
Julie Tibshirani 2021-02-22 11:02:37 -08:00 committed by GitHub
parent f43fe7642e
commit bfce5f36da
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 54 additions and 0 deletions

View File

@ -0,0 +1,54 @@
# Index backwards compatibility
This README describes the approach to maintaining compatibility with indices
from previous versions and gives guidelines for making format changes.
## Compatibility strategy
Codecs and file formats are versioned according to the minor version in which
they were created. For example Lucene87Codec represents the codec used for
creating Lucene 8.7 indices, and potentially later index versions too. Each
segment records the codec name that was used to write it.
Lucene supports the ability to read segments created in older versions by
maintaining old codec classes. These older codecs live in the backwards-codecs
package along with their file formats. When making a change to a file format,
we create fresh copies of the codec and format, and move the existing ones
into backwards-codecs.
Older codecs are tested in two ways:
* Through unit tests like TestLucene80NormsFormat, which checks we can write
then read data using each old format
* Through TestBackwardsCompatibility, which loads indices created in previous
versions and checks that we can search them
## Making index format changes
As an example, let's say we're making a change to the norms file format, and
the current class in core is Lucene80NormsFormat. We'd perform the following
steps:
1. Create a new format with the target version for the changes, for example
Lucene90NormsFormat. This includes creating copies of its writer and reader
classes, as well as any helper classes. Make sure to copy unit tests too, like
TestLucene80NormsFormat.
2. Move the old Lucene80NormsFormat, along with its writer, reader, tests, and
helper classes to the backwards-codecs package. If the format will only be
used for reading, then delete the write-side logic and move it to a test-only
class like Lucene80RWNormsFormat to support unit tests. Note that most formats
only need read logic, but a small set including DocValuesFormat and
FieldInfosFormat will need to retain write logic since they can be used to
update old segments.
3. Update the current codec, for example Lucene90Codec, to use the new file
format. If this new codec doesn't exist yet, then create it first and move the
existing one to backwards-codecs.
4. Make a change to the new format!
## Internal format versions
Each format class maintains an internal version which is written into the
file header. Generally these internal versions should not be used to make
format changes. For any significant change, we prefer to use the
'copy-on-write' approach described above, even if it produces a fair amount of
duplicated code. This keeps the versioning strategy simple and clear, and
ensures that we unit test all older index formats.