mirror of https://github.com/apache/lucene.git
LUCENE-9616: Add developer docs on how to update a format. (#2395)
This commit adds simple guidelines on how to make a change to a file format: * Document how the 'copy-on-write' approach works with backwards-codecs * Clarify that we prefer to copy the format instead of using internal versions
This commit is contained in:
parent
f43fe7642e
commit
bfce5f36da
|
@ -0,0 +1,54 @@
|
|||
# Index backwards compatibility
|
||||
|
||||
This README describes the approach to maintaining compatibility with indices
|
||||
from previous versions and gives guidelines for making format changes.
|
||||
|
||||
## Compatibility strategy
|
||||
|
||||
Codecs and file formats are versioned according to the minor version in which
|
||||
they were created. For example Lucene87Codec represents the codec used for
|
||||
creating Lucene 8.7 indices, and potentially later index versions too. Each
|
||||
segment records the codec name that was used to write it.
|
||||
|
||||
Lucene supports the ability to read segments created in older versions by
|
||||
maintaining old codec classes. These older codecs live in the backwards-codecs
|
||||
package along with their file formats. When making a change to a file format,
|
||||
we create fresh copies of the codec and format, and move the existing ones
|
||||
into backwards-codecs.
|
||||
|
||||
Older codecs are tested in two ways:
|
||||
* Through unit tests like TestLucene80NormsFormat, which checks we can write
|
||||
then read data using each old format
|
||||
* Through TestBackwardsCompatibility, which loads indices created in previous
|
||||
versions and checks that we can search them
|
||||
|
||||
## Making index format changes
|
||||
|
||||
As an example, let's say we're making a change to the norms file format, and
|
||||
the current class in core is Lucene80NormsFormat. We'd perform the following
|
||||
steps:
|
||||
|
||||
1. Create a new format with the target version for the changes, for example
|
||||
Lucene90NormsFormat. This includes creating copies of its writer and reader
|
||||
classes, as well as any helper classes. Make sure to copy unit tests too, like
|
||||
TestLucene80NormsFormat.
|
||||
2. Move the old Lucene80NormsFormat, along with its writer, reader, tests, and
|
||||
helper classes to the backwards-codecs package. If the format will only be
|
||||
used for reading, then delete the write-side logic and move it to a test-only
|
||||
class like Lucene80RWNormsFormat to support unit tests. Note that most formats
|
||||
only need read logic, but a small set including DocValuesFormat and
|
||||
FieldInfosFormat will need to retain write logic since they can be used to
|
||||
update old segments.
|
||||
3. Update the current codec, for example Lucene90Codec, to use the new file
|
||||
format. If this new codec doesn't exist yet, then create it first and move the
|
||||
existing one to backwards-codecs.
|
||||
4. Make a change to the new format!
|
||||
|
||||
## Internal format versions
|
||||
|
||||
Each format class maintains an internal version which is written into the
|
||||
file header. Generally these internal versions should not be used to make
|
||||
format changes. For any significant change, we prefer to use the
|
||||
'copy-on-write' approach described above, even if it produces a fair amount of
|
||||
duplicated code. This keeps the versioning strategy simple and clear, and
|
||||
ensures that we unit test all older index formats.
|
Loading…
Reference in New Issue