1. Add an option to supply a custom leaf sorter for IndexWriter.
A DirectoryReader opened from this IndexWriter will have its leaf
readers sorted with the provided leaf sorter. This is useful for
indices on which it is expected to run many queries with particular
sort criteria (e.g. for time-based indices this is usually a
descending sort on timestamp). Providing leafSorter allows
to speed up early termination for this particular type of
sort queries.
2. Add an option to supply a custom sub-readers sorter for
BaseCompositeReader. In this case sub-readers will be sorted
according to the the provided leafSorter.
3. Add an option to supply a custom leaf sorter for
StandardDirectoryReader. The leaf readers of this
StandardDirectoryReader will be sorted according to
the the provided leaf sorter.
Requiring the annotation is helpful because if an abstract method is removed, the concrete methods will then show up as compile errors: preventing dead code from being accidentally left behind.
Co-authored-by: Robert Muir <rmuir@apache.org>
Enable ecj unused local variable, private instance and method detection. Allow SuppressWarnings("unused") to disable unused checks (e.g. for generated code or very special tests). Fix gradlew regenerate for python 3.9 SuppressWarnings("unused") for generated javacc and jflex code. Enable a few other easy ecj checks such as Deprecated annotation, hashcode/equals, equals across different types.
Co-authored-by: Mike McCandless <mikemccand@apache.org>
Enable ecj unused local variable, private instance and method detection. Allow SuppressWarnings("unused") to disable unused checks (e.g. for generated code or very special tests). Fix gradlew regenerate for python 3.9 SuppressWarnings("unused") for generated javacc and jflex code. Enable a few other easy ecj checks such as Deprecated annotation, hashcode/equals, equals across different types.
Co-authored-by: Mike McCandless <mikemccand@apache.org>
The MockSynonymFilter should add the type TypeAttribute to the synonyms it
generates in order to make it a better stand-in for the real filter in tests.
SortedDocValues do not have a per-document binary value, they have a
per-document numeric `ordValue()`. The ordinal can then be dereferenced
to its binary form with `lookupOrd()`, but it was a performance trap to
implement a `binaryValue()` on the SortedDocValues api that does this
behind-the-scenes on every document.
You can replace calls of `binaryValue()` with `lookupOrd(ordValue())`
as a "quick fix", but it is better to use the ordinal alone
(integer-based datastructures) for per-document access, and only call
lookupOrd() a few times at the end (e.g. for the hits you want to display).
Otherwise, if you really don't want per-document ordinals, but instead a
per-document `byte[]`, use a BinaryDocValues field.
This change only addresses the API (slow `binaryValue()` trap), but
doesn't yet fix any slow algorithms that were discovered in the process,
so it doesn't yield any performance improvements.
Stored Fields and Term Vectors are block-compressed. Decompressing and
recompressing all the documents on every merge is too slow, so we try to
avoid doing it unless it will actually improve the compression ratio. If
we can get away with it, we just bulk-copy existing compressed blocks to
the new segment.
Previously, small segments would always be considered dirty and
recompressed... the special optimized bulk merge wouldn't kick in until
segments were relatively large. But as block size and ratio (shared
dictionaries etc) have increased, "relatively large" has become a much
bigger number.
So try to avoid doing wasted work: if there's only 1 dirty chunk
(incompletely filled compression block), then don't recompress: it will
likely only give us 1 dirty chunk as a result, at the expense of cpu.
Require at least 2 dirty chunks to recompress: this way the
recompression actually buys us something (reduces 2 to 1).
The change also means that bulk merge will now happen often in
the unit test suite, increasing coverage.