SortedDocValues do not have a per-document binary value, they have a
per-document numeric `ordValue()`. The ordinal can then be dereferenced
to its binary form with `lookupOrd()`, but it was a performance trap to
implement a `binaryValue()` on the SortedDocValues api that does this
behind-the-scenes on every document.
You can replace calls of `binaryValue()` with `lookupOrd(ordValue())`
as a "quick fix", but it is better to use the ordinal alone
(integer-based datastructures) for per-document access, and only call
lookupOrd() a few times at the end (e.g. for the hits you want to display).
Otherwise, if you really don't want per-document ordinals, but instead a
per-document `byte[]`, use a BinaryDocValues field.
This change only addresses the API (slow `binaryValue()` trap), but
doesn't yet fix any slow algorithms that were discovered in the process,
so it doesn't yield any performance improvements.
Stored Fields and Term Vectors are block-compressed. Decompressing and
recompressing all the documents on every merge is too slow, so we try to
avoid doing it unless it will actually improve the compression ratio. If
we can get away with it, we just bulk-copy existing compressed blocks to
the new segment.
Previously, small segments would always be considered dirty and
recompressed... the special optimized bulk merge wouldn't kick in until
segments were relatively large. But as block size and ratio (shared
dictionaries etc) have increased, "relatively large" has become a much
bigger number.
So try to avoid doing wasted work: if there's only 1 dirty chunk
(incompletely filled compression block), then don't recompress: it will
likely only give us 1 dirty chunk as a result, at the expense of cpu.
Require at least 2 dirty chunks to recompress: this way the
recompression actually buys us something (reduces 2 to 1).
The change also means that bulk merge will now happen often in
the unit test suite, increasing coverage.
Removes `scratch1` field in `BytesRefHash` by accessing underlying bytes pool directly
in `equals` method. As a result it is now possible to call `BytesRefHash#find`
concurrently as long as there are no concurrent modifications to BytesRefHash instance
and it is correctly published.
This addresses the concurrency issue with Monitor (aka Luwak) since it
is using `BytesRefHash#find` concurrently without additional synchronization.
It was never truly required there.
Pervasive use of "javabin" reduces the need to care about client-side XML speed. Better to reduce dependencies and let clients use the libs they want.