When merging `Lucene99HnswScalarQuantizedVectorsFormat` a NPE is possible when deleted documents are present.
`ScalarQuantizer#fromVectors` doesn't take deleted documents into account. This means using `FloatVectorValues#size` may actually be larger than the actual size of live documents. Consequently, when iterating for sampling iteration too far is possible and an NPE will be thrown.
Split taxonomy arrays across chunks
Taxonomy ordinals are added in an append-only way.
Instead of reallocating a single big array when loading new taxonomy
ordinals and copying all the values from the previous arrays over
individually, we can keep blocks of ordinals and reuse blocks from the
previous arrays.
This change runs the current BWC indices generation code together
with the unittest to catch issues with the generated indices earliy.
Each generation method runs a sanity check on the generated indices.
The changes on #12929 broke the generation code for BWC indices
since they are expecting vertain fields created by LineDocFile.
Yet, this change adds some sanity checks that run with unittest to ensure
the BWC generation is at least readable with the current version.
Relates to #12929
This change prevents users from adding a parent field to an existing index.
Parent field must be added before any documents are added to the index to
prevent documents without the parent field from being indexed and later
to be treated as child documents upon merge.
Relates to #12829
Before this change, `DocumentsWriterPerThread#getAndLock` could sometimes
return `null` even though the queue was empty at no point in time. The
practical implication is that we can end up with more DWPTs in memory than
indexing threads, which, while not strictly a bug, may require doing more
merging than we'd like later on.
I ran luceneutil's `IndexGeonames` with this change, and
`DocumentsWriterPerThread#getAndLock` was not the main source of
contention.
Closes#12649#12916
Binary doc values were being written directly in SimpleTextCodec, though
they may not be valid UTF-8 (i.e. they may not be "text"). This change
encodes them as a string representing an array of hexadecimal bytes.
Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents.
Relates to #12711
The SimpleTextSegmentInfoFormat was writing the random byte array used
as a segment's ID directly -- not converting to a simple text
representation of the byte array. As a result, the segment infos were
often malformed.
A previous iteration of this code used an AtomicInteger and
required this comment. The committed version uses a self-documenting
boolean and the comment is not needed.
Both of these tests have been disabled for quiet a long time. While `TestManyPointsInOldIndex`
looks indeed stale, `TestIndexWriterOnOldIndex` is not a more general test.
`flushControl.isFullFlush()` is a surprising source of contention with
documents that are cheap to index and many indexing threads. If I slightly
modify luceneutil's `IndexGeoNames` benchmark to configure a 4GB indexing
buffer and disable `TextField` fields, which are more costly to index than
`KeywordField` or `IntField` fields, this brings the time to load all the
dataset in the `IndexWriter` buffers from 8.0s to 7.0s.
* Introduce stale workflow
* Exempt draft PRs
* Tune the action to our needs
1. Don't mark issues stale, only PRs.
2. Don't close anything automatically.
3. Keep the default Stale label.
4. Run in debug-only mode to start.
### Description
Identified in #12955, where `TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity` fails because of a disconnected HNSW graph
This is a bigger issue, but we can reduce intermittent failures by keeping the number of docs and dimensions same as [`BaseKnnVectorQueryTestCase.testRandom`](dc9f154aa5/lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java (L470)) (similar test for KNN with random vectors)
### Command to reproduce
```
./gradlew :lucene:core:test --tests "org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity" -Ptests.jvms=12 -Ptests.jvmargs= -Ptests.seed=1A1CDC0974AF361
```