The SimpleTextSegmentInfoFormat was writing the random byte array used
as a segment's ID directly -- not converting to a simple text
representation of the byte array. As a result, the segment infos were
often malformed.
A previous iteration of this code used an AtomicInteger and
required this comment. The committed version uses a self-documenting
boolean and the comment is not needed.
Both of these tests have been disabled for quiet a long time. While `TestManyPointsInOldIndex`
looks indeed stale, `TestIndexWriterOnOldIndex` is not a more general test.
`flushControl.isFullFlush()` is a surprising source of contention with
documents that are cheap to index and many indexing threads. If I slightly
modify luceneutil's `IndexGeoNames` benchmark to configure a 4GB indexing
buffer and disable `TextField` fields, which are more costly to index than
`KeywordField` or `IntField` fields, this brings the time to load all the
dataset in the `IndexWriter` buffers from 8.0s to 7.0s.
* Introduce stale workflow
* Exempt draft PRs
* Tune the action to our needs
1. Don't mark issues stale, only PRs.
2. Don't close anything automatically.
3. Keep the default Stale label.
4. Run in debug-only mode to start.
### Description
Identified in #12955, where `TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity` fails because of a disconnected HNSW graph
This is a bigger issue, but we can reduce intermittent failures by keeping the number of docs and dimensions same as [`BaseKnnVectorQueryTestCase.testRandom`](dc9f154aa5/lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java (L470)) (similar test for KNN with random vectors)
### Command to reproduce
```
./gradlew :lucene:core:test --tests "org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity" -Ptests.jvms=12 -Ptests.jvmargs= -Ptests.seed=1A1CDC0974AF361
```
Real-world data exhibits patterns that are taken advantage of by the
compression logic, but also hardly reproducible in a randomized way. This makes
this new test introduce interesting coverage.
It takes one second to run on my machine, so I did not mark it `@Nightly`.
* Fix PathHierarchyTokenizer positions
PathHierarchyTokenizer was emitting multiple tokens in the same position
with changing offsets. To be consistent with EdgeNGramTokenizer (which
is conceptually similar -- it's emitting multiple prefixes/suffixes off
the input string), we can output every token with length 1 with
positions incrementing by 1.
* Fix ReversePathHierarchyTokenizer positions
Making ReversePathHierarchyTokenizer consistent with recent changes in PathHierarchyTokenizer.
---------
Co-authored-by: Michael Froh <froh@amazon.com>
While quantization generally works well, when the number of dimensions is tiny (just two like in our tests), and we are indexing a circle, and we have random merge policies, we can end up getting unexpected ordering on the resulting vectors.
closes: https://github.com/apache/lucene/issues/12940