Both of these tests have been disabled for quiet a long time. While `TestManyPointsInOldIndex`
looks indeed stale, `TestIndexWriterOnOldIndex` is not a more general test.
`flushControl.isFullFlush()` is a surprising source of contention with
documents that are cheap to index and many indexing threads. If I slightly
modify luceneutil's `IndexGeoNames` benchmark to configure a 4GB indexing
buffer and disable `TextField` fields, which are more costly to index than
`KeywordField` or `IntField` fields, this brings the time to load all the
dataset in the `IndexWriter` buffers from 8.0s to 7.0s.
* Introduce stale workflow
* Exempt draft PRs
* Tune the action to our needs
1. Don't mark issues stale, only PRs.
2. Don't close anything automatically.
3. Keep the default Stale label.
4. Run in debug-only mode to start.
### Description
Identified in #12955, where `TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity` fails because of a disconnected HNSW graph
This is a bigger issue, but we can reduce intermittent failures by keeping the number of docs and dimensions same as [`BaseKnnVectorQueryTestCase.testRandom`](dc9f154aa5/lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java (L470)) (similar test for KNN with random vectors)
### Command to reproduce
```
./gradlew :lucene:core:test --tests "org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity" -Ptests.jvms=12 -Ptests.jvmargs= -Ptests.seed=1A1CDC0974AF361
```
Real-world data exhibits patterns that are taken advantage of by the
compression logic, but also hardly reproducible in a randomized way. This makes
this new test introduce interesting coverage.
It takes one second to run on my machine, so I did not mark it `@Nightly`.
* Fix PathHierarchyTokenizer positions
PathHierarchyTokenizer was emitting multiple tokens in the same position
with changing offsets. To be consistent with EdgeNGramTokenizer (which
is conceptually similar -- it's emitting multiple prefixes/suffixes off
the input string), we can output every token with length 1 with
positions incrementing by 1.
* Fix ReversePathHierarchyTokenizer positions
Making ReversePathHierarchyTokenizer consistent with recent changes in PathHierarchyTokenizer.
---------
Co-authored-by: Michael Froh <froh@amazon.com>
While quantization generally works well, when the number of dimensions is tiny (just two like in our tests), and we are indexing a circle, and we have random merge policies, we can end up getting unexpected ordering on the resulting vectors.
closes: https://github.com/apache/lucene/issues/12940
* ensure Nori/Kuromoji shipped binary FST is the latest version (closes#12911)
* fold feedback from @uschindler: sharpen test failure methods to give the specific gradlew command to regenerate the precise FST (not everything)
* add javadoc for FSTMetadata.getVersion
* Simple patch to prevent the common zero-width code points in our source and some types of resource files
* Validate correct UTF-8 input and fix buggy CSS file (ISO-8859-x encoded)
* add a bit of context
* Add CHANGES.txt
Discovered in #12921, and introduced in #12679
The first issue is that we weren't advancing the `VectorScorer` [here](cf13a92950/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java (L257-L262)) -- so it was still un-positioned while trying to compute the similarity score
Earlier in the PR, the underlying delegate of the `FilteredDocIdSetIterator` was `scorer.iterator()` (see [here](cad565439b/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java (L107))) -- so we didn't need to explicitly advance it
Later, we decided to maintain parity to `AbstractKnnVectorQuery` and introduce filtering in `AbstractVectorSimilarityQuery` (see [this commit](5096790f28)) to determine the `visitLimit` of approximate search -- after which the underlying iterator changed to the accepted docs (see [here](5096790f28/lucene/core/src/java/org/apache/lucene/search/AbstractVectorSimilarityQuery.java (L255))) and I missed advancing the `VectorScorer` explicitly..
After doing so, we no longer get the original `java.lang.ArrayIndexOutOfBoundsException` -- but the `BaseVectorSimilarityQueryTestCase#testApproximate` starts failing because it falls back to exact search, as the limit of the prefilter is met during graph search
Relaxed the parameters of the test to fix this (making the filter less restrictive, and trying to visit a fewer number of nodes so that approximate search completes without hitting its limit)
Sorry for missing this earlier!