Binary doc values were being written directly in SimpleTextCodec, though
they may not be valid UTF-8 (i.e. they may not be "text"). This change
encodes them as a string representing an array of hexadecimal bytes.
Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents.
Relates to #12711
The SimpleTextSegmentInfoFormat was writing the random byte array used
as a segment's ID directly -- not converting to a simple text
representation of the byte array. As a result, the segment infos were
often malformed.
A previous iteration of this code used an AtomicInteger and
required this comment. The committed version uses a self-documenting
boolean and the comment is not needed.
Both of these tests have been disabled for quiet a long time. While `TestManyPointsInOldIndex`
looks indeed stale, `TestIndexWriterOnOldIndex` is not a more general test.
`flushControl.isFullFlush()` is a surprising source of contention with
documents that are cheap to index and many indexing threads. If I slightly
modify luceneutil's `IndexGeoNames` benchmark to configure a 4GB indexing
buffer and disable `TextField` fields, which are more costly to index than
`KeywordField` or `IntField` fields, this brings the time to load all the
dataset in the `IndexWriter` buffers from 8.0s to 7.0s.
* Introduce stale workflow
* Exempt draft PRs
* Tune the action to our needs
1. Don't mark issues stale, only PRs.
2. Don't close anything automatically.
3. Keep the default Stale label.
4. Run in debug-only mode to start.
### Description
Identified in #12955, where `TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity` fails because of a disconnected HNSW graph
This is a bigger issue, but we can reduce intermittent failures by keeping the number of docs and dimensions same as [`BaseKnnVectorQueryTestCase.testRandom`](dc9f154aa5/lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java (L470)) (similar test for KNN with random vectors)
### Command to reproduce
```
./gradlew :lucene:core:test --tests "org.apache.lucene.search.TestFloatVectorSimilarityQuery.testVectorsAboveSimilarity" -Ptests.jvms=12 -Ptests.jvmargs= -Ptests.seed=1A1CDC0974AF361
```
Real-world data exhibits patterns that are taken advantage of by the
compression logic, but also hardly reproducible in a randomized way. This makes
this new test introduce interesting coverage.
It takes one second to run on my machine, so I did not mark it `@Nightly`.
* Fix PathHierarchyTokenizer positions
PathHierarchyTokenizer was emitting multiple tokens in the same position
with changing offsets. To be consistent with EdgeNGramTokenizer (which
is conceptually similar -- it's emitting multiple prefixes/suffixes off
the input string), we can output every token with length 1 with
positions incrementing by 1.
* Fix ReversePathHierarchyTokenizer positions
Making ReversePathHierarchyTokenizer consistent with recent changes in PathHierarchyTokenizer.
---------
Co-authored-by: Michael Froh <froh@amazon.com>