Given an input text 'A B A C A B C' and search ORDERED(A, B, C), we should
retrieve hits [0,3] and [4,6]; currently [4,6] is skipped.
After finding the first interval [0, 3], the subintervals will become A[0,0], B[1,1],
C[3,3]; then the algorithm will try to minimize it and the subintervals will
become: A:[2,2], B:[5,5], C:[3,3] (after finding 5 > 3 it breaks the minimization)
And when finding next interval, it will do advance(B) before checking whether
it is after A(the do-while loop), so subintervals will become A[2,2], B[inf, inf],
C[3,3] and return NO_MORE_INTERVAL.
This commit instead continues advancing subintervals from where the last
`nextInterval` call stopped, rather than always advancing all subintervals.
Currently we're only half reusing postings enums when flushing sorted indexes
as we still create new wrapper instances every time, which can be costly with
fields that have many terms.
Obtaining a DWPT and putting it back into the pool is subject to contention.
This change reduces contention by using 8 sub pools that are tried sequentially.
When applied on top of #12198, this reduces the time to index geonames with 20
threads from ~19s to ~16-17s.
This switches to LSBRadixSorter instead of TimSorter to sort postings whose
index options are `DOCS`. On a synthetic benchmark this yielded barely any
difference in the case when the index order is the same as the sort order, or
reverse, but almost a 3x speedup for writing postings in the case when the
index order is mostly random.
lucene-util's `IndexGeoNames` benchmark is heavily contended when running with
many indexing threads, 20 in my case. The main offender is
`DocumentsWriterFlushControl#doAfterDocument`, which runs after every index
operation to update doc and RAM accounting.
This change reduces contention by only updating RAM accounting if the amount of
RAM consumption that has not been committed yet by a single DWPT is at least
0.1% of the total RAM buffer size. This effectively batches updates to RAM
accounting, similarly to what happens when using `IndexWriter#addDocuments` to
index multiple documents at once. Since updates to RAM accounting may be
batched, `FlushPolicy` can no longer distinguish between inserts, updates and
deletes, so all 3 methods got merged into a single one.
With this change, `IndexGeoNames` goes from ~22s to ~19s and the main offender
for contention is now `DocumentsWriterPerThreadPool#getAndLock`.
Co-authored-by: Simon Willnauer <simonw@apache.org>
In Lucene 5.4.0 62313b83ba9c69379e1f84dffc881a361713ce9 introduced some changes for immutability of queries. setBoost() function was replaced with new BoostQuery(), but BoostQuery is not handled in setSlop function. This commit adds the handling of BoostQuery in setSlop() function.
GH#11744 deprecated LongValueFacetCounts#getTopChildrenSortByCount in favor
of the standard Facets#getTopChildren. The issue is that #getTopChildrenSortByCount
didn't do any input validation and allowed for topN == 0, while #getTopChildren
does input validation. Randomized testing could produce topN values of 0, which
resulted in falied tests. This addresses the tests.
* Define inputs and outputs for task validateJarLicenses
* Lazily configure validateJarLicenses
* Move functionality from copyTestResources task into processTestResources task
* Lazily configure processTestResources
* Altered TestCustomAnalyzer.testStopWordsFromFile() to find resources in updated location
* Resolve "overlapping output" issue preventing processTestResources from being cached
* Provide system properties from CommandLineArgumentProviders
* Configure certain system properties as inputs to take advantage of UP-TO-DATE checking
* Applies the correct pathing strategies to take full advantage of caching even if builds are executed from different locations on disk
* Make validateSourcePatterns task cacheable by removing .gradle directory from its input
- Reduce overhead of non-concurrent search by preserving original execution
- Improve readability by factoring into separate functions
---------
Co-authored-by: Kaival Parikh <kaivalp2000@gmail.com>
This change introduces `MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE`, a new rewrite method
meant to be used in place of `MultiTermQuery#CONSTANT_SCORE_REWRITE` as the default for multi-term
queries that act as a filter. Currently, multi-term queries with a filter rewrite internally rewrite to a
disjunction if 16 terms or less match the query. Otherwise postings lists of
matching terms are collected into a `DocIdSetBuilder`. This change replaces the
latter with a mixed approach where a disjunction is created between the 16
terms that have the highest document frequency and an iterator produced from
the `DocIdSetBuilder` that collects all other terms. On fields that have a
zipfian distribution, it's quite likely that no high-frequency terms make it to
the `DocIdSetBuilder`. This provides two main benefits:
- Queries are less likely to allocate a FixedBitSet of size `maxDoc`.
- Queries are better at skipping or early terminating.
On the other hand, queries that need to consume most or all matching documents may get a slowdown, so
users can still opt-in to the "full filter rewrite" functionality by overriding the rewrite method. This is the new
default for PrefixQuery, WildcardQuery and TermRangeQuery.
Co-authored-by: Adrien Grand <jpountz@gmail.com> / Greg Miller <gsmiller@gmail.com>
The default implementation of merging doc values resolves the ordinal of a
document in `nextDoc()`. But sometimes, doc values iterators are consumed
without retrieving ordinals, e.g. to write the set of documents that have a
value, so this may be wasteful.
With this change, ordinals get resolved lazily upon `ordValue()`.
`LogMergePolicy` has this boundary at the floor level that prevents merging
segments above the minimum segment size with segments below this size. I cannot
see a benefit from doing this, and no tests fail if I remove it, while this
boundary has the downside of not running merges that seem legit to me. Should
we remove this boundary check?
Indexing simple keywords through a `TokenStream` abstraction introduces a bit
of overhead due to attribute management. Not much, but indexing keywords boils
down to adding to a hash map and appending to a postings list, which is quite
cheap too so even some low overhead can significantly impact indexing speed.
The two minor performance improvements are around count and Weight#scorer.
segmentStarts is a monotonically increasing start for each scored document indexed by leaf-segment ordinal. Consequently, if the upper and lower segments are equivalent, that means no docs match for this segment.
Count is similarly calculated by the difference between upper and lower segmentStarts according to the segment ordinal.
Similar to use of ScorerSupplier in #12129, implement it here too,
because creation of a Scorer requires lookupTerm() operations in the DV
terms dictionary. This results in wasted effort/random accesses, if, based on the cost(),
IndexOrDocValuesQuery decides not to use this query.
The helper class DocAndScoreQuery implements advanceShallow to help skip
non-competitive documents. This method doesn't actually keep track of where it
has advanced, which means it can do extra work.
Overall the complexity here didn't seem worth it, given the low cost of
collecting matching kNN docs. This PR switches to a simpler approach, which uses
a fixed upper bound on the max score.
This change adjusts the cache policy to ensure that all segments in the
max tier to be cached. Before, we cache segments that have more than 3%
of the total documents in the index; now cache segments have more than
half of the average documents per leave of the index.
Closes#12140
The test assumes a single segment is created (because only one scorer is created from the leaf contexts).
But, force merging wasn't done before getting the reader. Forcemerge to a single segment before getting the reader.
In TestUnifiedHighlighterTermVec, we have a special reader which counts the
number of times term vectors are accessed, so that we can assert that caching
works correctly here. There is some special logic in place to skip the check
when the test framework wraps readers with CheckIndex or ParallelReader;
however, this logic no longer works with ParallelReader in particular, because
term vectors are now accessed through an anonymous class.
A simpler solution here is to call newSearcher(reader, false), which disables
wrapping, meaning that we can remove this extra logic entirely.
Fixes#12115
* Remove implicit addition of vector 0
Removes logic to add 0 vector implicitly. This is in preparation for
adding nodes from other graphs to initialize a new graph. Having the
implicit addition of node 0 complicates this logic.
Signed-off-by: John Mazanec <jmazane@amazon.com>
* Enable out of order insertion of nodes in hnsw
Enables nodes to be added into OnHeapHnswGraph in out of order fashion.
To do so, additional operations have to be taken to resort the
nodesByLevel array. Optimizations have been made to avoid sorting
whenever possible.
Signed-off-by: John Mazanec <jmazane@amazon.com>
* Add ability to initialize from graph
Adds method to initialize an HNSWGraphBuilder from another HNSWGraph.
Initialization can only happen when the builder's graph is empty.
Signed-off-by: John Mazanec <jmazane@amazon.com>
* Utilize merge with graph init in HNSWWriter
Uses HNSWGraphBuilder initialization from graph functionality in
Lucene95HnswVectorsWriter. Selects the largest graph to initialize the
new graph produced by the HNSWGraphBuilder for merge.
Signed-off-by: John Mazanec <jmazane@amazon.com>
* Minor modifications to Lucene95HnswVectorsWriter
Signed-off-by: John Mazanec <jmazane@amazon.com>
* Use TreeMap for graph structure for levels > 0
Refactors OnHeapHnswGraph to use TreeMap to represent graph structure of
levels greater than 0. Refactors NodesIterator to support set
representation of nodes.
Signed-off-by: John Mazanec <jmazane@amazon.com>
* Refactor initializer to be in static create method
Refeactors initialization from graph to be accessible via a create
static method in HnswGraphBuilder.
Signed-off-by: John Mazanec <jmazane@amazon.com>
* Address review comments
Signed-off-by: John Mazanec <jmazane@amazon.com>
* Add change log entry
Signed-off-by: John Mazanec <jmazane@amazon.com>
* Remove empty iterator for neighborqueue
Signed-off-by: John Mazanec <jmazane@amazon.com>
---------
Signed-off-by: John Mazanec <jmazane@amazon.com>
`KeywordField` is a combination of `StringField` and `SortedSetDocValuesField`,
similarly to how `LongField` is a combination of `LongPoint` and
`SortedNumericDocValuesField`. This makes it easier for users to create fields
that can be used for filtering, sorting and faceting.
* Optimize the common case that docs only have single values for the field
* In the multivalued case, terminate reading docvalues if they are > maximum set ordinal
* Implement ScorerSupplier, so that (potentially large) number of ordinal lookups aren't performed just to get the cost()
* Graduate to Sorted(Set)DocValuesField.newSlowSetQuery to complement newSlowRangeQuery, newSlowExactQuery
Like other slow queries in these classes, it's currently only recommended to use with points, e.g. IndexOrDocValuesQuery(new PointInSetQuery, newSlowSetQuery)
LongHashSet is used for the set of numbers, but it has some issues:
* tries to hard to extend AbstractSet, mostly for testing
* causes traps with boxing if you aren't careful
* complex hashcode/equals
Practically we should take advantage of the fact numbers come in sorted
order for multivalued fields: just like range queries do. So we use
min/max to our advantage, including termination of docvalues iteration
Actually it is generally a win to just check min/max even in the single-valued
case: these constant time comparisons are cheap and can avoid hashing,
etc.
In the worst-case, if all of your query Sets contain both the minimum and maximum
possible values, then it won't help, but it doesn't hurt either.
There's no need to make things abstract: DocValues does the right thing
Optimizing for where no docs for the field in the segment exist is easy, simple null check (replacing the existing one!)
Currently stored fields have to look at binaryValue(), stringValue() and
numericValue() to guess the type of the value and then store it. This has a few
issues:
- If there is a problem, e.g. all of these 3 methods return null, it's
currently discovered late, when we already passed the responsibility of
writing data from IndexingChain to the codec.
- numericValue() is used both for numeric doc values and storage. This makes
it impossible to implement a `double` field that is stored and doc-valued,
as numericValue() needs to return simultaneously a number that consists of
the double for storage, and the long bits of the double for doc values.
- binaryValue() is used both for sorted(_set) doc values and storage. This
makes it impossible to implement `keyword` fields that is stored and
doc-valued, as the field returns a non-null value for both binaryValue() and
stringValue() and stored fields no longer know which field to store.
This commit introduces `IndexableField#storedValue()`, which is used only for
stored fields. This addresses the above issues. IndexingChain passes the
storedValue() directly to the codec, so it's impossible for a stored fields
format to mistakenly use binaryValue()/stringValue()/numericValue() instead of
storedValue().