Commit Graph

35346 Commits

Author SHA1 Message Date
Dawid Weiss 1bb4554832
LUCENE-10135: Correct passage selector behavior for long matching snippets (#334) 2021-09-30 15:05:41 +02:00
Chris Hegarty 797cfbf477
LUCENE-10118: Improve CMS infostream messages (#337)
Expand the log message when CMS.MergeThread completes its merge operation, 
to include addition useful diagnostic information, like the total-bytes-written, 
the time taken, as well as rate limiter information. Also, while here, unify the 
thread start and end log output to help improve tracing.
2021-09-30 11:43:45 +01:00
Alan Woodward ca810e732d
LUCENE-10138: Use maven central to resolve third-party gradle plugins (#336)
The gradle plugin portal uses jcenter to resolve third-party plugins, which
can be flaky. This commit instructs gradle to look first in maven central,
and only use the plugin portal for gradle's own plugins.
2021-09-30 11:41:05 +01:00
Chris Hegarty 3e568b911f
Support addition of diagnostics by custom merge policies (#329)
This commit adds a new `addDiagnostics` method to `SegmentInfo` that
allows custom merge policies to add new diagnostic information to the
segment's diagnostic map.
2021-09-30 09:50:22 +01:00
Dawid Weiss 0c13a52df5 Correct test error that allowed an empty array. 2021-09-30 09:17:29 +02:00
Dawid Weiss d2b88b7a0b LUCENE-10134: clean up the test from leaking threads and resources if an error occurs somewhere - this obscures the original cause of the problem. 2021-09-30 09:14:58 +02:00
Mayya Sharipova 88b264a368
LUCENE-10126 Add extra test on _doc sort (#326)
Add extra test on _doc sort to test
that search with after collects all documents
2021-09-29 14:49:16 -04:00
Adrien Grand 84e4050269
LUCENE-10125: Speed up DirectWriter. (#327)
There was a regression introduced in
https://github.com/apache/lucene/pull/107/files#diff-49b11ced76acedf749c5a5a0ff6e7fe93b8fb64caf8697e487a56f4f7adbb510
where we moved from write logic that was optimized for every number of bits per
value to more general logic that had to work for every number of bits per value.

This PR doesn't restore as much specialization, but some middle ground that
makes flushes and merges of doc values noticeably faster (though not much
faster).
2021-09-29 19:21:14 +02:00
Nhat Nguyen e56995d85e LUCENE-10126: Remove chunk scoring in AssertingBulkScorer
Many tests are failing due to the newly introduced chunk scoring in
AssertingBulkScorer. This commit reverts that change and will
reintroduce it later.
2021-09-28 22:21:17 -04:00
Nhat Nguyen cb153886eb LUCENE-10126: Fix AssertingBulkScorer
AssertingBulkScorer can generate a backward sub-range.
2021-09-28 19:50:40 -04:00
Timothy Potter a73848cfab DOAP changes for release 8.10.0 2021-09-28 13:28:45 -06:00
Nhat Nguyen 5ab900e10b
LUCENE-10126: Fix competitiveIterator wrongly skip documents (#324)
The competitive iterator can wrongly skip a document that is advanced 
but not collected in the previous scoreRange.
2021-09-28 15:26:30 -04:00
Adrien Grand 9f80b4d8fb
LUCENE-10125: Speed up computation of exceptions. (#322)
Even though it was not the driver for the slowdown, in LUCENE-10125 we
identified that the move to PFOR had slowed down indexing significantly
for fields indexed with indexOptions=DOCS. This patch gets some of the
peformance back by using the `LongHeap` that we introduced for vectors
instead of sorting the same array over and over again.

On the NYC Taxis benchmark, I observed ~8% faster merges of postings
with this change.
2021-09-28 17:25:56 +02:00
Adrien Grand 8f3f2ea4ab
LUCENE-10127: Minor speedup to doc values writes. (#325)
This reduces a bit the overhead of writing doc values. On the NYC Taxis
benchmark this resulted in ~10% faster merges for doc values.
2021-09-28 17:23:09 +02:00
Robert Muir 6ac311068f
LUCENE-10128: avoid costly reflection in SparseFixedBitSet ctor
Seems that VectorFormat merge creates A LOT of these bitsets. We don't
need to do any fancy reflection here via shallowSizeOf(Object), when we
can call sizeOf(long[]) which is fast.

We may want to revisit this RAMUsageEstimator api in the future to
prevent traps like this.
2021-09-28 09:39:36 -04:00
Adrien Grand 7357bdc272
LUCENE-10123: Handling of singletons in DocValuesConsumer. (#320)
This avoids double wrapping of doc values in `Lucene90DocValuesConsumer`.
2021-09-28 08:54:46 +02:00
Greg Miller 1ebd193fbe
Move CHANGES entry for LUCENE-10070 under 8.11 after backport (#323) 2021-09-27 12:15:52 -07:00
Uwe Schindler 849d5fc1ac
LUCENE-10125: Optimize primitive writes in OutputStreamIndexOutput (#321) 2021-09-27 19:04:03 +02:00
Julie Tibshirani eaa421094d
LUCENE-10109: Bump default beam width for HNSW (#312)
Lucene90HnswVectorsFormat has a default 'beam width' of 16. This is quite low
and produces poor recall on typical-sized datasets.

This commit bumps it to 100. This new default tries to balance good search
performance with indexing speed. Most runs in ann-benchmarks set the parameter
between ~400 and 800, but they are heavily optimizing search over index speed.
2021-09-24 18:02:34 -07:00
Greg Miller eb44d1e6ad
Add slightly more language in the README Contributing section (#318) 2021-09-24 12:06:06 -07:00
Nhat Nguyen 7390d1af51
LUCENE-10119: Do not set single sort with search after (#317)
We should not set single sort when the search_after is non-null; 
otherwise, we will incorrectly skip documents whose values are equal to
the value from the search_after and docIDs are greater than the docID
from the search_after.
2021-09-23 13:10:17 -04:00
Uwe Schindler fc475360a8 Only pass "--illegal-access=deny" up to JDK-15, later versions deprecate the option and default to "deny" 2021-09-22 19:41:59 +02:00
Lu Xugang ed7fb8dea0
LUCENE-10116: Missing calculating the bytes used of DocsWithFieldSet and currentValues in SortedSetDocValuesWriter (#316) 2021-09-22 14:25:40 +02:00
Lu Xugang a7bddfaacc
LUCENE-10111: Missing calculating the bytes used of DocsWithFieldSet in NormValuesWriter (#307) 2021-09-22 07:44:30 +02:00
Chris Hegarty a7578709a6
LUCENE-10115: Add a fuzzy parsing extension point for custom query parsers
This commit adds the QueryParserBase::getFuzzyDistance protected method, which 
can be overridden by subclasses to provide customisation of how the similarity distance 
is determined. The default implementation retains the current behaviour.
2021-09-21 13:25:09 +01:00
Julie Tibshirani b2a04a4bb4 LUCENE-10069: Adjust TestKnnVectorQuery#testRandom to stop failures
The test fails randomly because HNSW can sometimes miss results when k is close
to the number of total docs. While we wait for a fix, this commit decreases k to
prevent failures.
2021-09-20 14:16:47 -07:00
Uwe Schindler 5871ea7972
LUCENE-10112: Improve LZ4 Compression performance with direct primitive read/writes (#310)
Co-authored-by: Tim Brooks <tim@timbrooks.org>
2021-09-20 19:12:38 +02:00
Christine Poerschke 57524c6a5e
LUCENE-9809: replace 'master' with 'main' in release wizard (#305) 2021-09-20 17:51:41 +01:00
Uwe Schindler c57d6e5f8c
LUCENE-10113: Use VarHandles to access int/long/short types in byte arrays (e.g. ByteArrayDataInput) (#308)
Co-authored-by: Robert Muir <rmuir@apache.org>
2021-09-20 15:37:33 +02:00
Adrien Grand 4bcd64c5ed LUCENE-9620: Fix test bug. 2021-09-20 09:49:13 +02:00
Uwe Schindler 075d801abe
LUCENE-10114: Remove unused byte order mark in Lucene90PostingsWriter (#309)
Co-authored-by: Robert Muir <rmuir@apache.org>
2021-09-20 08:37:05 +02:00
Jim Ferenczi ccf0d5404d
LUCENE-10110: MultiCollector should conditionally wrap single leaf collector (#303)
MultiCollector should wrap single leaf collector that wants to skip low-scoring hits
 but the combined score mode doesn't allow it.
2021-09-20 07:26:51 +02:00
Tomoko Uchida 6c1e5920d8 LUCENE-10102: do not call incrementToken() against already consumed input stream. 2021-09-20 10:58:39 +09:00
Robert Muir 8b95e51d70
Add additional docs refs (nightly, build system help/) to README.md (#302) 2021-09-19 20:24:13 -04:00
Uwe Schindler f3c3b90e35
LUCENE-9047: fix typo in javadocs (still referred to big endian) 2021-09-19 13:51:51 +02:00
Tomoko Uchida 5dfbef313c LUCENE-10102: Fix JapaneseCompletionFilter javadoc 2021-09-18 14:43:03 +09:00
goankur deff5a1f5a
LUCENE-10070: Skip deleted documents during facet counting for all documents (#293) 2021-09-17 10:35:44 -07:00
Tomoko Uchida 4e86df96c0
LUCENE-10102: Add JapaneseCompletionFilter for Input Method-aware auto-completion (#297)
Co-authored-by: Robert Muir <rmuir@apache.org>
2021-09-17 22:37:12 +09:00
Dawid Weiss de45b68c90 LUCENE-9448, LUCENE-9990: fix Luke's launcher task. 2021-09-16 08:49:26 +02:00
Nhat Nguyen b7a286dd69
LUCENE-10106: Sort optimization wrongly skip first docs (#300)
The first documents of subsequent segments are mistakenly skipped when 
sort optimization is enabled. We should initialize maxDocVisited in
NumericComparator to -1 instead of 0.
2021-09-15 09:21:59 -04:00
Uwe Schindler 1586933b18 Merge branch 'main' of https://gitbox.apache.org/repos/asf/lucene into main 2021-09-15 01:19:42 +02:00
Uwe Schindler 3c6d4a00cd LUCENE-10104, SOLR-15631: Upgrade forbiddenapis to version 3.2 2021-09-15 01:19:17 +02:00
Alan Woodward 26093735cc
LUCENE-8638: Expressions haversin() method should continue to return its value in km (#299)
SloppyMath had a deprecated haversin() function that returned its values in
km, which has been replaced by a haversinMeters() function that is explicit
about its units. As part of removing this function, we changed the expressions
module haversin function to point instead to haversinMeters. However, this
may silently change the behaviour of expressions on upgrade.

This commit instead adds a haversinKilometers method to the expressions
module and maps the haversin function to it. It also adds a new
haversinMeters expression function to be more explicit for future users.
2021-09-14 14:01:10 +01:00
Uwe Schindler 3802bdc686
LUCENE-10101: Use getField() instead of getDeclaredField() to minimize security impact by analysis SPI discovery (#298) 2021-09-14 10:31:46 +02:00
Jim Ferenczi 19537578dd
LUCENE-10089: Disable numeric sort optimization early (#291)
This commit moves the responsibility to disable
the numeric sort optimization on comparators to the SortField.
This way we don't need to apply the logic on every top field collectors.
2021-09-13 07:31:43 +02:00
Robert Muir 56968b762a
LUCENE-10098: add note/link to GermanAnalyzer for decompounding nouns. (#294)
LUCENE-10098: add note/link to GermanAnalyzer for decompounding nouns.

We can't do this out of box with the analyzer, due to incompatible
licenses. But we can make it easy on the user to do this, by linking to
repo that has sample code, documentation, and the required data files.
2021-09-12 12:55:51 -04:00
Robert Muir 24aa45dc3e
LUCENE-10096: Tamil Analyzer (#292)
Add Tamil analyzer based on snowball stemmer and TamilNLP stopwords
2021-09-10 21:02:11 -04:00
Robert Muir 8bce765218
LUCENE-10095: Nepali Analyzer (#290)
Add Nepali analyzer based on snowball stemmer and NLTK stopwords
2021-09-10 20:45:23 -04:00
Alan Woodward cc8c4283dd LUCENE-10094: Fix test bug 2021-09-10 16:32:33 +01:00
Alan Woodward 1bb52859c8
LUCENE-10094: Delegate count() from CachingWrapperWeight (#289)
CachingWrapperWeight always returns -1 from its count() method, which
disables the fast path for TermQuery, MatchAllDocQuery, etc, when running
IndexSearcher.count(Query). This commit makes it delegate the method
to its wrapped Weight.
2021-09-10 10:45:20 +01:00