Commit Graph

12085 Commits

Author SHA1 Message Date
Dawid Weiss c5cf13259e
LUCENE-9562: All binary analysis packages (and corresponding Maven artifacts) with names containing '-analyzers-' have been renamed to '-analysis-'. (#1968) 2020-10-12 09:15:56 +02:00
Dawid Weiss 7362c4ce60
LUCENE-6831: start removing LinkedList in favor of ArrayList or De/Queues (#1969)
I'm committing it in, seems like a trivial thing.
2020-10-12 09:15:07 +02:00
Jason Gerlowski 9e13d99c52 Add back-compat indices for 8.6.3 2020-10-09 11:39:20 -04:00
msfroh 4e0aa0d23b
LUCENE-9567: JPOSSFF loads built-in stop tags by default (#1961)
load stoptags.txt from analysis-kuromoji when no tags argument is specified
2020-10-09 10:52:07 -04:00
Jason Gerlowski 76a8cc3c3e Add bugfix version 8.6.3 2020-10-09 10:27:42 -04:00
Uwe Schindler 2329423e5c
LUCENE-9577: Move Lucene/Solr Documentation assembly to subproject (#1967) 2020-10-09 14:56:44 +02:00
Mike Drob 08e38d3452
LUCENE-9488 Create Release Artifacts with Gradle (#1905)
* Build Lucene binary distribution using Gradle
* Generate SHA-512 checksums for all release artifacts
* Update documentation artifacts included in binaries
* Delete some additional Ant relics

Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>
Co-authored-by: Uwe Schindler <uschindler@apache.org>
2020-10-08 14:25:51 -05:00
Mayya Sharipova 0b08943112
LUCENE-9566 TestApproximationSearchEquivalence.testExclusion fix (#1955) 2020-10-07 07:02:46 -04:00
Mayya Sharipova 5039e7170b Mute TestApproximationSearchEquivalence.testExclusion
Temporarily mute TestApproximationSearchEquivalence.testExclusion
2020-10-06 15:10:48 -04:00
Mayya Sharipova 6b8288445f LUCENE-9541 ConjunctionDISI sub-iterators check (#1937)
* LUCENE-9541 ConjunctionDISI sub-iterators check

Ensure sub-iterators of a conjunction iterator are on the same doc.
2020-10-06 13:23:01 -04:00
Mayya Sharipova 874c446ab9
LUCENE-9565 Fix competitive iteration (#1952)
PR #1351 introduced a sort optimization where documents can be skipped.
But iteration over competitive iterators was not properly organized,
as they were not storing the current docID, and
when competitive iterator was updated the current doc ID was lost.

This patch fixed it.

Relates to #1351
2020-10-06 13:22:16 -04:00
Mayya Sharipova 6ac94a6f9f
LUCENE-9555: Advance conjunction Iterator for two phase iteration (#1943)
PR #1351 introduced a sort optimization where
documents can be skipped.
But there was a bug in case we were using two phase
approximation, as we would advance it without advancing
an overall conjunction iterator.

This patch fixed it.

Relates to #1351
2020-10-06 09:22:42 -04:00
Mayya Sharipova e325f66e61 Revert "LUCENE-9541 ConjunctionDISI sub-iterators check (#1937)"
This reverts commit 5f34acfdb5.
2020-10-05 10:55:25 -04:00
Mayya Sharipova 5f34acfdb5
LUCENE-9541 ConjunctionDISI sub-iterators check (#1937)
* LUCENE-9541 ConjunctionDISI sub-iterators check

Ensure sub-iterators of a conjunction iterator are on the same doc.
2020-10-05 09:38:17 -04:00
Tomoko Uchida b70eaeee5a
LUCENE-9558: Clean up package name conflicts for analyzers-icu. (#1946) 2020-10-05 17:52:23 +09:00
iverase 0864b39a11 make sure we don't build circles with zero radius in ShapeTestUtil 2020-10-05 10:45:31 +02:00
Adrien Grand 1038fe8bee Fix rare test failure.
This test fails when the maximum segment size is only one byte larger
than the min segment size.
2020-10-05 09:58:18 +02:00
Erick Erickson f6c4f8a755 SOLR-14910: Use in-line tags for logger declarations in Gradle ValidateLogCalls that are non-standard, change //logok to //nowarn 2020-10-03 09:47:37 -04:00
Nhat Nguyen 7e04e4d0ca
LUCENE-9554: Expose IndexWriter#pendingNumDocs (#1941)
Some applications can use the pendingNumDocs from IndexWriter to 
estimate that the number of documents of an index is very close to the
hard limit so that it can reject writes without constructing Lucene
documents.
2020-10-02 17:32:20 -04:00
David Smiley 2aa51fe77c
LUCENE-9032: BaseTokenStreamTestCase minor...
* make checkResetException() public
* one assertAnalyzesTo variant should be calling checkAnalysisConsistency (only used by OpenNLP tests now)
2020-10-01 23:43:03 -04:00
David Smiley 0303063e12
LUCENE-9458: WDGF should tie-break by endOffset (#1740)
Can happen with catenateAll and not generating word xor number part when the input ends with the non-generated sub-token.
Fuzzing revealed that only start & end offsets are needed to order sub-tokens.
2020-10-01 22:27:45 -04:00
goankur 2e2161b0e0
LUCENE-9444: Improve test coverage for TaxonomyFacetLabels (#1928)
Co-authored-by: Ankur Goel <goankur@amazon.com>
2020-09-30 13:21:18 -04:00
Munendra S N b9c7f50b6e LUCENE-9401: include field in the complex pharse query's toString 2020-09-29 19:28:01 +05:30
Dawid Weiss 65a62b04c5 Remove unused imports. 2020-09-29 10:24:17 +02:00
Mike McCandless 98a49ed18d LUCENE-9444: add CHANGES.txt entry 2020-09-28 13:01:22 -04:00
goankur 24aadc220b
LUCENE-9444: add utility class to retrieve facet labels from the taxonomy index for a facet field (#1893)
LUCENE-9444: add utility class to retrieve facet labels from the taxonomy index for a facet field so such fields do not also have to be redundantly stored in the index.

Co-authored-by: Ankur Goel <goankur@amazon.com>
2020-09-28 10:55:37 -04:00
Adrien Grand fc6d0a40dc LUCENE-9317: Remove unused imports. 2020-09-28 16:29:36 +02:00
Adrien Grand c3f97fbdc1
Compute RAM usage ByteBuffersDataOutput on the fly. (#1919)
This helps remove the assumption that all blocks have the same size.
2020-09-28 15:08:08 +02:00
Namgyu Kim 00d7f5ea68
LUCENE-9544: Port Nori dictionary compilation (#1926) 2020-09-28 20:28:21 +09:00
Tomoko Uchida 5e617ccc33
LUCENE-9317: Clean up split package in analyzers-common (#1836) 2020-09-28 16:49:28 +09:00
Adrien Grand c032cd1b6f Revert "LUCENE-9535: Reduce the size of compressed blocks of stored fields by 2x."
This reverts commit 12dd19427e.
2020-09-25 22:17:18 +02:00
Simon Willnauer c258905bd0
LUCENE-9535: Commit DWPT bytes used before locking indexing (#1918)
Currently we calculate the ramBytesUsed by the DWPT under the flushControl
lock. We can do this calculation safely outside of the lock without any downside.
The FlushControl lock should be used with care since it's a central part of indexing
and might block all indexing.
2020-09-24 09:39:33 +02:00
Adrien Grand d226abd448
LUCENE-9535: Make ByteBuffersDataOutput#ramBytesUsed run in constant-time. (#1917) 2020-09-23 19:49:15 +02:00
Mayya Sharipova 7d90b858c2
Fix bug in sort optimization (#1903)
Fix bug how iterator with skipping functionality
advances and produces docs

Relates to #1725
2020-09-23 09:09:43 -04:00
Uwe Schindler e19239d96b Upgrade forbiddenapis to version 3.1 2020-09-23 14:56:46 +02:00
Simon Willnauer 17c285d617
LUCENE-9539: Remove caches from SortingCodecReader (#1909)
SortingCodecReader keeps all docvalues in memory that are loaded from this reader.
Yet, this reader should only be used for merging which happens sequentially. This makes
caching docvalues unnecessary.

Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
2020-09-23 14:21:28 +02:00
Adrien Grand 12dd19427e LUCENE-9535: Reduce the size of compressed blocks of stored fields by 2x.
In order to see whether this has any effect on nigthly benchmarks.
2020-09-23 12:22:22 +02:00
Simon Willnauer c82b99464d
LUCENE-9539: Use more compact datastructures for sorting doc-values (#1908)
This change cuts over from object based data-structures to primitive / compressed data-structures.
2020-09-22 15:10:53 +02:00
Simon Willnauer 208a1c07b0
LUCENE-9534: Ensure DWPT#ramBytesUsed is only called unter lock (#1889)
Consumers of the used RAM of a DWPT should use it's committed bytesUsed
value that's threadsafe.
2020-09-18 17:59:05 +02:00
Dawid Weiss 3a92e1b93e
LUCENE-9528: cleanup of flexible query parser's grammar (#1879) 2020-09-18 09:38:20 +02:00
Dawid Weiss 5ec2bac91c
LUCENE-9531: Consolidate duplicated generated classes CharStream and FastCharStream (#1886) 2020-09-18 08:53:30 +02:00
Ignacio Vera fbf8e4f044
LUCENE-9523: Speed up query shapes for geometries that generate multiple points (#1866)
In query shapes over shape fields, skip points while traversing the BKD tree when the relationship with the document is already known
2020-09-18 07:50:58 +02:00
Adrien Grand 33f7280078
LUCENE-9529: Track dirtiness of stored fields via a number of docs, not chunks. (#1882)
The problem of tracking dirtiness via numbers of chunks is that larger
chunks make stored fields readers more likely to be considered dirty, so
I'm trying to work around it by tracking numbers of docs instead.
2020-09-17 18:59:08 +02:00
Adrien Grand e0a64908d8
Further tune Lucene87StoredFieldsFormat for small documents. (#1888)
The increase of the maximum number of chunks per doc done in previous
issues was mostly random. I'd like to provide users with a similar
trade-off with what the old versions of BEST_SPEED and BEST_COMPRESSION
used to do. So since BEST_SPEED used to compress at most 128 docs at
once, I think we should roughly make it 128*10 now since there are 10
sub blocks. I made it 1024 to account for the fact that there is a preset
dict as well that need decompressing. And similarly BEST_COMPRESSION used
to allow 4x more docs than BEST_SPEED, so I made it 4096.

With such larger numbers of docs per chunk, the decoding of metadata
became a bottleneck for stored field access so I made it a bit faster by
doing bulk decoding of the packed longs.
2020-09-17 18:30:57 +02:00
Dawid Weiss 6c9d7adf79
LUCENE-9527: upgrade javacc to 7.0.4 (#1884) 2020-09-17 13:29:18 +02:00
Dawid Weiss 4f344cb0d4
LUCENE-9530: cleaned up javacc gradle generation scripts. (#1883)
* LUCENE-9530: cleaned up gradle javacc generation/ tweaks script so that it's consistent across runs. Removed ant remnants.
2020-09-17 10:53:02 +02:00
Adrien Grand ad71bee016
LUCENE-9525: Better handle small documents with Lucene87StoredFieldsFormat. (#1876)
Instead of configuring a dictionary size and a block size, the format
now tries to have 10 sub blocks per bigger block, and adapts the size of
the dictionary and of the sub blocks to this overall block size.
2020-09-16 13:09:00 +02:00
Adrien Grand 93094ef7e4
LUCENE-9510: Don't compress temporary stored fields and term vectors when index sorting is enabled. (#1874)
When index sorting is enabled, stored fields and term vectors can't be
written on the fly like in the normal case, so they are written into
temporary files that then get resorted. For these temporary files,
disabling compression speeds up indexing significantly.

On a synthetic test that indexes stored fields and a doc value field
populated with random values that is used for index sorting, this
resulted in a 3x indexing speedup.
2020-09-16 13:05:22 +02:00
Dawid Weiss 9b9b0a6339 Fix corrupted umlaut characters. This was introduced back in 2009... 2020-09-15 19:07:30 +02:00
Mike Drob 3134f10a42
LUCENE-9488 Update release process to work with gradle (#1860)
* Restore lucene/version.properties
* Switch release wizard commands from ant to gradle equivalents
* Remove remaining checks for ant
* Remove checks for Java 8
* Update Copyright year
* Minor bug fixes around determining next version for a major release
2020-09-15 10:10:17 -05:00