Commit Graph

35575 Commits

Author SHA1 Message Date
balmukundblr 66062e8991
Add explicit flush to Lucene's benchmarks module (#116)
* Added a explicit Flush Task to flush data at Thread level once it completes the processing

* Included explicit flush per Thread level
2021-04-29 20:45:34 -04:00
Mayya Sharipova a9a3f6529d
Fix regression to account payloads while merging (#103)
Before PR#11, during merging if any merging segment has payloads
for a certain field, the new merged segment will also has payloads
set up for this field.

PR #11 introduced a bug where the first segment among merging
segments will define if the new merged segment will have
payloads. If the first segment doesn't have payloads, and
others do, the new merged segment mistakenly will not
have payloads set up.

This PR fixes this bug.

Relates to #11
2021-04-29 08:37:59 -04:00
Alan Woodward f7a3587091
LUCENE-9940: DisjunctionMaxQuery shouldn't depend on disjunct order for equals checks (#110)
DisjunctionMaxQuery stores its disjuncts in a Query[], and uses
Arrays.equals() for comparisons in its equals() implementation.
This means that the order in which disjuncts are added to the query
matters for equality checks.

This commit changes DMQ to instead store its disjuncts in a Multiset,
meaning that ordering no longer matters. The getDisjuncts()
method now returns a Collection<Query> rather than a List, and
some tests are changed to use query equality checks rather than
iterating over disjuncts and expecting a particular order.
2021-04-29 09:47:55 +01:00
Gus Heck 043ed3a91f
LUCENE-9572 adjust changes entry (#112) 2021-04-29 00:23:15 -04:00
Ayushman Singh Chauhan c49bfb8e01
DOC: Fix spelling (#111) 2021-04-28 13:19:34 -04:00
Alan Woodward 90d363ece7
LUCENE-9930: Only load Ukrainian morfologik dictionary once per JVM (#109)
The UkrainianMorfologikAnalyzer was reloading its dictionary every
time it created a new TokenStreamComponents, which meant that
while the analyzer was open it would hold onto one copy of the
dictionary per thread.

This commit loads the dictionary in a lazy static initializer, alongside
its stopword set. It also makes the normalizer charmap a singleton
so that we do not rebuild the same immutable object on every call
to initReader.
2021-04-28 13:51:23 +01:00
Gus Heck 0c33e621f9 LUCENE-9574 adjust changes entry 2021-04-27 23:13:11 -04:00
Michael Sokolov 45bd06c804 LUCENE-9905: rename Lucene90VectorFormat and its reader and writer 2021-04-27 18:59:40 -04:00
Michael Sokolov 6d4b5eaba3 LUCENE-9905: rename VectorValues.SearchStrategy to VectorValues.SimilarityFunction 2021-04-27 16:18:58 -04:00
Julie Tibshirani 3115f85697
LUCENE-9908: Move VectorValues#search to LeafReader (#104)
This PR removes `VectorValues#search` in favor of exposing NN search through
`VectorReader#search` and `LeafReader#searchNearestVectors`. It also marks the
vector methods on `LeafReader` as experimental.
2021-04-26 11:26:49 -07:00
Ignacio Vera 6b386e7e68
LUCENE-9047: Remove unnecessary ByteBuffersDataOutput in BKD writer (#102) 2021-04-26 10:28:26 +02:00
Kai 1b1fd7206b
Use HTTPS for documentation link (#105) 2021-04-24 09:19:16 -04:00
Robert Muir 044d152d95
LUCENE-9928: speed up analysis/icu regeneration (#82)
The compilation of the library is slow, disable optimization as it doesn't speed up our usage of the gennorm2 tool.
Use better heuristic for make parallelism (tests.jvms rather than just hardcoded value of four).
2021-04-22 07:24:44 -04:00
John Carlson 2c43f57f91
Update gradle to 6.8.3 (#100) 2021-04-21 21:02:37 +02:00
Tomoko Uchida 5f5d1949e9
LUCENE-9353: revise format documentation of Lucene90BlockTreeTermsWriter (#90) 2021-04-20 23:36:34 +09:00
Ignacio Vera 5592d582b8
LUCENE-9047: Adapt big endian dependent code to work in little endian 2021-04-20 10:55:19 +02:00
Ignacio Vera e0436872c4
LUCENE-9907: Move PackedInts#getReaderNoHeader() to backwards codec 2021-04-20 09:09:38 +02:00
Ignacio Vera b0662c807c
LUCENE-9907: Remove unused method PackedInts.Mutable#save 2021-04-19 14:52:21 +02:00
Ignacio Vera 2a7951cd30
LUCENE-9907: Remove unused methods in PackedInts (#94) 2021-04-19 14:10:49 +02:00
Dawid Weiss bd8f182b13
LUCENE-9933: Add non-file properties to wrapped regenerate checksums (#95) 2021-04-19 13:37:47 +02:00
Ignacio Vera 936b3451af
LUCENE-9907: Remove unused BlockPackedReader (#93) 2021-04-19 09:47:44 +02:00
Ignacio Vera d15231709a
LUCENE-9907: Remove dependency on PackedInts#getReaderNoHeader in MonotonicBlockPackedReader (#85) 2021-04-19 07:18:41 +02:00
Dawid Weiss beafd113de
LUCENE-9931: Rename checksummed regen. tasks FooInternal and generated wrappers Foo (#88) 2021-04-16 22:35:51 +02:00
Mayya Sharipova 52e2abc665
Temporarily mute BaseTermVectorsFormatTestCase::testMerge (#89)
Relates to #11, and #86
2021-04-16 15:27:17 -04:00
Mayya Sharipova 49c7cc1197
Fix test that modifies schema (#87)
LUCENE-9334 requires that docs have the same schema
across the whole schema.
This fixes the test that attempts to modify schema of "number" field
from DocValues and Points to just DocValues.

Relates to #11
2021-04-15 17:48:49 -04:00
Mayya Sharipova 9a346e3739
Temporarily mute TestLucene50TermVectorsFormat:testMerge (#86)
Relates to #11
2021-04-15 11:43:29 -04:00
Ignacio Vera 873ac5f162
LUCENE-9907: Remove packedInts#getReaderNoHeader dependency on TermsVectorFieldsFormat (#72) 2021-04-15 16:04:13 +02:00
Mayya Sharipova d03662c48b
LUCENE-9334 Consistency of field data structures
Require consistency between data-structures on a per-field basis

A field must be indexed with the same index options and data-structures across
all documents. Thus, for example, it is not allowed to have one document
where a certain field is indexed with doc values and points, and another document 
where the same field is indexed only with points. 
But it is allowed for a document not to have a certain field at all.

As a consequence of this, doc values updates are
only applicable for fields that are indexed with doc values only.
2021-04-14 15:00:41 -04:00
Adrien Grand 79f14b1742
LUCENE-9387: Remove CodecReader#ramBytesUsed. (#79)
This commit removes `ramBytesUsed()` from `CodecReader` and all file formats
besides vectors, which is the only remaining file format that might use lots of
memory in the default codec. I left `ramBytesUsed()` on the `completion` format
too, which is another feature that could use lots of memory.

Other components that relied on being able to compute memory usage of readers
like facets' TaxonomyReader and the analyzing suggester assume that readers have
a RAM usage of 0 now.
2021-04-14 14:37:54 +02:00
Greg Miller fbbdc62913
LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR) (#69)
Co-authored-by: Greg Miller <gmiller@amazon.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>
2021-04-14 14:36:20 +02:00
Dawid Weiss 0b1d8ccba6
LUCENE-9925: add checksums to snowball-generated files (#80) 2021-04-13 08:59:31 +02:00
Mike McCandless b23e261786 LUCENE-9888: revert CheckIndex change that confirmed all segments have identical segment sort: it is too strict 2021-04-12 17:59:58 -04:00
Michael Sokolov 757da76919 Revert "LUCENE-9798 : Fix looping bug and made Full Knn calculation parallelizable (#55)"
This reverts commit e7de06eb51.
2021-04-12 16:50:16 -04:00
Mike Drob df0780843a Add back-compat indices for 8.8.2 2021-04-12 15:07:30 -05:00
Mike Drob 68ccfb7d1e Add bugfix version 8.8.2 2021-04-12 14:48:31 -05:00
Mike Drob a2a68360ff DOAP changes for release 8.8.2 2021-04-12 13:31:10 -05:00
Dawid Weiss 3f3917d504 LUCENE-9914: remove stale file. 2021-04-12 20:19:14 +02:00
Dawid Weiss f91700a713
LUCENE-9914: Modernize Emoji regeneration scripts (#78) 2021-04-12 20:16:43 +02:00
nitirajrathore e7de06eb51
LUCENE-9798 : Fix looping bug and made Full Knn calculation parallelizable (#55) 2021-04-12 12:38:29 -04:00
Adrien Grand a7b0aadcfc LUCENE-9827: Propagate `numChunks` through bulk merges for term vectors as well.
This commit also adds more checks about the values of `numChunks`,
`numDirtyChunks` and `numDirtyDocs` that would have helped discover this issue
earlier.
2021-04-12 09:44:35 +02:00
Robert Muir 9d15435b15
LUCENE-9916: add a simple regeneration help doc (#73)
Add a simple regeneration help doc

Improve task help and checksum failure message (include corresponding regeneration task). Sorry for being verbose. Maybe somebody will read it. :)

Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>
2021-04-11 11:28:41 -04:00
Robert Muir b0bd64c620
LUCENE-9924: generate TLD list from IANA TLD db, rather than root zone db (#77)
This adds a bit of simplicity as the file is a simple domain list,
rather than a DNS zone. So the regexes parsing DNS can be removed.

Also the file may change less often as it contains JUST the list of
TLDs, and not any additional DNS metadata.
2021-04-11 11:25:15 -04:00
Robert Muir f33335157d
LUCENE-9923: remove always-changing timestamp from ASCIITLD.jflex generation (#76)
This makes regenerate idempotent by removing the new Date() from the
output.

We already have the root.zone's Last-Modified date, which is the one
that matters and only changes when the root.zone changes.
2021-04-10 16:13:29 -04:00
Robert Muir 15bfb28d7f
LUCENE-9922: checksum files should use a deterministic sort order (#75)
This way the files don't unnecessarily change, depending on filesystem
order or anything else.
2021-04-10 16:00:55 -04:00
Dawid Weiss 4818a83cb2 LUCENE-9920: Remove binary gradle-wrapper.jar from the repository 2021-04-10 16:08:39 +02:00
Julie Tibshirani c587677150
LUCENE-9705: Correct the format names in Lucene90StoredFieldsFormat (#74)
We accidentally kept the old names when creating the new format.
2021-04-09 16:19:43 -07:00
Adrien Grand e510ef11c2 LUCENE-9827: Propagate `numChunks` through bulk merges. 2021-04-08 16:45:52 +02:00
Uwe Schindler 779e00542c Make the character printout code uniform (always print at least 4 hex chars) 2021-04-08 16:38:31 +02:00
Dawid Weiss 4c2384a1f3 LUCENE-9872: load input/output checksums prior to executing the target task, even if regenerate is not called. 2021-04-08 15:00:20 +02:00
Robert Muir 2971f311a2
LUCENE-9911: enable ecjLint unusedExceptionParameter (#70)
Fails the linter if an exception is swallowed (e.g. variable completely
unused).

If this is intentional for some reason, the exception can simply by
annotated with @SuppressWarnings("unused").
2021-04-08 08:19:01 -04:00