Commit Graph

35253 Commits

Author SHA1 Message Date
Dawid Weiss beafd113de
LUCENE-9931: Rename checksummed regen. tasks FooInternal and generated wrappers Foo (#88) 2021-04-16 22:35:51 +02:00
Mayya Sharipova 52e2abc665
Temporarily mute BaseTermVectorsFormatTestCase::testMerge (#89)
Relates to #11, and #86
2021-04-16 15:27:17 -04:00
Mayya Sharipova 49c7cc1197
Fix test that modifies schema (#87)
LUCENE-9334 requires that docs have the same schema
across the whole schema.
This fixes the test that attempts to modify schema of "number" field
from DocValues and Points to just DocValues.

Relates to #11
2021-04-15 17:48:49 -04:00
Mayya Sharipova 9a346e3739
Temporarily mute TestLucene50TermVectorsFormat:testMerge (#86)
Relates to #11
2021-04-15 11:43:29 -04:00
Ignacio Vera 873ac5f162
LUCENE-9907: Remove packedInts#getReaderNoHeader dependency on TermsVectorFieldsFormat (#72) 2021-04-15 16:04:13 +02:00
Mayya Sharipova d03662c48b
LUCENE-9334 Consistency of field data structures
Require consistency between data-structures on a per-field basis

A field must be indexed with the same index options and data-structures across
all documents. Thus, for example, it is not allowed to have one document
where a certain field is indexed with doc values and points, and another document 
where the same field is indexed only with points. 
But it is allowed for a document not to have a certain field at all.

As a consequence of this, doc values updates are
only applicable for fields that are indexed with doc values only.
2021-04-14 15:00:41 -04:00
Adrien Grand 79f14b1742
LUCENE-9387: Remove CodecReader#ramBytesUsed. (#79)
This commit removes `ramBytesUsed()` from `CodecReader` and all file formats
besides vectors, which is the only remaining file format that might use lots of
memory in the default codec. I left `ramBytesUsed()` on the `completion` format
too, which is another feature that could use lots of memory.

Other components that relied on being able to compute memory usage of readers
like facets' TaxonomyReader and the analyzing suggester assume that readers have
a RAM usage of 0 now.
2021-04-14 14:37:54 +02:00
Greg Miller fbbdc62913
LUCENE-9850: Use PFOR encoding for doc IDs (instead of FOR) (#69)
Co-authored-by: Greg Miller <gmiller@amazon.com>
Co-authored-by: Adrien Grand <jpountz@gmail.com>
2021-04-14 14:36:20 +02:00
Dawid Weiss 0b1d8ccba6
LUCENE-9925: add checksums to snowball-generated files (#80) 2021-04-13 08:59:31 +02:00
Mike McCandless b23e261786 LUCENE-9888: revert CheckIndex change that confirmed all segments have identical segment sort: it is too strict 2021-04-12 17:59:58 -04:00
Michael Sokolov 757da76919 Revert "LUCENE-9798 : Fix looping bug and made Full Knn calculation parallelizable (#55)"
This reverts commit e7de06eb51.
2021-04-12 16:50:16 -04:00
Mike Drob df0780843a Add back-compat indices for 8.8.2 2021-04-12 15:07:30 -05:00
Mike Drob 68ccfb7d1e Add bugfix version 8.8.2 2021-04-12 14:48:31 -05:00
Mike Drob a2a68360ff DOAP changes for release 8.8.2 2021-04-12 13:31:10 -05:00
Dawid Weiss 3f3917d504 LUCENE-9914: remove stale file. 2021-04-12 20:19:14 +02:00
Dawid Weiss f91700a713
LUCENE-9914: Modernize Emoji regeneration scripts (#78) 2021-04-12 20:16:43 +02:00
nitirajrathore e7de06eb51
LUCENE-9798 : Fix looping bug and made Full Knn calculation parallelizable (#55) 2021-04-12 12:38:29 -04:00
Adrien Grand a7b0aadcfc LUCENE-9827: Propagate `numChunks` through bulk merges for term vectors as well.
This commit also adds more checks about the values of `numChunks`,
`numDirtyChunks` and `numDirtyDocs` that would have helped discover this issue
earlier.
2021-04-12 09:44:35 +02:00
Robert Muir 9d15435b15
LUCENE-9916: add a simple regeneration help doc (#73)
Add a simple regeneration help doc

Improve task help and checksum failure message (include corresponding regeneration task). Sorry for being verbose. Maybe somebody will read it. :)

Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>
2021-04-11 11:28:41 -04:00
Robert Muir b0bd64c620
LUCENE-9924: generate TLD list from IANA TLD db, rather than root zone db (#77)
This adds a bit of simplicity as the file is a simple domain list,
rather than a DNS zone. So the regexes parsing DNS can be removed.

Also the file may change less often as it contains JUST the list of
TLDs, and not any additional DNS metadata.
2021-04-11 11:25:15 -04:00
Robert Muir f33335157d
LUCENE-9923: remove always-changing timestamp from ASCIITLD.jflex generation (#76)
This makes regenerate idempotent by removing the new Date() from the
output.

We already have the root.zone's Last-Modified date, which is the one
that matters and only changes when the root.zone changes.
2021-04-10 16:13:29 -04:00
Robert Muir 15bfb28d7f
LUCENE-9922: checksum files should use a deterministic sort order (#75)
This way the files don't unnecessarily change, depending on filesystem
order or anything else.
2021-04-10 16:00:55 -04:00
Dawid Weiss 4818a83cb2 LUCENE-9920: Remove binary gradle-wrapper.jar from the repository 2021-04-10 16:08:39 +02:00
Julie Tibshirani c587677150
LUCENE-9705: Correct the format names in Lucene90StoredFieldsFormat (#74)
We accidentally kept the old names when creating the new format.
2021-04-09 16:19:43 -07:00
Adrien Grand e510ef11c2 LUCENE-9827: Propagate `numChunks` through bulk merges. 2021-04-08 16:45:52 +02:00
Uwe Schindler 779e00542c Make the character printout code uniform (always print at least 4 hex chars) 2021-04-08 16:38:31 +02:00
Dawid Weiss 4c2384a1f3 LUCENE-9872: load input/output checksums prior to executing the target task, even if regenerate is not called. 2021-04-08 15:00:20 +02:00
Robert Muir 2971f311a2
LUCENE-9911: enable ecjLint unusedExceptionParameter (#70)
Fails the linter if an exception is swallowed (e.g. variable completely
unused).

If this is intentional for some reason, the exception can simply by
annotated with @SuppressWarnings("unused").
2021-04-08 08:19:01 -04:00
Peter Gromov 7f147fece0
LUCENE-9894: Hunspell: add user-friendly diagnostics for morph data API misuse (#51) 2021-04-07 14:52:36 +02:00
Peter Gromov 8eb582e671
LUCENE-9895: Hunspell: make suggest-with-timeout API public (#54) 2021-04-07 14:52:29 +02:00
Robert Muir df25653cbd
LUCENE-9882: better synchronize eclipse formatter with spotless. (#47)
Import the spotless formatting settings to our eclipse IDE setting, so
that it is a closer match.
2021-04-07 06:20:42 -04:00
Robert Muir 4026753744
LUCENE-9910: maximize javac lint (#68)
This enables quite a few javac warnings from java11+ that weren't
enabled for some reason. None of them fail, so lock them in.

Additionally some newer checks are only recognized for newer JDK
versions, so they are only enabled based on the javac version used. They
will cause no annoyance because they relate to newer language features.
2021-04-07 06:10:29 -04:00
Ignacio Vera 430b3baa80
LUCENE-9907: Remove packedInts dependency on StoredFieldsFormat (#64) 2021-04-07 11:33:49 +02:00
Dawid Weiss 39071dbc54
LUCENE-9904: Port GenerateJflexTLDMacros.java regeneration to gradle and regenerate UAX tokenizer with up-to-date TLDs 2021-04-07 10:56:21 +02:00
Gautam Worah efeea0b8ee
LUCENE-9902 Minor fixes to the faceting API (#62) 2021-04-06 14:50:23 -04:00
Robert Muir be94a667f2
LUCENE-9827: avoid wasteful recompression for small segments (#28)
Require that the segment has enough dirty documents to create a clean
chunk before recompressing during merge, there must be at least maxChunkSize.

This prevents wasteful recompression with small flushes (e.g. every
document): we ensure recompression achieves some "permanent" progress.

Expose maxDocsPerChunk as a parameter for Term vectors too, matching the
stored fields format. This allows for easy testing.

Increment numDirtyDocs for partially optimized merges:
If segment N needs recompression, we have to flush any buffered docs
before bulk-copying segment N+1. Don't just increment numDirtyChunks,
also make sure numDirtyDocs is incremented, too.
This doesn't have a performance impact, and is unrelated to tooDirty()
improvements, but it is easier to reason about things with correct
statistics in the index.

Further tuning of how dirtiness is measured: for simplification just use percentage
of dirty chunks.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
2021-04-06 14:18:48 -04:00
Adrien Grand d991fefb49
Add an example to the CacheHelper docs. (#50) 2021-04-06 16:25:15 +02:00
Dawid Weiss 2662a74cab Correct some of the jdk17-offending javadocs. 2021-04-05 20:34:52 +02:00
Dawid Weiss 2773172455 Correct some of the jdk17-offending javadocs. 2021-04-05 20:21:52 +02:00
Dawid Weiss baceb16904 Correct some of the jdk17-offending javadocs. 2021-04-05 20:19:56 +02:00
Dawid Weiss fbf9191abf
LUCENE-9901: UnicodeData.java has no regeneration task (#63) 2021-04-05 20:12:56 +02:00
Ignacio Vera 67a0bd4b6d
LUCENE-9705: Final clean-up and entry in CHANGES.txt (#59) 2021-04-04 11:30:47 +02:00
Dawid Weiss 010e3a1ba9
LUCENE-9900: Regenerate/ run ICU only if inputs changed (#61) 2021-04-02 11:46:43 +02:00
Dawid Weiss e3ae57a3c1
LUCENE-9872: Make the most painful tasks in regenerate fully incremental (#60) 2021-04-02 09:56:47 +02:00
Tomoko Uchida 670bbf8b99
Ignore sdkmanrc file on Git (#58) 2021-04-02 01:04:14 +09:00
Ignacio Vera 8c9b9546cc
LUCENE-9705: Create Lucene90PointsFormat (#52) 2021-04-01 07:04:04 +02:00
Pieter van Boxtel 1d579b9448
LUCENE-9898 Remove no longer used scorePayload method from BM25Similarity (#57) 2021-04-01 09:06:03 +09:00
zacharymorn 79fcd99f4c
LUCENE-9883: Turn on ecj missingEnumCaseDespiteDefault setting (#56) 2021-03-31 15:50:52 +09:00
Dawid Weiss 32e891c60f LUCENE-9871: move dummy outputs aspect into a separate file. 2021-03-30 20:15:55 +02:00
Adrien Grand 10520185a9 LUCENE-9877: Move CHANGES entry under 8.9. 2021-03-30 15:13:00 +02:00