lucene

Commit Graph

Author	SHA1	Message	Date
Ignacio Vera	5ba87f9efa	LUCENE-10310: Fix test error in TestXYDocValuesQueries#testRandomDistanceHuge (#537 ) We create random circles using ShapeTestUtils which is safe.	2021-12-13 12:02:13 +01:00
Tomoko Uchida	e0a6e1c662	LUCENE-10309: Minimum KnnVector codec support in Luke (#535 )	2021-12-12 15:32:10 +09:00
Tomoko Uchida	140f48e267	LUCENE-10303: Upgrade log4j to 2.15.0	2021-12-11 10:45:00 +09:00
Tomoko Uchida	f046b59a5b	LUCENE-10305: Ensure line endings of versions.props is LF	2021-12-11 10:12:34 +09:00
Dawid Weiss	cf5a7337e2	LUCENE-10229: change the wording a bit.	2021-12-09 17:35:18 +01:00
Patrick Zhai	2a47bbe8be	LUCENE-10229: Unify behaviour of match offsets for interval queries (#521 )	2021-12-09 17:35:18 +01:00
Ignacio Vera	5e0d8dc87a	Revert "LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long (#520 )" (#532 ) This reverts commit `af1e68b891`.	2021-12-09 13:55:36 +01:00
Dawid Weiss	17aaab654e	LUCENE-10294: Avoid compiling javadocs twice in 'gradlew check'.	2021-12-09 09:56:32 +01:00
Julie Tibshirani	394472d4b8	LUCENE-10040: Add test for vector search with skewed deletions (#527 ) This exercises a challenging case where the documents to skip all happen to be closest to the query vector. In many cases, HNSW appears to be robust to this case and maintains good recall.	2021-12-08 11:26:45 -08:00
Robert Muir	c74642d9a7	remove unnecessary "dependencies" in versions.props (#526 ) Looks like stray cats from back when it was shared with solr	2021-12-07 21:23:33 -05:00
Adrien Grand	85caa4364e	DOAP changes for release 9.0.0	2021-12-07 14:38:56 +01:00
Ignacio Vera	1eb935229f	LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long (#520 )	2021-12-07 07:42:04 +01:00
Tomoko Uchida	3eadfd4596	LUCENE-10287: Re-add abstract FSDirectory class as a supported directory (#522 )	2021-12-07 15:34:24 +09:00
gf2121	892e324d02	LUCENE-10280: Store BKD blocks with continuous ids more efficiently (#510 )	2021-12-07 07:27:11 +01:00
Robert Muir	4d48dc87f7	speed up TestSimpleExplanationsWithFillerDocs (#516 ) This is the slowest test suite, runs for ~ 60s, because between every document it adds 2048 "filler docs". This just adds up to a ton of indexing across all the test methods. Use 2048 for Nightly, and instead a smaller number (4) for local builds.	2021-12-06 22:13:04 -05:00
Robert Muir	9000dfc382	simplify jflex grammars by using difference rather than negation (#515 ) Jflex grammars now avoid using complement operator twice as a demorgan-workaround for "macros in char classes". With the latest version of jflex, we can just do the subtraction directly and avoid unnecessary NFA->DFA conversions. This speeds up `generateUAX29URLEmailTokenizer` around 3x.	2021-12-06 21:59:40 -05:00
Uwe Schindler	d36c70cdd6	LUCENE-10287: Add changes entry	2021-12-06 20:28:46 +01:00
Uwe Schindler	8e7fbcaf5b	LUCENE-10287: Fix startup script of module enabled Luke to pass jdk.unsupported as module (#517 )	2021-12-06 20:24:56 +01:00
gf2121	ebee531df7	LUCENE-10233: fix Unit Test TestFixedBitSet#testAndNot (#512 ) Co-authored-by: guofeng.my <guofeng.my@bytedance.com>	2021-12-06 07:34:41 +01:00
Robert Muir	aa6a78c28c	tone down TestIndexWriter.testMaxCompletedSequenceNumber in non-nightly (#506 ) this test currently indexes up to 600 docs for each thread.	2021-12-04 14:23:03 -05:00
Robert Muir	3e06e2338e	tone down BaseTermVectorsFormatTestCase.testLotsOfFields in non-nightly (#505 ) This test runs across every IndexOptions, indexing hundreds of fields. It is slow for some implementations (e.g. SimpleText). Use less fields for normal runs.	2021-12-04 14:22:57 -05:00
Robert Muir	401d6209fb	Make TestNRTReplication.testCrashReplica nightly (#504 ) This test is forking and crashing JVMs, always runs over 10 seconds	2021-12-04 14:22:48 -05:00
Dawid Weiss	d2563e6f1f	LUCENE-10284: Upgrade morfologik-stemming to 2.1.8 (#514 )	2021-12-04 09:57:02 +01:00
Robert Muir	eff5430e58	LUCENE-10243: increase unicode versions of tokenizers to 12.1 (#465 ) * Bump %unicode 9 -> %unicode 12.1 for the 3 unicode grammars * regenerate emoji conformance tests for unicode 12.1 * modify wordbreak conformance tests to use emoji data (which replaces old crazy E_base etc properties) * regenerate wordbreak conformance tests * Simplify grammar files and word-break conformance test generator, now that full-width numbers are WordBreak=Numeric * Use jflex emoji properties rather than ICU-generated ones	2021-12-03 20:34:29 -05:00
gf2121	eadc146e08	LUCENE-10233: Use AND NOT for inverse intersector (#499 ) When docIds are stored as a BitSet, use andNot to speed up collecting them.	2021-12-03 09:24:54 +01:00
Ignacio Vera	e2264cd7ef	LUCENE-10279: add entry in CHANGES.txt and make RangeClause final (#507 )	2021-12-03 07:24:40 +01:00
Misha Tiurin	074a233244	Remove duplicate entries in SpanishPluralStemmer invariants list (#508 ) * Remove duplicate entries in SpanishPluralStemmer invariants list Add assertion to prevent duplicates in the future Co-authored-by: Xavier Sanchez <xavier.sanchez@wallapop.com>	2021-12-02 14:23:39 -05:00
Robert Muir	cbd306f87b	LUCENE-10278: don't write zero-sized array in this test (#501 ) DocIdsWriter is not prepared for this.	2021-12-02 15:54:28 +01:00
gf2121	dd14424817	LUCENE-10233: Store docIds as bitset to speed up addAll (#438 )	2021-12-02 15:54:18 +01:00
Adrien Grand	1b04807440	Make EndiannessReverser(Data\|Index)Input always reverse byte order. (#502 ) Currently EndiannessReverser(Data\|Index)Input doesn't reverse the byte order for `readLongs` and `readFloats`. The reasoning is that these two method replaced `readLELongs` and `readLEFloats`, so the byte order doesn't need changing. However this is creating some confusing situations where new code expects a consistent byte order on the write and read sides and gets longs in the wrong byte order. So this commit suggests a different approach, where EndiannessReverser(Data\|Index)Input always changes the byte order, and former call sites of `readLELongs` and `readLEFloats` are changed to manually reverse the byte order on top of `readLongs` and `readFloats`. This is making old codecs a bit slower, but I think it's fair since these are old codecs. And this makes the endianness reversing backward compatibility layer easier to reason about?	2021-12-02 14:01:37 +01:00
Ignacio Vera	c0f0686f74	LUCENE-10279: Fix equals in MultiRangeQuery (#503 )	2021-12-02 13:34:47 +01:00
Robert Muir	f33ae4e81f	improve term vector merging tests (#500 ) Use less iterations locally so that term vector merging doesn't dominate the list of slowest tests. Split out deletes/no-deletes into separate methods to improve debuggability. Remove nightly from SimpleText term vectors merging tests, now that they run much faster.	2021-12-02 05:40:36 -05:00
Ignacio Vera	a580e29539	LUCENE-10275: Speed up MultiRangeQuery by using an interval tree	2021-12-02 09:54:32 +01:00
Robert Muir	072b775199	LUCENE-10273: Deprecate SpanishMinimalStemmer in favor of SpanishPluralStemmer (#497 ) * LUCENE-10273: Deprecate SpanishMinimalStemmer in favor of SpanishPluralStemmer The new SpanishPluralStemmer is in fact more "minimal", less agressive stemming and normalization. For the user that wants only plural stemming, it is the better choice.	2021-12-01 14:27:48 -05:00
Tomoko Uchida	d551e128de	backport `a7ebf6618c`	2021-12-01 20:13:23 +09:00
Robert Muir	a201fde054	LUCENE-10270: Improve MIGRATE.md (#491 ) * LUCENE-10270: Improve MIGRATE.md * Separate sections for 9.0 and 9.1 * Remove abbreviations for artifact, package, class names etc. e.g. `lucene-core` instead of `core` and `org.apache.lucene.analysis` instead of `o.a.l.a`. * Specify "java" for text blocks to get syntax highlighting * When provided, consistently put JIRA issue in the same place * Fixed-width font for classes/reserved words (e.g. false, true, long, makes for less ambiguous reading) * More use of tables vs lists when there is mapping of old -> new names (packages, classes, etc) * Use consistent notation for method calls (Class.method() vs Class.method vs Class#method etc) * LUCENE-10270: replace LUCENE_ with LUCENE- so it gets JIRA link * LUCENE-10270: fix things found by msokolov	2021-12-01 05:30:26 -05:00
Dawid Weiss	025b4b3106	LUCENE-10234: update smoke-tester with new module names.	2021-12-01 10:39:28 +01:00
Greg Miller	da48e47f3b	Move CHANGES entry for LUCENE-10232 to 9.0	2021-11-30 14:37:49 -08:00
Xavier Sanchez Loro	0293fc896d	LUCENE-10248: Spanish Plural Stemmer (#461 ) Adds a new Spanish stemmer just for stemming plural to singular whilst maintaining gender: the SpanishPluralStemmer. The goal is to provide a lightweight algorithmic approach with better precision and recall than current approaches. See blog post for more details: https://medium.com/inside-wallapop/spanish-plural-stemmer-matching-plural-and-singular-forms-in-spanish-using-lucene-93e005e38373 This approach is based on rules specified in WikiLingua: http://www.wikilengua.org/index.php/Plural_(formaci%C3%B3n) Some characteristics: * Designed to stem just plural to singular form * Distinguishes between masculine and feminine forms * It will increase recall but precision can be reduced depending on the use case/information need * Stems plural words of foreign origin: i.e. complots, bits, punks, robots * Support for invariant words: same plural and singular form or plural does not make sense: i.e. crisis, jueves, lapsus, abrebotellas, etc * Support for special cases: i.e. yoes, clubes, itemes, faralaes * Use it when the distinction between singular and plural is not relevant but gender is relevant * Produces meaningful tokens in form of singular * Not strange stems like “amig”: it’s true that stemmers must not generate grammatically correct tokens, but if we generate correct stems we decrease the possibility of collisions with other words	2021-11-30 16:07:09 -05:00
Robert Muir	468aceff0c	LUCENE-10248: add CHANGES.txt entry	2021-11-30 16:07:09 -05:00
Dawid Weiss	26257292c3	LUCENE-10234: Change module prefix to org.apache.* (#487 )	2021-11-30 22:04:24 +01:00
Robert Muir	c89c78cee0	LUCENE-10272: cross-check norms with postings in checkindex (#493 ) Previously, CheckIndex would iterate norms and validate each one. But if norms that should be there were missing, nothing would fail. Now it computes an expected count of norms and ensures it saw them all.	2021-11-30 15:40:35 -05:00
Greg Miller	51e023bf7a	LUCENE-10232: Fix MultiRangeQuery to confirm all dimensions for a given range match	2021-11-30 12:01:34 -08:00
Alan Woodward	b697745407	LUCENE-10263: Implement Weight.count() on NormsFieldExistsQuery (#477 ) If all documents in the segment have a value, then `Reader.getDocCount()` will equal `maxDoc` and we can return `numDocs` as a shortcut.	2021-11-30 10:07:39 +00:00
Greg Miller	8a03d2ffc9	Add javadoc note in DoubleValuesSource (see LUCENE-10258) (#490 )	2021-11-29 18:02:26 -08:00
Robert Muir	c5b5fd641b	support tables in generated html documentation (#489 ) Tables can be used in markdown (e.g. MIGRATE.md) and will become html tables in our generated HTML docs on the website.	2021-11-29 17:44:26 -05:00
Robert Muir	278316377c	Improve MIGRATE.md around analyzers artifacts. (#488 ) * Improve MIGRATE.md around analyzers artifacts. Move this to the very top of MIGRATE, the user needs to first be able to pull in the artifacts, before doing anything else like trying to compile, deal with renamed classes, etc. Add a table of each package that got moved, with explicit old and new names. Hopefully it helps search engines and users. Link to MIGRATE.md explicitly from README.md	2021-11-29 17:44:26 -05:00
Ignacio Vera	70243ea811	LUCENE-9538: Detect polygon self-intersections in the Tessellator (#428 ) Detect self-intersections so it can provide a more meaningful error to the users.	2021-11-29 11:06:06 +01:00
Ignacio Vera	62084d7138	LUCENE-10264: Clone index input when creating a PointTree in SimpleTextBKDReader (#478 ) Fixes a race condition introduced in LUCENE-9820.	2021-11-29 09:21:27 +01:00
Robert Muir	8d0103724d	Speed up ECJ tasks by avoiding --release (#484 ) LUCENE-10185 caused a large performance regression in ECJ tasks by using the --release flag. Instead of using --release, we can just disable "terminal deprecation", and leave this check to `javac`. The --release flag makes this tool run 50% slower.	2021-11-28 15:11:02 -05:00

1 2 3 4 5 ...

35546 Commits All Branches Search

35546 Commits

All Branches