Commit Graph

35823 Commits

Author SHA1 Message Date
Xavier Sanchez Loro edb936f090
LUCENE-10248: Spanish Plural Stemmer (#461)
Adds a new Spanish stemmer just for stemming plural to singular whilst maintaining gender: the SpanishPluralStemmer. The goal is to provide a lightweight algorithmic approach with better precision and recall than current approaches.

See blog post for more details: https://medium.com/inside-wallapop/spanish-plural-stemmer-matching-plural-and-singular-forms-in-spanish-using-lucene-93e005e38373

This approach is based on rules specified in WikiLingua: http://www.wikilengua.org/index.php/Plural_(formaci%C3%B3n)

Some characteristics:

* Designed to stem just plural to singular form
* Distinguishes between masculine and feminine forms
* It will increase recall but precision can be reduced depending on the use case/information need
* Stems plural words of foreign origin: i.e. complots, bits, punks, robots
* Support for invariant words: same plural and singular form or plural does not make sense: i.e. crisis, jueves, lapsus, abrebotellas, etc
* Support for special cases: i.e. yoes, clubes, itemes, faralaes
* Use it when the distinction between singular and plural is not relevant but gender is relevant
* Produces meaningful tokens in form of singular
* Not strange stems like “amig”: it’s true that stemmers must not generate grammatically correct tokens, but if we generate correct stems we decrease the possibility of collisions with other words
2021-11-30 15:51:10 -05:00
Greg Miller f48a430f35
LUCENE-10232: Fix MultiRangeQuery to confirm all dimensions for a given range match (#437) 2021-11-30 11:58:38 -08:00
Robert Muir 46a5a57724
LUCENE-10272: cross-check norms with postings in checkindex (#493)
Previously, CheckIndex would iterate norms and validate each one. But if norms that should be there were missing, nothing would fail. Now it computes an expected count of norms and ensures it saw them all.
2021-11-30 14:21:40 -05:00
Alan Woodward 749b744c0c
LUCENE-10263: Implement Weight.count() on NormsFieldExistsQuery (#477)
If all documents in the segment have a value, then `Reader.getDocCount()` will
equal `maxDoc` and we can return `numDocs` as a shortcut.
2021-11-30 10:00:38 +00:00
Greg Miller 4f5b41a71c
Add javadoc note in DoubleValuesSource (see LUCENE-10258) (#490) 2021-11-29 18:00:52 -08:00
Robert Muir 453168ec76
support tables in generated html documentation (#489)
Tables can be used in markdown (e.g. MIGRATE.md) and will become html tables in our generated HTML docs on the website.
2021-11-29 17:38:14 -05:00
Robert Muir 5aa9da9ead
Improve MIGRATE.md around analyzers artifacts. (#488)
* Improve MIGRATE.md around analyzers artifacts.

Move this to the very top of MIGRATE, the user needs to first be able to
pull in the artifacts, before doing anything else like trying to
compile, deal with renamed classes, etc.

Add a table of each package that got moved, with explicit old and new
names. Hopefully it helps search engines and users.

Link to MIGRATE.md explicitly from README.md
2021-11-29 17:04:15 -05:00
Ignacio Vera 78c8d7b7ea
LUCENE-9538: Detect polygon self-intersections in the Tessellator (#428)
Detect self-intersections so it can provide a more meaningful error to the users.
2021-11-29 11:05:03 +01:00
Ignacio Vera 634c22c527
LUCENE-10264: Clone index input when creating a PointTree in SimpleTextBKDReader (#478)
Fixes a race condition introduced in LUCENE-9820.
2021-11-29 09:20:20 +01:00
Robert Muir 63c89f678d
Speed up ECJ tasks by avoiding --release (#484)
LUCENE-10185 caused a large performance regression in ECJ tasks by using the --release flag.

Instead of using --release, we can just disable "terminal deprecation", and leave this check to `javac`. The --release flag makes this tool run 50% slower.
2021-11-28 15:10:32 -05:00
Robert Muir 1fb45da7bb
upgrade ecj linter from 3.25.0 -> 3.27.0 (#483)
The newest version has a significant performance increase for our
use-case.
2021-11-28 12:05:19 -05:00
Robert Muir 3772ff563a
speed up extremely slow test methods (runtime 15-30s) (#471) 2021-11-28 09:40:43 -05:00
Tomoko Uchida cb5f1b6ca0
Use the same analysis chain to StandardAnalyzer (a follow-up of #480) (#482) 2021-11-28 21:22:28 +09:00
Tomoko Uchida c041517304
set group to 'run' benchmark task (#481) 2021-11-28 21:22:07 +09:00
Tomoko Uchida 9eb7857199 fix typo in documentation 2021-11-28 10:11:49 +09:00
Uwe Schindler aed47c1862 Fix wrong path in documentation 2021-11-28 00:55:28 +01:00
Tomoko Uchida 57f695b14d
LUCENE-10261: clean up reflection stuff in luke module and make minor adjustments (#480) 2021-11-27 15:36:38 +09:00
Dawid Weiss 1029651d12 Don't log warnings from ant (different class loader, I guess). Makes Alan happier. 2021-11-26 11:39:55 +01:00
Dawid Weiss 651755aab7
LUCENE-10260: Luke's about window no longer shows version number (#473) 2021-11-26 08:32:23 +01:00
Ignacio Vera a590c6d2a0
LUCENE-10262: Lift up restrictions for navigating PointValues#PointTree (#476)
This change allows random navigation of a PointValues#PointTree.
2021-11-26 07:42:13 +01:00
Uwe Schindler d973e50c15
LUCENE-10259: Fix startup scripts to allow whitespace in path names and use /bin/sh only (#472) 2021-11-25 16:07:23 +01:00
Tomoko Uchida 40b38438c8
LUCENE-10261: Remove preset analyzer panel from Luke Analysis UI. (#475) 2021-11-25 20:30:36 +09:00
Ignacio Vera 800f002e44
LUCENE-9820: PointTree#size() should handle the case of balanced tree in pre-8.6 indexes (#462)
Handle properly the case where trees are fully balanced for number of dimension > 1
2021-11-25 11:03:16 +01:00
Adrien Grand 8710252116 Fix test failures with testIndexUpgraderCommandLineArgs and ExtraFS. 2021-11-25 08:51:56 +01:00
Adrien Grand f80d816ce7
Speed up TestBackwardsCompatibility#testCommandLineArgs. (#467)
This test unzip files that we already unzipped. This commit copies the already
uncompressed files instead.
2021-11-24 08:25:22 +01:00
Adrien Grand 24fcd80a37
LUCENE-10168: Only test N-2 codecs on nightly runs. (#466)
In order for tests to keep running fast, this annotates all tests of N-2 codecs
with `@Nightly`. To keep good coverage of releases, the smoke tester is now
configured to run nightly tests.
2021-11-24 08:20:04 +01:00
Greg Miller 6ee69e06fb
LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (#264) 2021-11-23 06:00:11 -08:00
David Smiley 0fcf9c825f
Javadocs, Sorter impls (#426)
* Javadocs, Sorter impls
* clarify which sorts are stable/not
* link from utility methods to the primary Sorter implementations for further information
* describe when InPlaceMergeSorter is useful.  Fix incorrect statement that is uses insertion sort.

* Javadocs for Sorter
2021-11-23 07:13:40 -05:00
Tomoko Uchida 4193bcbc02 LUCENE-10200: fix luke lauch script. 2021-11-22 18:46:28 +09:00
Greg Miller 78ee53f837 Add missing CHANGES entry 2021-11-21 07:41:25 -08:00
Greg Miller 9d7e5ef388 Fixup TestCombinedFieldQuery to not (randomy) use numHits = 0 2021-11-21 07:38:28 -08:00
Andriy Redko 5993b9050a
LUCENE-10244: Please consider opening MultiCollector::getCollectors for public use (#455) 2021-11-21 07:36:54 -08:00
Adrien Grand 0902d803fd Add dash between `rev` and the git hash. 2021-11-20 08:09:42 +01:00
Quentin Pradet 1a869c185b
LUCENE-10085: Implement Weight#count on DocValuesFieldExistsQuery (#445)
Co-authored-by: Adrien Grand <jpountz@gmail.com>
2021-11-19 18:06:58 +01:00
Robert Muir af831d2810
LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2) (#452)
Upgrade jflex.

Change doesn't alter the behavior of any of the analyzers (unicode
version or grammar refactorings), just the minimal to get new tooling
working.
2021-11-19 09:24:27 -05:00
Ignacio Vera ad911df260
LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it (#7)
Extract BKD tree interface and move intersecting logic to the PointValues abstract class.
2021-11-19 08:28:01 +01:00
zacharymorn 07ee3ba83a
LUCENE-10236: Update field-weight used in CombinedFieldQuery scoring calculation (#444) 2021-11-18 21:36:38 -08:00
Andriy Redko 6bd5c14bf3
LUCENE-10242: The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc (#450)
Signed-off-by: Andriy Redko <andriy.redko@aiven.io>
2021-11-18 16:35:59 +01:00
Patrick Zhai b4476e4318
LUCENE-10122 Use NumericDocValue to store taxonomy parent array instead of custom term positions (#451) 2021-11-17 19:32:34 -05:00
Dawid Weiss bae095ae48
LUCENE-10240: gradle regenerate fails on java 17 (#449) 2021-11-17 18:36:34 +01:00
Dawid Weiss 0eeba8d37c
LUCENE-10238: Update icu4j to 70.1. (#447) 2021-11-17 18:13:40 +01:00
Adrien Grand 556c7c5fb5 Add back-compat indices for 8.11.0. 2021-11-17 11:53:49 +01:00
Bruno Roustant c71cbac4f9
LUCENE-10225: Improve IntroSelector with 3-way partitioning. 2021-11-17 10:38:27 +01:00
Adrien Grand c0112dd2ff DOAP changes for release 8.11.0 2021-11-16 10:54:24 +01:00
Dawid Weiss f5e5cf008a
LUCENE-10234: Add automatic module name to JAR manifests. (#440) 2021-11-15 17:02:40 +01:00
Quentin Pradet 1e5e997880
LUCENE-10085: Rename DocValuesFieldExistsQuery test (#441)
FieldValueQuery got renamed to DocValuesFieldExistsQuery but the test
wasn't renamed.
2021-11-15 16:24:29 +01:00
Julie Tibshirani 3b914a4d73
LUCENE-10069: Document that kNN queries might not return all results (#434)
Performing a kNN search with very large k may return fewer than k documents.
This is due to the fact that the HNSW graph is not guaranteed to be connected.
This commit documents the behavior as part of a general warning that the results
of a kNN search may be approximate.
2021-11-12 14:19:20 -08:00
Julie Tibshirani 2a9adb81df LUCENE-10063: Fix score calculation in SimpleTextKnnVectorsFormat
The method VectorSimilarityFunction#convertToScore already reverses the
similarity, so we shouldn't reverse it again.
2021-11-11 11:22:03 -08:00
Dawid Weiss f725b27e12
LUCENE-10223: interval support in standard syntax parser (#429) 2021-11-11 08:54:59 +01:00
Julie Tibshirani 53586d4231
LUCENE-10228: Ensure PerFieldKnnVectorsFormat uses right format name (#432)
Before when creating a KnnVectorsWriter for merging, we consulted the existing
"PER_FIELD_SUFFIX_KEY" attribute to determine the format's per-field suffix.
This isn't correct since we could be using a new codec (that produces different
formats/ suffixes).

This commit modifies TestPerFieldDocValuesFormat#testMergeUsesNewFormat to
trigger the problem. Without the fix we it throws an error like
"java.nio.file.FileAlreadyExistsException: File
"_3_Lucene90HnswVectorsFormat_0.vem" was already written to."
2021-11-10 08:18:01 -08:00