Commit Graph

35558 Commits

Author SHA1 Message Date
Xavier Sanchez Loro 0293fc896d
LUCENE-10248: Spanish Plural Stemmer (#461)
Adds a new Spanish stemmer just for stemming plural to singular whilst maintaining gender: the SpanishPluralStemmer. The goal is to provide a lightweight algorithmic approach with better precision and recall than current approaches.

See blog post for more details: https://medium.com/inside-wallapop/spanish-plural-stemmer-matching-plural-and-singular-forms-in-spanish-using-lucene-93e005e38373

This approach is based on rules specified in WikiLingua: http://www.wikilengua.org/index.php/Plural_(formaci%C3%B3n)

Some characteristics:

* Designed to stem just plural to singular form
* Distinguishes between masculine and feminine forms
* It will increase recall but precision can be reduced depending on the use case/information need
* Stems plural words of foreign origin: i.e. complots, bits, punks, robots
* Support for invariant words: same plural and singular form or plural does not make sense: i.e. crisis, jueves, lapsus, abrebotellas, etc
* Support for special cases: i.e. yoes, clubes, itemes, faralaes
* Use it when the distinction between singular and plural is not relevant but gender is relevant
* Produces meaningful tokens in form of singular
* Not strange stems like “amig”: it’s true that stemmers must not generate grammatically correct tokens, but if we generate correct stems we decrease the possibility of collisions with other words
2021-11-30 16:07:09 -05:00
Robert Muir 468aceff0c
LUCENE-10248: add CHANGES.txt entry 2021-11-30 16:07:09 -05:00
Dawid Weiss 26257292c3 LUCENE-10234: Change module prefix to org.apache.* (#487) 2021-11-30 22:04:24 +01:00
Robert Muir c89c78cee0
LUCENE-10272: cross-check norms with postings in checkindex (#493)
Previously, CheckIndex would iterate norms and validate each one. But if norms that should be there were missing, nothing would fail. Now it computes an expected count of norms and ensures it saw them all.
2021-11-30 15:40:35 -05:00
Greg Miller 51e023bf7a LUCENE-10232: Fix MultiRangeQuery to confirm all dimensions for a given range match 2021-11-30 12:01:34 -08:00
Alan Woodward b697745407 LUCENE-10263: Implement Weight.count() on NormsFieldExistsQuery (#477)
If all documents in the segment have a value, then `Reader.getDocCount()` will
equal `maxDoc` and we can return `numDocs` as a shortcut.
2021-11-30 10:07:39 +00:00
Greg Miller 8a03d2ffc9 Add javadoc note in DoubleValuesSource (see LUCENE-10258) (#490) 2021-11-29 18:02:26 -08:00
Robert Muir c5b5fd641b
support tables in generated html documentation (#489)
Tables can be used in markdown (e.g. MIGRATE.md) and will become html tables in our generated HTML docs on the website.
2021-11-29 17:44:26 -05:00
Robert Muir 278316377c
Improve MIGRATE.md around analyzers artifacts. (#488)
* Improve MIGRATE.md around analyzers artifacts.

Move this to the very top of MIGRATE, the user needs to first be able to
pull in the artifacts, before doing anything else like trying to
compile, deal with renamed classes, etc.

Add a table of each package that got moved, with explicit old and new
names. Hopefully it helps search engines and users.

Link to MIGRATE.md explicitly from README.md
2021-11-29 17:44:26 -05:00
Ignacio Vera 70243ea811 LUCENE-9538: Detect polygon self-intersections in the Tessellator (#428)
Detect self-intersections so it can provide a more meaningful error to the users.
2021-11-29 11:06:06 +01:00
Ignacio Vera 62084d7138 LUCENE-10264: Clone index input when creating a PointTree in SimpleTextBKDReader (#478)
Fixes a race condition introduced in LUCENE-9820.
2021-11-29 09:21:27 +01:00
Robert Muir 8d0103724d
Speed up ECJ tasks by avoiding --release (#484)
LUCENE-10185 caused a large performance regression in ECJ tasks by using the --release flag.

Instead of using --release, we can just disable "terminal deprecation", and leave this check to `javac`. The --release flag makes this tool run 50% slower.
2021-11-28 15:11:02 -05:00
Robert Muir 95095d0d49
upgrade ecj linter from 3.25.0 -> 3.27.0 (#483)
The newest version has a significant performance increase for our
use-case.
2021-11-28 12:05:42 -05:00
Robert Muir 756550f88b
speed up extremely slow test methods (runtime 15-30s) (#471) 2021-11-28 09:41:15 -05:00
Tomoko Uchida 38762ee8cf Use the same analysis chain to StandardAnalyzer (a follow-up of #480) (#482) 2021-11-28 21:24:11 +09:00
Tomoko Uchida 93bb52c601 set group to 'run' benchmark task (#481) 2021-11-28 21:23:48 +09:00
Tomoko Uchida eb912a9158 fix typo in documentation 2021-11-28 10:13:28 +09:00
Uwe Schindler 92a2428906 Fix wrong path in documentation 2021-11-28 00:56:27 +01:00
Tomoko Uchida e222031943 LUCENE-10261: clean up reflection stuff in luke module and make minor adjustments (#480) 2021-11-27 15:42:16 +09:00
Dawid Weiss f599a8e2ee LUCENE-10260: Luke's about window no longer shows version number (#473) 2021-11-26 08:33:30 +01:00
Ignacio Vera cb0c2b87ed LUCENE-10262: Lift up restrictions for navigating PointValues#PointTree (#476)
This change allows random navigation of a PointValues#PointTree.
2021-11-26 07:43:43 +01:00
Uwe Schindler 30c4b8d5b8 LUCENE-10259: Fix startup scripts to allow whitespace in path names and use /bin/sh only (#472) 2021-11-25 16:11:37 +01:00
Tomoko Uchida bfa3f01a17 LUCENE-10261: Remove preset analyzer panel from Luke Analysis UI. (#475) 2021-11-25 20:33:34 +09:00
Ignacio Vera 58ef7d911a
LUCENE-9820: PointTree#size() should handle the case of balanced tree in pre-8.6 indexes (#462) (#474)
Handle properly the case where trees are fully balanced for number of dimension > 1
2021-11-25 11:19:02 +01:00
Adrien Grand a97a1e2815 Fix test failures with testIndexUpgraderCommandLineArgs and ExtraFS. 2021-11-25 08:50:27 +01:00
David Smiley e2e99da4a8
Javadocs, Sorter impls (#426)
* Javadocs, Sorter impls
* clarify which sorts are stable/not
* link from utility methods to the primary Sorter implementations for further information
* describe when InPlaceMergeSorter is useful.  Fix incorrect statement that is uses insertion sort.

* Javadocs for Sorter
2021-11-25 00:44:58 -05:00
Adrien Grand b3a36166a5 Speed up TestBackwardsCompatibility#testCommandLineArgs. (#467)
This test unzip files that we already unzipped. This commit copies the already
uncompressed files instead.
2021-11-24 08:26:42 +01:00
Adrien Grand 3f634e2ab9 LUCENE-10168: Only test N-2 codecs on nightly runs.
In order for tests to keep running fast, this annotates all tests of N-2 codecs
with `@Nightly`. To keep good coverage of releases, the smoke tester is now
configured to run nightly tests.
2021-11-24 08:26:42 +01:00
Tomoko Uchida 170137129a LUCENE-10200: fix luke lauch script. 2021-11-22 19:16:18 +09:00
Andriy Redko 51c37db005 LUCENE-10244: MultiCollector::getCollectors is now public 2021-11-21 07:44:08 -08:00
Adrien Grand ee8829da5b Add dash between `rev` and the git hash. 2021-11-20 08:09:33 +01:00
Greg Miller 0ba310782f
LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals 2021-11-19 13:11:42 -08:00
Patrick Zhai 6b99f03cdd
LUCENE-10122 Use NumericDocValue to store taxonomy parent array (#454) 2021-11-19 13:05:56 -05:00
Quentin Pradet 631d1ad749 LUCENE-10085: Implement Weight#count on DocValuesFieldExistsQuery (#445)
Co-authored-by: Adrien Grand <jpountz@gmail.com>
2021-11-19 18:07:29 +01:00
Robert Muir ee56d31425
LUCENE-10239: upgrade jflex (1.7.0 -> 1.8.2) (#452)
Upgrade jflex.

Change doesn't alter the behavior of any of the analyzers (unicode
version or grammar refactorings), just the minimal to get new tooling
working.
2021-11-19 09:28:11 -05:00
Ignacio Vera 9adf7e27f9
LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it (#7) (#457)
Extract BKD tree interface and move intersecting logic to the PointValues abstract class.
2021-11-19 08:39:28 +01:00
Jim Ferenczi 2e5c4bb5a5 LUCENE-10208: Ensure that the minimum competitive score does not decrease in concurrent search (#431)
Co-authored-by: Adrien Grand <jpountz@gmail.com>
2021-11-18 17:33:04 +01:00
Andriy Redko 42bee6f223 LUCENE-10242: The TopScoreDocCollector::createSharedManager should use ScoreDoc instead of FieldDoc (#450)
Signed-off-by: Andriy Redko <andriy.redko@aiven.io>
2021-11-18 16:36:32 +01:00
Dawid Weiss 8d07018050 LUCENE-10240: gradle regenerate fails on java 17 (#449) 2021-11-17 18:36:58 +01:00
Dawid Weiss 4c22d30f80 LUCENE-10238: Update icu4j to 70.1. (#447) 2021-11-17 18:14:33 +01:00
Adrien Grand 7ce0cfa9c5 Add back-compat indices for 8.11.0 2021-11-17 11:51:18 +01:00
Bruno Roustant 02a63f688c
LUCENE-10225: Improve IntroSelector with 3-way partitioning. 2021-11-17 11:31:11 +01:00
Adrien Grand b6f456573a DOAP changes for release 8.11.0 2021-11-16 10:55:08 +01:00
Dawid Weiss 9d0eb88d2c LUCENE-10234: Add automatic module name to JAR manifests. (#440) 2021-11-15 17:03:08 +01:00
Quentin Pradet e034a2d6e2 LUCENE-10085: Rename DocValuesFieldExistsQuery test (#441)
FieldValueQuery got renamed to DocValuesFieldExistsQuery but the test
wasn't renamed.
2021-11-15 16:24:57 +01:00
Julie Tibshirani 607b10dc2a LUCENE-10069: Document that kNN queries might not return all results (#434)
Performing a kNN search with very large k may return fewer than k documents.
This is due to the fact that the HNSW graph is not guaranteed to be connected.
This commit documents the behavior as part of a general warning that the results
of a kNN search may be approximate.
2021-11-12 14:20:09 -08:00
Julie Tibshirani 68be365283 LUCENE-10063: Fix score calculation in SimpleTextKnnVectorsFormat
The method VectorSimilarityFunction#convertToScore already reverses the
similarity, so we shouldn't reverse it again.
2021-11-11 11:36:50 -08:00
Julie Tibshirani 9c73562161 LUCENE-10228: Ensure PerFieldKnnVectorsFormat uses right format name (#432)
Before when creating a KnnVectorsWriter for merging, we consulted the existing
"PER_FIELD_SUFFIX_KEY" attribute to determine the format's per-field suffix.
This isn't correct since we could be using a new codec (that produces different
formats/ suffixes).

This commit modifies TestPerFieldDocValuesFormat#testMergeUsesNewFormat to
trigger the problem. Without the fix we it throws an error like
"java.nio.file.FileAlreadyExistsException: File
"_3_Lucene90HnswVectorsFormat_0.vem" was already written to."
2021-11-11 11:22:52 -08:00
Dawid Weiss ff9ee28c60 LUCENE-10223: interval support in standard syntax parser (#429) 2021-11-11 08:56:48 +01:00
Dawid Weiss 238cd5fd0c LUCENE-10226: test target creates a weird folder (lazy property). 2021-11-09 08:38:42 +01:00