1
0
mirror of https://github.com/apache/lucene.git synced 2025-02-28 13:29:26 +00:00

11754 Commits

Author SHA1 Message Date
Alan Woodward
7c1ba1aebe
LUCENE-9099: Correctly handle repeats in ORDERED and UNORDERED intervals ()
If you have repeating intervals in an ordered or unordered interval source, you currently 
get somewhat confusing behaviour:

* `ORDERED(a, a, b)` will return an extra interval over just a b if it first matches a a b, meaning
that you can get incorrect results if used in a `CONTAINING` filter - 
`CONTAINING(ORDERED(x, y), ORDERED(a, a, b))` will match on the document `a x a b y`
* `UNORDERED(a, a)` will match on documents that just containg a single a.

This commit adds a RepeatingIntervalsSource that correctly handles repeats within 
ordered and unordered sources. It also changes the way that gaps are calculated within 
ordered and unordered sources, by using a new width() method on IntervalIterator. The 
default implementation just returns end() - start() + 1, but RepeatingIntervalsSource 
instead returns the sum of the widths of its child iterators. This preserves maxgaps filtering 
on ordered and unordered sources that contain repeats.

In order to correctly handle matches in this scenario, IntervalsSource#matches now always 
returns an explicit IntervalsMatchesIterator rather than a plain MatchesIterator, which adds 
gaps() and width() methods so that submatches can be combined in the same way that 
subiterators are. Extra checks have been added to checkIntervals() to ensure that the same 
intervals are returned by both iterator and matches, and a fix to 
DisjunctionIntervalIterator#matches() is also included - DisjunctionIntervalIterator minimizes 
its intervals, while MatchesUtils.disjunction does not, so there was a discrepancy between 
the two methods.
2020-02-06 14:44:47 +00:00
Adrien Grand
fdf5ade727 LUCENE-9147: Fix codec excludes. 2020-02-06 10:34:03 +01:00
Adrien Grand
1b882246d7 LUCENE-9147: Avoid reusing file names with FileSwitchDirectory or NRTCachingDirectory and IOContext randomization. 2020-02-06 08:27:33 +01:00
Robert Muir
196ec5f4a8
LUCENE-9206: add forbidden api exclusion to new class 2020-02-05 20:30:18 -05:00
Robert Muir
93b83f635d
LUCENE-9206: Improve IndexMergeTool defaults and options
IndexMergeTool previously had no options and always forceMerge(1)
the resulting index. This can result in wasted work and confusing
performance (unbalancing the index).

Instead the default is to not do anything, except merges from the
merge policy.
2020-02-05 16:31:07 -05:00
Adrien Grand
136dcbdbbc
LUCENE-9147: Move the stored fields index off-heap. ()
This replaces the index of stored fields and term vectors with two
`DirectMonotonic` arrays. `DirectMonotonicWriter` requires to know the number
of values to write up-front, so incoming doc IDs and file pointers are buffered
on disk using temporary files that never get fsynced, but have index headers
and footers to make sure any corruption in these files wouldn't propagate to the
index.

`DirectMonotonicReader` gets a specialized `binarySearch` implementation that
leverages the metadata in order to avoid going to the IndexInput as often as
possible. Actually in the common case, it would only go to a single
sub `DirectReader` which, combined with the size of blocks of 1k values, helps
bound the number of page faults to 2.
2020-02-05 18:35:08 +01:00
Mike McCandless
47386f8cca LUCENE-9200: consistently use double (not float) math for TieredMergePolicy's decisions, to fix a corner-case bug uncovered by randomized tests 2020-02-05 09:51:31 -05:00
Ignacio Vera
641680fbf1
LUCENE-9197: fix wrong implementation on Point2D#withinTriangle () 2020-02-04 07:10:08 +01:00
Erick Erickson
d3ac1329a3
LUCENE-8656: Deprecations in FuzzyQuery ()
LUCENE-8656: Deprecations in FuzzyQuery

Closes 
2020-02-03 08:52:33 -05:00
Jan Høydahl
16b8d50284
SOLR-14221: Upgrade restlet to version 2.4.0 () 2020-02-02 11:35:14 +01:00
Kazuaki Hiraga
b457c2ee2e LUCENE-9123: Add new JapaneseTokenizer constructors with discardCompoundToken option to control whether the tokenizer emits original tokens when the mode is not NORMAL. 2020-02-01 14:51:09 +09:00
Erick Erickson
5253c0cb74
LUCENE-9134 Port ant-regenerate tasks to Gradle build ()
LUCENE-9134: Port ant-regenerate tasks to Gradle build Javacc sub-task. Closes 
2020-01-31 17:04:10 -05:00
Robert Muir
7382375d8a
support ECJ linting on newer JDK versions
The entire precommit task will still fail with unsupported java version
(subsequent checks do not support the newer javadocs format).

But this allows the ECJ linter to run, which checks for things such as
unused imports.
2020-01-31 14:16:04 -05:00
Christine Poerschke
0c1b19a321 LUCENE-8530: fix some 'rawtypes' javac warnings 2020-01-31 16:40:55 +00:00
Robert Muir
9ceaff913e
LUCENE-9195: more slow tests fixes 2020-01-31 07:57:34 -05:00
Dawid Weiss
043dd207b6 LUCENE-9080: this jflex file got corrupted somehow during previous commit. I regenerated it with ant, along with the final java file. I also added a crlf normalization, encoding and forced-regeneration to ant because it didn't work before. 2020-01-30 13:09:47 +01:00
Adrien Grand
13e2094804 LUCENE-4702: Improve performance for fuzzy queries.
Fuzzy queries with an edit distance of 1 or 2 must visit all blocks whose prefix
length is 1 or 2. By not compressing those, we can trade very little space (a
couple MBs in the case of the wikibigall index) for better query efficiency.
2020-01-30 10:37:39 +01:00
Ignacio Vera
a9482911a8
LUCENE-9141: Simplify LatLonShapeXQuery API by adding a new abstract class called LatLonGeometry. () 2020-01-30 08:03:22 +01:00
Robert Muir
29469b454f
LUCENE-9192: speed up more slow tests 2020-01-29 14:31:32 -05:00
Ignacio Vera
c98229948a
LUCENE-9152: Improve line intersection detection for polygons () 2020-01-29 19:24:51 +01:00
Adrien Grand
92b684c647
LUCENE-9161: DirectMonotonicWriter checks for overflows. () 2020-01-28 19:06:53 +01:00
Adrien Grand
6eb8834a57
LUCENE-4702: Reduce terms dictionary compression overhead. ()
Changes include:
 - Removed LZ4 compression of suffix lengths which didn't save much space
   anyway.
 - For stats, LZ4 was only really used for run-length compression of terms whose
   docFreq is 1. This has been replaced by explicit run-length compression.
 - Since we only use LZ4 for suffix bytes if the compression ration is < 75%, we
   now only try LZ4 out if the average suffix length is greater than 6, in order
   to reduce index-time overhead.
2020-01-28 18:38:30 +01:00
Robert Muir
4773574578
LUCENE-9189: TestIndexWriterDelete.testDeletesOnDiskFull can run for minutes
The issue is that MockDirectoryWrapper's disk full check is horribly
inefficient. On every writeByte/etc, it totally recomputes disk space
across all files. This means it calls listAll() on the underlying
Directory (which sorts all the underlying files), then sums up fileLength()
for each of those files.

This leads to many pathological cases in the disk full tests... but the
number of tests impacted by this is minimal, and the logic is scary.
2020-01-28 12:24:31 -05:00
Robert Muir
3bcc97c8eb
LUCENE-9186: remove linefiledocs usage from BaseTokenStreamTestCase 2020-01-28 11:55:51 -05:00
Robert Muir
4350efa932
LUCENE-9187: remove too-expensive assert from LZ4 HighCompressionHashTable 2020-01-28 11:45:43 -05:00
Adrien Grand
9e4c445d17 LUCENE-4702: CHANGES entry. 2020-01-27 18:27:53 +01:00
Robert Muir
975df9ddd3
LUCENE-9182: add apache license headers to all .gradle files and enforce in rat task 2020-01-27 12:05:34 -05:00
Robert Muir
8e357b167b
LUCENE-9180: dos2unix files that don't need dos line endings 2020-01-27 11:29:59 -05:00
Robert Muir
fddb5314fc
LUCENE-9172: nuke some compiler warnings 2020-01-27 06:08:30 -05:00
Alan Woodward
02f862670e
LUCENE-9153: Allow WhitespaceAnalyzer to set a custom maxTokenLen ()
WhitespaceTokenizer defaults to a maximum token length of 255, and WhitespaceAnalyzer
does not allow this to be changed. This commit adds an optional maxTokenLen parameter
to WhitespaceAnalyzer as well, and documents the existing token length restriction.
2020-01-27 09:22:25 +00:00
Ignacio Vera
1fe4177ac0
LUCENE-9176: Handle the case when there is only one leaf node in TestEstimatePointCount () 2020-01-27 09:52:25 +01:00
Uwe Schindler
0635756f76 Fix Windows Line endings in the source-patterns checker (silly bug: it's \r\n on windows not the other way round) 2020-01-26 11:48:24 +01:00
Robert Muir
c53cc3edaf
LUCENE-9167: test speedup for slowest/pathological tests (round 3) 2020-01-24 08:58:59 -05:00
Adrien Grand
b283b8df62
LUCENE-4702: Terms dictionary compression. ()
Compress blocks of suffixes in order to make the terms dictionary more
space-efficient. Two compression algorithms are used depending on which one is
more space-efficient:
 - LowercaseAsciiCompression, which applies when all bytes are in the
   `[0x1F,0x3F)` or `[0x5F,0x7F)` ranges, which notably include all digits,
   lowercase ASCII characters, '.', '-' and '_', and encodes 4 chars on 3 bytes.
   It is very often applicable on analyzed content and decompresses very quickly
   thanks to auto-vectorization support in the JVM.
 - LZ4, when the compression ratio is less than 0.75.

I was a bit unhappy with the complexity of the high-compression LZ4 option, so
I simplified it in order to only keep the logic that detects duplicate strings.
The logic about what to do in case overlapping matches are found, which was
responsible for most of the complexity while only yielding tiny benefits, has
been removed.
2020-01-24 14:46:57 +01:00
Robert Muir
a29a4f4aa5
LUCENE-9168: don't let crazy tests run us out of open files with these params 2020-01-24 08:46:50 -05:00
Cassandra Targett
64cb1c8fe8
SOLR-12930: Create developer docs directories in source repo () 2020-01-23 14:00:23 -06:00
Robert Muir
f440fbdf59
LUCENE-9083: throw assumption if you try to remap /dev to /dev with this test mock 2020-01-22 21:58:52 -05:00
Robert Muir
1051db4038
LUCENE-9163: test speedup for slowest/pathological tests
Calming down individual test methods with double-digit execution times
after running tests many times.

There are a few more issues remaining, but this solves the majority of them.
2020-01-22 17:49:33 -05:00
Robert Muir
8fd3fbd93c
TestPointValues only index 300k docs in NIGHTLY configuration, that is too much locally 2020-01-22 10:27:15 -05:00
Robert Muir
b7694535eb
mark StressRamUsageEstimator tests nightly.
This is consistently the slowest test for me in all of lucene core by
far. Takes around an entire minute. Mark it nightly: should catch any
issues with RAM estimation but keep local builds fast.
2020-01-22 10:19:44 -05:00
Robert Muir
9dae566ee7
LUCENE-9160: add params/docs to override jvm params in gradle build, default C2 off in tests.
Adds some build parameters to tune how tests run. There is an example
shown by "gradle helpLocalSettings"

Default C2 off in tests as it is wasteful locally and causes slowdown of
tests runs. You can override this by setting tests.jvmargs for gradle,
or args for ant.

Some crazy lucene stress tests may need to be toned down after the
change, as they may have been doing too many iterations by default...
but this is not a new problem.
2020-01-22 09:58:30 -05:00
Robert Muir
3ecd7a03aa
LUCENE-9159: merge gradle/ant test security policies (main file) 2020-01-21 23:43:31 -05:00
Robert Muir
7e0534d87c
LUCENE-9159: merge gradle/ant test security policies 2020-01-21 21:26:37 -05:00
Robert Muir
c754a764d4
LUCENE-9157: test speedup for slowest tests 2020-01-21 19:27:19 -05:00
Mike
ec6a9aab09
LUCENE-9098 Use multibyte code-points for complex fuzzy query () 2020-01-21 12:16:42 -06:00
Bruno Roustant
8894babd4a
LUCENE-9135: Make UniformSplit FieldMetadata counters long.
Closes 
2020-01-21 11:24:26 +01:00
Adrien Grand
bddb06b650 CompetitiveImpactAccumulator should protect its costly invariant checks behind an assert. 2020-01-20 11:16:09 +01:00
Nicholas Knize
aad849bf87 LUCENE-8621: Refactor LatLonShape, XYShape, and all query and utility classes from sandbox to core 2020-01-17 14:34:40 -06:00
Mike
338d386ae0
LUCENE-9145 First pass addressing static analysis ()
Fixed a bunch of the smaller warnings found by error-prone compiler
plugin, while ignoring a lot of the bigger ones.
2020-01-17 13:30:39 -06:00
Mike McCandless
8147e491ce LUCENE-9053: improve FST's package-info.java comment to clarify required (Unicode code point) sort order for FST.Builder 2020-01-17 13:35:05 -05:00