Commit Graph

1617 Commits

Author SHA1 Message Date
Ignacio Vera 0cef29f138
LUCENE-9417: Tessellator might fail when several holes share are connected to the same vertex (#1614) 2020-06-29 17:46:21 +02:00
Simon Willnauer 7f352a9665
LUCENE-8962: Merge small segments on commit (#1617)
Add IndexWriter merge-on-commit feature to selectively merge small segments on commit,
subject to a configurable timeout, to improve search performance by reducing the number of small
segments for searching.

Co-authored-by: Michael Froh <msfroh@apache.org>
Co-authored-by: Michael Sokolov <sokolov@falutin.net>
Co-authored-by: Mike McCandless <mikemccand@apache.org>
2020-06-27 22:25:45 +02:00
Mayya Sharipova b0333ab5c8
LUCENE-9280: Collectors to skip noncompetitive documents (#1351)
Similar how scorers can update their iterators to skip non-competitive
documents, collectors and comparators should also provide and update
iterators that allow them to skip non-competive documents.
2020-06-23 16:04:58 -04:00
Tomas Fernandez Lobbe 4774c6f0c1
Include delegate in AssertingSimilarity toString (#1596) 2020-06-22 16:38:00 -07:00
Michael Sokolov 5d43e73c66 Revert "LUCENE-8962: add ability to selectively merge on commit (#1552)"
This reverts commit 972c84022e.
2020-06-22 17:35:49 -04:00
Michael Sokolov 972c84022e
LUCENE-8962: add ability to selectively merge on commit (#1552)
Co-authored-by: Michael Froh <msfroh@apache.org>
Co-authored-by: Simon Willnauer <simonw@apache.org>
2020-06-18 16:56:29 -04:00
Adrien Grand 87a3bef50f
LUCENE-9353: Move terms metadata to its own file. (#1473) 2020-06-16 15:05:28 +02:00
Michael Sokolov 26075fc1dc
LUCENE-9394: fix and suppress warnings (#1563)
* LUCENE-9394: fix and suppress warnings in lucene/*
* Change type of ValuesSource context from raw Map to Map<Object, Object>
2020-06-12 07:25:31 -04:00
Bruno Roustant 75d25ad677
LUCENE-9397: UniformSplit supports encodable fields metadata. 2020-06-11 18:19:48 +02:00
Adrien Grand 54c5dd7d6d
LUCENE-9148: Move the BKD index to its own file. (#1475) 2020-06-09 09:59:14 +02:00
Alan Woodward de2bad9039
LUCENE-9330: Make SortFields responsible for index sorting and serialization (#1440)
This commit adds a new class IndexSorter which handles how a sort should be applied
to documents in an index:

* how to serialize/deserialize sort info in the segment header
* how to sort documents within a segment
* how to sort documents from merging segments

SortField has a getIndexSorter() method, which will return null if the sort cannot be used
to sort an index (eg if it uses scores or other query-dependent values). This also requires a
new Codec as there is a change to the SegmentInfoFormat
2020-05-22 13:33:06 +01:00
Erick Erickson 21b08d5cab LUCENE-9376: Fix or suppress 20 resource leak precommit warnings in lucene/search 2020-05-21 20:29:18 -04:00
Uwe Schindler 06df50e759
LUCENE-9321: Port markdown task to Gradle (#1477) 2020-05-17 14:46:26 +02:00
Mike McCandless 1783c4ad47 LUCENE-9191: ensure LineFileDocs random seeking effort does not seek into the middle of a multi-byte UTF-8 encoded Unicode character 2020-05-04 13:29:00 -04:00
Simon Willnauer 207d240ae2 Fix tests to survive nightly runs with many documents 2020-04-29 22:11:42 +02:00
Simon Willnauer bc4da80776
Fix visibility on member variables in IndexWriter and friends (#1460)
Today it looks like wild wild west inside IndexWriter and some of it's
associated classes. This change makes sure all non-final members have
private visibility, methods that are not used outside of IW today are
made private unless they have been public. This change also removes
some unused or unnecessary members where possible and deleted some dead
code from previous refactoring.
2020-04-27 17:49:20 +02:00
Simon Willnauer d7e0b906ab
LUCENE-9345: Separate MergeSchedulder from IndexWriter (#1451)
This change extracts the methods that are used by MergeScheduler into
a MergeSource interface. This allows IndexWriter to better ensure
locking, hide internal methods and removes the tight coupling between the two
complex classes. This will also improve future testing.
2020-04-24 15:02:55 +02:00
Simon Willnauer 83018deef7 Ensure we use a sane IWC for tests adding many documents.
This test produced tons of files on nighly builds causing
TooManyOpenFilesExceptions likely due to not using CFS on flush
and/or very small maxMergeSize values.
2020-04-24 08:36:06 +02:00
Simon Willnauer 4a98918bfa
LUCENE-9339: Only call MergeScheduler when we actually found new merges (#1445)
IW#maybeMerge calls the MergeScheduler even if it didn't find any merges we should instead only do this if there is in-fact anything there to merge and safe the call into a sync'd method.
2020-04-22 21:26:45 +02:00
Mike McCandless e0c06ee6a6 LUCENE-9191: make LineFileDocs random seeking more efficient by recording safe skip points in the concatenated gzip'd chunks 2020-04-21 12:09:17 -04:00
Simon Willnauer 113043b1ed
LUCENE-9324: Add an ID to SegmentCommitInfo (#1434)
We already have IDs in SegmentInfo, as well as on SegmentInfos which are useful to uniquely identify segments and entire commits. Having IDs on SegmentCommitInfo is be useful too in
order to compare commits for equality and make snapshots incremental on generational files.
This change adds a unique ID to SegmentCommitInfo starting from Lucene 8.6. Older segments won't have an ID until the segment receives an update or a delete even if they have been opened and / or committed by Lucene 8.6 or above.
2020-04-18 14:24:57 +02:00
Adrien Grand 0aa4ba7ccb
LUCENE-9260: Verify checksums of CFS files. (#1311) 2020-04-15 15:10:59 +02:00
Simon Willnauer 2602269f3e
LUCENE-9304: Refactor DWPTPool to pool DWPT directly (#1397)
This change removes the ThreadState indirection from DWPTPool and pools DWPT directly. The tracking information and locking semantics are mostly moved to DWPT directly and the pool semantics have changed slightly such that DWPT need to be checked-out in the pool once they need to be flushed or aborted. This automatically grows and shrinks the number of DWPT in the system when number of threads grow or shrink. Access of pooled DWPTs is more straight forward and doesn't require ordinal. Instead consumers can just iterate over the elements in the pool.
This allowed for removal of indirections in DWPTFlushControl like BlockedFlush, the removal of DWPTPool setter and getter in IndexWriterConfig and the addition of stronger assertions in DWPT and DW.
2020-04-11 12:23:46 +02:00
Bruno Roustant c7cf9e8e4f
LUCENE-9254: UniformSplit supports FST off-heap.
Closes #1301
2020-03-09 16:35:42 +01:00
Michael Sokolov 4501b3d3fd Revert "LUCENE-8962: Split test case (#1313)"
This reverts commit 90aced5a51.

Revert "LUCENE-8962: woops, remove leftover accidental copyright (darned IDEs)"

This reverts commit 3dbfd10279.

Revert "LUCENE-8962: Fix intermittent test failures"

This reverts commit a5475de57f.

Revert "LUCENE-8962: Add ability to selectively merge on commit (#1155)"

This reverts commit a1791e7714.
2020-03-08 18:27:54 -04:00
Bruno Roustant c73d2c15ba
LUCENE-9257: Always keep FST off-heap. Remove SegmentReadState.openedFromWriter. 2020-03-06 14:24:12 +01:00
Robert Muir 624f5a3c2f
LUCENE-9264: Remove SimpleFSDirectory in favor of NIOFSDirectory
Closes #1321
2020-03-06 05:42:22 -05:00
Bruno Roustant 9733643466
LUCENE-9257: Always keep FST off-heap. Remove FSTLoadMode and Reader attributes.
Closes #1320
2020-03-06 11:15:09 +01:00
Yannick Welsch 8a88dd02c6 Remove SimpleFSDirectory in favor of NIOFSDirectory 2020-03-06 00:04:25 +01:00
Ignacio Vera c313365c5f
LUCENE-9251: Filter equal edges with different value on isEdgeFromPolygon (#1290)
Fix bug in the polygon tessellator where edges with different value on #isEdgeFromPolygon were bot filtered out properly
2020-03-03 07:07:34 +01:00
msfroh 043c5dff6f
LUCENE-8962: Add ability to selectively merge on commit (#1155)
* LUCENE-8962: Add ability to selectively merge on commit

This adds a new "findCommitMerges" method to MergePolicy, which can
specify merges to be executed before the
IndexWriter.prepareCommitInternal method returns.

If we have many index writer threads, they will flush their DWPT buffers
on commit, resulting in many small segments, which can be merged before
the commit returns.

* Add missing Javadoc

* Fix incorrect comment

* Refactoring and fix intermittent test failure

1. Made some changes to the callback to update toCommit, leveraging
SegmentInfos.applyMergeChanges.
2. I realized that we'll never end up with 0 registered merges, because
we throw an exception if we fail to register a merge.
3. Moved the IndexWriterEvents.beginMergeOnCommit notification to before
we call MergeScheduler.merge, since we may not be merging on another
thread.
4. There was an intermittent test failure due to randomness in the time
it takes for merges to complete. Before doing the final commit, we wait
for pending merges to finish. We may still end up abandoning the final
merge, but we can detect that and assert that either the merge was
abandoned (and we have > 1 segment) or we did merge down to 1 segment.

* Fix typo

* Fix/improve comments based on PR feedback

* More comment improvements from PR feedback

* Rename method and add new MergeTrigger

1. Renamed findCommitMerges -> findFullFlushMerges.
2. Added MergeTrigger.COMMIT, passed to findFullFlushMerges and to
   MergeScheduler when merging on commit.

* Update renamed method name in strings and comments
2020-03-02 12:19:47 -05:00
Adrien Grand f4b5069c1f LUCENE-9247: Fix test failures with ExtraFS. 2020-03-02 08:59:13 +01:00
Adrien Grand c929b65c81 LUCENE-9247: Exclude `write.lock` from files whose integrity is expected to be verified. 2020-02-29 08:46:16 +01:00
Adrien Grand 30944d3520 LUCENE-9247: Fix class visibility issue discovered when backporting. 2020-02-28 15:34:36 +01:00
Adrien Grand cd984e2dc0
LUCENE-9247: Add tests for `checkIntegrity`. (#1284)
This adds a test to `BaseIndexFileFormatTestCase` that the combination
of opening a reader and calling `checkIntegrity` on it reads all bytes
of all files (including index headers and footers). This would help
detect most cases when `checkIntegrity` is not implemented correctly.
2020-02-28 14:19:56 +01:00
Ignacio Vera 88dd1c3f3d
LUCENE-9238: Add new XYPointField, queries and sorting capabilities (#1272)
New XYPointField field and Queries for indexing, searching and sorting cartesian points.
2020-02-21 11:26:30 +01:00
Ignacio Vera d48bafb299
LUCENE-8707: Add LatLonShape and XYShape distance query (#587) 2020-02-19 16:03:30 +01:00
markharwood 79a4a680e7 Test fix - new binary doc values test could use invalid values. 2020-02-19 09:14:14 +00:00
markharwood ce2959fe4c
LUCENE-9211 Add compression for Binary doc value fields (#1234)
Stores groups of 32 binary doc values in LZ4-compressed blocks.
2020-02-18 14:02:42 +00:00
Robert Muir f41eabdc5f
LUCENE-8279: fix javadocs wrong header levels and accessibility issues
Java 13 adds a new doclint check under "accessibility" that the html
header nesting level isn't crazy.

Many are incorrect because the html4-style javadocs had horrible
font-sizes, so developers used the wrong header level to work around it.
This is no issue in trunk (always html5).

Java recommends against using such structured tags at all in javadocs,
but that is a more involved change: this just "shifts" header levels
in documents to be correct.
2020-02-08 10:00:00 -05:00
Nicholas Knize 206a70e7b7 LUCENE-9149: Increase data dimension limit in BKD 2020-02-07 16:08:14 -06:00
Robert Muir 0d339043e3
LUCENE-9209: fix javadocs to be html5, enable doclint html checks, remove jtidy
Current javadocs declare an HTML5 doctype: !DOCTYPE HTML. Some HTML5
features are used, but unfortunately also some constructs that do not
exist in HTML5 are used as well.

Because of this, we have no checking of any html syntax. jtidy is
disabled because it works with html4. doclint is disabled because it
works with html5. our docs are neither.

javadoc "doclint" feature can efficiently check that the html isn't
crazy. we just have to fix really ancient removed/deprecated stuff
(such as use of tt tag).

This enables the html checking in both ant and gradle. The docs are
fixed via straightforward transformations.

One exception is table cellpadding, for this some helper CSS classes
were added to make the transition easier (since it must apply padding
to inner th/td, not possible inline). I added TODOs, we should clean
this up. Most problems look like they may have been generated from a
GUI or similar and not a human.
2020-02-06 22:30:52 -05:00
Adrien Grand 136dcbdbbc
LUCENE-9147: Move the stored fields index off-heap. (#1179)
This replaces the index of stored fields and term vectors with two
`DirectMonotonic` arrays. `DirectMonotonicWriter` requires to know the number
of values to write up-front, so incoming doc IDs and file pointers are buffered
on disk using temporary files that never get fsynced, but have index headers
and footers to make sure any corruption in these files wouldn't propagate to the
index.

`DirectMonotonicReader` gets a specialized `binarySearch` implementation that
leverages the metadata in order to avoid going to the IndexInput as often as
possible. Actually in the common case, it would only go to a single
sub `DirectReader` which, combined with the size of blocks of 1k values, helps
bound the number of page faults to 2.
2020-02-05 18:35:08 +01:00
Mike McCandless 47386f8cca LUCENE-9200: consistently use double (not float) math for TieredMergePolicy's decisions, to fix a corner-case bug uncovered by randomized tests 2020-02-05 09:51:31 -05:00
Robert Muir 9ceaff913e
LUCENE-9195: more slow tests fixes 2020-01-31 07:57:34 -05:00
Robert Muir 29469b454f
LUCENE-9192: speed up more slow tests 2020-01-29 14:31:32 -05:00
Robert Muir 3bcc97c8eb
LUCENE-9186: remove linefiledocs usage from BaseTokenStreamTestCase 2020-01-28 11:55:51 -05:00
Robert Muir 975df9ddd3
LUCENE-9182: add apache license headers to all .gradle files and enforce in rat task 2020-01-27 12:05:34 -05:00
Robert Muir fddb5314fc
LUCENE-9172: nuke some compiler warnings 2020-01-27 06:08:30 -05:00
Robert Muir c53cc3edaf
LUCENE-9167: test speedup for slowest/pathological tests (round 3) 2020-01-24 08:58:59 -05:00