Commit Graph

33250 Commits

Author SHA1 Message Date
Dawid Weiss 3a8ed5e8ed LUCENE-9134: add python-based regeneration of HTMLCharacterEntities.jflex inside jflexHTMLStripCharFilter. 2020-01-30 13:48:16 +01:00
Dawid Weiss 043dd207b6 LUCENE-9080: this jflex file got corrupted somehow during previous commit. I regenerated it with ant, along with the final java file. I also added a crlf normalization, encoding and forced-regeneration to ant because it didn't work before. 2020-01-30 13:09:47 +01:00
Adrien Grand 13e2094804 LUCENE-4702: Improve performance for fuzzy queries.
Fuzzy queries with an edit distance of 1 or 2 must visit all blocks whose prefix
length is 1 or 2. By not compressing those, we can trade very little space (a
couple MBs in the case of the wikibigall index) for better query efficiency.
2020-01-30 10:37:39 +01:00
Ignacio Vera a9482911a8
LUCENE-9141: Simplify LatLonShapeXQuery API by adding a new abstract class called LatLonGeometry. (#1170) 2020-01-30 08:03:22 +01:00
Robert Muir 29469b454f
LUCENE-9192: speed up more slow tests 2020-01-29 14:31:32 -05:00
Ignacio Vera c98229948a
LUCENE-9152: Improve line intersection detection for polygons (#1187) 2020-01-29 19:24:51 +01:00
Dawid Weiss e25dac085f LUCENE-9134: this adds initial javacc support (without follow-up tweaks required to make the sources identical as those generated by ant). 2020-01-29 17:02:59 +01:00
Adrien Grand 7941d109bd SOLR-13897: Fix precommit. 2020-01-28 20:11:47 +01:00
Adrien Grand 92b684c647
LUCENE-9161: DirectMonotonicWriter checks for overflows. (#1197) 2020-01-28 19:06:53 +01:00
Adrien Grand 6eb8834a57
LUCENE-4702: Reduce terms dictionary compression overhead. (#1216)
Changes include:
 - Removed LZ4 compression of suffix lengths which didn't save much space
   anyway.
 - For stats, LZ4 was only really used for run-length compression of terms whose
   docFreq is 1. This has been replaced by explicit run-length compression.
 - Since we only use LZ4 for suffix bytes if the compression ration is < 75%, we
   now only try LZ4 out if the average suffix length is greater than 6, in order
   to reduce index-time overhead.
2020-01-28 18:38:30 +01:00
Robert Muir 4773574578
LUCENE-9189: TestIndexWriterDelete.testDeletesOnDiskFull can run for minutes
The issue is that MockDirectoryWrapper's disk full check is horribly
inefficient. On every writeByte/etc, it totally recomputes disk space
across all files. This means it calls listAll() on the underlying
Directory (which sorts all the underlying files), then sums up fileLength()
for each of those files.

This leads to many pathological cases in the disk full tests... but the
number of tests impacted by this is minimal, and the logic is scary.
2020-01-28 12:24:31 -05:00
Robert Muir 3bcc97c8eb
LUCENE-9186: remove linefiledocs usage from BaseTokenStreamTestCase 2020-01-28 11:55:51 -05:00
Robert Muir 4350efa932
LUCENE-9187: remove too-expensive assert from LZ4 HighCompressionHashTable 2020-01-28 11:45:43 -05:00
Robert Muir e504798a44
LUCENE-9185: add "tests.profile" to gradle build to aid fixing slow tests
Run test(s) with -Ptests.profile=true to print a histogram at the end of
the build.
2020-01-28 11:27:18 -05:00
Cassandra Targett 1a14c67426 Ref Guide: Remove outdated or invalid links to Solr Wiki; update URL of those that remain 2020-01-27 16:38:31 -06:00
Cassandra Targett b2f51f1941 Ref Guide: fix undefined substitution error caused by formatting of variables in paths 2020-01-27 16:38:30 -06:00
Jan Høydahl 53f7b394e4 SOLR-11207: Mute warnings for owasp false positives 2020-01-27 21:03:20 +01:00
Dawid Weiss ff635cf701 LUCENE-9184, LUCENE-9183: allow skipping git status check in precommit with -Pvalidation.git.failOnModified=false (or place this in gradle.properties to make it permanent). 2020-01-27 20:47:02 +01:00
Uwe Schindler 7dc35e3a62 Let precommit depend on generic forbiddenApis task 2020-01-27 19:47:54 +01:00
Adrien Grand 9e4c445d17 LUCENE-4702: CHANGES entry. 2020-01-27 18:27:53 +01:00
Robert Muir fd5a0ce7c2
LUCENE-9182: the rat-sources.gradle was the one .gradle file already with a license header, we don't need it twice 2020-01-27 12:11:44 -05:00
Robert Muir 975df9ddd3
LUCENE-9182: add apache license headers to all .gradle files and enforce in rat task 2020-01-27 12:05:34 -05:00
Dawid Weiss b420ef8f77 LUCENE-9179: don't invoke the same build recursively upon first run, just continue. Seems like gradle bug but let's not cry about it - it just happens once and CI defaults can be passed independently on command-line. 2020-01-27 17:34:13 +01:00
Robert Muir 8e357b167b
LUCENE-9180: dos2unix files that don't need dos line endings 2020-01-27 11:29:59 -05:00
Dawid Weiss a3b0cfcbe2 Moved under help/ 2020-01-27 17:23:41 +01:00
Dawid Weiss 6bde0f3ec8 LUCENE-9134: UAX29URLEmailTokenizerImpl regeneration. This requires TONS
of memory and time... insane compared to the size of the input. None of my
machines pass it without at least 12 gigs of heap (!).
2020-01-27 12:36:13 +01:00
Jan Høydahl 39df74de37 SOLR-11207: Exclude configuration 'unifiedClasspath'
It is generated by consistent-versions plugin and triggers owasp warnings for deps even for excluded projects
2020-01-27 12:17:31 +01:00
Robert Muir 2bb63afdaf
LUCENE-9166: gradle build: test failures need stacktraces 2020-01-27 06:09:04 -05:00
Robert Muir fddb5314fc
LUCENE-9172: nuke some compiler warnings 2020-01-27 06:08:30 -05:00
Robert Muir 5f964eeef2
SOLR-14217: tests respect tests.workDir correctly (prevent SSD destruction) 2020-01-27 06:07:48 -05:00
Alan Woodward 02f862670e
LUCENE-9153: Allow WhitespaceAnalyzer to set a custom maxTokenLen (#1198)
WhitespaceTokenizer defaults to a maximum token length of 255, and WhitespaceAnalyzer
does not allow this to be changed. This commit adds an optional maxTokenLen parameter
to WhitespaceAnalyzer as well, and documents the existing token length restriction.
2020-01-27 09:22:25 +00:00
Jan Høydahl 9ddd05cd14 SOLR-11207: Exclude solr-ref-guide from owasp check
It picked up log4j1 dependency only used during build
2020-01-27 09:55:12 +01:00
Ignacio Vera 1fe4177ac0
LUCENE-9176: Handle the case when there is only one leaf node in TestEstimatePointCount (#1212) 2020-01-27 09:52:25 +01:00
Dawid Weiss ae95f0ab68 LUCENE-9134: lucene:core:jflexStandardTokenizerImpl 2020-01-27 09:03:19 +01:00
Shalin Shekhar Mangar 776631254f SOLR-13897: Fix unsafe publication of Terms object in ZkShardTerms that can cause visibility issues and race conditions under contention 2020-01-27 12:08:20 +05:30
Dawid Weiss 6f85ec0460 LUCENE-9174: Bump default gradle memory to 2g 2020-01-26 18:27:41 +01:00
Uwe Schindler fd49c903b8 SOLR-14189: Add changes entry 2020-01-26 12:06:13 +01:00
andywebb1975 efd0e8f3e8 SOLR-14189 switch from String.trim() to StringUtils.isBlank() (#1172) 2020-01-26 12:03:39 +01:00
Uwe Schindler 0635756f76 Fix Windows Line endings in the source-patterns checker (silly bug: it's \r\n on windows not the other way round) 2020-01-26 11:48:24 +01:00
Dawid Weiss 5ab59f59ac SOLR-11207: minor changes:
- added 'owasp' task to the root project. This depends on
dependencyCheckAggregate which seems to be a better fit for multi-module
projects than dependencyCheckAnalyze (the difference is vague to me
from plugin's documentation).

- you can run the "gradlew owasp" task explicitly and it'll run the
validation without any flags.

- the owasp task is only added to check if validation.owasp property
is true. I think this should stay as the default on non-CI systems
(developer defaults) because it's a significant chunk of time it takes
to download and validate dependencies.

- I'm not sure *all* configurations should be included in the check...
perhaps we should only limit ourselves to actual runtime dependencies
 not build dependencies, solr-ref-guide, etc.
2020-01-26 10:45:05 +01:00
Jan Høydahl 74a8d6d5ac SOLR-11207: Add OWASP dependency checker to gradle build (#1121)
* SOLR-11207: Add OWASP dependency checker to gradle build
2020-01-26 10:01:51 +01:00
Gus Heck 127ce3e360 SOLR-13749 adjust changes to reflect backport to 8.5 2020-01-25 00:53:27 -05:00
Cassandra Targett 74e88deba7 Revert "SOLR-12930: move Gradle docs from ./help/ to new ./dev-docs/ directory"
This reverts commit 2d8650d36c.
2020-01-24 15:56:00 -06:00
Cassandra Targett ba77a5f2eb SOLR-14214: Clean up client lists and references 2020-01-24 15:46:30 -06:00
Mike eaa3dbe440
SOLR-14162 TestInjection can leak Timer objects (#1137) 2020-01-24 14:04:22 -06:00
Paul Merlin 24f7a28ac1 Add Github Workflow for Gradle Wrapper Validation (#1207) 2020-01-24 20:42:30 +01:00
Robert Muir f5e9bb9493
LUCENE-9165: explicitly cast with the horrible groovy language so that numbers above 9 don't fail 2020-01-24 09:53:47 -05:00
Robert Muir c53cc3edaf
LUCENE-9167: test speedup for slowest/pathological tests (round 3) 2020-01-24 08:58:59 -05:00
Robert Muir 4d61e4aaab
change generate-defaults.gradle not to cap testsJvms at 4 2020-01-24 08:49:17 -05:00
Adrien Grand b283b8df62
LUCENE-4702: Terms dictionary compression. (#1126)
Compress blocks of suffixes in order to make the terms dictionary more
space-efficient. Two compression algorithms are used depending on which one is
more space-efficient:
 - LowercaseAsciiCompression, which applies when all bytes are in the
   `[0x1F,0x3F)` or `[0x5F,0x7F)` ranges, which notably include all digits,
   lowercase ASCII characters, '.', '-' and '_', and encodes 4 chars on 3 bytes.
   It is very often applicable on analyzed content and decompresses very quickly
   thanks to auto-vectorization support in the JVM.
 - LZ4, when the compression ratio is less than 0.75.

I was a bit unhappy with the complexity of the high-compression LZ4 option, so
I simplified it in order to only keep the logic that detects duplicate strings.
The logic about what to do in case overlapping matches are found, which was
responsible for most of the complexity while only yielding tiny benefits, has
been removed.
2020-01-24 14:46:57 +01:00