lucene

Commit Graph

Author	SHA1	Message	Date
Dawid Weiss	f8a2c39906	LUCENE-9155: add missing naist dictionary generation, clean up the code a bit.	2020-02-21 10:24:05 +01:00
Robert Muir	9302eee1e0	LUCENE-9235: upgrade all python to python3 Die, python2, die. Some generated .java files change (parameterized automata for spell-correction). This is because the order of python dictionaries was not well-defined previously. A sort() was added so that the python code now generates reproducible output (Thanks @mikemccand). So we'll suffer a change once, but the automata are equivalent. If you run the script again you should not see source code changes. The relevant unit tests are exhaustive (if you trust the paper!), so we can be confident it does not break things, even though it looks very scary.	2020-02-20 21:27:38 -05:00
Anshum Gupta	cb18586ea0	LUCENE-9155: Add Apache License header to the Kuromoji dictionary compilation (#1271 )	2020-02-20 14:59:06 -08:00
Dawid Weiss	62662e477a	LUCENE-9155: Port Kuromoji dictionary compilation (regenerate).	2020-02-20 19:00:56 +01:00
Robert Muir	b9a569e7be	LUCENE-9230: explicitly call python version we want from builds On newer linux distros, at least, 'python' now means python3. So we can't rely on what version of python it will invoke (at least for a few years). For example in Fedora Linux: https://fedoraproject.org/wiki/Changes/Python_means_Python3 For python2.x code, explicitly call 'python2.7' and for python3.x code, explicitly call 'python3'. Ant variable names are cleaned up, e.g. 'python.exe' is renamed to 'python2.exe' and 'python32.exe' is renamed to 'python3.exe'. This also makes it easy to identify remaining python 2.x code that should be migrated to python 3.x	2020-02-18 18:58:17 -05:00
Robert Muir	ccb390d4a6	LUCENE-9220: prevent zip file reproducibility issues based on users umask	2020-02-17 13:34:00 -05:00
Robert Muir	0203815ab2	LUCENE-9220: regenerate all stemmers/stopwords/test data from snowball 2.0 (#1262 ) Previous situation: * The snowball base classes (Among, SnowballProgram, etc) had accumulated local performance-related changes. There was a task that would also "patch" generated classes (e.g. GermanStemmer) after-the-fact. * Snowball classes had many "non-changes" from the original such as removal of tabs addition of javadocs, license headers, etc. * Snowball test data (inputs and expected stems) was incorporated into lucene testing, but this was maintained manually. Also files had become large, making the test too slow (Nightly). * Snowball stopwords lists from their website were manually maintained. In some cases encoding fixes were manually applied. * Some generated stemmers (such as Estonian and Armenian) exist in lucene, but have no corresponding `.sbl` file in snowball sources at all. Besides this mess, snowball project is "moving along" and acquiring new languages, adding non-BSD-licensed test data, huge test data, and other complexity. So it is time to automate the integration better. New situation: * Lucene has a `gradle snowball` regeneration task. It works on Linux or Mac only. It checks out their repos, applies the `snowball.patch` in our repository, compiles snowball stemmers, regenerates all java code, applies any adjustments so that our build is happy. * Tests data is automatically regenerated from the commit hash of the snowball test data repository. Not all languages are tested from their data: only where the license is simple BSD. Test data is also (deterministically) sampled, so that we don't have huge files. We just want to make sure our integration works. * Randomized tests are still set to test every language with generated fake words. The regeneration task ensures all languages get tested (it writes a simple text file list of them). * Stopword files are automatically regenerated from the commit hash of the snowball website repository. * The regeneration procedure is idempotent. This way when stuff does change, you know exactly what happened. For example if test data changes to a different license, you may see a git deletion. Or if a new language/stopwords/test data gets added, you will see git additions.	2020-02-17 12:38:01 -05:00
Dawid Weiss	dcf448efeb	LUCENE-9134: Minor cleanups.	2020-02-13 11:18:01 +01:00
Erick Erickson	f9357ab0d2	LUCENE-9134: Port ant-regenerate tasks to Gradle build (util and packed) (#1251 ) * LUCENE-9134: Port ant-regenerate tasks to Gradle build	2020-02-11 18:56:11 -05:00
Erick Erickson	b0bb299dc4	LUCENE-9134: Port ant-regenerate tasks to Gradle build (#1230 ) LUCENE-9134: Port ant-regenerate tasks to Gradle build (Solr javacc)	2020-02-04 09:16:38 -05:00
Erick Erickson	5253c0cb74	LUCENE-9134 Port ant-regenerate tasks to Gradle build (#1226 ) LUCENE-9134: Port ant-regenerate tasks to Gradle build Javacc sub-task. Closes #1226	2020-01-31 17:04:10 -05:00
Dawid Weiss	3a8ed5e8ed	LUCENE-9134: add python-based regeneration of HTMLCharacterEntities.jflex inside jflexHTMLStripCharFilter.	2020-01-30 13:48:16 +01:00
Dawid Weiss	e25dac085f	LUCENE-9134: this adds initial javacc support (without follow-up tweaks required to make the sources identical as those generated by ant).	2020-01-29 17:02:59 +01:00
Robert Muir	975df9ddd3	LUCENE-9182: add apache license headers to all .gradle files and enforce in rat task	2020-01-27 12:05:34 -05:00
Dawid Weiss	6bde0f3ec8	LUCENE-9134: UAX29URLEmailTokenizerImpl regeneration. This requires TONS of memory and time... insane compared to the size of the input. None of my machines pass it without at least 12 gigs of heap (!).	2020-01-27 12:36:13 +01:00
Dawid Weiss	ae95f0ab68	LUCENE-9134: lucene:core:jflexStandardTokenizerImpl	2020-01-27 09:03:19 +01:00

16 Commits