lucene

History

Robert Muir 0203815ab2 LUCENE-9220: regenerate all stemmers/stopwords/test data from snowball 2.0 (#1262 ) Previous situation: * The snowball base classes (Among, SnowballProgram, etc) had accumulated local performance-related changes. There was a task that would also "patch" generated classes (e.g. GermanStemmer) after-the-fact. * Snowball classes had many "non-changes" from the original such as removal of tabs addition of javadocs, license headers, etc. * Snowball test data (inputs and expected stems) was incorporated into lucene testing, but this was maintained manually. Also files had become large, making the test too slow (Nightly). * Snowball stopwords lists from their website were manually maintained. In some cases encoding fixes were manually applied. * Some generated stemmers (such as Estonian and Armenian) exist in lucene, but have no corresponding `.sbl` file in snowball sources at all. Besides this mess, snowball project is "moving along" and acquiring new languages, adding non-BSD-licensed test data, huge test data, and other complexity. So it is time to automate the integration better. New situation: * Lucene has a `gradle snowball` regeneration task. It works on Linux or Mac only. It checks out their repos, applies the `snowball.patch` in our repository, compiles snowball stemmers, regenerates all java code, applies any adjustments so that our build is happy. * Tests data is automatically regenerated from the commit hash of the snowball test data repository. Not all languages are tested from their data: only where the license is simple BSD. Test data is also (deterministically) sampled, so that we don't have huge files. We just want to make sure our integration works. * Randomized tests are still set to test every language with generated fake words. The regeneration task ensures all languages get tested (it writes a simple text file list of them). * Stopword files are automatically regenerated from the commit hash of the snowball website repository. * The regeneration procedure is idempotent. This way when stuff does change, you know exactly what happened. For example if test data changes to a different license, you may see a git deletion. Or if a new language/stopwords/test data gets added, you will see git additions.		2020-02-17 12:38:01 -05:00
..
analysis-extras	Update README.txt (#1090 )	2020-02-15 22:57:46 +01:00
analytics	LUCENE-9182: add apache license headers to all .gradle files and enforce in rat task	2020-01-27 12:05:34 -05:00
clustering	LUCENE-9220: regenerate all stemmers/stopwords/test data from snowball 2.0 (#1262 )	2020-02-17 12:38:01 -05:00
dataimporthandler	LUCENE-9182: add apache license headers to all .gradle files and enforce in rat task	2020-01-27 12:05:34 -05:00
dataimporthandler-extras	LUCENE-9182: add apache license headers to all .gradle files and enforce in rat task	2020-01-27 12:05:34 -05:00
extraction	LUCENE-9182: add apache license headers to all .gradle files and enforce in rat task	2020-01-27 12:05:34 -05:00
jaegertracer-configurator	LUCENE-9182: add apache license headers to all .gradle files and enforce in rat task	2020-01-27 12:05:34 -05:00
langid	LUCENE-9182: add apache license headers to all .gradle files and enforce in rat task	2020-01-27 12:05:34 -05:00
ltr	LUCENE-8279: fix javadocs wrong header levels and accessibility issues	2020-02-08 10:00:00 -05:00
prometheus-exporter	LUCENE-9182: add apache license headers to all .gradle files and enforce in rat task	2020-01-27 12:05:34 -05:00
velocity	SOLR-14209: Upgrade JQuery to 3.4.1	2020-02-08 11:57:56 -06:00
contrib-build.xml	SOLR-7227: Don't create the WAR file at all	2015-07-28 19:04:21 +00:00