LUCENE-9220: regenerate all stemmers/stopwords/test data from snowball 2.0 (#1262)
Previous situation:
* The snowball base classes (Among, SnowballProgram, etc) had accumulated local performance-related changes. There was a task that would also "patch" generated classes (e.g. GermanStemmer) after-the-fact.
* Snowball classes had many "non-changes" from the original such as removal of tabs addition of javadocs, license headers, etc.
* Snowball test data (inputs and expected stems) was incorporated into lucene testing, but this was maintained manually. Also files had become large, making the test too slow (Nightly).
* Snowball stopwords lists from their website were manually maintained. In some cases encoding fixes were manually applied.
* Some generated stemmers (such as Estonian and Armenian) exist in lucene, but have no corresponding `.sbl` file in snowball sources at all.
Besides this mess, snowball project is "moving along" and acquiring new languages, adding non-BSD-licensed test data, huge test data, and other complexity. So it is time to automate the integration better.
New situation:
* Lucene has a `gradle snowball` regeneration task. It works on Linux or Mac only. It checks out their repos, applies the `snowball.patch` in our repository, compiles snowball stemmers, regenerates all java code, applies any adjustments so that our build is happy.
* Tests data is automatically regenerated from the commit hash of the snowball test data repository. Not all languages are tested from their data: only where the license is simple BSD. Test data is also (deterministically) sampled, so that we don't have huge files. We just want to make sure our integration works.
* Randomized tests are still set to test every language with generated fake words. The regeneration task ensures all languages get tested (it writes a simple text file list of them).
* Stopword files are automatically regenerated from the commit hash of the snowball website repository.
* The regeneration procedure is idempotent. This way when stuff does change, you know exactly what happened. For example if test data changes to a different license, you may see a git deletion. Or if a new language/stopwords/test data gets added, you will see git additions.
2020-02-17 12:38:01 -05:00
|
|
|
/*
|
|
|
|
* Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
* contributor license agreements. See the NOTICE file distributed with
|
|
|
|
* this work for additional information regarding copyright ownership.
|
|
|
|
* The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
* (the "License"); you may not use this file except in compliance with
|
|
|
|
* the License. You may obtain a copy of the License at
|
|
|
|
*
|
|
|
|
* http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
*
|
|
|
|
* Unless required by applicable law or agreed to in writing, software
|
|
|
|
* distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
* See the License for the specific language governing permissions and
|
|
|
|
* limitations under the License.
|
|
|
|
*/
|
|
|
|
|
2020-08-30 11:10:18 -04:00
|
|
|
import org.apache.tools.ant.taskdefs.condition.Os
|
|
|
|
|
LUCENE-9220: regenerate all stemmers/stopwords/test data from snowball 2.0 (#1262)
Previous situation:
* The snowball base classes (Among, SnowballProgram, etc) had accumulated local performance-related changes. There was a task that would also "patch" generated classes (e.g. GermanStemmer) after-the-fact.
* Snowball classes had many "non-changes" from the original such as removal of tabs addition of javadocs, license headers, etc.
* Snowball test data (inputs and expected stems) was incorporated into lucene testing, but this was maintained manually. Also files had become large, making the test too slow (Nightly).
* Snowball stopwords lists from their website were manually maintained. In some cases encoding fixes were manually applied.
* Some generated stemmers (such as Estonian and Armenian) exist in lucene, but have no corresponding `.sbl` file in snowball sources at all.
Besides this mess, snowball project is "moving along" and acquiring new languages, adding non-BSD-licensed test data, huge test data, and other complexity. So it is time to automate the integration better.
New situation:
* Lucene has a `gradle snowball` regeneration task. It works on Linux or Mac only. It checks out their repos, applies the `snowball.patch` in our repository, compiles snowball stemmers, regenerates all java code, applies any adjustments so that our build is happy.
* Tests data is automatically regenerated from the commit hash of the snowball test data repository. Not all languages are tested from their data: only where the license is simple BSD. Test data is also (deterministically) sampled, so that we don't have huge files. We just want to make sure our integration works.
* Randomized tests are still set to test every language with generated fake words. The regeneration task ensures all languages get tested (it writes a simple text file list of them).
* Stopword files are automatically regenerated from the commit hash of the snowball website repository.
* The regeneration procedure is idempotent. This way when stuff does change, you know exactly what happened. For example if test data changes to a different license, you may see a git deletion. Or if a new language/stopwords/test data gets added, you will see git additions.
2020-02-17 12:38:01 -05:00
|
|
|
apply plugin: "de.undercouch.download"
|
|
|
|
|
|
|
|
configure(rootProject) {
|
|
|
|
task snowball() {
|
|
|
|
description "Regenerate snowball-based sources, stopwords, and tests for ...lucene/analysis."
|
|
|
|
group "generation"
|
|
|
|
|
|
|
|
dependsOn ":lucene:analysis:common:snowballGen"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
configure(project(":lucene:analysis:common")) {
|
|
|
|
ext {
|
|
|
|
// git commit hash of source code https://github.com/snowballstem/snowball/
|
|
|
|
snowballStemmerCommit = "53739a805cfa6c77ff8496dc711dc1c106d987c1"
|
|
|
|
// git commit hash of stopwords https://github.com/snowballstem/snowball-website
|
2020-05-01 21:11:35 -04:00
|
|
|
snowballWebsiteCommit = "5a8cf2451d108217585d8e32d744f8b8fd20c711"
|
LUCENE-9220: regenerate all stemmers/stopwords/test data from snowball 2.0 (#1262)
Previous situation:
* The snowball base classes (Among, SnowballProgram, etc) had accumulated local performance-related changes. There was a task that would also "patch" generated classes (e.g. GermanStemmer) after-the-fact.
* Snowball classes had many "non-changes" from the original such as removal of tabs addition of javadocs, license headers, etc.
* Snowball test data (inputs and expected stems) was incorporated into lucene testing, but this was maintained manually. Also files had become large, making the test too slow (Nightly).
* Snowball stopwords lists from their website were manually maintained. In some cases encoding fixes were manually applied.
* Some generated stemmers (such as Estonian and Armenian) exist in lucene, but have no corresponding `.sbl` file in snowball sources at all.
Besides this mess, snowball project is "moving along" and acquiring new languages, adding non-BSD-licensed test data, huge test data, and other complexity. So it is time to automate the integration better.
New situation:
* Lucene has a `gradle snowball` regeneration task. It works on Linux or Mac only. It checks out their repos, applies the `snowball.patch` in our repository, compiles snowball stemmers, regenerates all java code, applies any adjustments so that our build is happy.
* Tests data is automatically regenerated from the commit hash of the snowball test data repository. Not all languages are tested from their data: only where the license is simple BSD. Test data is also (deterministically) sampled, so that we don't have huge files. We just want to make sure our integration works.
* Randomized tests are still set to test every language with generated fake words. The regeneration task ensures all languages get tested (it writes a simple text file list of them).
* Stopword files are automatically regenerated from the commit hash of the snowball website repository.
* The regeneration procedure is idempotent. This way when stuff does change, you know exactly what happened. For example if test data changes to a different license, you may see a git deletion. Or if a new language/stopwords/test data gets added, you will see git additions.
2020-02-17 12:38:01 -05:00
|
|
|
// git commit hash of test data https://github.com/snowballstem/snowball-data
|
|
|
|
snowballDataCommit = "9145f8732ec952c8a3d1066be251da198a8bc792"
|
|
|
|
|
|
|
|
snowballWorkDir = file("${buildDir}/snowball")
|
|
|
|
|
|
|
|
snowballStemmerDir = file("${snowballWorkDir}/stemmers-${snowballStemmerCommit}")
|
|
|
|
snowballWebsiteDir = file("${snowballWorkDir}/website-${snowballWebsiteCommit}")
|
|
|
|
snowballDataDir = file("${snowballWorkDir}/data-${snowballDataCommit}")
|
|
|
|
|
|
|
|
snowballPatchFile = rootProject.file("gradle/generation/snowball.patch")
|
|
|
|
snowballScript = rootProject.file("gradle/generation/snowball.sh")
|
|
|
|
}
|
|
|
|
|
|
|
|
// downloads snowball stemmers (or use cached copy)
|
|
|
|
task downloadSnowballStemmers(type: Download) {
|
|
|
|
inputs.file(snowballPatchFile)
|
|
|
|
src "https://github.com/snowballstem/snowball/archive/${snowballStemmerCommit}.zip"
|
|
|
|
def snowballStemmerZip = file("${snowballStemmerDir}.zip")
|
|
|
|
dest snowballStemmerZip
|
|
|
|
overwrite false
|
|
|
|
tempAndMove true
|
|
|
|
|
|
|
|
doLast {
|
|
|
|
ant.unzip(src: snowballStemmerZip, dest: snowballStemmerDir, overwrite: "true") {
|
|
|
|
ant.cutdirsmapper(dirs: "1")
|
|
|
|
}
|
|
|
|
ant.patch(patchfile: snowballPatchFile, dir: snowballStemmerDir, strip: "1")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// downloads snowball website (or use cached copy)
|
|
|
|
task downloadSnowballWebsite(type: Download) {
|
|
|
|
src "https://github.com/snowballstem/snowball-website/archive/${snowballWebsiteCommit}.zip"
|
|
|
|
def snowballWebsiteZip = file("${snowballWebsiteDir}.zip")
|
|
|
|
dest snowballWebsiteZip
|
|
|
|
overwrite false
|
|
|
|
tempAndMove true
|
|
|
|
|
|
|
|
doLast {
|
|
|
|
ant.unzip(src: snowballWebsiteZip, dest: snowballWebsiteDir, overwrite: "true") {
|
|
|
|
ant.cutdirsmapper(dirs: "1")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// downloads snowball test data (or use cached copy)
|
|
|
|
task downloadSnowballData(type: Download) {
|
|
|
|
src "https://github.com/snowballstem/snowball-data/archive/${snowballDataCommit}.zip"
|
|
|
|
def snowballDataZip = file("${snowballDataDir}.zip")
|
|
|
|
dest snowballDataZip
|
|
|
|
overwrite false
|
|
|
|
tempAndMove true
|
|
|
|
|
|
|
|
doLast {
|
|
|
|
ant.unzip(src: snowballDataZip, dest: snowballDataDir, overwrite: "true") {
|
|
|
|
ant.cutdirsmapper(dirs: "1")
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// runs shell script to regenerate stemmers, base stemming subclasses, test data, and stopwords.
|
|
|
|
task snowballGen() {
|
|
|
|
dependsOn downloadSnowballStemmers
|
|
|
|
dependsOn downloadSnowballWebsite
|
|
|
|
dependsOn downloadSnowballData
|
|
|
|
|
|
|
|
doLast {
|
2020-08-30 11:10:18 -04:00
|
|
|
if (Os.isFamily(Os.FAMILY_WINDOWS)) {
|
|
|
|
throw GradleException("Snowball generation does not work on Windows, use a platform where bash is available.")
|
|
|
|
}
|
|
|
|
|
LUCENE-9220: regenerate all stemmers/stopwords/test data from snowball 2.0 (#1262)
Previous situation:
* The snowball base classes (Among, SnowballProgram, etc) had accumulated local performance-related changes. There was a task that would also "patch" generated classes (e.g. GermanStemmer) after-the-fact.
* Snowball classes had many "non-changes" from the original such as removal of tabs addition of javadocs, license headers, etc.
* Snowball test data (inputs and expected stems) was incorporated into lucene testing, but this was maintained manually. Also files had become large, making the test too slow (Nightly).
* Snowball stopwords lists from their website were manually maintained. In some cases encoding fixes were manually applied.
* Some generated stemmers (such as Estonian and Armenian) exist in lucene, but have no corresponding `.sbl` file in snowball sources at all.
Besides this mess, snowball project is "moving along" and acquiring new languages, adding non-BSD-licensed test data, huge test data, and other complexity. So it is time to automate the integration better.
New situation:
* Lucene has a `gradle snowball` regeneration task. It works on Linux or Mac only. It checks out their repos, applies the `snowball.patch` in our repository, compiles snowball stemmers, regenerates all java code, applies any adjustments so that our build is happy.
* Tests data is automatically regenerated from the commit hash of the snowball test data repository. Not all languages are tested from their data: only where the license is simple BSD. Test data is also (deterministically) sampled, so that we don't have huge files. We just want to make sure our integration works.
* Randomized tests are still set to test every language with generated fake words. The regeneration task ensures all languages get tested (it writes a simple text file list of them).
* Stopword files are automatically regenerated from the commit hash of the snowball website repository.
* The regeneration procedure is idempotent. This way when stuff does change, you know exactly what happened. For example if test data changes to a different license, you may see a git deletion. Or if a new language/stopwords/test data gets added, you will see git additions.
2020-02-17 12:38:01 -05:00
|
|
|
project.exec {
|
2020-08-30 11:10:18 -04:00
|
|
|
executable "bash"
|
LUCENE-9220: regenerate all stemmers/stopwords/test data from snowball 2.0 (#1262)
Previous situation:
* The snowball base classes (Among, SnowballProgram, etc) had accumulated local performance-related changes. There was a task that would also "patch" generated classes (e.g. GermanStemmer) after-the-fact.
* Snowball classes had many "non-changes" from the original such as removal of tabs addition of javadocs, license headers, etc.
* Snowball test data (inputs and expected stems) was incorporated into lucene testing, but this was maintained manually. Also files had become large, making the test too slow (Nightly).
* Snowball stopwords lists from their website were manually maintained. In some cases encoding fixes were manually applied.
* Some generated stemmers (such as Estonian and Armenian) exist in lucene, but have no corresponding `.sbl` file in snowball sources at all.
Besides this mess, snowball project is "moving along" and acquiring new languages, adding non-BSD-licensed test data, huge test data, and other complexity. So it is time to automate the integration better.
New situation:
* Lucene has a `gradle snowball` regeneration task. It works on Linux or Mac only. It checks out their repos, applies the `snowball.patch` in our repository, compiles snowball stemmers, regenerates all java code, applies any adjustments so that our build is happy.
* Tests data is automatically regenerated from the commit hash of the snowball test data repository. Not all languages are tested from their data: only where the license is simple BSD. Test data is also (deterministically) sampled, so that we don't have huge files. We just want to make sure our integration works.
* Randomized tests are still set to test every language with generated fake words. The regeneration task ensures all languages get tested (it writes a simple text file list of them).
* Stopword files are automatically regenerated from the commit hash of the snowball website repository.
* The regeneration procedure is idempotent. This way when stuff does change, you know exactly what happened. For example if test data changes to a different license, you may see a git deletion. Or if a new language/stopwords/test data gets added, you will see git additions.
2020-02-17 12:38:01 -05:00
|
|
|
args = [snowballScript, snowballStemmerDir, snowballWebsiteDir, snowballDataDir, projectDir]
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|