Robert Muir 0203815ab2
LUCENE-9220: regenerate all stemmers/stopwords/test data from snowball 2.0 (#1262)
Previous situation:

* The snowball base classes (Among, SnowballProgram, etc) had accumulated local performance-related changes. There was a task that would also "patch" generated classes (e.g. GermanStemmer) after-the-fact.
* Snowball classes had many "non-changes" from the original such as removal of tabs addition of javadocs, license headers, etc.
* Snowball test data (inputs and expected stems) was incorporated into lucene testing, but this was maintained manually. Also files had become large, making the test too slow (Nightly).
* Snowball stopwords lists from their website were manually maintained. In some cases encoding fixes were manually applied.
* Some generated stemmers (such as Estonian and Armenian) exist in lucene, but have no corresponding `.sbl` file in snowball sources at all.

Besides this mess, snowball project is "moving along" and acquiring new languages, adding non-BSD-licensed test data, huge test data, and other complexity. So it is time to automate the integration better.

New situation:

* Lucene has a `gradle snowball` regeneration task. It works on Linux or Mac only. It checks out their repos, applies the `snowball.patch` in our repository, compiles snowball stemmers, regenerates all java code, applies any adjustments so that our build is happy.
* Tests data is automatically regenerated from the commit hash of the snowball test data repository. Not all languages are tested from their data: only where the license is simple BSD. Test data is also (deterministically) sampled, so that we don't have huge files. We just want to make sure our integration works.
* Randomized tests are still set to test every language with generated fake words. The regeneration task ensures all languages get tested (it writes a simple text file list of them).
* Stopword files are automatically regenerated from the commit hash of the snowball website repository.
* The regeneration procedure is idempotent. This way when stuff does change, you know exactly what happened. For example if test data changes to a different license, you may see a git deletion. Or if a new language/stopwords/test data gets added, you will see git additions.
2020-02-17 12:38:01 -05:00
2010-12-12 15:36:08 +00:00
2020-01-09 12:36:40 +01:00

Apache Lucene and Solr

Apache Lucene is a high-performance, full featured text search engine library written in Java.

Apache Solr is an enterprise search platform written using Apache Lucene. Major features include full-text search, index replication and sharding, and result faceting and highlighting.

Build Status Build Status

Online Documentation

This README file only contains basic setup instructions. For more comprehensive documentation, visit:

Building Lucene/Solr

(You do not need to do this if you downloaded a pre-built package)

Building with Ant

Lucene and Solr are built using Apache Ant. To build Lucene and Solr, run:

ant compile

If you see an error about Ivy missing while invoking Ant (e.g., .ant/lib does not exist), run ant ivy-bootstrap and retry.

Sometimes you may face issues with Ivy (e.g., an incompletely downloaded artifact). Cleaning up the Ivy cache and retrying is a workaround for most of such issues:

rm -rf ~/.ivy2/cache

The Solr server can then be packaged and prepared for startup by running the following command from the solr/ directory:

ant server

Building with Gradle

There is ongoing work (see LUCENE-9077) to switch the legacy ant-based build system to gradle. Please give it a try!

At the moment of writing, the gradle build requires precisely Java 11 (it may or may not work with newer Java versions).

To build Lucene and Solr, run (./ can be omitted on Windows):

./gradlew assemble

The command above also packages a full distribution of Solr server; the package can be located at:

solr/packaging/build/solr-*

Note that the gradle build does not create or copy binaries throughout the source repository (like ant build does) so you need to switch to the packaging output folder above; the rest of the instructions below remain identical.

Running Solr

After building Solr, the server can be started using the bin/solr control scripts. Solr can be run in either standalone or distributed (SolrCloud mode).

To run Solr in standalone mode, run the following command from the solr/ directory:

bin/solr start

To run Solr in SolrCloud mode, run the following command from the solr/ directory:

bin/solr start -c

The bin/solr control script allows heavy modification of the started Solr. Common options are described in some detail in solr/README.txt. For an exhaustive treatment of options, run bin/solr start -h from the solr/ directory.

Development/IDEs

Ant can be used to generate project files compatible with most common IDEs. Run the ant command corresponding to your IDE of choice before attempting to import Lucene/Solr.

  • Eclipse - ant eclipse (See this for details)
  • IntelliJ - ant idea (See this for details)
  • Netbeans - ant netbeans (See this for details)

Gradle build and IDE support

  • IntelliJ - IntelliJ idea can import the project out of the box. Code formatting conventions should be manually adjusted.
  • Eclipse - Not tested.
  • Netbeans - Not tested.

Running Tests

The standard test suite can be run with the command:

ant test

Like Solr itself, the test-running can be customized or tailored in a number or ways. For an exhaustive discussion of the options available, run:

ant test-help

Gradle build and tests

Run the following command to display an extensive help for running tests with gradle:

./gradlew helpTests

Contributing

Please review the Contributing to Solr Guide for information on contributing.

Discussion and Support

Description
Apache Lucene open-source search software
Readme 602 MiB
Languages
Java 97.7%
HTML 1%
Python 0.9%
Lex 0.3%