lucene/solr/contrib/langid
Robert Muir 975df9ddd3
LUCENE-9182: add apache license headers to all .gradle files and enforce in rat task
2020-01-27 12:05:34 -05:00
..
src SOLR-14013: javabin performance regressions 2019-12-12 23:26:37 +11:00
README.txt SOLR-11592: Add OpenNLP language detection to the langid contrib 2018-01-17 11:29:17 -05:00
build.gradle LUCENE-9182: add apache license headers to all .gradle files and enforce in rat task 2020-01-27 12:05:34 -05:00
build.xml LUCENE-8807: Change all download URLs in build files to HTTPS 2019-05-21 17:06:00 +02:00
ivy.xml SOLR-11592: Add OpenNLP language detection to the langid contrib 2018-01-17 11:29:17 -05:00

README.txt

Apache Solr Language Identifier


Introduction
------------
This module is intended to be used while indexing documents.
It is implemented as an UpdateProcessor to be placed in an UpdateChain.
Its purpose is to identify language from documents and tag the document with language code.
The module can optionally map field names to their language specific counterpart,
e.g. if the input is "title" and language is detected as "en", map to "title_en".
Language may be detected globally for the document, and/or individually per field.
Language detector implementations are pluggable.

Getting Started
---------------
Please refer to the module documentation at http://wiki.apache.org/solr/LanguageDetection

Dependencies
------------
The Tika detector depends on Tika Core (which is part of extraction contrib)
The Langdetect detector depends on LangDetect library
The OpenNLP detector depends on OpenNLP tools and requires a previously trained user-supplied model