lucene/solr/contrib/langid
Erick Erickson c9c75810c2 Revert "LUCENE-9433: Remove Ant support from trunk"
This reverts commit 37cd17dc
2020-08-21 16:57:58 -04:00
..
src SOLR-14545: Fix or suppress warnings in apache/solr/update 2020-06-08 19:03:34 -04:00
README.md SOLR-14429: Convert .txt files to properly formatted .md files (#1450) 2020-04-27 08:43:04 +09:00
build.gradle LUCENE-9321: Port markdown task to Gradle (#1477) 2020-05-17 14:46:26 +02:00
build.xml Revert "LUCENE-9433: Remove Ant support from trunk" 2020-08-21 16:57:58 -04:00
ivy.xml Revert "LUCENE-9433: Remove Ant support from trunk" 2020-08-21 16:57:58 -04:00

README.md

Apache Solr Language Identifier

Introduction

This module is intended to be used while indexing documents. It is implemented as an UpdateProcessor to be placed in an UpdateChain. Its purpose is to identify language from documents and tag the document with language code. The module can optionally map field names to their language specific counterpart, e.g. if the input is "title" and language is detected as "en", map to "title_en". Language may be detected globally for the document, and/or individually per field. Language detector implementations are pluggable.

Getting Started

Please refer to the module documentation at http://wiki.apache.org/solr/LanguageDetection

Dependencies

The Tika detector depends on Tika Core (which is part of extraction contrib) The Langdetect detector depends on LangDetect library The OpenNLP detector depends on OpenNLP tools and requires a previously trained user-supplied model