mirror of https://github.com/apache/lucene.git
Update language-analysis.adoc
This commit is contained in:
parent
956772b7ef
commit
87564a3e19
|
@ -577,6 +577,7 @@ Perform model-based lemmatization only, preserving the original token and emitti
|
|||
These factories are each designed to work with specific languages. The languages covered here are:
|
||||
|
||||
* <<Arabic>>
|
||||
* <<Bengali>>
|
||||
* <<Brazilian Portuguese>>
|
||||
* <<Bulgarian>>
|
||||
* <<Catalan>>
|
||||
|
@ -633,6 +634,31 @@ This algorithm defines both character normalization and stemming, so these are s
|
|||
</analyzer>
|
||||
----
|
||||
|
||||
=== Bengali
|
||||
|
||||
There are two filters written specifically for dealing with Bengali language. They use the Lucene classes `org.apache.lucene.analysis.bn.BengaliNormalizationFilter` and `org.apache.lucene.analysis.bn.BengaliStemFilter`.
|
||||
|
||||
*Factory classes:* `solr.BengaliStemFilterFactory`, `solr.BengaliNormalizationFilterFactory`
|
||||
|
||||
*Arguments:* None
|
||||
|
||||
*Example:*
|
||||
|
||||
[source,xml]
|
||||
----
|
||||
<analyzer>
|
||||
<tokenizer class="solr.StandardTokenizerFactory"/>
|
||||
<filter class="solr.BengaliNormalizationFilterFactory"/>
|
||||
<filter class="solr.BengaliStemFilterFactory"/>
|
||||
</analyzer>
|
||||
|
||||
----
|
||||
|
||||
*Normalisation* - `মানুষ` -> `মানুস`
|
||||
|
||||
*Stemming* - `সমস্ত` -> `সমস্`
|
||||
|
||||
|
||||
=== Brazilian Portuguese
|
||||
|
||||
This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese language. It uses the Lucene class `org.apache.lucene.analysis.br.BrazilianStemmer`. Although that stemmer can be configured to use a list of protected words (which should not be stemmed), this factory does not accept any arguments to specify such a list.
|
||||
|
|
Loading…
Reference in New Issue