Update language-analysis.adoc

This commit is contained in:
Konstantin Perikov 2019-01-30 10:28:27 +00:00 committed by Tomas Fernandez Lobbe
parent 956772b7ef
commit 87564a3e19
1 changed files with 26 additions and 0 deletions

View File

@ -577,6 +577,7 @@ Perform model-based lemmatization only, preserving the original token and emitti
These factories are each designed to work with specific languages. The languages covered here are:
* <<Arabic>>
* <<Bengali>>
* <<Brazilian Portuguese>>
* <<Bulgarian>>
* <<Catalan>>
@ -633,6 +634,31 @@ This algorithm defines both character normalization and stemming, so these are s
</analyzer>
----
=== Bengali
There are two filters written specifically for dealing with Bengali language. They use the Lucene classes `org.apache.lucene.analysis.bn.BengaliNormalizationFilter` and `org.apache.lucene.analysis.bn.BengaliStemFilter`.
*Factory classes:* `solr.BengaliStemFilterFactory`, `solr.BengaliNormalizationFilterFactory`
*Arguments:* None
*Example:*
[source,xml]
----
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.BengaliNormalizationFilterFactory"/>
<filter class="solr.BengaliStemFilterFactory"/>
</analyzer>
----
*Normalisation* - `মানুষ` -> `মানুস`
*Stemming* - `সমস্ত` -> `সমস্`
=== Brazilian Portuguese
This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese language. It uses the Lucene class `org.apache.lucene.analysis.br.BrazilianStemmer`. Although that stemmer can be configured to use a list of protected words (which should not be stemmed), this factory does not accept any arguments to specify such a list.