mirror of https://github.com/apache/lucene.git
Update language-analysis.adoc
This commit is contained in:
parent
956772b7ef
commit
87564a3e19
|
@ -577,6 +577,7 @@ Perform model-based lemmatization only, preserving the original token and emitti
|
||||||
These factories are each designed to work with specific languages. The languages covered here are:
|
These factories are each designed to work with specific languages. The languages covered here are:
|
||||||
|
|
||||||
* <<Arabic>>
|
* <<Arabic>>
|
||||||
|
* <<Bengali>>
|
||||||
* <<Brazilian Portuguese>>
|
* <<Brazilian Portuguese>>
|
||||||
* <<Bulgarian>>
|
* <<Bulgarian>>
|
||||||
* <<Catalan>>
|
* <<Catalan>>
|
||||||
|
@ -633,6 +634,31 @@ This algorithm defines both character normalization and stemming, so these are s
|
||||||
</analyzer>
|
</analyzer>
|
||||||
----
|
----
|
||||||
|
|
||||||
|
=== Bengali
|
||||||
|
|
||||||
|
There are two filters written specifically for dealing with Bengali language. They use the Lucene classes `org.apache.lucene.analysis.bn.BengaliNormalizationFilter` and `org.apache.lucene.analysis.bn.BengaliStemFilter`.
|
||||||
|
|
||||||
|
*Factory classes:* `solr.BengaliStemFilterFactory`, `solr.BengaliNormalizationFilterFactory`
|
||||||
|
|
||||||
|
*Arguments:* None
|
||||||
|
|
||||||
|
*Example:*
|
||||||
|
|
||||||
|
[source,xml]
|
||||||
|
----
|
||||||
|
<analyzer>
|
||||||
|
<tokenizer class="solr.StandardTokenizerFactory"/>
|
||||||
|
<filter class="solr.BengaliNormalizationFilterFactory"/>
|
||||||
|
<filter class="solr.BengaliStemFilterFactory"/>
|
||||||
|
</analyzer>
|
||||||
|
|
||||||
|
----
|
||||||
|
|
||||||
|
*Normalisation* - `মানুষ` -> `মানুস`
|
||||||
|
|
||||||
|
*Stemming* - `সমস্ত` -> `সমস্`
|
||||||
|
|
||||||
|
|
||||||
=== Brazilian Portuguese
|
=== Brazilian Portuguese
|
||||||
|
|
||||||
This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese language. It uses the Lucene class `org.apache.lucene.analysis.br.BrazilianStemmer`. Although that stemmer can be configured to use a list of protected words (which should not be stemmed), this factory does not accept any arguments to specify such a list.
|
This is a Java filter written specifically for stemming the Brazilian dialect of the Portuguese language. It uses the Lucene class `org.apache.lucene.analysis.br.BrazilianStemmer`. Although that stemmer can be configured to use a list of protected words (which should not be stemmed), this factory does not accept any arguments to specify such a list.
|
||||||
|
|
Loading…
Reference in New Issue