mirror of https://github.com/apache/lucene.git
LUCENE-2894: apply formatting to more code samples
git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1076237 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
6600f5acdf
commit
d51068ffd6
|
@ -26,7 +26,7 @@ Fragmenter, fragment Scorer, and Formatter classes.
|
|||
|
||||
<h2>Example Usage</h2>
|
||||
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
//... Above, create documents with two fields, one with term vectors (tv) and one without (notv)
|
||||
IndexSearcher searcher = new IndexSearcher(directory);
|
||||
QueryParser parser = new QueryParser("notv", analyzer);
|
||||
|
|
|
@ -131,7 +131,7 @@ you don't need to worry about dealing with those.
|
|||
|
||||
{@link org.apache.lucene.queryParser.standard.StandardQueryParser} usage:
|
||||
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
StandardQueryParser qpHelper = new StandardQueryParser();
|
||||
StandardQueryConfigHandler config = qpHelper.getQueryConfigHandler();
|
||||
config.setAllowLeadingWildcard(true);
|
||||
|
|
|
@ -130,7 +130,7 @@ There are many post tokenization steps that can be done, including (but not limi
|
|||
</li>
|
||||
</ul>
|
||||
However an application might invoke Analysis of any text for testing or for any other purpose, something like:
|
||||
<PRE>
|
||||
<PRE class="prettyprint">
|
||||
Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer
|
||||
TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some text goes here"));
|
||||
while (ts.incrementToken()) {
|
||||
|
@ -182,7 +182,7 @@ the source code of any one of the many samples located in this package.
|
|||
This allows phrase search and proximity search to seamlessly cross
|
||||
boundaries between these "sections".
|
||||
In other words, if a certain field "f" is added like this:
|
||||
<PRE>
|
||||
<PRE class="prettyprint">
|
||||
document.add(new Field("f","first ends",...);
|
||||
document.add(new Field("f","starts two",...);
|
||||
indexWriter.addDocument(document);
|
||||
|
@ -191,7 +191,7 @@ the source code of any one of the many samples located in this package.
|
|||
Where desired, this behavior can be modified by introducing a "position gap" between consecutive field "sections",
|
||||
simply by overriding
|
||||
{@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap(java.lang.String) Analyzer.getPositionIncrementGap(fieldName)}:
|
||||
<PRE>
|
||||
<PRE class="prettyprint">
|
||||
Analyzer myAnalyzer = new StandardAnalyzer() {
|
||||
public int getPositionIncrementGap(String fieldName) {
|
||||
return 10;
|
||||
|
@ -220,7 +220,7 @@ the source code of any one of the many samples located in this package.
|
|||
tokens following a removed stop word, using
|
||||
{@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#setPositionIncrement(int)}.
|
||||
This can be done with something like:
|
||||
<PRE>
|
||||
<PRE class="prettyprint">
|
||||
public TokenStream tokenStream(final String fieldName, Reader reader) {
|
||||
final TokenStream ts = someAnalyzer.tokenStream(fieldName, reader);
|
||||
TokenStream res = new TokenStream() {
|
||||
|
@ -334,7 +334,7 @@ here to illustrate the usage of the new TokenStream API.<br>
|
|||
Then we will develop a custom Attribute, a PartOfSpeechAttribute, and add another filter to the chain which
|
||||
utilizes the new custom attribute, and call it PartOfSpeechTaggingFilter.
|
||||
<h4>Whitespace tokenization</h4>
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
public class MyAnalyzer extends Analyzer {
|
||||
|
||||
public TokenStream tokenStream(String fieldName, Reader reader) {
|
||||
|
@ -381,7 +381,7 @@ API
|
|||
<h4>Adding a LengthFilter</h4>
|
||||
We want to suppress all tokens that have 2 or less characters. We can do that easily by adding a LengthFilter
|
||||
to the chain. Only the tokenStream() method in our analyzer needs to be changed:
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
public TokenStream tokenStream(String fieldName, Reader reader) {
|
||||
TokenStream stream = new WhitespaceTokenizer(reader);
|
||||
stream = new LengthFilter(stream, 3, Integer.MAX_VALUE);
|
||||
|
@ -398,7 +398,7 @@ TokenStream
|
|||
API
|
||||
</pre>
|
||||
Now let's take a look how the LengthFilter is implemented (it is part of Lucene's core):
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
public final class LengthFilter extends TokenFilter {
|
||||
|
||||
final int min;
|
||||
|
@ -448,7 +448,7 @@ is neccessary. The same is true for the consumer, which can simply use local ref
|
|||
<h4>Adding a custom Attribute</h4>
|
||||
Now we're going to implement our own custom Attribute for part-of-speech tagging and call it consequently
|
||||
<code>PartOfSpeechAttribute</code>. First we need to define the interface of the new Attribute:
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
public interface PartOfSpeechAttribute extends Attribute {
|
||||
public static enum PartOfSpeech {
|
||||
Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Conjunction, Article, Unknown
|
||||
|
@ -470,7 +470,7 @@ and returns an actual instance. You can implement your own factory if you need t
|
|||
Now here is the actual class that implements our new Attribute. Notice that the class has to extend
|
||||
{@link org.apache.lucene.util.AttributeImpl}:
|
||||
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
public final class PartOfSpeechAttributeImpl extends AttributeImpl
|
||||
implements PartOfSpeechAttribute{
|
||||
|
||||
|
@ -513,7 +513,7 @@ This is a simple Attribute implementation has only a single variable that stores
|
|||
new <code>AttributeImpl</code> class and therefore implements its abstract methods <code>clear(), copyTo(), equals(), hashCode()</code>.
|
||||
Now we need a TokenFilter that can set this new PartOfSpeechAttribute for each token. In this example we show a very naive filter
|
||||
that tags every word with a leading upper-case letter as a 'Noun' and all other words as 'Unknown'.
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
public static class PartOfSpeechTaggingFilter extends TokenFilter {
|
||||
PartOfSpeechAttribute posAtt;
|
||||
CharTermAttribute termAtt;
|
||||
|
@ -544,7 +544,7 @@ Just like the LengthFilter, this new filter accesses the attributes it needs in
|
|||
stores references in instance variables. Notice how you only need to pass in the interface of the new
|
||||
Attribute and instantiating the correct class is automatically been taken care of.
|
||||
Now we need to add the filter to the chain:
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
public TokenStream tokenStream(String fieldName, Reader reader) {
|
||||
TokenStream stream = new WhitespaceTokenizer(reader);
|
||||
stream = new LengthFilter(stream, 3, Integer.MAX_VALUE);
|
||||
|
@ -564,7 +564,7 @@ API
|
|||
Apparently it hasn't changed, which shows that adding a custom attribute to a TokenStream/Filter chain does not
|
||||
affect any existing consumers, simply because they don't know the new Attribute. Now let's change the consumer
|
||||
to make use of the new PartOfSpeechAttribute and print it out:
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
public static void main(String[] args) throws IOException {
|
||||
// text to tokenize
|
||||
final String text = "This is a demo of the new TokenStream API";
|
||||
|
@ -606,7 +606,7 @@ API the reader could now write an Attribute and TokenFilter that can specify for
|
|||
of a sentence or not. Then the PartOfSpeechTaggingFilter can make use of this knowledge and only tag capitalized words
|
||||
as nouns if not the first word of a sentence (we know, this is still not a correct behavior, but hey, it's a good exercise).
|
||||
As a small hint, this is how the new Attribute class could begin:
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
public class FirstTokenOfSentenceAttributeImpl extends Attribute
|
||||
implements FirstTokenOfSentenceAttribute {
|
||||
|
||||
|
|
|
@ -45,7 +45,7 @@ Features:
|
|||
<p>
|
||||
Lazy loading of Message Strings
|
||||
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
public class MessagesTestBundle extends NLS {
|
||||
|
||||
private static final String BUNDLE_NAME = MessagesTestBundle.class.getName();
|
||||
|
@ -85,7 +85,7 @@ Lazy loading of Message Strings
|
|||
<p>
|
||||
Normal loading of Message Strings
|
||||
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
String message1 = NLS.getLocalizedMessage(MessagesTestBundle.Q0004E_INVALID_SYNTAX_ESCAPE_UNICODE_TRUNCATION);
|
||||
String message2 = NLS.getLocalizedMessage(MessagesTestBundle.Q0004E_INVALID_SYNTAX_ESCAPE_UNICODE_TRUNCATION, Locale.JAPANESE);
|
||||
</pre>
|
||||
|
|
|
@ -130,14 +130,14 @@
|
|||
Using field (byte) values to as scores:
|
||||
<p>
|
||||
Indexing:
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
f = new Field("score", "7", Field.Store.NO, Field.Index.UN_TOKENIZED);
|
||||
f.setOmitNorms(true);
|
||||
d1.add(f);
|
||||
</pre>
|
||||
<p>
|
||||
Search:
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
Query q = new FieldScoreQuery("score", FieldScoreQuery.Type.BYTE);
|
||||
</pre>
|
||||
Document d1 above would get a score of 7.
|
||||
|
@ -148,7 +148,7 @@
|
|||
<p>
|
||||
Dividing the original score of each document by a square root of its docid
|
||||
(just to demonstrate what it takes to manipulate scores this way)
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
Query q = queryParser.parse("my query text");
|
||||
CustomScoreQuery customQ = new CustomScoreQuery(q) {
|
||||
public float customScore(int doc, float subQueryScore, float valSrcScore) {
|
||||
|
@ -158,7 +158,7 @@
|
|||
</pre>
|
||||
<p>
|
||||
For more informative debug info on the custom query, also override the name() method:
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
CustomScoreQuery customQ = new CustomScoreQuery(q) {
|
||||
public float customScore(int doc, float subQueryScore, float valSrcScore) {
|
||||
return subQueryScore / Math.sqrt(docid);
|
||||
|
@ -171,7 +171,7 @@
|
|||
<p>
|
||||
Taking the square root of the original score and multiplying it by a "short field driven score", ie, the
|
||||
short value that was indexed for the scored doc in a certain field:
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
Query q = queryParser.parse("my query text");
|
||||
FieldScoreQuery qf = new FieldScoreQuery("shortScore", FieldScoreQuery.Type.SHORT);
|
||||
CustomScoreQuery customQ = new CustomScoreQuery(q,qf) {
|
||||
|
|
|
@ -59,7 +59,7 @@ two starts and ends at the greater of the two ends.
|
|||
<p>For example, a span query which matches "John Kerry" within ten
|
||||
words of "George Bush" within the first 100 words of the document
|
||||
could be constructed with:
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
SpanQuery john = new SpanTermQuery(new Term("content", "john"));
|
||||
SpanQuery kerry = new SpanTermQuery(new Term("content", "kerry"));
|
||||
SpanQuery george = new SpanTermQuery(new Term("content", "george"));
|
||||
|
@ -82,7 +82,7 @@ SpanQuery johnKerryNearGeorgeBushAtStart =
|
|||
So, for example, the above query can be restricted to documents which
|
||||
also use the word "iraq" with:
|
||||
|
||||
<pre>
|
||||
<pre class="prettyprint">
|
||||
Query query = new BooleanQuery();
|
||||
query.add(johnKerryNearGeorgeBushAtStart, true, false);
|
||||
query.add(new TermQuery("content", "iraq"), true, false);
|
||||
|
|
|
@ -52,7 +52,7 @@
|
|||
<h2>Example Usages</h2>
|
||||
|
||||
<h3>Farsi Range Queries</h3>
|
||||
<code><pre>
|
||||
<pre class="prettyprint">
|
||||
// "fa" Locale is not supported by Sun JDK 1.4 or 1.5
|
||||
Collator collator = Collator.getInstance(new Locale("ar"));
|
||||
CollationKeyAnalyzer analyzer = new CollationKeyAnalyzer(Version.LUCENE_40, collator);
|
||||
|
@ -76,10 +76,10 @@
|
|||
ScoreDoc[] result
|
||||
= is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
|
||||
assertEquals("The index Term should not be included.", 0, result.length);
|
||||
</pre></code>
|
||||
</pre>
|
||||
|
||||
<h3>Danish Sorting</h3>
|
||||
<code><pre>
|
||||
<pre class="prettyprint">
|
||||
Analyzer analyzer
|
||||
= new CollationKeyAnalyzer(Version.LUCENE_40, Collator.getInstance(new Locale("da", "dk")));
|
||||
RAMDirectory indexStore = new RAMDirectory();
|
||||
|
@ -103,10 +103,10 @@
|
|||
Document doc = searcher.doc(result[i].doc);
|
||||
assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]);
|
||||
}
|
||||
</pre></code>
|
||||
</pre>
|
||||
|
||||
<h3>Turkish Case Normalization</h3>
|
||||
<code><pre>
|
||||
<pre class="prettyprint">
|
||||
Collator collator = Collator.getInstance(new Locale("tr", "TR"));
|
||||
collator.setStrength(Collator.PRIMARY);
|
||||
Analyzer analyzer = new CollationKeyAnalyzer(Version.LUCENE_40, collator);
|
||||
|
@ -121,7 +121,7 @@
|
|||
Query query = parser.parse("d\u0131gy"); // U+0131: dotless i
|
||||
ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
|
||||
assertEquals("The index Term should be included.", 1, result.length);
|
||||
</pre></code>
|
||||
</pre>
|
||||
|
||||
<h2>Caveats and Comparisons</h2>
|
||||
<p>
|
||||
|
|
|
@ -66,12 +66,12 @@ algorithm.
|
|||
</ul>
|
||||
<h2>Example Usages</h2>
|
||||
<h3>Tokenizing multilanguage text</h3>
|
||||
<code><pre>
|
||||
<pre class="prettyprint">
|
||||
/**
|
||||
* This tokenizer will work well in general for most languages.
|
||||
*/
|
||||
Tokenizer tokenizer = new ICUTokenizer(reader);
|
||||
</pre></code>
|
||||
</pre>
|
||||
<hr/>
|
||||
<h1><a name="collation">Collation</a></h1>
|
||||
<p>
|
||||
|
@ -111,7 +111,7 @@ algorithm.
|
|||
<h2>Example Usages</h2>
|
||||
|
||||
<h3>Farsi Range Queries</h3>
|
||||
<code><pre>
|
||||
<pre class="prettyprint">
|
||||
Collator collator = Collator.getInstance(new ULocale("ar"));
|
||||
ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(Version.LUCENE_40, collator);
|
||||
RAMDirectory ramDir = new RAMDirectory();
|
||||
|
@ -134,10 +134,10 @@ algorithm.
|
|||
ScoreDoc[] result
|
||||
= is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
|
||||
assertEquals("The index Term should not be included.", 0, result.length);
|
||||
</pre></code>
|
||||
</pre>
|
||||
|
||||
<h3>Danish Sorting</h3>
|
||||
<code><pre>
|
||||
<pre class="prettyprint">
|
||||
Analyzer analyzer
|
||||
= new ICUCollationKeyAnalyzer(Version.LUCENE_40, Collator.getInstance(new ULocale("da", "dk")));
|
||||
RAMDirectory indexStore = new RAMDirectory();
|
||||
|
@ -161,10 +161,10 @@ algorithm.
|
|||
Document doc = searcher.doc(result[i].doc);
|
||||
assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]);
|
||||
}
|
||||
</pre></code>
|
||||
</pre>
|
||||
|
||||
<h3>Turkish Case Normalization</h3>
|
||||
<code><pre>
|
||||
<pre class="prettyprint">
|
||||
Collator collator = Collator.getInstance(new ULocale("tr", "TR"));
|
||||
collator.setStrength(Collator.PRIMARY);
|
||||
Analyzer analyzer = new ICUCollationKeyAnalyzer(Version.LUCENE_40, collator);
|
||||
|
@ -179,7 +179,7 @@ algorithm.
|
|||
Query query = parser.parse("d\u0131gy"); // U+0131: dotless i
|
||||
ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
|
||||
assertEquals("The index Term should be included.", 1, result.length);
|
||||
</pre></code>
|
||||
</pre>
|
||||
|
||||
<h2>Caveats and Comparisons</h2>
|
||||
<p>
|
||||
|
@ -239,7 +239,7 @@ algorithm.
|
|||
</ul>
|
||||
<h2>Example Usages</h2>
|
||||
<h3>Normalizing text to NFC</h3>
|
||||
<code><pre>
|
||||
<pre class="prettyprint">
|
||||
/**
|
||||
* Normalizer2 objects are unmodifiable and immutable.
|
||||
*/
|
||||
|
@ -248,7 +248,7 @@ algorithm.
|
|||
* This filter will normalize to NFC.
|
||||
*/
|
||||
TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer);
|
||||
</pre></code>
|
||||
</pre>
|
||||
<hr/>
|
||||
<h1><a name="casefolding">Case Folding</a></h1>
|
||||
<p>
|
||||
|
@ -278,12 +278,12 @@ this integration. To perform case-folding, you use normalization with the form
|
|||
</ul>
|
||||
<h2>Example Usages</h2>
|
||||
<h3>Lowercasing text</h3>
|
||||
<code><pre>
|
||||
<pre class="prettyprint">
|
||||
/**
|
||||
* This filter will case-fold and normalize to NFKC.
|
||||
*/
|
||||
TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer);
|
||||
</pre></code>
|
||||
</pre>
|
||||
<hr/>
|
||||
<h1><a name="searchfolding">Search Term Folding</a></h1>
|
||||
<p>
|
||||
|
@ -305,13 +305,13 @@ many character foldings recursively.
|
|||
</ul>
|
||||
<h2>Example Usages</h2>
|
||||
<h3>Removing accents</h3>
|
||||
<code><pre>
|
||||
<pre class="prettyprint">
|
||||
/**
|
||||
* This filter will case-fold, remove accents and other distinctions, and
|
||||
* normalize to NFKC.
|
||||
*/
|
||||
TokenStream tokenstream = new ICUFoldingFilter(tokenizer);
|
||||
</pre></code>
|
||||
</pre>
|
||||
<hr/>
|
||||
<h1><a name="transform">Text Transformation</a></h1>
|
||||
<p>
|
||||
|
@ -335,19 +335,19 @@ and
|
|||
</ul>
|
||||
<h2>Example Usages</h2>
|
||||
<h3>Convert Traditional to Simplified</h3>
|
||||
<code><pre>
|
||||
<pre class="prettyprint">
|
||||
/**
|
||||
* This filter will map Traditional Chinese to Simplified Chinese
|
||||
*/
|
||||
TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified"));
|
||||
</pre></code>
|
||||
</pre>
|
||||
<h3>Transliterate Serbian Cyrillic to Serbian Latin</h3>
|
||||
<code><pre>
|
||||
<pre class="prettyprint">
|
||||
/**
|
||||
* This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules
|
||||
*/
|
||||
TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN"));
|
||||
</pre></code>
|
||||
</pre>
|
||||
<hr/>
|
||||
<h1><a name="backcompat">Backwards Compatibility</a></h1>
|
||||
<p>
|
||||
|
@ -359,7 +359,7 @@ a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer
|
|||
</p>
|
||||
<h2>Example Usages</h2>
|
||||
<h3>Restricting normalization to Unicode 5.0</h3>
|
||||
<code><pre>
|
||||
<pre class="prettyprint">
|
||||
/**
|
||||
* This filter will do NFC normalization, but will ignore any characters that
|
||||
* did not exist as of Unicode 5.0. Because of the normalization stability policy
|
||||
|
@ -371,6 +371,6 @@ a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer
|
|||
set.freeze();
|
||||
FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set);
|
||||
TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50);
|
||||
</pre></code>
|
||||
</pre>
|
||||
</body>
|
||||
</html>
|
||||
|
|
Loading…
Reference in New Issue