LUCENE-2894: apply formatting to more code samples

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1076237 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Robert Muir 2011-03-02 14:59:02 +00:00
parent 6600f5acdf
commit d51068ffd6
8 changed files with 50 additions and 50 deletions

View File

@ -26,7 +26,7 @@ Fragmenter, fragment Scorer, and Formatter classes.
<h2>Example Usage</h2>
<pre>
<pre class="prettyprint">
//... Above, create documents with two fields, one with term vectors (tv) and one without (notv)
IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser("notv", analyzer);

View File

@ -131,7 +131,7 @@ you don't need to worry about dealing with those.
{@link org.apache.lucene.queryParser.standard.StandardQueryParser} usage:
<pre>
<pre class="prettyprint">
StandardQueryParser qpHelper = new StandardQueryParser();
StandardQueryConfigHandler config = qpHelper.getQueryConfigHandler();
config.setAllowLeadingWildcard(true);

View File

@ -130,7 +130,7 @@ There are many post tokenization steps that can be done, including (but not limi
</li>
</ul>
However an application might invoke Analysis of any text for testing or for any other purpose, something like:
<PRE>
<PRE class="prettyprint">
Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer
TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some text goes here"));
while (ts.incrementToken()) {
@ -182,7 +182,7 @@ the source code of any one of the many samples located in this package.
This allows phrase search and proximity search to seamlessly cross
boundaries between these "sections".
In other words, if a certain field "f" is added like this:
<PRE>
<PRE class="prettyprint">
document.add(new Field("f","first ends",...);
document.add(new Field("f","starts two",...);
indexWriter.addDocument(document);
@ -191,7 +191,7 @@ the source code of any one of the many samples located in this package.
Where desired, this behavior can be modified by introducing a "position gap" between consecutive field "sections",
simply by overriding
{@link org.apache.lucene.analysis.Analyzer#getPositionIncrementGap(java.lang.String) Analyzer.getPositionIncrementGap(fieldName)}:
<PRE>
<PRE class="prettyprint">
Analyzer myAnalyzer = new StandardAnalyzer() {
public int getPositionIncrementGap(String fieldName) {
return 10;
@ -220,7 +220,7 @@ the source code of any one of the many samples located in this package.
tokens following a removed stop word, using
{@link org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#setPositionIncrement(int)}.
This can be done with something like:
<PRE>
<PRE class="prettyprint">
public TokenStream tokenStream(final String fieldName, Reader reader) {
final TokenStream ts = someAnalyzer.tokenStream(fieldName, reader);
TokenStream res = new TokenStream() {
@ -334,7 +334,7 @@ here to illustrate the usage of the new TokenStream API.<br>
Then we will develop a custom Attribute, a PartOfSpeechAttribute, and add another filter to the chain which
utilizes the new custom attribute, and call it PartOfSpeechTaggingFilter.
<h4>Whitespace tokenization</h4>
<pre>
<pre class="prettyprint">
public class MyAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
@ -381,7 +381,7 @@ API
<h4>Adding a LengthFilter</h4>
We want to suppress all tokens that have 2 or less characters. We can do that easily by adding a LengthFilter
to the chain. Only the tokenStream() method in our analyzer needs to be changed:
<pre>
<pre class="prettyprint">
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream stream = new WhitespaceTokenizer(reader);
stream = new LengthFilter(stream, 3, Integer.MAX_VALUE);
@ -398,7 +398,7 @@ TokenStream
API
</pre>
Now let's take a look how the LengthFilter is implemented (it is part of Lucene's core):
<pre>
<pre class="prettyprint">
public final class LengthFilter extends TokenFilter {
final int min;
@ -448,7 +448,7 @@ is neccessary. The same is true for the consumer, which can simply use local ref
<h4>Adding a custom Attribute</h4>
Now we're going to implement our own custom Attribute for part-of-speech tagging and call it consequently
<code>PartOfSpeechAttribute</code>. First we need to define the interface of the new Attribute:
<pre>
<pre class="prettyprint">
public interface PartOfSpeechAttribute extends Attribute {
public static enum PartOfSpeech {
Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Conjunction, Article, Unknown
@ -470,7 +470,7 @@ and returns an actual instance. You can implement your own factory if you need t
Now here is the actual class that implements our new Attribute. Notice that the class has to extend
{@link org.apache.lucene.util.AttributeImpl}:
<pre>
<pre class="prettyprint">
public final class PartOfSpeechAttributeImpl extends AttributeImpl
implements PartOfSpeechAttribute{
@ -513,7 +513,7 @@ This is a simple Attribute implementation has only a single variable that stores
new <code>AttributeImpl</code> class and therefore implements its abstract methods <code>clear(), copyTo(), equals(), hashCode()</code>.
Now we need a TokenFilter that can set this new PartOfSpeechAttribute for each token. In this example we show a very naive filter
that tags every word with a leading upper-case letter as a 'Noun' and all other words as 'Unknown'.
<pre>
<pre class="prettyprint">
public static class PartOfSpeechTaggingFilter extends TokenFilter {
PartOfSpeechAttribute posAtt;
CharTermAttribute termAtt;
@ -544,7 +544,7 @@ Just like the LengthFilter, this new filter accesses the attributes it needs in
stores references in instance variables. Notice how you only need to pass in the interface of the new
Attribute and instantiating the correct class is automatically been taken care of.
Now we need to add the filter to the chain:
<pre>
<pre class="prettyprint">
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream stream = new WhitespaceTokenizer(reader);
stream = new LengthFilter(stream, 3, Integer.MAX_VALUE);
@ -564,7 +564,7 @@ API
Apparently it hasn't changed, which shows that adding a custom attribute to a TokenStream/Filter chain does not
affect any existing consumers, simply because they don't know the new Attribute. Now let's change the consumer
to make use of the new PartOfSpeechAttribute and print it out:
<pre>
<pre class="prettyprint">
public static void main(String[] args) throws IOException {
// text to tokenize
final String text = "This is a demo of the new TokenStream API";
@ -606,7 +606,7 @@ API the reader could now write an Attribute and TokenFilter that can specify for
of a sentence or not. Then the PartOfSpeechTaggingFilter can make use of this knowledge and only tag capitalized words
as nouns if not the first word of a sentence (we know, this is still not a correct behavior, but hey, it's a good exercise).
As a small hint, this is how the new Attribute class could begin:
<pre>
<pre class="prettyprint">
public class FirstTokenOfSentenceAttributeImpl extends Attribute
implements FirstTokenOfSentenceAttribute {

View File

@ -45,7 +45,7 @@ Features:
<p>
Lazy loading of Message Strings
<pre>
<pre class="prettyprint">
public class MessagesTestBundle extends NLS {
private static final String BUNDLE_NAME = MessagesTestBundle.class.getName();
@ -85,7 +85,7 @@ Lazy loading of Message Strings
<p>
Normal loading of Message Strings
<pre>
<pre class="prettyprint">
String message1 = NLS.getLocalizedMessage(MessagesTestBundle.Q0004E_INVALID_SYNTAX_ESCAPE_UNICODE_TRUNCATION);
String message2 = NLS.getLocalizedMessage(MessagesTestBundle.Q0004E_INVALID_SYNTAX_ESCAPE_UNICODE_TRUNCATION, Locale.JAPANESE);
</pre>

View File

@ -130,14 +130,14 @@
Using field (byte) values to as scores:
<p>
Indexing:
<pre>
<pre class="prettyprint">
f = new Field("score", "7", Field.Store.NO, Field.Index.UN_TOKENIZED);
f.setOmitNorms(true);
d1.add(f);
</pre>
<p>
Search:
<pre>
<pre class="prettyprint">
Query q = new FieldScoreQuery("score", FieldScoreQuery.Type.BYTE);
</pre>
Document d1 above would get a score of 7.
@ -148,7 +148,7 @@
<p>
Dividing the original score of each document by a square root of its docid
(just to demonstrate what it takes to manipulate scores this way)
<pre>
<pre class="prettyprint">
Query q = queryParser.parse("my query text");
CustomScoreQuery customQ = new CustomScoreQuery(q) {
public float customScore(int doc, float subQueryScore, float valSrcScore) {
@ -158,7 +158,7 @@
</pre>
<p>
For more informative debug info on the custom query, also override the name() method:
<pre>
<pre class="prettyprint">
CustomScoreQuery customQ = new CustomScoreQuery(q) {
public float customScore(int doc, float subQueryScore, float valSrcScore) {
return subQueryScore / Math.sqrt(docid);
@ -171,7 +171,7 @@
<p>
Taking the square root of the original score and multiplying it by a "short field driven score", ie, the
short value that was indexed for the scored doc in a certain field:
<pre>
<pre class="prettyprint">
Query q = queryParser.parse("my query text");
FieldScoreQuery qf = new FieldScoreQuery("shortScore", FieldScoreQuery.Type.SHORT);
CustomScoreQuery customQ = new CustomScoreQuery(q,qf) {

View File

@ -59,7 +59,7 @@ two starts and ends at the greater of the two ends.
<p>For example, a span query which matches "John Kerry" within ten
words of "George Bush" within the first 100 words of the document
could be constructed with:
<pre>
<pre class="prettyprint">
SpanQuery john = new SpanTermQuery(new Term("content", "john"));
SpanQuery kerry = new SpanTermQuery(new Term("content", "kerry"));
SpanQuery george = new SpanTermQuery(new Term("content", "george"));
@ -82,7 +82,7 @@ SpanQuery johnKerryNearGeorgeBushAtStart =
So, for example, the above query can be restricted to documents which
also use the word "iraq" with:
<pre>
<pre class="prettyprint">
Query query = new BooleanQuery();
query.add(johnKerryNearGeorgeBushAtStart, true, false);
query.add(new TermQuery("content", "iraq"), true, false);

View File

@ -52,7 +52,7 @@
<h2>Example Usages</h2>
<h3>Farsi Range Queries</h3>
<code><pre>
<pre class="prettyprint">
// "fa" Locale is not supported by Sun JDK 1.4 or 1.5
Collator collator = Collator.getInstance(new Locale("ar"));
CollationKeyAnalyzer analyzer = new CollationKeyAnalyzer(Version.LUCENE_40, collator);
@ -76,10 +76,10 @@
ScoreDoc[] result
= is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
assertEquals("The index Term should not be included.", 0, result.length);
</pre></code>
</pre>
<h3>Danish Sorting</h3>
<code><pre>
<pre class="prettyprint">
Analyzer analyzer
= new CollationKeyAnalyzer(Version.LUCENE_40, Collator.getInstance(new Locale("da", "dk")));
RAMDirectory indexStore = new RAMDirectory();
@ -103,10 +103,10 @@
Document doc = searcher.doc(result[i].doc);
assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]);
}
</pre></code>
</pre>
<h3>Turkish Case Normalization</h3>
<code><pre>
<pre class="prettyprint">
Collator collator = Collator.getInstance(new Locale("tr", "TR"));
collator.setStrength(Collator.PRIMARY);
Analyzer analyzer = new CollationKeyAnalyzer(Version.LUCENE_40, collator);
@ -121,7 +121,7 @@
Query query = parser.parse("d\u0131gy"); // U+0131: dotless i
ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
assertEquals("The index Term should be included.", 1, result.length);
</pre></code>
</pre>
<h2>Caveats and Comparisons</h2>
<p>

View File

@ -66,12 +66,12 @@ algorithm.
</ul>
<h2>Example Usages</h2>
<h3>Tokenizing multilanguage text</h3>
<code><pre>
<pre class="prettyprint">
/**
* This tokenizer will work well in general for most languages.
*/
Tokenizer tokenizer = new ICUTokenizer(reader);
</pre></code>
</pre>
<hr/>
<h1><a name="collation">Collation</a></h1>
<p>
@ -111,7 +111,7 @@ algorithm.
<h2>Example Usages</h2>
<h3>Farsi Range Queries</h3>
<code><pre>
<pre class="prettyprint">
Collator collator = Collator.getInstance(new ULocale("ar"));
ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(Version.LUCENE_40, collator);
RAMDirectory ramDir = new RAMDirectory();
@ -134,10 +134,10 @@ algorithm.
ScoreDoc[] result
= is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
assertEquals("The index Term should not be included.", 0, result.length);
</pre></code>
</pre>
<h3>Danish Sorting</h3>
<code><pre>
<pre class="prettyprint">
Analyzer analyzer
= new ICUCollationKeyAnalyzer(Version.LUCENE_40, Collator.getInstance(new ULocale("da", "dk")));
RAMDirectory indexStore = new RAMDirectory();
@ -161,10 +161,10 @@ algorithm.
Document doc = searcher.doc(result[i].doc);
assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]);
}
</pre></code>
</pre>
<h3>Turkish Case Normalization</h3>
<code><pre>
<pre class="prettyprint">
Collator collator = Collator.getInstance(new ULocale("tr", "TR"));
collator.setStrength(Collator.PRIMARY);
Analyzer analyzer = new ICUCollationKeyAnalyzer(Version.LUCENE_40, collator);
@ -179,7 +179,7 @@ algorithm.
Query query = parser.parse("d\u0131gy"); // U+0131: dotless i
ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
assertEquals("The index Term should be included.", 1, result.length);
</pre></code>
</pre>
<h2>Caveats and Comparisons</h2>
<p>
@ -239,7 +239,7 @@ algorithm.
</ul>
<h2>Example Usages</h2>
<h3>Normalizing text to NFC</h3>
<code><pre>
<pre class="prettyprint">
/**
* Normalizer2 objects are unmodifiable and immutable.
*/
@ -248,7 +248,7 @@ algorithm.
* This filter will normalize to NFC.
*/
TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer);
</pre></code>
</pre>
<hr/>
<h1><a name="casefolding">Case Folding</a></h1>
<p>
@ -278,12 +278,12 @@ this integration. To perform case-folding, you use normalization with the form
</ul>
<h2>Example Usages</h2>
<h3>Lowercasing text</h3>
<code><pre>
<pre class="prettyprint">
/**
* This filter will case-fold and normalize to NFKC.
*/
TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer);
</pre></code>
</pre>
<hr/>
<h1><a name="searchfolding">Search Term Folding</a></h1>
<p>
@ -305,13 +305,13 @@ many character foldings recursively.
</ul>
<h2>Example Usages</h2>
<h3>Removing accents</h3>
<code><pre>
<pre class="prettyprint">
/**
* This filter will case-fold, remove accents and other distinctions, and
* normalize to NFKC.
*/
TokenStream tokenstream = new ICUFoldingFilter(tokenizer);
</pre></code>
</pre>
<hr/>
<h1><a name="transform">Text Transformation</a></h1>
<p>
@ -335,19 +335,19 @@ and
</ul>
<h2>Example Usages</h2>
<h3>Convert Traditional to Simplified</h3>
<code><pre>
<pre class="prettyprint">
/**
* This filter will map Traditional Chinese to Simplified Chinese
*/
TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified"));
</pre></code>
</pre>
<h3>Transliterate Serbian Cyrillic to Serbian Latin</h3>
<code><pre>
<pre class="prettyprint">
/**
* This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules
*/
TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN"));
</pre></code>
</pre>
<hr/>
<h1><a name="backcompat">Backwards Compatibility</a></h1>
<p>
@ -359,7 +359,7 @@ a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer
</p>
<h2>Example Usages</h2>
<h3>Restricting normalization to Unicode 5.0</h3>
<code><pre>
<pre class="prettyprint">
/**
* This filter will do NFC normalization, but will ignore any characters that
* did not exist as of Unicode 5.0. Because of the normalization stability policy
@ -371,6 +371,6 @@ a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer
set.freeze();
FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set);
TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50);
</pre></code>
</pre>
</body>
</html>