2002-01-26 11:38:28 -05:00
|
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
|
|
|
|
|
|
<!-- Content Stylesheet for Site -->
|
|
|
|
|
|
|
|
|
|
|
|
<!-- start the processing -->
|
|
|
|
<!-- ====================================================================== -->
|
|
|
|
<!-- Main Page Section -->
|
|
|
|
<!-- ====================================================================== -->
|
|
|
|
<html>
|
|
|
|
<head>
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
|
|
|
|
|
|
|
|
<meta name="author" value="Andrew C. Oliver">
|
|
|
|
<meta name="email" value="acoliver@apache.org">
|
|
|
|
|
2002-02-11 14:45:24 -05:00
|
|
|
|
|
|
|
|
2002-01-26 11:38:28 -05:00
|
|
|
<title>Jakarta Lucene - Jakarta Lucene - Basic Demo Sources Walkthrough</title>
|
|
|
|
</head>
|
|
|
|
|
|
|
|
<body bgcolor="#ffffff" text="#000000" link="#525D76">
|
|
|
|
<table border="0" width="100%" cellspacing="0">
|
|
|
|
<!-- TOP IMAGE -->
|
|
|
|
<tr>
|
|
|
|
<td align="left">
|
|
|
|
<a href="http://jakarta.apache.org"><img src="http://jakarta.apache.org/images/jakarta-logo.gif" border="0"/></a>
|
|
|
|
</td>
|
|
|
|
<td align="right">
|
|
|
|
<a href="http://jakarta.apache.org/lucene/"><img src="./images/lucene_green_300.gif" alt="Jakarta Lucene" border="0"/></a>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
<table border="0" width="100%" cellspacing="4">
|
|
|
|
<tr><td colspan="2">
|
|
|
|
<hr noshade="" size="1"/>
|
|
|
|
</td></tr>
|
|
|
|
|
|
|
|
<tr>
|
|
|
|
<!-- LEFT SIDE NAVIGATION -->
|
|
|
|
<td width="20%" valign="top" nowrap="true">
|
|
|
|
<p><strong>About</strong></p>
|
|
|
|
<ul>
|
|
|
|
<li> <a href="./index.html">Overview</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="./powered.html">Powered by Lucene</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="./whoweare.html">Who We Are</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="http://jakarta.apache.org/site/mail.html">Mailing Lists</a>
|
|
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
<p><strong>Resources</strong></p>
|
|
|
|
<ul>
|
|
|
|
<li> <a href="http://www.lucene.com/cgi-bin/faq/faqmanager.cgi">FAQ (Official)</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="./gettingstarted.html">Getting Started</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="http://www.jguru.com/faq/Lucene">JGuru FAQ</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="http://jakarta.apache.org/site/bugs.html">Bugs</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="http://nagoya.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=%27Importance%27">Lucene Bugs</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="./resources.html">Articles</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="./api/index.html">Javadoc</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="./contributions.html">Contributions</a>
|
2002-02-23 17:03:21 -05:00
|
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
<p><strong>Plans</strong></p>
|
|
|
|
<ul>
|
|
|
|
<li> <a href="./luceneplan.html">Application Extensions</a>
|
2002-01-26 11:38:28 -05:00
|
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
<p><strong>Download</strong></p>
|
|
|
|
<ul>
|
|
|
|
<li> <a href="http://jakarta.apache.org/site/binindex.html">Binaries</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="http://jakarta.apache.org/site/sourceindex.html">Source Code</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="http://jakarta.apache.org/site/cvsindex.html">CVS Repositories</a>
|
|
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
<p><strong>Jakarta</strong></p>
|
|
|
|
<ul>
|
|
|
|
<li> <a href="http://jakarta.apache.org/site/getinvolved.html">Get Involved</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="http://jakarta.apache.org/site/acknowledgements.html">Acknowledgements</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="http://jakarta.apache.org/site/contact.html">Contact</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="http://jakarta.apache.org/site/legal.html">Legal</a>
|
|
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
</td>
|
|
|
|
<td width="80%" align="left" valign="top">
|
|
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
|
|
<tr><td bgcolor="#525D76">
|
|
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
|
|
<a name="About the Code"><strong>About the Code</strong></a>
|
|
|
|
</font>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td>
|
|
|
|
<blockquote>
|
|
|
|
<p>
|
|
|
|
In this section we walk through the sources behind the basic Lucene demo such as where to
|
|
|
|
find it, its parts and their function. This section is intended for Java developers
|
|
|
|
wishing to understand how to use Jakarta Lucene in their applications.
|
|
|
|
</p>
|
|
|
|
</blockquote>
|
|
|
|
</p>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td><br/></td></tr>
|
|
|
|
</table>
|
|
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
|
|
<tr><td bgcolor="#525D76">
|
|
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
|
|
<a name="Location of the source"><strong>Location of the source</strong></a>
|
|
|
|
</font>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td>
|
|
|
|
<blockquote>
|
|
|
|
<p>
|
|
|
|
Relative to the directory created when you extracted Lucene or retreived it from CVS, you
|
|
|
|
should see a directory called "src" which in turn contains a directory called "demo".
|
|
|
|
This is the root for all of the Lucene demos. Under this directory is org/apache/lucene/demo,
|
|
|
|
this is where all the Java sources live.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Within this directory you should see the IndexFiles class we executed earlier. Bring that
|
|
|
|
up in vi or your alternative text editor and lets take a look at it.
|
|
|
|
</p>
|
|
|
|
</blockquote>
|
|
|
|
</p>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td><br/></td></tr>
|
|
|
|
</table>
|
|
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
|
|
<tr><td bgcolor="#525D76">
|
|
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
|
|
<a name="IndexFiles"><strong>IndexFiles</strong></a>
|
|
|
|
</font>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td>
|
|
|
|
<blockquote>
|
|
|
|
<p>
|
|
|
|
As we discussed in the previous walkthrough, the IndexFiles class creates a Lucene Index.
|
|
|
|
Lets take a look at how it does this.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The first substantial thing the main function does is instantiate an instance
|
|
|
|
of IndexWriter. It passes a string called "index" and a new instance of a class called
|
|
|
|
"StandardAnalyzer". The "index" string is the name of the directory that all index information
|
|
|
|
should be stored in. Because we're not passing any path information, one must assume this
|
|
|
|
will be created as a subdirectory of the current directory (if does not already exist). On
|
|
|
|
some platforms this may actually result in it being created in other directories (such as
|
|
|
|
the user's home directory).
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The <b>IndexWriter</b> is the main class responsible for creating indicies. To use it you
|
|
|
|
must instantiate it with a path that it can write the index into, if this path does not
|
|
|
|
exist it will create it, otherwise it will refresh the index living at that path. You
|
|
|
|
must a also pass an instance of <b>org.apache.analysis.Analyzer</b>.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The <b>Analyzer</b>, in this case, the <b>Stop Analyzer</b> is little more than a standard Java
|
|
|
|
Tokenizer, converting all strings to lowercase and filtering out useless words from the index.
|
|
|
|
By useless words I mean common language words such as articles (a,an,the) and other words that
|
|
|
|
would be useless for searching. It should be noted that there are different rules for every
|
|
|
|
language, and you should use the proper analyzer for each. Lucene currently provides Analyzers
|
|
|
|
for English and German.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Looking down further in the file, you should see the indexDocs() code. This recursive function
|
|
|
|
simply crawls the directories and uses FileDocument to create Document objects. The Document
|
|
|
|
is simply a data object to represent the content in the file as well as its creation time and
|
|
|
|
location. These instances are added to the indexWriter. Take a look inside FileDocument. Its
|
|
|
|
not particularly complicated, it just adds fields to the Document.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
As you can see there isn't much to creating an index. The devil is in the details. You may also
|
|
|
|
wish to examine the other samples in this directory, particularly the IndexHTML class. It is
|
|
|
|
a bit more complex but builds upon this example.
|
|
|
|
</p>
|
|
|
|
</blockquote>
|
|
|
|
</p>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td><br/></td></tr>
|
|
|
|
</table>
|
|
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
|
|
<tr><td bgcolor="#525D76">
|
|
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
|
|
<a name="Searching Files"><strong>Searching Files</strong></a>
|
|
|
|
</font>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td>
|
|
|
|
<blockquote>
|
|
|
|
<p>
|
|
|
|
The SearchFiles class is quite simple. It primarily collaborates with an IndexSearcher, StandardAnalyzer
|
|
|
|
(which is used in the IndexFiles class as well) and a QueryParser. The query parser is constructed
|
|
|
|
with an analyzer used to interperate your query in the same way the Index was interperated: finding
|
|
|
|
the end of words and removing useless words like 'a', 'an' and 'the'. The Query object contains the
|
|
|
|
results from the QueryParser which is passed to the searcher. The searcher results are returned in
|
|
|
|
a collection of Documents called "Hits" which is then iterated through and displayed to the user.
|
|
|
|
</p>
|
|
|
|
</blockquote>
|
|
|
|
</p>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td><br/></td></tr>
|
|
|
|
</table>
|
|
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
|
|
<tr><td bgcolor="#525D76">
|
|
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
|
|
<a name="The Web example..."><strong>The Web example...</strong></a>
|
|
|
|
</font>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td>
|
|
|
|
<blockquote>
|
|
|
|
<p>
|
|
|
|
<a href="demo3.html">read on>>></a>
|
|
|
|
</p>
|
|
|
|
</blockquote>
|
|
|
|
</p>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td><br/></td></tr>
|
|
|
|
</table>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
|
|
|
|
<!-- FOOTER -->
|
|
|
|
<tr><td colspan="2">
|
|
|
|
<hr noshade="" size="1"/>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td colspan="2">
|
|
|
|
<div align="center"><font color="#525D76" size="-1"><em>
|
|
|
|
Copyright © 1999-2002, Apache Software Foundation
|
|
|
|
</em></font></div>
|
|
|
|
</td></tr>
|
|
|
|
</table>
|
|
|
|
</body>
|
|
|
|
</html>
|
|
|
|
<!-- end the processing -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2002-02-11 14:45:24 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|