2002-01-26 11:38:28 -05:00
|
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
|
|
|
2004-03-04 09:41:28 -05:00
|
|
|
<!--
|
|
|
|
Copyright 1999-2004 The Apache Software Foundation
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
|
|
you may not use this file except in compliance with the License.
|
|
|
|
You may obtain a copy of the License at
|
|
|
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
See the License for the specific language governing permissions and
|
|
|
|
limitations under the License.
|
|
|
|
-->
|
|
|
|
|
|
|
|
|
2002-01-26 11:38:28 -05:00
|
|
|
<!-- Content Stylesheet for Site -->
|
|
|
|
|
|
|
|
|
|
|
|
<!-- start the processing -->
|
|
|
|
<!-- ====================================================================== -->
|
2002-12-12 01:23:48 -05:00
|
|
|
<!-- GENERATED FILE, DO NOT EDIT, EDIT THE XML FILE IN xdocs INSTEAD! -->
|
2002-01-26 11:38:28 -05:00
|
|
|
<!-- Main Page Section -->
|
|
|
|
<!-- ====================================================================== -->
|
|
|
|
<html>
|
|
|
|
<head>
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
|
|
|
|
|
|
|
|
<meta name="author" value="Andrew C. Oliver">
|
|
|
|
<meta name="email" value="acoliver@apache.org">
|
|
|
|
|
2002-02-11 14:45:24 -05:00
|
|
|
|
|
|
|
|
2003-01-04 11:29:08 -05:00
|
|
|
|
2005-02-14 11:48:47 -05:00
|
|
|
<title>Apache Lucene - Apache Lucene - Basic Demo Sources Walkthrough</title>
|
2002-01-26 11:38:28 -05:00
|
|
|
</head>
|
|
|
|
|
|
|
|
<body bgcolor="#ffffff" text="#000000" link="#525D76">
|
|
|
|
<table border="0" width="100%" cellspacing="0">
|
|
|
|
<!-- TOP IMAGE -->
|
|
|
|
<tr>
|
|
|
|
<td align="left">
|
2005-05-02 14:50:45 -04:00
|
|
|
<a href="http://www.apache.org"><img src="http://lucene.apache.org/java/docs/images/asf-logo.gif" width="387" height="100" border="0"/></a>
|
2002-01-26 11:38:28 -05:00
|
|
|
</td>
|
|
|
|
<td align="right">
|
2005-02-14 11:48:47 -05:00
|
|
|
<a href="http://lucene.apache.org/"><img src="./images/lucene_green_300.gif" alt="Apache Lucene" border="0"/></a>
|
2002-01-26 11:38:28 -05:00
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
<table border="0" width="100%" cellspacing="4">
|
|
|
|
<tr><td colspan="2">
|
|
|
|
<hr noshade="" size="1"/>
|
|
|
|
</td></tr>
|
|
|
|
|
|
|
|
<tr>
|
|
|
|
<!-- LEFT SIDE NAVIGATION -->
|
|
|
|
<td width="20%" valign="top" nowrap="true">
|
2003-10-09 10:40:30 -04:00
|
|
|
|
|
|
|
<!-- ============================================================ -->
|
|
|
|
|
|
|
|
<p><strong>About</strong></p>
|
2002-01-26 11:38:28 -05:00
|
|
|
<ul>
|
|
|
|
<li> <a href="./index.html">Overview</a>
|
2005-05-14 19:05:04 -04:00
|
|
|
</li>
|
|
|
|
<li> <a href="./features.html">Features</a>
|
2002-01-26 11:38:28 -05:00
|
|
|
</li>
|
2004-05-18 09:32:01 -04:00
|
|
|
<li> <a href="http://wiki.apache.org/jakarta-lucene/PoweredBy">Powered by Lucene</a>
|
2002-01-26 11:38:28 -05:00
|
|
|
</li>
|
|
|
|
<li> <a href="./whoweare.html">Who We Are</a>
|
|
|
|
</li>
|
2005-03-26 07:58:35 -05:00
|
|
|
<li> <a href="./mailinglists.html">Mailing Lists</a>
|
2002-01-26 11:38:28 -05:00
|
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
<p><strong>Resources</strong></p>
|
|
|
|
<ul>
|
2004-02-28 21:43:36 -05:00
|
|
|
<li> <a href="http://wiki.apache.org/jakarta-lucene">Wiki</a>
|
|
|
|
</li>
|
2004-12-30 16:50:43 -05:00
|
|
|
<li> <a href="http://wiki.apache.org/jakarta-lucene/LuceneFAQ">FAQ</a>
|
2002-06-20 10:23:48 -04:00
|
|
|
</li>
|
|
|
|
<li> <a href="./gettingstarted.html">Getting Started</a>
|
2002-05-16 01:16:10 -04:00
|
|
|
</li>
|
|
|
|
<li> <a href="./queryparsersyntax.html">Query Syntax</a>
|
2002-10-29 23:14:11 -05:00
|
|
|
</li>
|
|
|
|
<li> <a href="./fileformats.html">File Formats</a>
|
2002-01-26 11:38:28 -05:00
|
|
|
</li>
|
|
|
|
<li> <a href="./api/index.html">Javadoc</a>
|
|
|
|
</li>
|
|
|
|
<li> <a href="./contributions.html">Contributions</a>
|
2002-12-04 00:56:33 -05:00
|
|
|
</li>
|
|
|
|
<li> <a href="./benchmarks.html">Benchmarks</a>
|
2002-11-29 16:23:47 -05:00
|
|
|
</li>
|
2004-11-15 14:00:49 -05:00
|
|
|
<li> <a href="http://issues.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=%5BPATCH%5D&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=Importance">Patches</a>
|
2002-11-07 15:18:09 -05:00
|
|
|
</li>
|
2004-11-15 14:00:49 -05:00
|
|
|
<li> <a href="http://issues.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=Importance">Lucene Bugs</a>
|
2002-12-04 00:56:33 -05:00
|
|
|
</li>
|
|
|
|
<li> <a href="./lucene-sandbox/">Lucene Sandbox</a>
|
2002-01-26 11:38:28 -05:00
|
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
<p><strong>Download</strong></p>
|
|
|
|
<ul>
|
2004-11-29 08:34:57 -05:00
|
|
|
<li> <a href="http://www.apache.org/dyn/closer.cgi/jakarta/lucene/binaries/">Binaries</a>
|
2002-01-26 11:38:28 -05:00
|
|
|
</li>
|
2004-11-29 08:34:57 -05:00
|
|
|
<li> <a href="http://www.apache.org/dyn/closer.cgi/jakarta/lucene/source/">Source Code</a>
|
2002-01-26 11:38:28 -05:00
|
|
|
</li>
|
2005-03-11 22:13:09 -05:00
|
|
|
<li> <a href="http://svn.apache.org/viewcvs.cgi/lucene/">Source Repository</a>
|
2002-01-26 11:38:28 -05:00
|
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
</td>
|
|
|
|
<td width="80%" align="left" valign="top">
|
|
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
|
|
<tr><td bgcolor="#525D76">
|
|
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
|
|
<a name="About the Code"><strong>About the Code</strong></a>
|
|
|
|
</font>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td>
|
|
|
|
<blockquote>
|
|
|
|
<p>
|
|
|
|
In this section we walk through the sources behind the basic Lucene demo such as where to
|
|
|
|
find it, its parts and their function. This section is intended for Java developers
|
2005-02-14 11:48:47 -05:00
|
|
|
wishing to understand how to use Apache Lucene in their applications.
|
2002-01-26 11:38:28 -05:00
|
|
|
</p>
|
|
|
|
</blockquote>
|
|
|
|
</p>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td><br/></td></tr>
|
|
|
|
</table>
|
|
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
|
|
<tr><td bgcolor="#525D76">
|
|
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
|
|
<a name="Location of the source"><strong>Location of the source</strong></a>
|
|
|
|
</font>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td>
|
|
|
|
<blockquote>
|
|
|
|
<p>
|
2005-02-14 11:48:47 -05:00
|
|
|
Relative to the directory created when you extracted Lucene or retreived it from Subversion, you
|
2002-01-26 11:38:28 -05:00
|
|
|
should see a directory called "src" which in turn contains a directory called "demo".
|
|
|
|
This is the root for all of the Lucene demos. Under this directory is org/apache/lucene/demo,
|
|
|
|
this is where all the Java sources live.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Within this directory you should see the IndexFiles class we executed earlier. Bring that
|
|
|
|
up in vi or your alternative text editor and lets take a look at it.
|
|
|
|
</p>
|
|
|
|
</blockquote>
|
|
|
|
</p>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td><br/></td></tr>
|
|
|
|
</table>
|
|
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
|
|
<tr><td bgcolor="#525D76">
|
|
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
|
|
<a name="IndexFiles"><strong>IndexFiles</strong></a>
|
|
|
|
</font>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td>
|
|
|
|
<blockquote>
|
|
|
|
<p>
|
|
|
|
As we discussed in the previous walkthrough, the IndexFiles class creates a Lucene Index.
|
|
|
|
Lets take a look at how it does this.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The first substantial thing the main function does is instantiate an instance
|
|
|
|
of IndexWriter. It passes a string called "index" and a new instance of a class called
|
|
|
|
"StandardAnalyzer". The "index" string is the name of the directory that all index information
|
|
|
|
should be stored in. Because we're not passing any path information, one must assume this
|
2004-10-30 08:16:09 -04:00
|
|
|
will be created as a subdirectory of the current directory (if it does not already exist). On
|
2002-01-26 11:38:28 -05:00
|
|
|
some platforms this may actually result in it being created in other directories (such as
|
|
|
|
the user's home directory).
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
The <b>IndexWriter</b> is the main class responsible for creating indicies. To use it you
|
|
|
|
must instantiate it with a path that it can write the index into, if this path does not
|
|
|
|
exist it will create it, otherwise it will refresh the index living at that path. You
|
2005-04-22 00:31:17 -04:00
|
|
|
must a also pass an instance of <b>org.apache.lucene.analysis.Analyzer</b>.
|
2002-01-26 11:38:28 -05:00
|
|
|
</p>
|
|
|
|
<p>
|
2004-10-30 08:16:09 -04:00
|
|
|
The <b>Analyzer</b>, in this case, the <b>StandardAnalyzer</b> is little more than a standard Java
|
|
|
|
Tokenizer, converting all strings to lowercase and filtering out useless words and characters from the index.
|
|
|
|
By useless words and characters I mean common language words such as articles (a, an, the, etc.) and other
|
|
|
|
strings that would be useless for searching (e.g. <b>'s</b>) . It should be noted that there are different
|
|
|
|
rules for every language, and you should use the proper analyzer for each. Lucene currently
|
|
|
|
provides Analyzers for English and German, more can be found in the Lucene Sandbox.
|
2002-01-26 11:38:28 -05:00
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
Looking down further in the file, you should see the indexDocs() code. This recursive function
|
|
|
|
simply crawls the directories and uses FileDocument to create Document objects. The Document
|
|
|
|
is simply a data object to represent the content in the file as well as its creation time and
|
2004-10-30 08:16:09 -04:00
|
|
|
location. These instances are added to the indexWriter. Take a look inside FileDocument. It's
|
2002-01-26 11:38:28 -05:00
|
|
|
not particularly complicated, it just adds fields to the Document.
|
|
|
|
</p>
|
|
|
|
<p>
|
|
|
|
As you can see there isn't much to creating an index. The devil is in the details. You may also
|
|
|
|
wish to examine the other samples in this directory, particularly the IndexHTML class. It is
|
|
|
|
a bit more complex but builds upon this example.
|
|
|
|
</p>
|
|
|
|
</blockquote>
|
|
|
|
</p>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td><br/></td></tr>
|
|
|
|
</table>
|
|
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
|
|
<tr><td bgcolor="#525D76">
|
|
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
|
|
<a name="Searching Files"><strong>Searching Files</strong></a>
|
|
|
|
</font>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td>
|
|
|
|
<blockquote>
|
|
|
|
<p>
|
|
|
|
The SearchFiles class is quite simple. It primarily collaborates with an IndexSearcher, StandardAnalyzer
|
|
|
|
(which is used in the IndexFiles class as well) and a QueryParser. The query parser is constructed
|
|
|
|
with an analyzer used to interperate your query in the same way the Index was interperated: finding
|
|
|
|
the end of words and removing useless words like 'a', 'an' and 'the'. The Query object contains the
|
|
|
|
results from the QueryParser which is passed to the searcher. The searcher results are returned in
|
|
|
|
a collection of Documents called "Hits" which is then iterated through and displayed to the user.
|
|
|
|
</p>
|
|
|
|
</blockquote>
|
|
|
|
</p>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td><br/></td></tr>
|
|
|
|
</table>
|
|
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
|
|
<tr><td bgcolor="#525D76">
|
|
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
|
|
<a name="The Web example..."><strong>The Web example...</strong></a>
|
|
|
|
</font>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td>
|
|
|
|
<blockquote>
|
|
|
|
<p>
|
|
|
|
<a href="demo3.html">read on>>></a>
|
|
|
|
</p>
|
|
|
|
</blockquote>
|
|
|
|
</p>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td><br/></td></tr>
|
|
|
|
</table>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
|
|
|
|
<!-- FOOTER -->
|
|
|
|
<tr><td colspan="2">
|
|
|
|
<hr noshade="" size="1"/>
|
|
|
|
</td></tr>
|
|
|
|
<tr><td colspan="2">
|
|
|
|
<div align="center"><font color="#525D76" size="-1"><em>
|
2005-01-18 08:57:32 -05:00
|
|
|
Copyright © 1999-2005, The Apache Software Foundation
|
2002-01-26 11:38:28 -05:00
|
|
|
</em></font></div>
|
|
|
|
</td></tr>
|
|
|
|
</table>
|
|
|
|
</body>
|
|
|
|
</html>
|
|
|
|
<!-- end the processing -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2002-02-11 14:45:24 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|