As we discussed in the previous walk-through, the IndexFiles
class creates a Lucene
Index. Let's take a look at how it does this.
The first substantial thing the main
function does is instantiate IndexWriter
. It passes the string
"index
" and a new instance of a class called StandardAnalyzer
.
The "index
" string is the name of the filesystem directory where all index information
should be stored. Because we're not passing a full path, this will be created as a subdirectory of
the current working directory (if it does not already exist). On some platforms, it may be created
in other directories (such as the user's home directory).
The IndexWriter
is the main
class responsible for creating indices. To use it you must instantiate it with a path that it can
write the index into. If this path does not exist it will first create it. Otherwise it will
refresh the index at that path. You can also create an index using one of the subclasses of Directory
. In any case, you must also pass an
instance of org.apache.lucene.analysis.Analyzer
.
The particular Analyzer
we
are using, StandardAnalyzer
, is
little more than a standard Java Tokenizer, converting all strings to lowercase and filtering out
useless words and characters from the index. By useless words and characters I mean common language
words such as articles (a, an, the, etc.) and other strings that would be useless for searching
(e.g. 's) . It should be noted that there are different rules for every language, and you
should use the proper analyzer for each. Lucene currently provides Analyzers for a number of
different languages (see the *Analyzer.java
sources under contrib/analyzers/src/java/org/apache/lucene/analysis).
Looking further down in the file, you should see the indexDocs()
code. This recursive
function simply crawls the directories and uses FileDocument
to create Document
objects. The Document
is simply a data object to
represent the content in the file as well as its creation time and location. These instances are
added to the indexWriter
. Take a look inside FileDocument
. It's not particularly
complicated. It just adds fields to the Document
.
As you can see there isn't much to creating an index. The devil is in the details. You may also
wish to examine the other samples in this directory, particularly the IndexHTML
class. It is a bit more
complex but builds upon this example.