mirror of https://github.com/apache/lucene.git
469 lines
20 KiB
HTML
469 lines
20 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
|
|
|
|
<!-- Content Stylesheet for Site -->
|
|
|
|
|
|
<!-- start the processing -->
|
|
<!-- ====================================================================== -->
|
|
<!-- Main Page Section -->
|
|
<!-- ====================================================================== -->
|
|
<html>
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
|
|
|
|
<meta name="author" value="Kelvin Tan">
|
|
<meta name="email" value="kelvint@apache.org">
|
|
|
|
|
|
|
|
<title>Jakarta Lucene - Indyo Tutorial</title>
|
|
</head>
|
|
|
|
<body bgcolor="#ffffff" text="#000000" link="#525D76">
|
|
<table border="0" width="100%" cellspacing="0">
|
|
<!-- TOP IMAGE -->
|
|
<tr>
|
|
<td align="left">
|
|
<a href="http://jakarta.apache.org"><img src="http://jakarta.apache.org/images/jakarta-logo.gif" border="0"/></a>
|
|
</td>
|
|
<td align="right">
|
|
<a href="http://jakarta.apache.org/lucene/"><img src="../../images/lucene_green_300.gif" alt="Jakarta Lucene" border="0"/></a>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
<table border="0" width="100%" cellspacing="4">
|
|
<tr><td colspan="2">
|
|
<hr noshade="" size="1"/>
|
|
</td></tr>
|
|
|
|
<tr>
|
|
<!-- LEFT SIDE NAVIGATION -->
|
|
<td width="20%" valign="top" nowrap="true">
|
|
<p><strong>About</strong></p>
|
|
<ul>
|
|
<li> <a href="../../index.html">Overview</a>
|
|
</li>
|
|
<li> <a href="../../powered.html">Powered by Lucene</a>
|
|
</li>
|
|
<li> <a href="../../whoweare.html">Who We Are</a>
|
|
</li>
|
|
<li> <a href="http://jakarta.apache.org/site/mail.html">Mailing Lists</a>
|
|
</li>
|
|
</ul>
|
|
<p><strong>Resources</strong></p>
|
|
<ul>
|
|
<li> <a href="http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi">FAQ (Official)</a>
|
|
</li>
|
|
<li> <a href="http://www.jguru.com/faq/Lucene">JGuru FAQ</a>
|
|
</li>
|
|
<li> <a href="../../gettingstarted.html">Getting Started</a>
|
|
</li>
|
|
<li> <a href="http://jakarta.apache.org/site/bugs.html">Bugs</a>
|
|
</li>
|
|
<li> <a href="http://nagoya.apache.org/bugzilla/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&email1=&emailtype1=substring&emailassigned_to1=1&email2=&emailtype2=substring&emailreporter2=1&bugidtype=include&bug_id=&changedin=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&product=Lucene&short_desc=&short_desc_type=allwordssubstr&long_desc=&long_desc_type=allwordssubstr&bug_file_loc=&bug_file_loc_type=allwordssubstr&keywords=&keywords_type=anywords&field0-0-0=noop&type0-0-0=noop&value0-0-0=&cmdtype=doit&order=%27Importance%27">Lucene Bugs</a>
|
|
</li>
|
|
<li> <a href="../../queryparsersyntax.html">Query Syntax</a>
|
|
</li>
|
|
<li> <a href="../../fileformats.html">File Formats</a>
|
|
</li>
|
|
<li> <a href="../../api/index.html">Javadoc</a>
|
|
</li>
|
|
<li> <a href="../../contributions.html">Contributions</a>
|
|
</li>
|
|
<li> <a href="../../lucene-sandbox/">Lucene Sandbox</a>
|
|
</li>
|
|
<li> <a href="../../resources.html">Articles, etc.</a>
|
|
</li>
|
|
</ul>
|
|
<p><strong>Plans</strong></p>
|
|
<ul>
|
|
<li> <a href="../../luceneplan.html">Application Extensions</a>
|
|
</li>
|
|
</ul>
|
|
<p><strong>Download</strong></p>
|
|
<ul>
|
|
<li> <a href="http://jakarta.apache.org/site/binindex.html">Binaries</a>
|
|
</li>
|
|
<li> <a href="http://jakarta.apache.org/site/sourceindex.html">Source Code</a>
|
|
</li>
|
|
<li> <a href="http://jakarta.apache.org/site/cvsindex.html">CVS Repositories</a>
|
|
</li>
|
|
</ul>
|
|
<p><strong>Jakarta</strong></p>
|
|
<ul>
|
|
<li> <a href="http://jakarta.apache.org/site/getinvolved.html">Get Involved</a>
|
|
</li>
|
|
<li> <a href="http://jakarta.apache.org/site/acknowledgements.html">Acknowledgements</a>
|
|
</li>
|
|
<li> <a href="http://jakarta.apache.org/site/contact.html">Contact</a>
|
|
</li>
|
|
<li> <a href="http://jakarta.apache.org/site/legal.html">Legal</a>
|
|
</li>
|
|
</ul>
|
|
</td>
|
|
<td width="80%" align="left" valign="top">
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
<tr><td bgcolor="#525D76">
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
<a name="About this Tutorial"><strong>About this Tutorial</strong></a>
|
|
</font>
|
|
</td></tr>
|
|
<tr><td>
|
|
<blockquote>
|
|
<p>
|
|
This tutorial is intended to give first-time users an
|
|
introduction to using Indyo, a datasource-independent
|
|
Lucene indexing framework.
|
|
</p>
|
|
<p>
|
|
This will include how to obtain Indyo, configuring Indyo
|
|
and indexing a directory on a filesystem.
|
|
</p>
|
|
</blockquote>
|
|
</p>
|
|
</td></tr>
|
|
<tr><td><br/></td></tr>
|
|
</table>
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
<tr><td bgcolor="#525D76">
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
<a name="Step 1: Obtaining Indyo"><strong>Step 1: Obtaining Indyo</strong></a>
|
|
</font>
|
|
</td></tr>
|
|
<tr><td>
|
|
<blockquote>
|
|
<p>
|
|
First, you need to obtain Indyo. As
|
|
of this writing, Indyo is only available via CVS, from the
|
|
"jakarta-lucene-sandbox" repository. See
|
|
<a href="http://jakarta.apache.org/cvsindex.html">Jakarta CVS</a>
|
|
on accessing files via CVS.</p>
|
|
</blockquote>
|
|
</p>
|
|
</td></tr>
|
|
<tr><td><br/></td></tr>
|
|
</table>
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
<tr><td bgcolor="#525D76">
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
<a name="Step 2: Building Indyo"><strong>Step 2: Building Indyo</strong></a>
|
|
</font>
|
|
</td></tr>
|
|
<tr><td>
|
|
<blockquote>
|
|
<p>
|
|
Get a copy of <a href="http://jakarta.apache.org/ant">Ant</a> if
|
|
you don't already have it installed. Then simply type "ant" in the
|
|
directory where the local copy of the Indyo sources reside.
|
|
</p>
|
|
<p>
|
|
Voila! You should now have a jar file "indyo-<version number>.jar".
|
|
</p>
|
|
</blockquote>
|
|
</p>
|
|
</td></tr>
|
|
<tr><td><br/></td></tr>
|
|
</table>
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
<tr><td bgcolor="#525D76">
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
<a name="Step 3: Configuring Indyo"><strong>Step 3: Configuring Indyo</strong></a>
|
|
</font>
|
|
</td></tr>
|
|
<tr><td>
|
|
<blockquote>
|
|
<p>
|
|
The "src/conf" folder contains a default configuration file which is
|
|
sufficient for normal use.
|
|
</p>
|
|
</blockquote>
|
|
</p>
|
|
</td></tr>
|
|
<tr><td><br/></td></tr>
|
|
</table>
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
<tr><td bgcolor="#525D76">
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
<a name="Step 4: Using Indyo"><strong>Step 4: Using Indyo</strong></a>
|
|
</font>
|
|
</td></tr>
|
|
<tr><td>
|
|
<blockquote>
|
|
<p>
|
|
Congratulations, you have finally reached the fun the
|
|
part of this tutorial. This is where you'll discover
|
|
the power of Indyo.
|
|
</p>
|
|
<p>
|
|
To index a datasource, first instantiate the respective
|
|
datasource, then hand it to IndyoIndexer for indexing.
|
|
For example:
|
|
</p>
|
|
<div align="left">
|
|
<table cellspacing="4" cellpadding="0" border="0">
|
|
<tr>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
</tr>
|
|
<tr>
|
|
<td bgcolor="#023264" width="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#ffffff"><pre>
|
|
IndexDataSource ds = new FSDataSource("/usr/local/lucene/docs");
|
|
IndyoIndexer indexer = new IndyoIndexer("/usr/local/index",
|
|
"/usr/local/indyo/default.config.xml");
|
|
indexer.index(ds);
|
|
</pre></td>
|
|
<td bgcolor="#023264" width="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
</tr>
|
|
<tr>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
</tr>
|
|
</table>
|
|
</div>
|
|
<p>
|
|
FSDataSource is a simple datasource which indexes both files
|
|
and directories. The metadata FSDataSource adds to each document is:
|
|
filePath, fileName, fileSize, fileFormat, fileContents,
|
|
fileLastModifiedDate. Based on the file extension of the files indexed,
|
|
Indyo will use file content-handlers according to the mappings found in the
|
|
configuration file. If you're not happy with this list of file
|
|
metadata, feel free to subclass FSDataSource, or, as we're about
|
|
to cover next, write your own custom IndexDataSource.
|
|
</p>
|
|
<p>
|
|
Get familiar with FSDataSource. You'll find it very handy, both for indexing
|
|
files directly, as well as nesting it within another datasource. For example,
|
|
you might need to index a database table, in which one of the rows represent
|
|
the location of a file, and you may want to use FSDataSource to index this
|
|
file as well.
|
|
</p>
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
<tr><td bgcolor="#828DA6">
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
<a name="Writing your custom IndexDataSource"><strong>Writing your custom IndexDataSource</strong></a>
|
|
</font>
|
|
</td></tr>
|
|
<tr><td>
|
|
<blockquote>
|
|
<p>
|
|
To write a custom IndexDataSource, you need to write a class
|
|
which implements IndexDataSource, and provides an implementation
|
|
for the getData() method which returns a Map[]. The javadoc of the
|
|
getData() method reads:
|
|
</p>
|
|
<div align="left">
|
|
<table cellspacing="4" cellpadding="0" border="0">
|
|
<tr>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
</tr>
|
|
<tr>
|
|
<td bgcolor="#023264" width="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#ffffff"><pre>
|
|
/**
|
|
* Retrieve a array of Maps. Each map represents the
|
|
* a document to be indexed. The key:value pair of the map
|
|
* is the metadata of the document.
|
|
*/
|
|
</pre></td>
|
|
<td bgcolor="#023264" width="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
</tr>
|
|
<tr>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
</tr>
|
|
</table>
|
|
</div>
|
|
<p>
|
|
So, the getData() method provides a way for Indyo to retrieve document
|
|
metadata from each IndexDataSource. A simple example of a custom
|
|
IndexDataSource, HashMapDataSource is provided below.
|
|
</p>
|
|
<div align="left">
|
|
<table cellspacing="4" cellpadding="0" border="0">
|
|
<tr>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
</tr>
|
|
<tr>
|
|
<td bgcolor="#023264" width="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#ffffff"><pre>
|
|
public class HashMapDataSource implements IndexDataSource
|
|
{
|
|
private Map data;
|
|
|
|
public HashMapDataSource(Map data)
|
|
{
|
|
this.data = data;
|
|
}
|
|
|
|
public Map[] getData() throws Exception
|
|
{
|
|
return new Map[1]{data};
|
|
}
|
|
}
|
|
</pre></td>
|
|
<td bgcolor="#023264" width="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
</tr>
|
|
<tr>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
</tr>
|
|
</table>
|
|
</div>
|
|
<p>
|
|
As you can see, HashMapDataSource doesn't do anything very useful. It
|
|
always results in one Document being indexed, and the document's fields
|
|
depend on the contents of the map that HashMapDataSource was initialized
|
|
with.
|
|
</p>
|
|
<p>
|
|
A slightly more useful IndexDataSource, SingleDocumentFSDataSource
|
|
provides an example of how to nest datasources. Given a directory,
|
|
SingleDocumentFSDataSource recursively indexes all directories
|
|
and files within that directory <i>as the same Document</i>. In other
|
|
words, only one Document is created in the index. This is accomplished
|
|
by the use of a nested datasource. The code for
|
|
SingleDocumentFSDataSource is listed below:
|
|
</p>
|
|
<div align="left">
|
|
<table cellspacing="4" cellpadding="0" border="0">
|
|
<tr>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
</tr>
|
|
<tr>
|
|
<td bgcolor="#023264" width="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#ffffff"><pre>
|
|
public class SingleDocumentFSDataSource
|
|
implements IndexDataSource
|
|
{
|
|
private File file;
|
|
|
|
public SingleDocumentFSDataSource(File file)
|
|
{
|
|
this.file = file;
|
|
}
|
|
|
|
public Map[] getData() throws Exception
|
|
{
|
|
Map data = new HashMap(1);
|
|
data.put(NESTED_DATASOURCE, new FSDataSource(file));
|
|
return new Map[1]{data};
|
|
}
|
|
}
|
|
</pre></td>
|
|
<td bgcolor="#023264" width="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
</tr>
|
|
<tr>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
<td bgcolor="#023264" width="1" height="1"><img src="/images/void.gif" width="1" height="1" vspace="0" hspace="0" border="0"/></td>
|
|
</tr>
|
|
</table>
|
|
</div>
|
|
<p>
|
|
Nested datasources don't result in a separate Document being created.
|
|
Use them when working with complex datasources, i.e., datasources
|
|
which are an aggregation of multiple datasources. The current way to
|
|
add a nested datasource is using the key "NESTED_DATASOURCE". Indyo
|
|
accepts an IndexDataSource object, a List of IndexDataSources,
|
|
or an IndexDataSource[] for this key.
|
|
</p>
|
|
</blockquote>
|
|
</td></tr>
|
|
<tr><td><br/></td></tr>
|
|
</table>
|
|
</blockquote>
|
|
</p>
|
|
</td></tr>
|
|
<tr><td><br/></td></tr>
|
|
</table>
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
<tr><td bgcolor="#525D76">
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
<a name="Where to Go From Here"><strong>Where to Go From Here</strong></a>
|
|
</font>
|
|
</td></tr>
|
|
<tr><td>
|
|
<blockquote>
|
|
<p>
|
|
Congratulations! You have completed the Indyo
|
|
tutorial. Although this has only been an introduction
|
|
to Torque, it should be sufficient to get you started
|
|
with Indyo in your applications. For those of you
|
|
seeking additional information, there are several other
|
|
documents on this site that can provide details on
|
|
various subjects. Lastly, the source code is an
|
|
invaluable resource when all else fails to provide
|
|
answers!
|
|
</p>
|
|
</blockquote>
|
|
</p>
|
|
</td></tr>
|
|
<tr><td><br/></td></tr>
|
|
</table>
|
|
<table border="0" cellspacing="0" cellpadding="2" width="100%">
|
|
<tr><td bgcolor="#525D76">
|
|
<font color="#ffffff" face="arial,helvetica,sanserif">
|
|
<a name="Acknowledgements"><strong>Acknowledgements</strong></a>
|
|
</font>
|
|
</td></tr>
|
|
<tr><td>
|
|
<blockquote>
|
|
<p>
|
|
This document was shamelessly ripped from the extremely well-written
|
|
and well-organized
|
|
<a href="http://jakarta.apache.org/turbine/torque/tutorial.html">Torque
|
|
</a> tutorial. Thanks Pete!
|
|
</p>
|
|
</blockquote>
|
|
</p>
|
|
</td></tr>
|
|
<tr><td><br/></td></tr>
|
|
</table>
|
|
</td>
|
|
</tr>
|
|
|
|
<!-- FOOTER -->
|
|
<tr><td colspan="2">
|
|
<hr noshade="" size="1"/>
|
|
</td></tr>
|
|
<tr><td colspan="2">
|
|
<div align="center"><font color="#525D76" size="-1"><em>
|
|
Copyright © 1999-2002, Apache Software Foundation
|
|
</em></font></div>
|
|
</td></tr>
|
|
</table>
|
|
</body>
|
|
</html>
|
|
<!-- end the processing -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|