lucene/site/tutorial.html

534 lines
23 KiB
HTML
Raw Normal View History

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta content="Apache Forrest" name="Generator">
<meta name="Forrest-version" content="0.7">
<meta name="Forrest-skin-name" content="pelt">
<title>Solr tutorial</title>
<link type="text/css" href="skin/basic.css" rel="stylesheet">
<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
<link type="text/css" href="skin/profile.css" rel="stylesheet">
<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
<link rel="shortcut icon" href="images/favicon.ico">
</head>
<body onload="init()">
<script type="text/javascript">ndeSetTextSize();</script>
<div id="top">
<div class="breadtrail">
<a href="http://www.apache.org/">Apache</a> &gt; <a href="http://lucene.apache.org/">Lucene</a> &gt; <a href="http://incubator.apache.org/solr/">Solr</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
</div>
<div class="header">
<div class="grouplogo">
<a href="http://incubator.apache.org/"><img class="logoImage" alt="Apache Incubator" src="http://incubator.apache.org/images/apache-incubator-logo.png" title="Apache Incubator"></a>
</div>
<div class="projectlogo">
<a href="http://incubator.apache.org/solr/"><img class="logoImage" alt="Solr" src="images/solr.png" title="Solr Description"></a>
</div>
<div class="searchbox">
<form action="http://www.google.com/search" method="get" class="roundtopsmall">
<input value="incubator.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">&nbsp;
<input attr="value" name="Search" value="Search" type="submit">
</form>
</div>
<ul id="tabs">
<li class="current">
<a class="base-selected" href="index.html">Main</a>
</li>
<li>
<a class="base-not-selected" href="http://wiki.apache.org/solr">Wiki</a>
</li>
</ul>
</div>
</div>
<div id="main">
<div id="publishedStrip">
<div id="level2tabs"></div>
<script type="text/javascript"><!--
document.write("<text>Last Published:</text> " + document.lastModified);
// --></script>
</div>
<div class="breadtrail">
&nbsp;
</div>
<div id="menu">
<div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">About</div>
<div id="menu_1.1" class="menuitemgroup">
<div class="menuitem">
<a href="index.html" title="Welcome to Solr">Welcome</a>
</div>
<div class="menuitem">
<a href="who.html" title="Solr Committers">Who We Are</a>
</div>
</div>
<div onclick="SwitchMenu('menu_selected_1.2', 'skin/')" id="menu_selected_1.2Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
<div id="menu_selected_1.2" class="selectedmenuitemgroup" style="display: block;">
<div class="menuitem">
<a href="features.html">Features</a>
</div>
<div class="menuitem">
<a href="http://wiki.apache.org/solr/FAQ">FAQ</a>
</div>
<div class="menuitem">
<a href="http://wiki.apache.org/solr/">Wiki</a>
</div>
<div class="menupage">
<div class="menupagetitle">Tutorial</div>
</div>
<div class="menuitem">
<a href="docs/api/">API Docs</a>
</div>
</div>
<div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" class="menutitle">Resources</div>
<div id="menu_1.3" class="menuitemgroup">
<div class="menuitem">
<a href="http://cvs.apache.org/dist/lucene/solr/nightly/">Download</a>
</div>
<div class="menuitem">
<a href="mailing_lists.html">Mailing Lists</a>
</div>
<div class="menuitem">
<a href="issue_tracking.html">Issue Tracking</a>
</div>
<div class="menuitem">
<a href="version_control.html">Version Control</a>
</div>
</div>
<div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" class="menutitle">Related Projects</div>
<div id="menu_1.4" class="menuitemgroup">
<div class="menuitem">
<a href="http://lucene.apache.org/java/">Lucene Java</a>
</div>
<div class="menuitem">
<a href="http://lucene.apache.org/nutch/">Nutch</a>
</div>
</div>
<div id="credit"></div>
<div id="roundbottom">
<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
<div id="credit2"></div>
</div>
<div id="content">
<div title="Portable Document Format" class="pdflink">
<a class="dida" href="tutorial.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
PDF</a>
</div>
<h1>Solr tutorial</h1>
<div id="minitoc-area">
<ul class="minitoc">
<li>
<a href="#Overview">Overview</a>
</li>
<li>
<a href="#Requirements">Requirements</a>
</li>
<li>
<a href="#Getting+Started">Getting Started</a>
</li>
<li>
<a href="#Indexing+Data">Indexing Data</a>
</li>
<li>
<a href="#Updating+Data">Updating Data</a>
<ul class="minitoc">
<li>
<a href="#Deleting+Data">Deleting Data</a>
</li>
</ul>
</li>
<li>
<a href="#Querying+Data">Querying Data</a>
<ul class="minitoc">
<li>
<a href="#Sorting">Sorting</a>
</li>
</ul>
</li>
<li>
<a href="#Text+Analysis">Text Analysis</a>
<ul class="minitoc">
<li>
<a href="#Analysis+Debugging">Analysis Debugging</a>
</li>
</ul>
</li>
</ul>
</div>
<a name="N1000C"></a><a name="Overview"></a>
<h2 class="boxed">Overview</h2>
<div class="section">
<p>
This document covers the basics of running Solr using an example
schema, and some sample data.
</p>
</div>
<a name="N10016"></a><a name="Requirements"></a>
<h2 class="boxed">Requirements</h2>
<div class="section">
<p>
To follow along with this tutorial, you will need...
</p>
<ol>
<li>Java 1.5 or greater. Some places you can get it are from
<a href="http://java.sun.com/j2se/downloads.html">Sun</a>,
<a href="http://www-106.ibm.com/developerworks/java/jdk/">IBM</a>, or
<a href="http://www.bea.com/jrockit/">BEA</a>.
</li>
<li>A <a href="http://cvs.apache.org/dist/lucene/solr/nightly/">Solr release</a>.
</li>
<li>On Win32, <a href="http://www.cygwin.com/">cygwin</a>, for
shell support. (If you plan to use Subversion on Win32, be
sure to select the subversion package when you install, in the
"Devel" category.) This tutorial will assume that "<span class="codefrag">sh</span>"
is in your PATH, and that you have "curl" installed from the "Web" category.
</li>
<li>FireFox or Mozilla is the preferred browser to view the admin pages...
the current stylesheet doesn't currently look good on IE.
</li>
</ol>
</div>
<a name="N10046"></a><a name="Getting+Started"></a>
<h2 class="boxed">Getting Started</h2>
<div class="section">
<p>
Begin by unziping the Solr release and changing your working directory
to be the "<span class="codefrag">example</span>" directory
</p>
<pre class="code">
chrish@asimov:~/tmp/solr$ ls
solr-1.0.zip
chrish@asimov:~/tmp/solr$ unzip -q solr-1.0.zip
chrish@asimov:~/tmp/solr$ cd solr-1.0/example/
</pre>
<p>
Solr can run in any Java Servlet Container of your choice, but to simplify
this tutorial, the example index includes a small installation of Jetty.
</p>
<p>
To launch Jetty with the Solr WAR, and the example configs, just run the <span class="codefrag">start.jar</span> ...
</p>
<pre class="code">
chrish@asimov:~/tmp/solr/solr-1.0/example$ java -jar start.jar
1 [main] INFO org.mortbay.log - Logging to org.slf4j.impl.SimpleLogger@1f436f5 via org.mortbay.log.Slf4jLog
334 [main] INFO org.mortbay.log - Extract jar:file:/home/chrish/tmp/solr/solr-1.0/example/webapps/solr.war!/ to /tmp/Jetty__solr/webapp
Feb 24, 2006 5:54:52 PM org.apache.solr.servlet.SolrServlet init
INFO: user.dir=/home/chrish/tmp/solr/solr-1.0/example
Feb 24, 2006 5:54:52 PM org.apache.solr.core.SolrConfig &lt;clinit&gt;
INFO: Loaded Config solrconfig.xml
...
1656 [main] INFO org.mortbay.log - Started SelectChannelConnector @ 0.0.0.0:8983
</pre>
<p>
This will start up the Jetty application server on port 8983, and use your terminal to display the logging information from Solr.
</p>
<p>
You can see that the Solr is running by loading <a href="http://localhost:8983/solr/admin/">http://localhost:8983/solr/admin/</a> in your web browser. This is the main starting point for Administering Solr.
</p>
</div>
<a name="N1006E"></a><a name="Indexing+Data"></a>
<h2 class="boxed">Indexing Data</h2>
<div class="section">
<p>
Your Solr port is up and running, but it doesn't contain any data. You can modify a Solr index by POSTing XML Documents containing instructions to add (or update) documents, delete documents, commit pending adds and deletes, and optimize your index. The <span class="codefrag">exampledocs</span> directory contains samples of the types of instructions Solr expects, as well as a Shell script for posting them using the command line utility "<span class="codefrag">curl</span>".
</p>
<p>
Open a new Terminal window, enter the exampledocs directory, and run the "<span class="codefrag">post.sh</span>" script on some of the XML files in that directory...
</p>
<pre class="code">
chrish@asimov:~/tmp/solr/solr-1.0/example/exampledocs$ sh post.sh solr.xml
Posting file solr.xml to http://localhost:8983/solr/update
&lt;result status="0"&gt;&lt;/result&gt;
&lt;result status="0"&gt;&lt;/result&gt;
</pre>
<p>
You have now indexed one document about Solr, and committed that change. You can now search for "solr" using the "Make a Query" interface on the Admin screen, and you should get one result. Clicking the "Search" button should take you to the following URL...
</p>
<p>
<a href="http://localhost:8983/solr/select/?stylesheet=&q=solr&version=2.1&start=0&rows=10&indent=on">http://localhost:8983/solr/select/?stylesheet=&amp;q=solr&amp;version=2.1&amp;start=0&amp;rows=10&amp;indent=on</a>
</p>
<p>
You can index all of the sample data, using the following command...
</p>
<pre class="code">
chrish@asimov:~/tmp/solr/solr-1.0/example/exampledocs$ sh post.sh *.xml
Posting file hd.xml to http://localhost:8983/solr/update
&lt;result status="0"&gt;&lt;/result&gt;&lt;result status="0"&gt;&lt;/result&gt;
Posting file ipod_other.xml to http://localhost:8983/solr/update
&lt;result status="0"&gt;&lt;/result&gt;&lt;result status="0"&gt;&lt;/result&gt;
Posting file ipod_video.xml to http://localhost:8983/solr/update
&lt;result status="0"&gt;&lt;/result&gt;
Posting file mem.xml to http://localhost:8983/solr/update
&lt;result status="0"&gt;&lt;/result&gt;&lt;result status="0"&gt;&lt;/result&gt;&lt;result status="0"&gt;&lt;/result&gt;
Posting file monitor.xml to http://localhost:8983/solr/update
&lt;result status="0"&gt;&lt;/result&gt;
Posting file monitor2.xml to http://localhost:8983/solr/update
&lt;result status="0"&gt;&lt;/result&gt;
Posting file mp500.xml to http://localhost:8983/solr/update
&lt;result status="0"&gt;&lt;/result&gt;
Posting file sd500.xml to http://localhost:8983/solr/update
&lt;result status="0"&gt;&lt;/result&gt;
Posting file solr.xml to http://localhost:8983/solr/update
&lt;result status="0"&gt;&lt;/result&gt;
Posting file vidcard.xml to http://localhost:8983/solr/update
&lt;result status="0"&gt;&lt;/result&gt;&lt;result status="0"&gt;&lt;/result&gt;
&lt;result status="0"&gt;&lt;/result&gt;
</pre>
<p>
...and now you can search for all sorts of things using the default <a href="http://lucene.apache.org/java/docs/queryparsersyntax.html">Lucene QueryParser syntax</a>...
</p>
<ul>
<li>
<a href="http://localhost:8983/solr/select/?version=2.1&indent=on&q=video">video</a>
</li>
<li>
<a href="http://localhost:8983/solr/select/?version=2.1&indent=on&q=name:video">name:video</a>
</li>
<li>
<a href="http://localhost:8983/solr/select/?version=2.1&indent=on&q=%2Bvideo+%2Bprice%3A[*+TO+400]">+video +price:[* TO 400]</a>
</li>
</ul>
</div>
<a name="N100B2"></a><a name="Updating+Data"></a>
<h2 class="boxed">Updating Data</h2>
<div class="section">
<p>
You may have noticed that even though the file <span class="codefrag">solr.xml</span> has now
been POSTed to the server twice, you still only get 1 result when searching for
"solr". This is because the example schema.xml specifies a "uniqueKey" field
called "<span class="codefrag">id</span>". Whenever you POST instructions to Solr to add a
document with the same value for the uniqueKey as an existing document, it
automaticaly replaces it for you. You can see that that has happened by
looking at the values for <span class="codefrag">numDocs</span> and <span class="codefrag">maxDoc</span> in the
"CORE" section of the statistics page... </p>
<p>
<a href="http://localhost:8983/solr/admin/stats.jsp">http://localhost:8983/solr/admin/stats.jsp</a>
</p>
<p>
numDoc should be 15, but maxDoc may be larger (the maxDoc count includes logically deleted documents that have not yet been removed from the index). You can re-post the sample XML
files over and over again as much as you want and numDocs will never increase,
because the new documents will constantly be replacing the old.
</p>
<p>
Go ahead and edit the existing XML files to change some of the data, and re-run the post.sh command, you'll see your changes reflected in subsequent searches.
</p>
<a name="N100D4"></a><a name="Deleting+Data"></a>
<h3 class="boxed">Deleting Data</h3>
<p>You can delete data by POSTing a delete command to the update URL and specifying the value
of the document's unique key field, or a query that matches multiple documents. Since these commands
are smaller, we will specify them right on the command line rather than reference an XML file.
</p>
<p>Execute the following command to delete a document</p>
<pre class="code">curl http://localhost:8983/solr/update --data-binary '&lt;delete&gt;&lt;id&gt;SP2514N&lt;/id&gt;&lt;/delete&gt;'</pre>
<p>Now if you go to the <a href="http://localhost:8983/solr/admin/stats.jsp">statistics</a> page and scroll down
to the UPDATE_HANDLERS section and verify that "<span class="codefrag">deletesPending : 1</span>"</p>
<p>If you search for <a href="http://localhost:8983/solr/select?q=id:SP2514N">id:SP2514N</a> it will still be found,
because index changes are not visible until changes are flushed to disk, and a new searcher is opened. To cause
this to happen, send the following commit command to Solr:</p>
<pre class="code">curl http://localhost:8983/solr/update --data-binary '&lt;commit/&gt;'</pre>
<p>Now re-execute the previous search and verify that no matching documents are found. Also revisit the
statistics page and observe the changes in both the UPDATE_HANDLERS section and the CORE section.</p>
<p>Here is an example of using delete-by-query to delete anything with
<a href="http://localhost:8983/solr/select?q=name:DDR&fl=name">DDR</a> in the name:</p>
<pre class="code">curl http://localhost:8983/solr/update --data-binary '&lt;delete&gt;&lt;query&gt;name:DDR&lt;/query&gt;&lt;/delete&gt;'
curl http://localhost:8983/solr/update --data-binary '&lt;commit/&gt;'
</pre>
<p>Commit can be a very expensive operation so it's best to make many changes to an index in a batch and
then send the commit command at the end. There is also an optimize command that does the same thing as commit,
in addition to merging all index segments into a single segment, making it faster to search and causing any
deleted documents to be removed. All of the update commands are documented <a href="http://wiki.apache.org/solr/UpdateXmlMessages">here</a>.
</p>
<p>To continue with the tutorial, re-add any documents you may have deleted by going to the <span class="codefrag">exampledocs</span> directory and executing</p>
<pre class="code">sh post.sh *.xml</pre>
</div>
<a name="N1011A"></a><a name="Querying+Data"></a>
<h2 class="boxed">Querying Data</h2>
<div class="section">
<p>
Searches are done via HTTP GET on the select URL with the query string in the q parameter.
You can pass a number of optional <a href="http://wiki.apache.org/solr/StandardRequestHandler">request parameters</a>
to the request handler to control what information is returned. For example, you can use the "fl" parameter
to control what stored fields are returned, and if the relevancy score is returned...
</p>
<ul>
<li>
<a href="http://localhost:8983/solr/select/?indent=on&q=video&fl=name,id">q=video&amp;fl=name,id</a> (return only name and id fields) </li>
<li>
<a href="http://localhost:8983/solr/select/?indent=on&q=video&fl=name,id,score">q=video&amp;fl=name,id,score</a> (return relevancy score as well) </li>
<li>
<a href="http://localhost:8983/solr/select/?indent=on&q=video&fl=*,score">q=video&amp;fl=*,score</a> (return all stored fields, as well as relevancy score) </li>
<li>
<a href="http://localhost:8983/solr/select/?indent=on&q=video;price desc&fl=name,id">q=video;price desc&amp;fl=name,id</a> (add sort specification: sort by price descending) </li>
</ul>
<p>
Solr provides a <a href="http://localhost:8983/solr/admin/form.jsp">query form</a> within the web admin interface
that allows setting the various request parameters and is useful when trying out or debugging queries.
</p>
<a name="N10149"></a><a name="Sorting"></a>
<h3 class="boxed">Sorting</h3>
<p>
Solr provides a simple extension to the Lucene QueryParser syntax for specifying sort options. After your search, add a semi-colon followed by a list of "field direction" pairs...
</p>
<ul>
<li>
<a href="http://localhost:8983/solr/select/?indent=on&q=video;price+desc">video; price desc</a>
</li>
<li>
<a href="http://localhost:8983/solr/select/?indent=on&q=video;price+asc">video; price asc</a>
</li>
<li>
<a href="http://localhost:8983/solr/select/?indent=on&q=video;inStock+asc+price+desc">video; inStock asc, price desc</a>
</li>
</ul>
<p>
"score" can also be used as a field name when specifying a sort...
</p>
<ul>
<li>
<a href="http://localhost:8983/solr/select/?indent=on&q=video;score+desc">video; score desc</a>
</li>
<li>
<a href="http://localhost:8983/solr/select/?indent=on&q=video;inStock+asc,score+desc">video; inStock asc, score desc</a>
</li>
</ul>
<p>
If no sort is specified, the default is <span class="codefrag">score desc</span>, the same as in the Lucene search APIs.
</p>
</div>
<a name="N1017C"></a><a name="Text+Analysis"></a>
<h2 class="boxed">Text Analysis</h2>
<div class="section">
<p>
Text fields are typically indexed by breaking the field into words and applying various transformations such as
lowercasing, removing plurals, or stemming to increase relevancy. The same text transformations are normally
applied to any queries in order to match what is indexed.
</p>
<p>Example queries demonstrating relevancy improving transformations:</p>
<ul>
<li>A search for
<a href="http://localhost:8983/solr/select/?indent=on&q=power-shot&fl=name">power-shot</a>
matches <span class="codefrag">PowerShot</span>, and
<a href="http://localhost:8983/solr/select/?indent=on&q=adata&fl=name">adata</a>
matches <span class="codefrag">A-DATA</span> due to the use of WordDelimiterFilter and LowerCaseFilter.
</li>
<li>A search for
<a href="http://localhost:8983/solr/select/?indent=on&q=name:printers&fl=name">name:printers</a>
matches <span class="codefrag">Printer</span>, and
<a href="http://localhost:8983/solr/select/?indent=on&q=features:recharging&fl=name,features">features:recharging</a>
matches <span class="codefrag">Rechargeable</span> due to stemming with the EnglishPorterFilter.
</li>
<li>A search for
<a href="http://localhost:8983/solr/select/?indent=on&q=%221+gigabyte%22&fl=name">"1 gigabyte"</a>
matches things with <span class="codefrag">GB</span>, and
<a href="http://localhost:8983/solr/select/?indent=on&q=pixima&fl=name">pixima</a>
matches <span class="codefrag">Pixma</span> due to use of a SynonymFilter.
</li>
</ul>
<p>
The <a href="http://wiki.apache.org/solr/SchemaXml">schema</a> defines
the fields in the index and what type of analysis is applied to them. The current schema your server is using
may be accessed via the <span class="codefrag">[SCHEMA]</span> link on the <a href="http://localhost:8983/solr/admin/">admin</a> page.
</p>
<p>A full description of the analysis components, Analyzers, Tokenizers, and TokenFilters
available for use is <a href="http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters">here</a>.
</p>
<a name="N101D3"></a><a name="Analysis+Debugging"></a>
<h3 class="boxed">Analysis Debugging</h3>
<p>There is a handy <a href="http://localhost:8983/solr/admin/analysis.jsp">analysis</a>
debugging page where you can see how a text value is broken down into words,
and shows the resulting tokens after they pass through each filter in the chain.
</p>
<p>
<a href="http://localhost:8983/solr/admin/analysis.jsp?name=name&val=Canon+PowerShot+SD500">This</a>
shows how "<span class="codefrag">Canon PowerShot SD500</span>" would be indexed as a value in the name field. Each row of
the table shows the resulting tokens after having passed through the next TokenFilter in the Analyzer for the <span class="codefrag">name</span> field.
Notice how both <span class="codefrag">powershot</span> and <span class="codefrag">power</span>, <span class="codefrag">shot</span> are indexed. Tokens generated at the same position
are shown in the same column, in this case <span class="codefrag">shot</span> and <span class="codefrag">powershot</span>.
</p>
<p>Selecting <a href="http://localhost:8983/solr/admin/analysis.jsp?name=name&verbose=on&val=Canon+PowerShot+SD500">verbose output</a>
will show more details, such as the name of each analyzer component in the chain, token positions, and the start and end positions
of the token in the original text.
</p>
<p>Selecting <a href="http://localhost:8983/solr/admin/analysis.jsp?name=name&highlight=on&val=Canon+PowerShot+SD500&qval=power-shot">highlight matches</a>
when both index and query values are provided will take the resulting terms from the query value and highlight
all matches in the index value analysis.
</p>
<p>
<a href="http://localhost:8983/solr/admin/analysis.jsp?name=text&highlight=on&val=Four+score+and+seven+years+ago+our+fathers+brought+forth+on+this+continent+a+new+nation%2C+conceived+in+liberty+and+dedicated+to+the+proposition+that+all+men+are+created+equal.+&qval=liberties+and+equality">Here</a>
is an example of stemming and stop-words at work.
</p>
</div>
</div>
<div class="clearboth">&nbsp;</div>
</div>
<div id="footer">
<div class="lastmodified">
<script type="text/javascript"><!--
document.write("<text>Last Published:</text> " + document.lastModified);
// --></script>
</div>
<div class="copyright">
Copyright &copy;
2006 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
</div>
</div>
</body>
</html>