SOLR-284: Solr Cell: Add support for Tika content extraction

git-svn-id: https://svn.apache.org/repos/asf/lucene/solr/trunk@723977 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Grant Ingersoll 2008-12-06 13:04:26 +00:00
parent 474ab9a515
commit cedd07b500
49 changed files with 2675 additions and 93 deletions

View File

@ -98,6 +98,8 @@ New Features
can be specified. can be specified.
(Georgios Stamatis, Lars Kotthoff, Chris Harris via koji) (Georgios Stamatis, Lars Kotthoff, Chris Harris via koji)
20. SOLR-284: Added support for extracting content from binary documents like MS Word and PDF using Apache Tika. See also contrib/extraction/CHANGES.txt (Eric Pugh, Chris Harris, gsingers)
Optimizations Optimizations
---------------------- ----------------------
1. SOLR-374: Use IndexReader.reopen to save resources by re-using parts of the 1. SOLR-374: Use IndexReader.reopen to save resources by re-using parts of the

View File

@ -261,9 +261,9 @@ such code.
1.13. You (or Your) means an individual or a legal entity exercising rights 1.13. You (or Your) means an individual or a legal entity exercising rights
under, and complying with all of the terms of, this License. For legal under, and complying with all of the terms of, this License. For legal
entities, You includes any entity which controls, is controlled by, or is under entities, You includes any entity which controls, is controlled by, or is under
common control with You. For purposes of this definition, control means (a)áthe common control with You. For purposes of this definition, control means (a)<EFBFBD>the
power, direct or indirect, to cause the direction or management of such entity, power, direct or indirect, to cause the direction or management of such entity,
whether by contract or otherwise, or (b)áownership of more than fifty percent whether by contract or otherwise, or (b)<EFBFBD>ownership of more than fifty percent
(50%) of the outstanding shares or beneficial ownership of such entity. (50%) of the outstanding shares or beneficial ownership of such entity.
2. License Grants. 2. License Grants.
@ -278,12 +278,12 @@ with or without Modifications, and/or as part of a Larger Work; and (b) under
Patent Claims infringed by the making, using or selling of Original Software, Patent Claims infringed by the making, using or selling of Original Software,
to make, have made, use, practice, sell, and offer for sale, and/or otherwise to make, have made, use, practice, sell, and offer for sale, and/or otherwise
dispose of the Original Software (or portions thereof). (c) The licenses dispose of the Original Software (or portions thereof). (c) The licenses
granted in Sectionsá2.1(a) and (b) are effective on the date Initial Developer granted in Sections<EFBFBD>2.1(a) and (b) are effective on the date Initial Developer
first distributes or otherwise makes the Original Software available to a third first distributes or otherwise makes the Original Software available to a third
party under the terms of this License. (d) Notwithstanding Sectioná2.1(b) party under the terms of this License. (d) Notwithstanding Section<EFBFBD>2.1(b)
above, no patent license is granted: (1)áfor code that You delete from the above, no patent license is granted: (1)<EFBFBD>for code that You delete from the
Original Software, or (2)áfor infringements caused by: (i)áthe modification of Original Software, or (2)<EFBFBD>for infringements caused by: (i)<29>the modification of
the Original Software, or (ii)áthe combination of the Original Software with the Original Software, or (ii)<EFBFBD>the combination of the Original Software with
other software or devices. other software or devices.
2.2. Contributor Grant. Conditioned upon Your compliance with Section 3.1 2.2. Contributor Grant. Conditioned upon Your compliance with Section 3.1
@ -297,17 +297,17 @@ and/or as part of a Larger Work; and (b) under Patent Claims infringed by the
making, using, or selling of Modifications made by that Contributor either making, using, or selling of Modifications made by that Contributor either
alone and/or in combination with its Contributor Version (or portions of such alone and/or in combination with its Contributor Version (or portions of such
combination), to make, use, sell, offer for sale, have made, and/or otherwise combination), to make, use, sell, offer for sale, have made, and/or otherwise
dispose of: (1)áModifications made by that Contributor (or portions thereof); dispose of: (1)<EFBFBD>Modifications made by that Contributor (or portions thereof);
and (2)áthe combination of Modifications made by that Contributor with its and (2)<EFBFBD>the combination of Modifications made by that Contributor with its
Contributor Version (or portions of such combination). (c) The licenses Contributor Version (or portions of such combination). (c) The licenses
granted in Sectionsá2.2(a) and 2.2(b) are effective on the date Contributor granted in Sections<EFBFBD>2.2(a) and 2.2(b) are effective on the date Contributor
first distributes or otherwise makes the Modifications available to a third first distributes or otherwise makes the Modifications available to a third
party. (d) Notwithstanding Sectioná2.2(b) above, no patent license is granted: party. (d) Notwithstanding Section<EFBFBD>2.2(b) above, no patent license is granted:
(1)áfor any code that Contributor has deleted from the Contributor Version; (1)<EFBFBD>for any code that Contributor has deleted from the Contributor Version;
(2)áfor infringements caused by: (i)áthird party modifications of Contributor (2)<EFBFBD>for infringements caused by: (i)<29>third party modifications of Contributor
Version, or (ii)áthe combination of Modifications made by that Contributor with Version, or (ii)<EFBFBD>the combination of Modifications made by that Contributor with
other software (except as part of the Contributor Version) or other devices; or other software (except as part of the Contributor Version) or other devices; or
(3)áunder Patent Claims infringed by Covered Software in the absence of (3)<EFBFBD>under Patent Claims infringed by Covered Software in the absence of
Modifications made by that Contributor. Modifications made by that Contributor.
3. Distribution Obligations. 3. Distribution Obligations.
@ -389,9 +389,9 @@ License published by the license steward. 4.3. Modified Versions.
When You are an Initial Developer and You want to create a new license for Your When You are an Initial Developer and You want to create a new license for Your
Original Software, You may create and use a modified version of this License if Original Software, You may create and use a modified version of this License if
You: (a)árename the license and remove any references to the name of the You: (a)<EFBFBD>rename the license and remove any references to the name of the
license steward (except to note that the license differs from this License); license steward (except to note that the license differs from this License);
and (b)áotherwise make it clear that the license contains terms which differ and (b)<EFBFBD>otherwise make it clear that the license contains terms which differ
from this License. from this License.
5. DISCLAIMER OF WARRANTY. 5. DISCLAIMER OF WARRANTY.
@ -422,14 +422,14 @@ the Participant is a Contributor or the Original Software where the Participant
is the Initial Developer) directly or indirectly infringes any patent, then any is the Initial Developer) directly or indirectly infringes any patent, then any
and all rights granted directly or indirectly to You by such Participant, the and all rights granted directly or indirectly to You by such Participant, the
Initial Developer (if the Initial Developer is not the Participant) and all Initial Developer (if the Initial Developer is not the Participant) and all
Contributors under Sectionsá2.1 and/or 2.2 of this License shall, upon 60 days Contributors under Sections<EFBFBD>2.1 and/or 2.2 of this License shall, upon 60 days
notice from Participant terminate prospectively and automatically at the notice from Participant terminate prospectively and automatically at the
expiration of such 60 day notice period, unless if within such 60 day period expiration of such 60 day notice period, unless if within such 60 day period
You withdraw Your claim with respect to the Participant Software against such You withdraw Your claim with respect to the Participant Software against such
Participant either unilaterally or pursuant to a written agreement with Participant either unilaterally or pursuant to a written agreement with
Participant. Participant.
6.3. In the event of termination under Sectionsá6.1 or 6.2 above, all end user 6.3. In the event of termination under Sections<EFBFBD>6.1 or 6.2 above, all end user
licenses that have been validly granted by You or any distributor hereunder licenses that have been validly granted by You or any distributor hereunder
prior to termination (excluding licenses granted to You by any distributor) prior to termination (excluding licenses granted to You by any distributor)
shall survive termination. shall survive termination.
@ -453,9 +453,9 @@ LIMITATION MAY NOT APPLY TO YOU.
8. U.S. GOVERNMENT END USERS. 8. U.S. GOVERNMENT END USERS.
The Covered Software is a commercial item, as that term is defined in The Covered Software is a commercial item, as that term is defined in
48áC.F.R.á2.101 (Oct. 1995), consisting of commercial computer software (as 48<EFBFBD>C.F.R.<2E>2.101 (Oct. 1995), consisting of commercial computer software (as
that term is defined at 48 C.F.R. á252.227-7014(a)(1)) and commercial computer that term is defined at 48 C.F.R. <EFBFBD>252.227-7014(a)(1)) and commercial computer
software documentation as such terms are used in 48áC.F.R.á12.212 (Sept. 1995). software documentation as such terms are used in 48<EFBFBD>C.F.R.<2E>12.212 (Sept. 1995).
Consistent with 48 C.F.R. 12.212 and 48 C.F.R. 227.7202-1 through 227.7202-4 Consistent with 48 C.F.R. 12.212 and 48 C.F.R. 227.7202-1 through 227.7202-4
(June 1995), all U.S. Government End Users acquire Covered Software with only (June 1995), all U.S. Government End Users acquire Covered Software with only
those rights set forth herein. This U.S. Government Rights clause is in lieu those rights set forth herein. This U.S. Government Rights clause is in lieu
@ -736,3 +736,161 @@ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
===========================================================================
Apache Tika Licenses - contrib/extraction
---------------------------------------------------------------------------
Apache Tika is licensed under the ASL 2.0. See above for the text of the license
APACHE TIKA SUBCOMPONENTS
Apache Tika includes a number of subcomponents with separate copyright notices
and license terms. Your use of these subcomponents is subject to the terms and
conditions of the following licenses.
Bouncy Castle libraries (bcmail and bcprov)
Copyright (c) 2000-2006 The Legion Of The Bouncy Castle
(http://www.bouncycastle.org)
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files
(the "Software"), to deal in the Software without restriction,
including without limitation the rights to use, copy, modify, merge,
publish, distribute, sublicense, and/or sell copies of the Software,
and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
PDFBox library (pdfbox)
Copyright (c) 2003-2005, www.pdfbox.org
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of pdfbox; nor the names of its
contributors may be used to endorse or promote products derived from this
software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE.
FontBox and JempBox libraries (fontbox, jempbox)
Copyright (c) 2003-2005, www.fontbox.org
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of fontbox; nor the names of its
contributors may be used to endorse or promote products derived from this
software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE.
ICU4J library (icu4j)
Copyright (c) 1995-2005 International Business Machines Corporation
and others
All rights reserved.
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, and/or sell copies of the Software, and to permit persons
to whom the Software is furnished to do so, provided that the above
copyright notice(s) and this permission notice appear in all copies
of the Software and that both the above copyright notice(s) and this
permission notice appear in supporting documentation.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS.
IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE
BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES,
OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
SOFTWARE.
Except as contained in this notice, the name of a copyright holder shall
not be used in advertising or otherwise to promote the sale, use or other
dealings in this Software without prior written authorization of the
copyright holder.
ASM library (asm)
Copyright (c) 2000-2005 INRIA, France Telecom
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
3. Neither the name of the copyright holders nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
THE POSSIBILITY OF SUCH DAMAGE.

View File

@ -113,3 +113,24 @@ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
=========================================================================
== Apache Tika Notices ==
=========================================================================
The following notices apply to the Apache Tika libraries in contrib/extraction/lib:
This product includes software developed by the following copyright owners:
Copyright (c) 2000-2006 The Legion Of The Bouncy Castle
(http://www.bouncycastle.org)
Copyright (c) 2003-2005, www.pdfbox.org
Copyright (c) 2003-2005, www.fontbox.org
Copyright (c) 1995-2005 International Business Machines Corporation and others
Copyright (c) 2000-2005 INRIA, France Telecom

View File

@ -30,8 +30,7 @@
<!-- Destination for distribution files (demo WAR, src distro, etc.) --> <!-- Destination for distribution files (demo WAR, src distro, etc.) -->
<property name="dist" value="dist" /> <property name="dist" value="dist" />
<!-- Example directory -->
<property name="example" value="example" />
<property name="clover.db.dir" location="${dest}/tests/clover/db"/> <property name="clover.db.dir" location="${dest}/tests/clover/db"/>
<property name="clover.report.dir" location="${dest}/tests/clover/reports"/> <property name="clover.report.dir" location="${dest}/tests/clover/reports"/>
@ -612,7 +611,7 @@
<target name="example" <target name="example"
description="Creates a runnable example configuration." description="Creates a runnable example configuration."
depends="init-forrest-entities,dist-contrib,dist-war"> depends="init-forrest-entities,dist-contrib,dist-war,example-contrib">
<copy file="${dist}/${fullnamever}.war" <copy file="${dist}/${fullnamever}.war"
tofile="${example}/webapps/${ant.project.name}.war"/> tofile="${example}/webapps/${ant.project.name}.war"/>
<jar destfile="${example}/exampledocs/post.jar" <jar destfile="${example}/exampledocs/post.jar"
@ -624,7 +623,7 @@
value="org.apache.solr.util.SimplePostTool"/> value="org.apache.solr.util.SimplePostTool"/>
</manifest> </manifest>
</jar> </jar>
<copy todir="${example}/solr/bin"> <copy todir="${example}/solr/bin">
<fileset dir="${src}/scripts"> <fileset dir="${src}/scripts">
<exclude name="scripts.conf"/> <exclude name="scripts.conf"/>

View File

@ -23,17 +23,14 @@ import java.io.Writer;
import java.net.URLEncoder; import java.net.URLEncoder;
import java.text.DateFormat; import java.text.DateFormat;
import java.text.ParseException; import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList; import java.util.ArrayList;
import java.util.Collection; import java.util.Collection;
import java.util.Date; import java.util.Date;
import java.util.Iterator; import java.util.Iterator;
import java.util.TimeZone; import java.util.TimeZone;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.httpclient.util.DateParseException; import org.apache.commons.httpclient.util.DateParseException;
import org.apache.commons.httpclient.util.DateUtil;
import org.apache.solr.common.SolrDocument; import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrInputDocument; import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.SolrInputField; import org.apache.solr.common.SolrInputField;
@ -41,6 +38,7 @@ import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.ContentStream; import org.apache.solr.common.util.ContentStream;
import org.apache.solr.common.util.ContentStreamBase; import org.apache.solr.common.util.ContentStreamBase;
import org.apache.solr.common.util.XML; import org.apache.solr.common.util.XML;
import org.apache.solr.common.util.DateUtil;
/** /**
@ -61,17 +59,17 @@ public class ClientUtils
{ {
if( str == null ) if( str == null )
return null; return null;
ArrayList<ContentStream> streams = new ArrayList<ContentStream>( 1 ); ArrayList<ContentStream> streams = new ArrayList<ContentStream>( 1 );
ContentStreamBase ccc = new ContentStreamBase.StringStream( str ); ContentStreamBase ccc = new ContentStreamBase.StringStream( str );
ccc.setContentType( contentType ); ccc.setContentType( contentType );
streams.add( ccc ); streams.add( ccc );
return streams; return streams;
} }
/** /**
* @param d SolrDocument to convert * @param d SolrDocument to convert
* @return a SolrInputDocument with the same fields and values as the * @return a SolrInputDocument with the same fields and values as the
* SolrDocument. All boosts are 1.0f * SolrDocument. All boosts are 1.0f
*/ */
public static SolrInputDocument toSolrInputDocument( SolrDocument d ) public static SolrInputDocument toSolrInputDocument( SolrDocument d )
@ -95,38 +93,38 @@ public class ClientUtils
} }
return doc; return doc;
} }
//------------------------------------------------------------------------ //------------------------------------------------------------------------
//------------------------------------------------------------------------ //------------------------------------------------------------------------
public static void writeXML( SolrInputDocument doc, Writer writer ) throws IOException public static void writeXML( SolrInputDocument doc, Writer writer ) throws IOException
{ {
writer.write("<doc boost=\""+doc.getDocumentBoost()+"\">"); writer.write("<doc boost=\""+doc.getDocumentBoost()+"\">");
for( SolrInputField field : doc ) { for( SolrInputField field : doc ) {
float boost = field.getBoost(); float boost = field.getBoost();
String name = field.getName(); String name = field.getName();
for( Object v : field ) { for( Object v : field ) {
if (v instanceof Date) { if (v instanceof Date) {
v = fmtThreadLocal.get().format( (Date)v ); v = DateUtil.getThreadLocalDateFormat().format( (Date)v );
} }
if( boost != 1.0f ) { if( boost != 1.0f ) {
XML.writeXML(writer, "field", v.toString(), "name", name, "boost", boost ); XML.writeXML(writer, "field", v.toString(), "name", name, "boost", boost );
} }
else { else {
XML.writeXML(writer, "field", v.toString(), "name", name ); XML.writeXML(writer, "field", v.toString(), "name", name );
} }
// only write the boost for the first multi-valued field // only write the boost for the first multi-valued field
// otherwise, the used boost is the product of all the boost values // otherwise, the used boost is the product of all the boost values
boost = 1.0f; boost = 1.0f;
} }
} }
writer.write("</doc>"); writer.write("</doc>");
} }
public static String toXML( SolrInputDocument doc )
public static String toXML( SolrInputDocument doc )
{ {
StringWriter str = new StringWriter(); StringWriter str = new StringWriter();
try { try {
@ -135,59 +133,45 @@ public class ClientUtils
catch( Exception ex ){} catch( Exception ex ){}
return str.toString(); return str.toString();
} }
//--------------------------------------------------------------------------------------- //---------------------------------------------------------------------------------------
public static final Collection<String> fmts = new ArrayList<String>(); /**
static { * @deprecated Use {@link org.apache.solr.common.util.DateUtil#DEFAULT_DATE_FORMATS}
fmts.add( "yyyy-MM-dd'T'HH:mm:ss'Z'" ); */
fmts.add( "yyyy-MM-dd'T'HH:mm:ss" ); public static final Collection<String> fmts = DateUtil.DEFAULT_DATE_FORMATS;
fmts.add( "yyyy-MM-dd" );
}
/** /**
* Returns a formatter that can be use by the current thread if needed to * Returns a formatter that can be use by the current thread if needed to
* convert Date objects to the Internal representation. * convert Date objects to the Internal representation.
* @throws ParseException * @throws ParseException
* @throws DateParseException * @throws DateParseException
*
* @deprecated Use {@link org.apache.solr.common.util.DateUtil#parseDate(String)}
*/ */
public static Date parseDate( String d ) throws ParseException, DateParseException public static Date parseDate( String d ) throws ParseException, DateParseException
{ {
// 2007-04-26T08:05:04Z return DateUtil.parseDate(d);
if( d.endsWith( "Z" ) && d.length() > 20 ) {
return getThreadLocalDateFormat().parse( d );
}
return DateUtil.parseDate( d, fmts );
}
/**
* Returns a formatter that can be use by the current thread if needed to
* convert Date objects to the Internal representation.
*/
public static DateFormat getThreadLocalDateFormat() {
return fmtThreadLocal.get();
} }
public static TimeZone UTC = TimeZone.getTimeZone("UTC"); /**
private static ThreadLocalDateFormat fmtThreadLocal = new ThreadLocalDateFormat(); * Returns a formatter that can be use by the current thread if needed to
* convert Date objects to the Internal representation.
private static class ThreadLocalDateFormat extends ThreadLocal<DateFormat> { *
DateFormat proto; * @deprecated use {@link org.apache.solr.common.util.DateUtil#getThreadLocalDateFormat()}
public ThreadLocalDateFormat() { */
super(); public static DateFormat getThreadLocalDateFormat() {
//2007-04-26T08:05:04Z
SimpleDateFormat tmp = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"); return DateUtil.getThreadLocalDateFormat();
tmp.setTimeZone(UTC);
proto = tmp;
}
@Override
protected DateFormat initialValue() {
return (DateFormat) proto.clone();
}
} }
/**
* @deprecated Use {@link org.apache.solr.common.util.DateUtil#UTC}.
*/
public static TimeZone UTC = DateUtil.UTC;
/** /**
* See: http://lucene.apache.org/java/docs/queryparsersyntax.html#Escaping Special Characters * See: http://lucene.apache.org/java/docs/queryparsersyntax.html#Escaping Special Characters
*/ */
@ -206,7 +190,7 @@ public class ClientUtils
} }
return sb.toString(); return sb.toString();
} }
public static String toQueryString( SolrParams params, boolean xml ) { public static String toQueryString( SolrParams params, boolean xml ) {
StringBuilder sb = new StringBuilder(128); StringBuilder sb = new StringBuilder(128);
try { try {

View File

@ -38,6 +38,9 @@
<format property="dateversion" pattern="yyyy.MM.dd.HH.mm.ss" /> <format property="dateversion" pattern="yyyy.MM.dd.HH.mm.ss" />
</tstamp> </tstamp>
<!-- Example directory -->
<property name="example" value="${common.dir}/example" />
<!-- <!--
we attempt to exec svnversion to get details build information we attempt to exec svnversion to get details build information
for jar manifests. this property can be set at runtime to an for jar manifests. this property can be set at runtime to an
@ -332,6 +335,10 @@
<contrib-crawl target="dist" failonerror="true" /> <contrib-crawl target="dist" failonerror="true" />
</target> </target>
<target name="example-contrib" description="Tell the contrib to add their stuff to examples">
<contrib-crawl target="example" failonerror="true" />
</target>
<!-- Creates a Manifest file for Jars and WARs --> <!-- Creates a Manifest file for Jars and WARs -->
<target name="make-manifest"> <target name="make-manifest">
<!-- If possible, include the svnversion --> <!-- If possible, include the svnversion -->

View File

@ -121,6 +121,8 @@
</sources> </sources>
</invoke-javadoc> </invoke-javadoc>
</sequential> </sequential>
</target> </target>
<target name="example" depends="build"/>
</project> </project>

View File

@ -0,0 +1,25 @@
Apache Solr Content Extraction Library (Solr Cell)
Version 1.4-dev
Release Notes
This file describes changes to the Solr Cell (contrib/extraction) module. See SOLR-284 for details.
Introduction
------------
Apache Solr Extraction provides a means for extracting and indexing content contained in "rich" documents, such
as Microsoft Word, Adobe PDF, etc. (Each name is a trademark of their respective owners) This contrib module
uses Apache Tika to extract content and metadata from the files, which can then be indexed. For more information,
see http://wiki.apache.org/solr/ExtractingRequestHandler
Getting Started
---------------
You will need Solr up and running. Then, simply add the extraction JAR file, plus the Tika dependencies (in the ./lib folder)
to your Solr Home lib directory. See http://wiki.apache.org/solr/ExtractingRequestHandler for more details on hooking it in
and configuring.
$Id:$
================== Release 1.4-dev ==================
1. SOLR-284: Added in support for extraction. (Eric Pugh, Chris Harris, gsingers)

View File

@ -0,0 +1,134 @@
<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<project name="solr-extraction" default="build">
<property name="solr-path" value="../.." />
<property name="tika.version" value="0.2-SNAPSHOT"/>
<property name="tika.lib" value="lib/tika-${tika.version}-standalone.jar"/>
<import file="../../common-build.xml"/>
<description>
Solr Integration with Tika for extracting content from binary file formats such as Microsoft Word and Adobe PDF.
</description>
<path id="common.classpath">
<pathelement location="${solr-path}/build/common" />
<pathelement location="${solr-path}/build/core" />
<fileset dir="lib" includes="*.jar"/>
<fileset dir="${solr-path}/lib" includes="*.jar"></fileset>
</path>
<path id="test.classpath">
<path refid="common.classpath" />
<pathelement path="${dest}/classes" />
<pathelement path="${dest}/test-classes" />
<pathelement path="${java.class.path}"/>
</path>
<target name="clean">
<delete failonerror="false" dir="${dest}"/>
</target>
<target name="init">
<mkdir dir="${dest}/classes"/>
<mkdir dir="${build.javadoc}" />
<ant dir="../../" inheritall="false" target="compile" />
<ant dir="../../" inheritall="false" target="make-manifest" />
</target>
<target name="compile" depends="init">
<solr-javac destdir="${dest}/classes"
classpathref="common.classpath">
<src path="src/main/java" />
</solr-javac>
</target>
<target name="build" depends="compile">
<solr-jar destfile="${dest}/${fullnamever}.jar" basedir="${dest}/classes"
manifest="${common.dir}/${dest}/META-INF/MANIFEST.MF">
<!--<zipfileset src="${tika.lib}"/>-->
</solr-jar>
</target>
<target name="compileTests" depends="compile">
<solr-javac destdir="${dest}/test-classes"
classpathref="test.classpath">
<src path="src/test/java" />
</solr-javac>
</target>
<target name="test" depends="compileTests">
<mkdir dir="${junit.output.dir}"/>
<junit printsummary="on"
haltonfailure="no"
errorProperty="tests.failed"
failureProperty="tests.failed"
dir="src/test/resources/"
>
<formatter type="brief" usefile="false" if="junit.details"/>
<classpath refid="test.classpath"/>
<formatter type="xml"/>
<batchtest fork="yes" todir="${junit.output.dir}" unless="testcase">
<fileset dir="src/test/java" includes="${junit.includes}"/>
</batchtest>
<batchtest fork="yes" todir="${junit.output.dir}" if="testcase">
<fileset dir="src/test/java" includes="**/${testcase}.java"/>
</batchtest>
</junit>
<fail if="tests.failed">Tests failed!</fail>
</target>
<target name="dist" depends="build">
</target>
<target name="example" depends="build">
<!-- Copy the jar into example/solr/lib -->
<copy file="${dest}/${fullnamever}.jar" todir="${example}/solr/lib"/>
<copy todir="${example}/solr/lib">
<fileset dir="lib">
<include name="**/*.jar"/>
</fileset>
</copy>
</target>
<target name="javadoc">
<sequential>
<mkdir dir="${build.javadoc}/contrib-${name}"/>
<path id="javadoc.classpath">
<path refid="common.classpath"/>
</path>
<invoke-javadoc
destdir="${build.javadoc}/contrib-${name}"
title="${Name} ${version} contrib-${fullnamever} API">
<sources>
<packageset dir="src/main/java"/>
</sources>
</invoke-javadoc>
</sequential>
</target>
</project>

View File

@ -0,0 +1,2 @@
AnyObjectId[8217cae0a1bc977b241e0c8517cc2e3e7cede276] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[680f8c60c1f0393f7e56595e24b29b3ceb46e933] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[552721d0e8deb28f2909cfc5ec900a5e35736795] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[957b6752af9a60c1bb2a4f65db0e90e5ce00f521] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[133dc6cb35f5ca2c5920fd0933a557c2def88680] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[87b80ab5db1729662ccf3439e147430a28c36d03] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[b73a80fab641131e6fbe3ae833549efb3c540d17] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[c9030febd2ae484532407db9ef98247cbe61b779] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[f5e8c167e7f7f3d078407859cb50b8abf23c697e] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[674d71e89ea154dbe2e3cd032821c22b39e8fd68] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[625130719013f195869881a36dcb8d2b14d64d1e] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[037b4fe2743eb161eec649f6fa5fa4725585b518] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[f821d644766c4d5c95e53db4b83cc6cb37b553f6] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[9e472a1610fa5d6736ecd56aec663623170003a3] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[58a33ac11683bec703fadffdbb263036146d7a74] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[16b9a3ed370d5a617d72f0b8935859bf0eac7678] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[3b351f6e2b566f73b742510738a52b866b4ffd0d] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,2 @@
AnyObjectId[b338fb66932a763d6939dc93f27ed985ca5d1ebb] was removed in git history.
Apache SVN contains full history.

View File

@ -0,0 +1,179 @@
package org.apache.solr.handler;
import org.apache.commons.io.IOUtils;
import org.apache.solr.common.SolrException;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.params.UpdateParams;
import org.apache.solr.common.util.ContentStream;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.request.SolrQueryResponse;
import org.apache.solr.schema.IndexSchema;
import org.apache.solr.update.AddUpdateCommand;
import org.apache.solr.update.processor.UpdateRequestProcessor;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.XHTMLContentHandler;
import org.apache.tika.sax.xpath.Matcher;
import org.apache.tika.sax.xpath.MatchingContentHandler;
import org.apache.tika.sax.xpath.XPathParser;
import org.apache.xml.serialize.OutputFormat;
import org.apache.xml.serialize.XMLSerializer;
import org.xml.sax.ContentHandler;
import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
/**
*
*
**/
public class ExtractingDocumentLoader extends ContentStreamLoader {
/**
* XHTML XPath parser.
*/
private static final XPathParser PARSER =
new XPathParser("xhtml", XHTMLContentHandler.XHTML);
final IndexSchema schema;
final SolrParams params;
final UpdateRequestProcessor processor;
protected AutoDetectParser autoDetectParser;
private final AddUpdateCommand templateAdd;
protected TikaConfig config;
protected SolrContentHandlerFactory factory;
//protected Collection<String> dateFormats = DateUtil.DEFAULT_DATE_FORMATS;
ExtractingDocumentLoader(SolrQueryRequest req, UpdateRequestProcessor processor,
TikaConfig config, SolrContentHandlerFactory factory) {
this.params = req.getParams();
schema = req.getSchema();
this.config = config;
this.processor = processor;
templateAdd = new AddUpdateCommand();
templateAdd.allowDups = false;
templateAdd.overwriteCommitted = true;
templateAdd.overwritePending = true;
if (params.getBool(UpdateParams.OVERWRITE, true)) {
templateAdd.allowDups = false;
templateAdd.overwriteCommitted = true;
templateAdd.overwritePending = true;
} else {
templateAdd.allowDups = true;
templateAdd.overwriteCommitted = false;
templateAdd.overwritePending = false;
}
//this is lightweight
autoDetectParser = new AutoDetectParser(config);
this.factory = factory;
}
/**
* this must be MT safe... may be called concurrently from multiple threads.
*
* @param
* @param
*/
void doAdd(SolrContentHandler handler, AddUpdateCommand template)
throws IOException {
template.solrDoc = handler.newDocument();
processor.processAdd(template);
}
void addDoc(SolrContentHandler handler) throws IOException {
templateAdd.indexedId = null;
doAdd(handler, templateAdd);
}
/**
* @param req
* @param stream
* @throws java.io.IOException
*/
public void load(SolrQueryRequest req, SolrQueryResponse rsp, ContentStream stream) throws IOException {
errHeader = "ExtractingDocumentLoader: " + stream.getSourceInfo();
Parser parser = null;
String streamType = req.getParams().get(ExtractingParams.STREAM_TYPE, null);
if (streamType != null) {
//Cache? Parsers are lightweight to construct and thread-safe, so I'm told
parser = config.getParser(streamType.trim().toLowerCase());
} else {
parser = autoDetectParser;
}
if (parser != null) {
Metadata metadata = new Metadata();
metadata.add(ExtractingMetadataConstants.STREAM_NAME, stream.getName());
metadata.add(ExtractingMetadataConstants.STREAM_SOURCE_INFO, stream.getSourceInfo());
metadata.add(ExtractingMetadataConstants.STREAM_SIZE, String.valueOf(stream.getSize()));
metadata.add(ExtractingMetadataConstants.STREAM_CONTENT_TYPE, stream.getContentType());
// If you specify the resource name (the filename, roughly) with this parameter,
// then Tika can make use of it in guessing the appropriate MIME type:
String resourceName = req.getParams().get(ExtractingParams.RESOURCE_NAME, null);
if (resourceName != null) {
metadata.add(Metadata.RESOURCE_NAME_KEY, resourceName);
}
SolrContentHandler handler = factory.createSolrContentHandler(metadata, params, schema);
InputStream inputStream = null;
try {
inputStream = stream.getStream();
String xpathExpr = params.get(ExtractingParams.XPATH_EXPRESSION);
boolean extractOnly = params.getBool(ExtractingParams.EXTRACT_ONLY, false);
ContentHandler parsingHandler = handler;
StringWriter writer = null;
XMLSerializer serializer = null;
if (extractOnly == true) {
writer = new StringWriter();
serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true));
if (xpathExpr != null) {
Matcher matcher =
PARSER.parse(xpathExpr);
serializer.startDocument();//The MatchingContentHandler does not invoke startDocument. See http://tika.markmail.org/message/kknu3hw7argwiqin
parsingHandler = new MatchingContentHandler(serializer, matcher);
} else {
parsingHandler = serializer;
}
} else if (xpathExpr != null) {
Matcher matcher =
PARSER.parse(xpathExpr);
parsingHandler = new MatchingContentHandler(handler, matcher);
} //else leave it as is
//potentially use a wrapper handler for parsing, but we still need the SolrContentHandler for getting the document.
parser.parse(inputStream, parsingHandler, metadata);
if (extractOnly == false) {
addDoc(handler);
} else {
//serializer is not null, so we need to call endDoc on it if using xpath
if (xpathExpr != null){
serializer.endDocument();
}
rsp.add(stream.getName(), writer.toString());
writer.close();
}
} catch (Exception e) {
//TODO: handle here with an option to not fail and just log the exception
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, e);
} finally {
IOUtils.closeQuietly(inputStream);
}
} else {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Stream type of " + streamType + " didn't match any known parsers. Please supply the " + ExtractingParams.STREAM_TYPE + " parameter.");
}
}
}

View File

@ -0,0 +1,13 @@
package org.apache.solr.handler;
/**
*
*
**/
public interface ExtractingMetadataConstants {
String STREAM_NAME = "stream_name";
String STREAM_SOURCE_INFO = "stream_source_info";
String STREAM_SIZE = "stream_size";
String STREAM_CONTENT_TYPE = "stream_content_type";
}

View File

@ -0,0 +1,125 @@
package org.apache.solr.handler;
/**
* The various parameters to use when extracting content.
*
**/
public interface ExtractingParams {
public static final String EXTRACTING_PREFIX = "ext.";
/**
* The param prefix for mapping Tika metadata to Solr fields.
* <p/>
* To map a field, add a name like:
* <pre>ext.map.title=solr.title</pre>
*
* In this example, the tika "title" metadata value will be added to a Solr field named "solr.title"
*
*
*/
public static final String MAP_PREFIX = EXTRACTING_PREFIX + "map.";
/**
* The boost value for the name of the field. The boost can be specified by a name mapping.
* <p/>
* For example
* <pre>
* ext.map.title=solr.title
* ext.boost.solr.title=2.5
* </pre>
* will boost the solr.title field for this document by 2.5
*
*/
public static final String BOOST_PREFIX = EXTRACTING_PREFIX + "boost.";
/**
* Pass in literal values to be added to the document, as in
* <pre>
* ext.literal.myField=Foo
* </pre>
*
*/
public static final String LITERALS_PREFIX = EXTRACTING_PREFIX + "literal.";
/**
* Restrict the extracted parts of a document to be indexed
* by passing in an XPath expression. All content that satisfies the XPath expr.
* will be passed to the {@link org.apache.solr.handler.SolrContentHandler}.
* <p/>
* See Tika's docs for what the extracted document looks like.
* <p/>
* @see #DEFAULT_FIELDNAME
* @see #CAPTURE_FIELDS
*/
public static final String XPATH_EXPRESSION = EXTRACTING_PREFIX + "xpath";
/**
* Only extract and return the document, do not index it.
*/
public static final String EXTRACT_ONLY = EXTRACTING_PREFIX + "extract.only";
/**
* Don't throw an exception if a field doesn't exist, just ignore it
*/
public static final String IGNORE_UNDECLARED_FIELDS = EXTRACTING_PREFIX + "ignore.und.fl";
/**
* Index attributes separately according to their name, instead of just adding them to the string buffer
*/
public static final String INDEX_ATTRIBUTES = EXTRACTING_PREFIX + "idx.attr";
/**
* The field to index the contents to by default. If you want to capture a specific piece
* of the Tika document separately, see {@link #CAPTURE_FIELDS}.
*
* @see #CAPTURE_FIELDS
*/
public static final String DEFAULT_FIELDNAME = EXTRACTING_PREFIX + "def.fl";
/**
* Capture the specified fields (and everything included below it that isn't capture by some other capture field) separately from the default. This is different
* then the case of passing in an XPath expression.
* <p/>
* The Capture field is based on the localName returned to the {@link org.apache.solr.handler.SolrContentHandler}
* by Tika, not to be confused by the mapped field. The field name can then
* be mapped into the index schema.
* <p/>
* For instance, a Tika document may look like:
* <pre>
* &lt;html&gt;
* ...
* &lt;body&gt;
* &lt;p&gt;some text here. &lt;div&gt;more text&lt;/div&gt;&lt;/p&gt;
* Some more text
* &lt;/body&gt;
* </pre>
* By passing in the p tag, you could capture all P tags separately from the rest of the text.
* Thus, in the example, the capture of the P tag would be: "some text here. more text"
*
* @see #DEFAULT_FIELDNAME
*/
public static final String CAPTURE_FIELDS = EXTRACTING_PREFIX + "capture";
/**
* The type of the stream. If not specified, Tika will use mime type detection.
*/
public static final String STREAM_TYPE = EXTRACTING_PREFIX + "stream.type";
/**
* Optional. The file name. If specified, Tika can take this into account while
* guessing the MIME type.
*/
public static final String RESOURCE_NAME = EXTRACTING_PREFIX + "resource.name";
/**
* Optional. If specified, the prefix will be prepended to all Metadata, such that it would be possible
* to setup a dynamic field to automatically capture it
*/
public static final String METADATA_PREFIX = EXTRACTING_PREFIX + "metadata.prefix";
}

View File

@ -0,0 +1,134 @@
package org.apache.solr.handler;
/**
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import org.apache.solr.common.SolrException;
import org.apache.solr.common.SolrException.ErrorCode;
import org.apache.solr.common.util.DateUtil;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.core.SolrCore;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.update.processor.UpdateRequestProcessor;
import org.apache.solr.util.plugin.SolrCoreAware;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.exception.TikaException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.File;
import java.util.Collection;
import java.util.HashSet;
/**
* Handler for rich documents like PDF or Word or any other file format that Tika handles that need the text to be extracted
* first from the document.
* <p/>
*/
public class ExtractingRequestHandler extends ContentStreamHandlerBase implements SolrCoreAware {
private transient static Logger log = LoggerFactory.getLogger(ExtractingRequestHandler.class);
public static final String CONFIG_LOCATION = "tika.config";
public static final String DATE_FORMATS = "date.formats";
protected TikaConfig config;
protected Collection<String> dateFormats = DateUtil.DEFAULT_DATE_FORMATS;
protected SolrContentHandlerFactory factory;
@Override
public void init(NamedList args) {
super.init(args);
}
public void inform(SolrCore core) {
if (initArgs != null) {
//if relative,then relative to config dir, otherwise, absolute path
String tikaConfigLoc = (String) initArgs.get(CONFIG_LOCATION);
if (tikaConfigLoc != null) {
File configFile = new File(tikaConfigLoc);
if (configFile.isAbsolute() == false) {
configFile = new File(core.getResourceLoader().getConfigDir(), configFile.getPath());
}
try {
config = new TikaConfig(configFile);
} catch (Exception e) {
throw new SolrException(ErrorCode.SERVER_ERROR, e);
}
} else {
try {
config = TikaConfig.getDefaultConfig();
} catch (TikaException e) {
throw new SolrException(ErrorCode.SERVER_ERROR, e);
}
}
NamedList configDateFormats = (NamedList) initArgs.get(DATE_FORMATS);
if (configDateFormats != null && configDateFormats.size() > 0) {
dateFormats = new HashSet<String>();
while (configDateFormats.iterator().hasNext()) {
String format = (String) configDateFormats.iterator().next();
log.info("Adding Date Format: " + format);
dateFormats.add(format);
}
}
} else {
try {
config = TikaConfig.getDefaultConfig();
} catch (TikaException e) {
throw new SolrException(ErrorCode.SERVER_ERROR, e);
}
}
factory = createFactory();
}
protected SolrContentHandlerFactory createFactory() {
return new SolrContentHandlerFactory(dateFormats);
}
protected ContentStreamLoader newLoader(SolrQueryRequest req, UpdateRequestProcessor processor) {
return new ExtractingDocumentLoader(req, processor, config, factory);
}
// ////////////////////// SolrInfoMBeans methods //////////////////////
@Override
public String getDescription() {
return "Add/Update Rich document";
}
@Override
public String getVersion() {
return "$Revision:$";
}
@Override
public String getSourceId() {
return "$Id:$";
}
@Override
public String getSource() {
return "$URL:$";
}
}

View File

@ -0,0 +1,353 @@
package org.apache.solr.handler;
import org.apache.solr.common.SolrException;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.SolrInputField;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.DateUtil;
import org.apache.solr.schema.DateField;
import org.apache.solr.schema.IndexSchema;
import org.apache.solr.schema.SchemaField;
import org.apache.solr.schema.StrField;
import org.apache.solr.schema.TextField;
import org.apache.solr.schema.FieldType;
import org.apache.solr.schema.UUIDField;
import org.apache.tika.metadata.Metadata;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
import java.text.DateFormat;
import java.util.Collection;
import java.util.Collections;
import java.util.Date;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Stack;
import java.util.UUID;
/**
* This class is not thread-safe. It is responsible for responding to Tika extraction events and producing a Solr document
*/
public class SolrContentHandler extends DefaultHandler implements ExtractingParams {
private transient static Logger log = LoggerFactory.getLogger(SolrContentHandler.class);
protected SolrInputDocument document;
protected Collection<String> dateFormats = DateUtil.DEFAULT_DATE_FORMATS;
protected Metadata metadata;
protected SolrParams params;
protected StringBuilder catchAllBuilder = new StringBuilder(2048);
//private StringBuilder currentBuilder;
protected IndexSchema schema;
//create empty so we don't have to worry about null checks
protected Map<String, StringBuilder> fieldBuilders = Collections.emptyMap();
protected Stack<StringBuilder> bldrStack = new Stack<StringBuilder>();
protected boolean ignoreUndeclaredFields = false;
protected boolean indexAttribs = false;
protected String defaultFieldName;
protected String metadataPrefix = "";
/**
* Only access through getNextId();
*/
private static long identifier = Long.MIN_VALUE;
public SolrContentHandler(Metadata metadata, SolrParams params, IndexSchema schema) {
this(metadata, params, schema, DateUtil.DEFAULT_DATE_FORMATS);
}
public SolrContentHandler(Metadata metadata, SolrParams params,
IndexSchema schema, Collection<String> dateFormats) {
document = new SolrInputDocument();
this.metadata = metadata;
this.params = params;
this.schema = schema;
this.dateFormats = dateFormats;
this.ignoreUndeclaredFields = params.getBool(ExtractingParams.IGNORE_UNDECLARED_FIELDS, false);
this.indexAttribs = params.getBool(ExtractingParams.INDEX_ATTRIBUTES, false);
this.defaultFieldName = params.get(ExtractingParams.DEFAULT_FIELDNAME);
this.metadataPrefix = params.get(ExtractingParams.METADATA_PREFIX, "");
//if there's no default field and we are intending to index, then throw an exception
if (defaultFieldName == null && params.getBool(ExtractingParams.EXTRACT_ONLY, false) == false) {
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "No default field name specified");
}
String[] captureFields = params.getParams(ExtractingParams.CAPTURE_FIELDS);
if (captureFields != null && captureFields.length > 0) {
fieldBuilders = new HashMap<String, StringBuilder>();
for (int i = 0; i < captureFields.length; i++) {
fieldBuilders.put(captureFields[i], new StringBuilder());
}
}
bldrStack.push(catchAllBuilder);
}
/**
* This is called by a consumer when it is ready to deal with a new SolrInputDocument. Overriding
* classes can use this hook to add in or change whatever they deem fit for the document at that time.
* The base implementation adds the metadata as fields, allowing for potential remapping.
*
* @return The {@link org.apache.solr.common.SolrInputDocument}.
*/
public SolrInputDocument newDocument() {
float boost = 1.0f;
//handle the metadata extracted from the document
for (String name : metadata.names()) {
String[] vals = metadata.getValues(name);
name = findMappedMetadataName(name);
SchemaField schFld = schema.getFieldOrNull(name);
if (schFld != null) {
boost = getBoost(name);
if (schFld.multiValued()) {
for (int i = 0; i < vals.length; i++) {
String val = vals[i];
document.addField(name, transformValue(val, schFld), boost);
}
} else {
StringBuilder builder = new StringBuilder();
for (int i = 0; i < vals.length; i++) {
builder.append(vals[i]).append(' ');
}
document.addField(name, transformValue(builder.toString().trim(), schFld), boost);
}
} else {
//TODO: error or log?
if (ignoreUndeclaredFields == false) {
// Arguably we should handle this as a special case. Why? Because unlike basically
// all the other fields in metadata, this one was probably set not by Tika by in
// ExtractingDocumentLoader.load(). You shouldn't have to define a mapping for this
// field just because you specified a resource.name parameter to the handler, should
// you?
if (name != Metadata.RESOURCE_NAME_KEY) {
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Invalid field: " + name);
}
}
}
}
//handle the literals from the params
Iterator<String> paramNames = params.getParameterNamesIterator();
while (paramNames.hasNext()) {
String name = paramNames.next();
if (name.startsWith(LITERALS_PREFIX)) {
String fieldName = name.substring(LITERALS_PREFIX.length());
//no need to map names here, since they are literals from the user
SchemaField schFld = schema.getFieldOrNull(fieldName);
if (schFld != null) {
String value = params.get(name);
boost = getBoost(fieldName);
//no need to transform here, b/c we can assume the user sent it in correctly
document.addField(fieldName, value, boost);
} else {
handleUndeclaredField(fieldName);
}
}
}
//add in the content
document.addField(defaultFieldName, catchAllBuilder.toString(), getBoost(defaultFieldName));
//add in the captured content
for (Map.Entry<String, StringBuilder> entry : fieldBuilders.entrySet()) {
if (entry.getValue().length() > 0) {
String fieldName = findMappedName(entry.getKey());
SchemaField schFld = schema.getFieldOrNull(fieldName);
if (schFld != null) {
document.addField(fieldName, transformValue(entry.getValue().toString(), schFld), getBoost(fieldName));
} else {
handleUndeclaredField(fieldName);
}
}
}
//make sure we have a unique id, if one is needed
SchemaField uniqueField = schema.getUniqueKeyField();
if (uniqueField != null) {
String uniqueFieldName = uniqueField.getName();
SolrInputField uniqFld = document.getField(uniqueFieldName);
if (uniqFld == null) {
String uniqId = generateId(uniqueField);
if (uniqId != null) {
document.addField(uniqueFieldName, uniqId);
}
}
}
if (log.isDebugEnabled()) {
log.debug("Doc: " + document);
}
return document;
}
/**
* Generate an ID for the document. First try to get
* {@link org.apache.solr.handler.ExtractingMetadataConstants#STREAM_NAME} from the
* {@link org.apache.tika.metadata.Metadata}, then try {@link ExtractingMetadataConstants#STREAM_SOURCE_INFO}
* then try {@link org.apache.tika.metadata.Metadata#IDENTIFIER}.
* If those all are null, then generate a random UUID using {@link java.util.UUID#randomUUID()}.
*
* @param uniqueField The SchemaField representing the unique field.
* @return The id as a string
*/
protected String generateId(SchemaField uniqueField) {
//we don't have a unique field specified, so let's add one
String uniqId = null;
FieldType type = uniqueField.getType();
if (type instanceof StrField || type instanceof TextField) {
uniqId = metadata.get(ExtractingMetadataConstants.STREAM_NAME);
if (uniqId == null) {
uniqId = metadata.get(ExtractingMetadataConstants.STREAM_SOURCE_INFO);
}
if (uniqId == null) {
uniqId = metadata.get(Metadata.IDENTIFIER);
}
if (uniqId == null) {
//last chance, just create one
uniqId = UUID.randomUUID().toString();
}
} else if (type instanceof UUIDField){
uniqId = UUID.randomUUID().toString();
}
else {
uniqId = String.valueOf(getNextId());
}
return uniqId;
}
@Override
public void startDocument() throws SAXException {
document.clear();
catchAllBuilder.setLength(0);
for (StringBuilder builder : fieldBuilders.values()) {
builder.setLength(0);
}
bldrStack.clear();
bldrStack.push(catchAllBuilder);
}
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
StringBuilder theBldr = fieldBuilders.get(localName);
if (theBldr != null) {
//we need to switch the currentBuilder
bldrStack.push(theBldr);
}
if (indexAttribs == true) {
for (int i = 0; i < attributes.getLength(); i++) {
String fieldName = findMappedName(localName);
SchemaField schFld = schema.getFieldOrNull(fieldName);
if (schFld != null) {
document.addField(fieldName, transformValue(attributes.getValue(i), schFld), getBoost(fieldName));
} else {
handleUndeclaredField(fieldName);
}
}
} else {
for (int i = 0; i < attributes.getLength(); i++) {
bldrStack.peek().append(attributes.getValue(i)).append(' ');
}
}
}
protected void handleUndeclaredField(String fieldName) {
if (ignoreUndeclaredFields == false) {
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Invalid field: " + fieldName);
} else {
if (log.isInfoEnabled()) {
log.info("Ignoring Field: " + fieldName);
}
}
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
StringBuilder theBldr = fieldBuilders.get(localName);
if (theBldr != null) {
//pop the stack
bldrStack.pop();
assert (bldrStack.size() >= 1);
}
}
@Override
public void characters(char[] chars, int offset, int length) throws SAXException {
bldrStack.peek().append(chars, offset, length);
}
/**
* Can be used to transform input values based on their {@link org.apache.solr.schema.SchemaField}
* <p/>
* This implementation only formats dates using the {@link org.apache.solr.common.util.DateUtil}.
*
* @param val The value to transform
* @param schFld The {@link org.apache.solr.schema.SchemaField}
* @return The potentially new value.
*/
protected String transformValue(String val, SchemaField schFld) {
String result = val;
if (schFld.getType() instanceof DateField) {
//try to transform the date
try {
Date date = DateUtil.parseDate(val, dateFormats);
DateFormat df = DateUtil.getThreadLocalDateFormat();
result = df.format(date);
} catch (Exception e) {
//TODO: error or log?
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Invalid value: " + val + " for field: " + schFld, e);
}
}
return result;
}
/**
* Get the value of any boost factor for the mapped name.
*
* @param name The name of the field to see if there is a boost specified
* @return The boost value
*/
protected float getBoost(String name) {
return params.getFloat(BOOST_PREFIX + name, 1.0f);
}
/**
* Get the name mapping
*
* @param name The name to check to see if there is a mapping
* @return The new name, if there is one, else <code>name</code>
*/
protected String findMappedName(String name) {
return params.get(ExtractingParams.MAP_PREFIX + name, name);
}
/**
* Get the name mapping for the metadata field. Prepends metadataPrefix onto the returned result.
*
* @param name The name to check to see if there is a mapping
* @return The new name, else <code>name</code>
*/
protected String findMappedMetadataName(String name) {
return metadataPrefix + params.get(ExtractingParams.MAP_PREFIX + name, name);
}
protected synchronized long getNextId(){
return identifier++;
}
}

View File

@ -0,0 +1,25 @@
package org.apache.solr.handler;
import org.apache.tika.metadata.Metadata;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.schema.IndexSchema;
import java.util.Collection;
/**
*
*
**/
public class SolrContentHandlerFactory {
protected Collection<String> dateFormats;
public SolrContentHandlerFactory(Collection<String> dateFormats) {
this.dateFormats = dateFormats;
}
public SolrContentHandler createSolrContentHandler(Metadata metadata, SolrParams params, IndexSchema schema) {
return new SolrContentHandler(metadata, params, schema,
dateFormats);
}
}

View File

@ -0,0 +1,140 @@
package org.apache.solr.handler;
import org.apache.solr.util.AbstractSolrTestCase;
import org.apache.solr.request.LocalSolrQueryRequest;
import org.apache.solr.request.SolrQueryResponse;
import org.apache.solr.common.util.ContentStream;
import org.apache.solr.common.util.ContentStreamBase;
import org.apache.solr.common.util.NamedList;
import java.util.List;
import java.util.ArrayList;
import java.io.File;
/**
*
*
**/
public class ExtractingRequestHandlerTest extends AbstractSolrTestCase {
@Override public String getSchemaFile() { return "schema.xml"; }
@Override public String getSolrConfigFile() { return "solrconfig.xml"; }
public void testExtraction() throws Exception {
ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
assertTrue("handler is null and it shouldn't be", handler != null);
loadLocal("solr-word.pdf", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
"ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
"ext.map.Author", "extractedAuthor",
"ext.def.fl", "extractedContent",
"ext.map.Last-Modified", "extractedDate"
);
assertQ(req("title:solr-word"),"//*[@numFound='0']");
assertU(commit());
assertQ(req("title:solr-word"),"//*[@numFound='1']");
loadLocal("simple.html", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
"ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
"ext.map.Author", "extractedAuthor",
"ext.map.language", "extractedLanguage",
"ext.def.fl", "extractedContent",
"ext.map.Last-Modified", "extractedDate"
);
assertQ(req("title:Welcome"),"//*[@numFound='0']");
assertU(commit());
assertQ(req("title:Welcome"),"//*[@numFound='1']");
loadLocal("version_control.xml", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
"ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
"ext.map.Author", "extractedAuthor",
"ext.def.fl", "extractedContent",
"ext.map.Last-Modified", "extractedDate"
);
assertQ(req("stream_name:version_control.xml"),"//*[@numFound='0']");
assertU(commit());
assertQ(req("stream_name:version_control.xml"),"//*[@numFound='1']");
}
public void testPlainTextSpecifyingMimeType() throws Exception {
ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
assertTrue("handler is null and it shouldn't be", handler != null);
// Load plain text specifying MIME type:
loadLocal("version_control.txt", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
"ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
"ext.map.Author", "extractedAuthor",
"ext.map.language", "extractedLanguage",
"ext.def.fl", "extractedContent",
ExtractingParams.STREAM_TYPE, "text/plain"
);
assertQ(req("extractedContent:Apache"),"//*[@numFound='0']");
assertU(commit());
assertQ(req("extractedContent:Apache"),"//*[@numFound='1']");
}
public void testPlainTextSpecifyingResourceName() throws Exception {
ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
assertTrue("handler is null and it shouldn't be", handler != null);
// Load plain text specifying filename
loadLocal("version_control.txt", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
"ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
"ext.map.Author", "extractedAuthor",
"ext.map.language", "extractedLanguage",
"ext.def.fl", "extractedContent",
ExtractingParams.RESOURCE_NAME, "version_control.txt"
);
assertQ(req("extractedContent:Apache"),"//*[@numFound='0']");
assertU(commit());
assertQ(req("extractedContent:Apache"),"//*[@numFound='1']");
}
// Note: If you load a plain text file specifying neither MIME type nor filename, extraction will silently fail. This is because Tika's
// automatic MIME type detection will fail, and it will default to using an empty-string-returning default parser
public void testExtractOnly() throws Exception {
ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
assertTrue("handler is null and it shouldn't be", handler != null);
SolrQueryResponse rsp = loadLocal("solr-word.pdf", ExtractingParams.EXTRACT_ONLY, "true");
assertTrue("rsp is null and it shouldn't be", rsp != null);
NamedList list = rsp.getValues();
String extraction = (String) list.get("solr-word.pdf");
assertTrue("extraction is null and it shouldn't be", extraction != null);
assertTrue(extraction + " does not contain " + "solr-word", extraction.indexOf("solr-word") != -1);
}
public void testXPath() throws Exception {
ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
assertTrue("handler is null and it shouldn't be", handler != null);
SolrQueryResponse rsp = loadLocal("example.html",
ExtractingParams.XPATH_EXPRESSION, "/xhtml:html/xhtml:body/xhtml:a/descendant:node()",
ExtractingParams.EXTRACT_ONLY, "true"
);
assertTrue("rsp is null and it shouldn't be", rsp != null);
NamedList list = rsp.getValues();
String val = (String) list.get("example.html");
val = val.trim();
assertTrue(val + " is not equal to " + "linkNews", val.equals("linkNews") == true);//there are two <a> tags, and they get collapesd
}
SolrQueryResponse loadLocal(String filename, String... args) throws Exception {
LocalSolrQueryRequest req = (LocalSolrQueryRequest)req(args);
// TODO: stop using locally defined streams once stream.file and
// stream.body work everywhere
List<ContentStream> cs = new ArrayList<ContentStream>();
cs.add(new ContentStreamBase.FileStream(new File(filename)));
req.setContentStreams(cs);
return h.queryAndResponse("/update/extract", req);
}
}

View File

@ -0,0 +1,49 @@
<html>
<head>
<title>Welcome to Solr</title>
</head>
<body>
<p>
Here is some text
</p>
<div>Here is some text in a div</div>
<div>This has a <a href="http://www.apache.org">link</a>.</div>
<a href="#news">News</a>
<ul class="minitoc">
<li>
<a href="#03+October+2008+-+Solr+Logo+Contest">03 October 2008 - Solr Logo Contest</a>
</li>
<li>
<a href="#15+September+2008+-+Solr+1.3.0+Available">15 September 2008 - Solr 1.3.0 Available</a>
</li>
<li>
<a href="#28+August+2008+-+Lucene%2FSolr+at+ApacheCon+New+Orleans">28 August 2008 - Lucene/Solr at ApacheCon New Orleans</a>
</li>
<li>
<a href="#03+September+2007+-+Lucene+at+ApacheCon+Atlanta">03 September 2007 - Lucene at ApacheCon Atlanta</a>
</li>
<li>
<a href="#06+June+2007%3A+Release+1.2+available">06 June 2007: Release 1.2 available</a>
</li>
<li>
<a href="#17+January+2007%3A+Solr+graduates+from+Incubator">17 January 2007: Solr graduates from Incubator</a>
</li>
<li>
<a href="#22+December+2006%3A+Release+1.1.0+available">22 December 2006: Release 1.1.0 available</a>
</li>
<li>
<a href="#15+August+2006%3A+Solr+at+ApacheCon+US">15 August 2006: Solr at ApacheCon US</a>
</li>
<li>
<a href="#21+April+2006%3A+Solr+at+ApacheCon">21 April 2006: Solr at ApacheCon</a>
</li>
<li>
<a href="#21+February+2006%3A+nightly+builds">21 February 2006: nightly builds</a>
</li>
<li>
<a href="#17+January+2006%3A+Solr+Joins+Apache+Incubator">17 January 2006: Solr Joins Apache Incubator</a>
</li>
</ul>
</body>
</html>

View File

@ -0,0 +1,12 @@
<html>
<head>
<title>Welcome to Solr</title>
</head>
<body>
<p>
Here is some text
</p>
<div>Here is some text in a div</div>
<div>This has a <a href="http://www.apache.org">link</a>.</div>
</body>
</html>

Binary file not shown.

View File

@ -0,0 +1,20 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#use a protected word file to avoid stemming two
#unrelated words to the same base word.
#to test, we will use words that would normally obviously be stemmed.
cats
ridding

View File

@ -0,0 +1,467 @@
<?xml version="1.0" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- The Solr schema file. This file should be named "schema.xml" and
should be located where the classloader for the Solr webapp can find it.
This schema is used for testing, and as such has everything and the
kitchen sink thrown in. See example/solr/conf/schema.xml for a
more concise example.
$Id: schema.xml 382610 2006-03-03 01:43:03Z yonik $
$Source: /cvs/main/searching/solr-configs/test/WEB-INF/classes/schema.xml,v $
$Name: $
-->
<schema name="test" version="1.0">
<types>
<!-- field type definitions... note that the "name" attribute is
just a label to be used by field definitions. The "class"
attribute and any other attributes determine the real type and
behavior of the fieldtype.
-->
<!-- numeric field types that store and index the text
value verbatim (and hence don't sort correctly or support range queries.)
These are provided more for backward compatability, allowing one
to create a schema that matches an existing lucene index.
-->
<fieldType name="integer" class="solr.IntField"/>
<fieldType name="long" class="solr.LongField"/>
<fieldtype name="float" class="solr.FloatField"/>
<fieldType name="double" class="solr.DoubleField"/>
<!-- numeric field types that manipulate the value into
a string value that isn't human readable in it's internal form,
but sorts correctly and supports range queries.
If sortMissingLast="true" then a sort on this field will cause documents
without the field to come after documents with the field,
regardless of the requested sort order.
If sortMissingFirst="true" then a sort on this field will cause documents
without the field to come before documents with the field,
regardless of the requested sort order.
If sortMissingLast="false" and sortMissingFirst="false" (the default),
then default lucene sorting will be used which places docs without the field
first in an ascending sort and last in a descending sort.
-->
<fieldtype name="sint" class="solr.SortableIntField" sortMissingLast="true"/>
<fieldtype name="slong" class="solr.SortableLongField" sortMissingLast="true"/>
<fieldtype name="sfloat" class="solr.SortableFloatField" sortMissingLast="true"/>
<fieldtype name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true"/>
<!-- bcd versions of sortable numeric type may provide smaller
storage space and support very large numbers.
-->
<fieldtype name="bcdint" class="solr.BCDIntField" sortMissingLast="true"/>
<fieldtype name="bcdlong" class="solr.BCDLongField" sortMissingLast="true"/>
<fieldtype name="bcdstr" class="solr.BCDStrField" sortMissingLast="true"/>
<!-- Field type demonstrating an Analyzer failure -->
<fieldtype name="failtype1" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
<!-- Demonstrating ignoreCaseChange -->
<fieldtype name="wdf_nocase" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
<fieldtype name="wdf_preserve" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
<!-- HighlitText optimizes storage for (long) columns which will be highlit -->
<fieldtype name="highlittext" class="solr.TextField" compressThreshold="345" />
<fieldtype name="boolean" class="solr.BoolField" sortMissingLast="true"/>
<fieldtype name="string" class="solr.StrField" sortMissingLast="true"/>
<!-- format for date is 1995-12-31T23:59:59.999Z and only the fractional
seconds part (.999) is optional.
-->
<fieldtype name="date" class="solr.DateField" sortMissingLast="true"/>
<!-- solr.TextField allows the specification of custom
text analyzers specified as a tokenizer and a list
of token filters.
-->
<fieldtype name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<!-- lucene PorterStemFilterFactory deprecated
<filter class="solr.PorterStemFilterFactory"/>
-->
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
</fieldtype>
<fieldtype name="nametext" class="solr.TextField">
<analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
</fieldtype>
<fieldtype name="teststop" class="solr.TextField">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"/>
</analyzer>
</fieldtype>
<!-- fieldtypes in this section isolate tokenizers and tokenfilters for testing -->
<fieldtype name="lowertok" class="solr.TextField">
<analyzer><tokenizer class="solr.LowerCaseTokenizerFactory"/></analyzer>
</fieldtype>
<fieldtype name="keywordtok" class="solr.TextField">
<analyzer><tokenizer class="solr.KeywordTokenizerFactory"/></analyzer>
</fieldtype>
<fieldtype name="standardtok" class="solr.TextField">
<analyzer><tokenizer class="solr.StandardTokenizerFactory"/></analyzer>
</fieldtype>
<fieldtype name="lettertok" class="solr.TextField">
<analyzer><tokenizer class="solr.LetterTokenizerFactory"/></analyzer>
</fieldtype>
<fieldtype name="whitetok" class="solr.TextField">
<analyzer><tokenizer class="solr.WhitespaceTokenizerFactory"/></analyzer>
</fieldtype>
<fieldtype name="HTMLstandardtok" class="solr.TextField">
<analyzer><tokenizer class="solr.HTMLStripStandardTokenizerFactory"/></analyzer>
</fieldtype>
<fieldtype name="HTMLwhitetok" class="solr.TextField">
<analyzer><tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/></analyzer>
</fieldtype>
<fieldtype name="standardtokfilt" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
</fieldtype>
<fieldtype name="standardfilt" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
</fieldtype>
<fieldtype name="lowerfilt" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
<fieldtype name="patternreplacefilt" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-zA-Z])" replacement="_" replace="all"
/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
</fieldtype>
<fieldtype name="porterfilt" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldtype>
<!-- fieldtype name="snowballfilt" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SnowballPorterFilterFactory"/>
</analyzer>
</fieldtype -->
<fieldtype name="engporterfilt" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
</fieldtype>
<fieldtype name="custengporterfilt" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
</analyzer>
</fieldtype>
<fieldtype name="stopfilt" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"/>
</analyzer>
</fieldtype>
<fieldtype name="custstopfilt" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt"/>
</analyzer>
</fieldtype>
<fieldtype name="lengthfilt" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="2" max="5"/>
</analyzer>
</fieldtype>
<fieldtype name="subword" class="solr.TextField" multiValued="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
</fieldtype>
<!-- more flexible in matching skus, but more chance of a false match -->
<fieldtype name="skutype1" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
<!-- less flexible in matching skus, but less chance of a false match -->
<fieldtype name="skutype2" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
<!-- less flexible in matching skus, but less chance of a false match -->
<fieldtype name="syn" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter name="syn" class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/>
</analyzer>
</fieldtype>
<!-- Demonstrates How RemoveDuplicatesTokenFilter makes stemmed
synonyms "better"
-->
<fieldtype name="dedup" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" expand="true" />
<filter class="solr.EnglishPorterFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldtype>
<fieldtype name="unstored" class="solr.StrField" indexed="true" stored="false"/>
<fieldtype name="textgap" class="solr.TextField" multiValued="true" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
</types>
<fields>
<field name="id" type="integer" indexed="true" stored="true" multiValued="false" required="false"/>
<field name="name" type="nametext" indexed="true" stored="true"/>
<field name="text" type="text" indexed="true" stored="false"/>
<field name="subject" type="text" indexed="true" stored="true"/>
<field name="title" type="nametext" indexed="true" stored="true"/>
<field name="weight" type="float" indexed="true" stored="true"/>
<field name="bday" type="date" indexed="true" stored="true"/>
<field name="title_stemmed" type="text" indexed="true" stored="false"/>
<field name="title_lettertok" type="lettertok" indexed="true" stored="false"/>
<field name="syn" type="syn" indexed="true" stored="true"/>
<!-- to test property inheritance and overriding -->
<field name="shouldbeunstored" type="unstored" />
<field name="shouldbestored" type="unstored" stored="true"/>
<field name="shouldbeunindexed" type="unstored" indexed="false" stored="true"/>
<!-- test different combinations of indexed and stored -->
<field name="bind" type="boolean" indexed="true" stored="false"/>
<field name="bsto" type="boolean" indexed="false" stored="true"/>
<field name="bindsto" type="boolean" indexed="true" stored="true"/>
<field name="isto" type="integer" indexed="false" stored="true"/>
<field name="iind" type="integer" indexed="true" stored="false"/>
<field name="ssto" type="string" indexed="false" stored="true"/>
<field name="sind" type="string" indexed="true" stored="false"/>
<field name="sindsto" type="string" indexed="true" stored="true"/>
<!-- test combinations of term vector settings -->
<field name="test_basictv" type="text" termVectors="true"/>
<field name="test_notv" type="text" termVectors="false"/>
<field name="test_postv" type="text" termVectors="true" termPositions="true"/>
<field name="test_offtv" type="text" termVectors="true" termOffsets="true"/>
<field name="test_posofftv" type="text" termVectors="true"
termPositions="true" termOffsets="true"/>
<!-- test highlit field settings -->
<field name="test_hlt" type="highlittext" indexed="true" compressed="true"/>
<field name="test_hlt_off" type="highlittext" indexed="true" compressed="false"/>
<!-- fields to test individual tokenizers and tokenfilters -->
<field name="teststop" type="teststop" indexed="true" stored="true"/>
<field name="lowertok" type="lowertok" indexed="true" stored="true"/>
<field name="keywordtok" type="keywordtok" indexed="true" stored="true"/>
<field name="standardtok" type="standardtok" indexed="true" stored="true"/>
<field name="HTMLstandardtok" type="HTMLstandardtok" indexed="true" stored="true"/>
<field name="lettertok" type="lettertok" indexed="true" stored="true"/>
<field name="whitetok" type="whitetok" indexed="true" stored="true"/>
<field name="HTMLwhitetok" type="HTMLwhitetok" indexed="true" stored="true"/>
<field name="standardtokfilt" type="standardtokfilt" indexed="true" stored="true"/>
<field name="standardfilt" type="standardfilt" indexed="true" stored="true"/>
<field name="lowerfilt" type="lowerfilt" indexed="true" stored="true"/>
<field name="patternreplacefilt" type="patternreplacefilt" indexed="true" stored="true"/>
<field name="porterfilt" type="porterfilt" indexed="true" stored="true"/>
<field name="engporterfilt" type="engporterfilt" indexed="true" stored="true"/>
<field name="custengporterfilt" type="custengporterfilt" indexed="true" stored="true"/>
<field name="stopfilt" type="stopfilt" indexed="true" stored="true"/>
<field name="custstopfilt" type="custstopfilt" indexed="true" stored="true"/>
<field name="lengthfilt" type="lengthfilt" indexed="true" stored="true"/>
<field name="dedup" type="dedup" indexed="true" stored="true"/>
<field name="wdf_nocase" type="wdf_nocase" indexed="true" stored="true"/>
<field name="wdf_preserve" type="wdf_preserve" indexed="true" stored="true"/>
<field name="numberpartfail" type="failtype1" indexed="true" stored="true"/>
<field name="nullfirst" type="string" indexed="true" stored="true" sortMissingFirst="true"/>
<field name="subword" type="subword" indexed="true" stored="true"/>
<field name="sku1" type="skutype1" indexed="true" stored="true"/>
<field name="sku2" type="skutype2" indexed="true" stored="true"/>
<field name="textgap" type="textgap" indexed="true" stored="true"/>
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
<field name="multiDefault" type="string" indexed="true" stored="true" default="muLti-Default" multiValued="true"/>
<field name="intDefault" type="sint" indexed="true" stored="true" default="42" multiValued="false"/>
<field name="extractedDate" type="date" indexed="true" stored="true" multiValued="true"/>
<field name="extractedContent" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="extractedProducer" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="extractedCreator" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="extractedKeywords" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="extractedAuthor" type="text" indexed="true" stored="true" multiValued="true"/>
<field name="extractedLanguage" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="resourceName" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- Dynamic field definitions. If a field name is not found, dynamicFields
will be used if the name matches any of the patterns.
RESTRICTION: the glob-like pattern in the name attribute must have
a "*" only at the start or the end.
EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i)
Longer patterns will be matched first. if equal size patterns
both match, the first appearing in the schema will be used.
-->
<dynamicField name="*_i" type="sint" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_s1" type="string" indexed="true" stored="true" multiValued="false"/>
<dynamicField name="*_l" type="slong" indexed="true" stored="true"/>
<dynamicField name="*_t" type="text" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/>
<dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/>
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
<dynamicField name="*_bcd" type="bcdstr" indexed="true" stored="true"/>
<dynamicField name="*_sI" type="string" indexed="true" stored="false"/>
<dynamicField name="*_sS" type="string" indexed="false" stored="true"/>
<dynamicField name="t_*" type="text" indexed="true" stored="true"/>
<dynamicField name="tv_*" type="text" indexed="true" stored="true"
termVectors="true" termPositions="true" termOffsets="true"/>
<dynamicField name="stream_*" type="text" indexed="true" stored="true"/>
<dynamicField name="Content*" type="text" indexed="true" stored="true"/>
<!-- special fields for dynamic copyField test -->
<dynamicField name="dynamic_*" type="string" indexed="true" stored="true"/>
<dynamicField name="*_dynamic" type="string" indexed="true" stored="true"/>
<!-- for testing to ensure that longer patterns are matched first -->
<dynamicField name="*aa" type="string" indexed="true" stored="true"/>
<dynamicField name="*aaa" type="integer" indexed="false" stored="true"/>
<!-- ignored becuase not stored or indexed -->
<dynamicField name="*_ignored" type="text" indexed="false" stored="false"/>
</fields>
<defaultSearchField>text</defaultSearchField>
<uniqueKey>id</uniqueKey>
<!-- copyField commands copy one field to another at the time a document
is added to the index. It's used either to index the same field different
ways, or to add multiple fields to the same field for easier/faster searching.
-->
<copyField source="title" dest="title_stemmed"/>
<copyField source="title" dest="title_lettertok"/>
<copyField source="title" dest="text"/>
<copyField source="subject" dest="text"/>
<copyField source="*_t" dest="text"/>
<!-- dynamic destination -->
<copyField source="*_dynamic" dest="dynamic_*"/>
</schema>

View File

@ -0,0 +1,359 @@
<?xml version="1.0" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- $Id: solrconfig.xml 382610 2006-03-03 01:43:03Z yonik $
$Source$
$Name$
-->
<config>
<jmx />
<!-- Used to specify an alternate directory to hold all index data.
It defaults to "index" if not present, and should probably
not be changed if replication is in use. -->
<dataDir>${solr.data.dir:./solr/data}</dataDir>
<indexDefaults>
<!-- Values here affect all index writers and act as a default
unless overridden. -->
<!-- Values here affect all index writers and act as a default unless overridden. -->
<useCompoundFile>false</useCompoundFile>
<mergeFactor>10</mergeFactor>
<!-- If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first.
-->
<!--<maxBufferedDocs>1000</maxBufferedDocs>-->
<!-- Tell Lucene when to flush documents to disk.
Giving Lucene more memory for indexing means faster indexing at the cost of more RAM
If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first.
-->
<ramBufferSizeMB>32</ramBufferSizeMB>
<maxMergeDocs>2147483647</maxMergeDocs>
<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
<!--
Expert: Turn on Lucene's auto commit capability.
NOTE: Despite the name, this value does not have any relation to Solr's autoCommit functionality
-->
<luceneAutoCommit>false</luceneAutoCommit>
<!--
Expert:
The Merge Policy in Lucene controls how merging is handled by Lucene. The default in 2.3 is the LogByteSizeMergePolicy, previous
versions used LogDocMergePolicy.
LogByteSizeMergePolicy chooses segments to merge based on their size. The Lucene 2.2 default, LogDocMergePolicy chose when
to merge based on number of documents
Other implementations of MergePolicy must have a no-argument constructor
-->
<mergePolicy>org.apache.lucene.index.LogByteSizeMergePolicy</mergePolicy>
<!--
Expert:
The Merge Scheduler in Lucene controls how merges are performed. The ConcurrentMergeScheduler (Lucene 2.3 default)
can perform merges in the background using separate threads. The SerialMergeScheduler (Lucene 2.2 default) does not.
-->
<mergeScheduler>org.apache.lucene.index.ConcurrentMergeScheduler</mergeScheduler>
<!-- these are global... can't currently override per index -->
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
<lockType>single</lockType>
</indexDefaults>
<mainIndex>
<!-- lucene options specific to the main on-disk lucene index -->
<useCompoundFile>false</useCompoundFile>
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>32</ramBufferSizeMB>
<maxMergeDocs>2147483647</maxMergeDocs>
<maxFieldLength>10000</maxFieldLength>
<unlockOnStartup>true</unlockOnStartup>
</mainIndex>
<updateHandler class="solr.DirectUpdateHandler2">
<!-- autocommit pending docs if certain criteria are met
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>3600000</maxTime>
</autoCommit>
-->
<!-- represents a lower bound on the frequency that commits may
occur (in seconds). NOTE: not yet implemented
<commitIntervalLowerBound>0</commitIntervalLowerBound>
-->
<!-- The RunExecutableListener executes an external command.
exe - the name of the executable to run
dir - dir to use as the current working directory. default="."
wait - the calling thread waits until the executable returns. default="true"
args - the arguments to pass to the program. default=nothing
env - environment variables to set. default=nothing
-->
<!-- A postCommit event is fired after every commit
<listener event="postCommit" class="solr.RunExecutableListener">
<str name="exe">/var/opt/resin3/__PORT__/scripts/solr/snapshooter</str>
<str name="dir">/var/opt/resin3/__PORT__</str>
<bool name="wait">true</bool>
<arr name="args"> <str>arg1</str> <str>arg2</str> </arr>
<arr name="env"> <str>MYVAR=val1</str> </arr>
</listener>
-->
</updateHandler>
<query>
<!-- Maximum number of clauses in a boolean query... can affect
range or wildcard queries that expand to big boolean
queries. An exception is thrown if exceeded.
-->
<maxBooleanClauses>1024</maxBooleanClauses>
<!-- Cache specification for Filters or DocSets - unordered set of *all* documents
that match a particular query.
-->
<filterCache
class="solr.search.LRUCache"
size="512"
initialSize="512"
autowarmCount="256"/>
<queryResultCache
class="solr.search.LRUCache"
size="512"
initialSize="512"
autowarmCount="1024"/>
<documentCache
class="solr.search.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>
<!-- If true, stored fields that are not requested will be loaded lazily.
-->
<enableLazyFieldLoading>true</enableLazyFieldLoading>
<!--
<cache name="myUserCache"
class="solr.search.LRUCache"
size="4096"
initialSize="1024"
autowarmCount="1024"
regenerator="MyRegenerator"
/>
-->
<useFilterForSortedQuery>true</useFilterForSortedQuery>
<queryResultWindowSize>10</queryResultWindowSize>
<!-- set maxSize artificially low to exercise both types of sets -->
<HashDocSet maxSize="3" loadFactor="0.75"/>
<!-- boolToFilterOptimizer converts boolean clauses with zero boost
into cached filters if the number of docs selected by the clause exceeds
the threshold (represented as a fraction of the total index)
-->
<boolTofilterOptimizer enabled="false" cacheSize="32" threshold=".05"/>
<!-- a newSearcher event is fired whenever a new searcher is being prepared
and there is a current searcher handling requests (aka registered). -->
<!-- QuerySenderListener takes an array of NamedList and executes a
local query request for each NamedList in sequence. -->
<!--
<listener event="newSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst> <str name="q">solr</str> <str name="start">0</str> <str name="rows">10</str> </lst>
<lst> <str name="q">rocks</str> <str name="start">0</str> <str name="rows">10</str> </lst>
</arr>
</listener>
-->
<!-- a firstSearcher event is fired whenever a new searcher is being
prepared but there is no current registered searcher to handle
requests or to gain prewarming data from. -->
<!--
<listener event="firstSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst> <str name="q">fast_warm</str> <str name="start">0</str> <str name="rows">10</str> </lst>
</arr>
</listener>
-->
</query>
<!-- An alternate set representation that uses an integer hash to store filters (sets of docids).
If the set cardinality <= maxSize elements, then HashDocSet will be used instead of the bitset
based HashBitset. -->
<!-- requestHandler plugins... incoming queries will be dispatched to the
correct handler based on the qt (query type) param matching the
name of registered handlers.
The "standard" request handler is the default and will be used if qt
is not specified in the request.
-->
<requestHandler name="standard" class="solr.StandardRequestHandler">
<bool name="httpCaching">true</bool>
</requestHandler>
<requestHandler name="dismaxOldStyleDefaults"
class="solr.DisMaxRequestHandler" >
<!-- for historic reasons, DisMaxRequestHandler will use all of
it's init params as "defaults" if there is no "defaults" list
specified
-->
<float name="tie">0.01</float>
<str name="qf">
text^0.5 features_t^1.0 subject^1.4 title_stemmed^2.0
</str>
<str name="pf">
text^0.2 features_t^1.1 subject^1.4 title_stemmed^2.0 title^1.5
</str>
<str name="bf">
ord(weight)^0.5 recip(rord(iind),1,1000,1000)^0.3
</str>
<str name="mm">
3&lt;-1 5&lt;-2 6&lt;90%
</str>
<int name="ps">100</int>
</requestHandler>
<requestHandler name="dismax" class="solr.DisMaxRequestHandler" >
<lst name="defaults">
<str name="q.alt">*:*</str>
<float name="tie">0.01</float>
<str name="qf">
text^0.5 features_t^1.0 subject^1.4 title_stemmed^2.0
</str>
<str name="pf">
text^0.2 features_t^1.1 subject^1.4 title_stemmed^2.0 title^1.5
</str>
<str name="bf">
ord(weight)^0.5 recip(rord(iind),1,1000,1000)^0.3
</str>
<str name="mm">
3&lt;-1 5&lt;-2 6&lt;90%
</str>
<int name="ps">100</int>
</lst>
</requestHandler>
<requestHandler name="old" class="solr.tst.OldRequestHandler" >
<int name="myparam">1000</int>
<float name="ratio">1.4142135</float>
<arr name="myarr"><int>1</int><int>2</int></arr>
<str>foo</str>
</requestHandler>
<requestHandler name="oldagain" class="solr.tst.OldRequestHandler" >
<lst name="lst1"> <str name="op">sqrt</str> <int name="val">2</int> </lst>
<lst name="lst2"> <str name="op">log</str> <float name="val">10</float> </lst>
</requestHandler>
<requestHandler name="test" class="solr.tst.TestRequestHandler" />
<!-- test query parameter defaults -->
<requestHandler name="defaults" class="solr.StandardRequestHandler">
<lst name="defaults">
<int name="rows">4</int>
<bool name="hl">true</bool>
<str name="hl.fl">text,name,subject,title,whitetok</str>
</lst>
</requestHandler>
<!-- test query parameter defaults -->
<requestHandler name="lazy" class="solr.StandardRequestHandler" startup="lazy">
<lst name="defaults">
<int name="rows">4</int>
<bool name="hl">true</bool>
<str name="hl.fl">text,name,subject,title,whitetok</str>
</lst>
</requestHandler>
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
<requestHandler name="/update/csv" class="solr.CSVRequestHandler" startup="lazy">
<bool name="httpCaching">false</bool>
</requestHandler>
<requestHandler name="/update/extract" class="org.apache.solr.handler.ExtractingRequestHandler"/>
<highlighting>
<!-- Configure the standard fragmenter -->
<fragmenter name="gap" class="org.apache.solr.highlight.GapFragmenter" default="true">
<lst name="defaults">
<int name="hl.fragsize">100</int>
</lst>
</fragmenter>
<fragmenter name="regex" class="org.apache.solr.highlight.RegexFragmenter">
<lst name="defaults">
<int name="hl.fragsize">70</int>
</lst>
</fragmenter>
<!-- Configure the standard formatter -->
<formatter name="html" class="org.apache.solr.highlight.HtmlFormatter" default="true">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter>
</highlighting>
<!-- enable streaming for testing... -->
<requestDispatcher handleSelect="true" >
<requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048" />
<httpCaching lastModifiedFrom="openTime" etagSeed="Solr" never304="false">
<cacheControl>max-age=30, public</cacheControl>
</httpCaching>
</requestDispatcher>
<admin>
<defaultQuery>solr</defaultQuery>
<gettableFiles>solrconfig.xml scheam.xml admin-extra.html</gettableFiles>
</admin>
<!-- test getting system property -->
<propTest attr1="${solr.test.sys.prop1}-$${literal}"
attr2="${non.existent.sys.prop:default-from-config}">prefix-${solr.test.sys.prop2}-suffix</propTest>
</config>

View File

@ -0,0 +1,16 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
stopworda
stopwordb

View File

@ -0,0 +1,22 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
a => aa
b => b1 b2
c => c1,c2
a\=>a => b\=>b
a\,a => b\,b
foo,bar,baz
Television,TV,Televisions

View File

@ -0,0 +1,18 @@
Solr Version Control System
Overview
The Solr source code resides in the Apache Subversion (SVN) repository.
The command-line SVN client can be obtained here or as an optional package
for cygwin.
The TortoiseSVN GUI client for Windows can be obtained here. There
are also SVN plugins available for older versions of Eclipse and
IntelliJ IDEA that don't have subversion support already included.
-------------------------------
Note: This document is an excerpt from a document Licensed to the
Apache Software Foundation (ASF) under one or more contributor
license agreements. See the XML version (version_control.xml) for
more details.

View File

@ -0,0 +1,42 @@
<?xml version="1.0"?>
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<document>
<header>
<title>Solr Version Control System</title>
</header>
<body>
<section>
<title>Overview</title>
<p>
The Solr source code resides in the Apache <a href="http://subversion.tigris.org/">Subversion (SVN)</a> repository.
The command-line SVN client can be obtained <a href="http://subversion.tigris.org/project_packages.html">here</a> or as an optional package for <a href="http://www.cygwin.com/">cygwin</a>.
The TortoiseSVN GUI client for Windows can be obtained <a href="http://tortoisesvn.tigris.org/">here</a>. There
are also SVN plugins available for older versions of <a href="http://subclipse.tigris.org/">Eclipse</a> and
<a href="http://svnup.tigris.org/">IntelliJ IDEA</a> that don't have subversion support already included.
</p>
</section>
<p>Here is some more text. It contains <a href="http://lucene.apache.org">a link</a>. </p>
<p>Text Here</p>
</body>
</document>

View File

@ -105,6 +105,8 @@
</target> </target>
--> -->
<target name="example" depends="build"/>
<!-- do nothing for now, required for generate maven artifacts --> <!-- do nothing for now, required for generate maven artifacts -->
<target name="build"/> <target name="build"/>

View File

@ -121,4 +121,6 @@
<!-- TODO: Autolaunch Solr --> <!-- TODO: Autolaunch Solr -->
</target> </target>
<target name="example" depends="build"/>
</project> </project>

View File

@ -0,0 +1,200 @@
package org.apache.solr.common.util;
/**
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Calendar;
import java.util.Collection;
import java.util.Date;
import java.util.Iterator;
import java.util.Locale;
import java.util.TimeZone;
/**
* This class has some code from HttpClient DateUtil.
*/
public class DateUtil {
//start HttpClient
/**
* Date format pattern used to parse HTTP date headers in RFC 1123 format.
*/
public static final String PATTERN_RFC1123 = "EEE, dd MMM yyyy HH:mm:ss zzz";
/**
* Date format pattern used to parse HTTP date headers in RFC 1036 format.
*/
public static final String PATTERN_RFC1036 = "EEEE, dd-MMM-yy HH:mm:ss zzz";
/**
* Date format pattern used to parse HTTP date headers in ANSI C
* <code>asctime()</code> format.
*/
public static final String PATTERN_ASCTIME = "EEE MMM d HH:mm:ss yyyy";
//These are included for back compat
private static final Collection<String> DEFAULT_HTTP_CLIENT_PATTERNS = Arrays.asList(
PATTERN_ASCTIME, PATTERN_RFC1036, PATTERN_RFC1123);
private static final Date DEFAULT_TWO_DIGIT_YEAR_START;
static {
Calendar calendar = Calendar.getInstance();
calendar.set(2000, Calendar.JANUARY, 1, 0, 0);
DEFAULT_TWO_DIGIT_YEAR_START = calendar.getTime();
}
private static final TimeZone GMT = TimeZone.getTimeZone("GMT");
//end HttpClient
//---------------------------------------------------------------------------------------
/**
* A suite of default date formats that can be parsed, and thus transformed to the Solr specific format
*/
public static final Collection<String> DEFAULT_DATE_FORMATS = new ArrayList<String>();
static {
DEFAULT_DATE_FORMATS.add("yyyy-MM-dd'T'HH:mm:ss'Z'");
DEFAULT_DATE_FORMATS.add("yyyy-MM-dd'T'HH:mm:ss");
DEFAULT_DATE_FORMATS.add("yyyy-MM-dd");
DEFAULT_DATE_FORMATS.add("yyyy-MM-dd hh:mm:ss");
DEFAULT_DATE_FORMATS.add("yyyy-MM-dd HH:mm:ss");
DEFAULT_DATE_FORMATS.add("EEE MMM d hh:mm:ss z yyyy");
DEFAULT_DATE_FORMATS.addAll(DEFAULT_HTTP_CLIENT_PATTERNS);
}
/**
* Returns a formatter that can be use by the current thread if needed to
* convert Date objects to the Internal representation.
*
* @param d The input date to parse
* @return The parsed {@link java.util.Date}
* @throws java.text.ParseException If the input can't be parsed
* @throws org.apache.commons.httpclient.util.DateParseException
* If the input can't be parsed
*/
public static Date parseDate(String d) throws ParseException {
return parseDate(d, DEFAULT_DATE_FORMATS);
}
public static Date parseDate(String d, Collection<String> fmts) throws ParseException {
// 2007-04-26T08:05:04Z
if (d.endsWith("Z") && d.length() > 20) {
return getThreadLocalDateFormat().parse(d);
}
return parseDate(d, fmts, null);
}
/**
* Slightly modified from org.apache.commons.httpclient.util.DateUtil.parseDate
* <p/>
* Parses the date value using the given date formats.
*
* @param dateValue the date value to parse
* @param dateFormats the date formats to use
* @param startDate During parsing, two digit years will be placed in the range
* <code>startDate</code> to <code>startDate + 100 years</code>. This value may
* be <code>null</code>. When <code>null</code> is given as a parameter, year
* <code>2000</code> will be used.
* @return the parsed date
* @throws ParseException if none of the dataFormats could parse the dateValue
*/
public static Date parseDate(
String dateValue,
Collection<String> dateFormats,
Date startDate
) throws ParseException {
if (dateValue == null) {
throw new IllegalArgumentException("dateValue is null");
}
if (dateFormats == null) {
dateFormats = DEFAULT_HTTP_CLIENT_PATTERNS;
}
if (startDate == null) {
startDate = DEFAULT_TWO_DIGIT_YEAR_START;
}
// trim single quotes around date if present
// see issue #5279
if (dateValue.length() > 1
&& dateValue.startsWith("'")
&& dateValue.endsWith("'")
) {
dateValue = dateValue.substring(1, dateValue.length() - 1);
}
SimpleDateFormat dateParser = null;
Iterator formatIter = dateFormats.iterator();
while (formatIter.hasNext()) {
String format = (String) formatIter.next();
if (dateParser == null) {
dateParser = new SimpleDateFormat(format, Locale.US);
dateParser.setTimeZone(GMT);
dateParser.set2DigitYearStart(startDate);
} else {
dateParser.applyPattern(format);
}
try {
return dateParser.parse(dateValue);
} catch (ParseException pe) {
// ignore this exception, we will try the next format
}
}
// we were unable to parse the date
throw new ParseException("Unable to parse the date " + dateValue, 0);
}
/**
* Returns a formatter that can be use by the current thread if needed to
* convert Date objects to the Internal representation.
*
* @return The {@link java.text.DateFormat} for the current thread
*/
public static DateFormat getThreadLocalDateFormat() {
return fmtThreadLocal.get();
}
public static TimeZone UTC = TimeZone.getTimeZone("UTC");
private static ThreadLocalDateFormat fmtThreadLocal = new ThreadLocalDateFormat();
private static class ThreadLocalDateFormat extends ThreadLocal<DateFormat> {
DateFormat proto;
public ThreadLocalDateFormat() {
super();
//2007-04-26T08:05:04Z
SimpleDateFormat tmp = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
tmp.setTimeZone(UTC);
proto = tmp;
}
@Override
protected DateFormat initialValue() {
return (DateFormat) proto.clone();
}
}
}

View File

@ -31,6 +31,7 @@ import java.util.*;
import org.slf4j.Logger; import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import org.slf4j.LoggerFactory;
import java.nio.charset.Charset; import java.nio.charset.Charset;
import java.lang.reflect.Constructor;
import javax.naming.Context; import javax.naming.Context;
import javax.naming.InitialContext; import javax.naming.InitialContext;
@ -308,6 +309,36 @@ public class SolrResourceLoader implements ResourceLoader
} }
return obj; return obj;
} }
public Object newInstance(String cName, String [] subPackages, Class[] params, Object[] args){
Class clazz = findClass(cName,subPackages);
if( clazz == null ) {
throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
"Can not find class: "+cName + " in " + classLoader, false);
}
Object obj = null;
try {
Constructor constructor = clazz.getConstructor(params);
obj = constructor.newInstance(args);
}
catch (Exception e) {
throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
"Error instantiating class: '" + clazz.getName()+"'", e, false );
}
if( obj instanceof SolrCoreAware ) {
assertAwareCompatibility( SolrCoreAware.class, obj );
waitingForCore.add( (SolrCoreAware)obj );
}
if( obj instanceof ResourceLoaderAware ) {
assertAwareCompatibility( ResourceLoaderAware.class, obj );
waitingForResources.add( (ResourceLoaderAware)obj );
}
return obj;
}
/** /**
* Tell all {@link SolrCoreAware} instances about the SolrCore * Tell all {@link SolrCoreAware} instances about the SolrCore
@ -436,4 +467,4 @@ public class SolrResourceLoader implements ResourceLoader
throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, builder.toString() ); throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, builder.toString() );
} }
} }

View File

@ -315,11 +315,7 @@ public class TestHarness {
* @see LocalSolrQueryRequest * @see LocalSolrQueryRequest
*/ */
public String query(String handler, SolrQueryRequest req) throws IOException, Exception { public String query(String handler, SolrQueryRequest req) throws IOException, Exception {
SolrQueryResponse rsp = new SolrQueryResponse(); SolrQueryResponse rsp = queryAndResponse(handler, req);
core.execute(core.getRequestHandler(handler),req,rsp);
if (rsp.getException() != null) {
throw rsp.getException();
}
StringWriter sw = new StringWriter(32000); StringWriter sw = new StringWriter(32000);
QueryResponseWriter responseWriter = core.getQueryResponseWriter(req); QueryResponseWriter responseWriter = core.getQueryResponseWriter(req);
@ -330,6 +326,15 @@ public class TestHarness {
return sw.toString(); return sw.toString();
} }
public SolrQueryResponse queryAndResponse(String handler, SolrQueryRequest req) throws Exception {
SolrQueryResponse rsp = new SolrQueryResponse();
core.execute(core.getRequestHandler(handler),req,rsp);
if (rsp.getException() != null) {
throw rsp.getException();
}
return rsp;
}
/** /**
* A helper method which valides a String against an array of XPath test * A helper method which valides a String against an array of XPath test