mirror of https://github.com/apache/lucene.git
SOLR-284: Solr Cell: Add support for Tika content extraction
git-svn-id: https://svn.apache.org/repos/asf/lucene/solr/trunk@723977 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
474ab9a515
commit
cedd07b500
|
@ -98,6 +98,8 @@ New Features
|
|||
can be specified.
|
||||
(Georgios Stamatis, Lars Kotthoff, Chris Harris via koji)
|
||||
|
||||
20. SOLR-284: Added support for extracting content from binary documents like MS Word and PDF using Apache Tika. See also contrib/extraction/CHANGES.txt (Eric Pugh, Chris Harris, gsingers)
|
||||
|
||||
Optimizations
|
||||
----------------------
|
||||
1. SOLR-374: Use IndexReader.reopen to save resources by re-using parts of the
|
||||
|
|
202
LICENSE.txt
202
LICENSE.txt
|
@ -261,9 +261,9 @@ such code.
|
|||
1.13. You (or Your) means an individual or a legal entity exercising rights
|
||||
under, and complying with all of the terms of, this License. For legal
|
||||
entities, You includes any entity which controls, is controlled by, or is under
|
||||
common control with You. For purposes of this definition, control means (a)áthe
|
||||
common control with You. For purposes of this definition, control means (a)<EFBFBD>the
|
||||
power, direct or indirect, to cause the direction or management of such entity,
|
||||
whether by contract or otherwise, or (b)áownership of more than fifty percent
|
||||
whether by contract or otherwise, or (b)<EFBFBD>ownership of more than fifty percent
|
||||
(50%) of the outstanding shares or beneficial ownership of such entity.
|
||||
|
||||
2. License Grants.
|
||||
|
@ -278,12 +278,12 @@ with or without Modifications, and/or as part of a Larger Work; and (b) under
|
|||
Patent Claims infringed by the making, using or selling of Original Software,
|
||||
to make, have made, use, practice, sell, and offer for sale, and/or otherwise
|
||||
dispose of the Original Software (or portions thereof). (c) The licenses
|
||||
granted in Sectionsá2.1(a) and (b) are effective on the date Initial Developer
|
||||
granted in Sections<EFBFBD>2.1(a) and (b) are effective on the date Initial Developer
|
||||
first distributes or otherwise makes the Original Software available to a third
|
||||
party under the terms of this License. (d) Notwithstanding Sectioná2.1(b)
|
||||
above, no patent license is granted: (1)áfor code that You delete from the
|
||||
Original Software, or (2)áfor infringements caused by: (i)áthe modification of
|
||||
the Original Software, or (ii)áthe combination of the Original Software with
|
||||
party under the terms of this License. (d) Notwithstanding Section<EFBFBD>2.1(b)
|
||||
above, no patent license is granted: (1)<EFBFBD>for code that You delete from the
|
||||
Original Software, or (2)<EFBFBD>for infringements caused by: (i)<29>the modification of
|
||||
the Original Software, or (ii)<EFBFBD>the combination of the Original Software with
|
||||
other software or devices.
|
||||
|
||||
2.2. Contributor Grant. Conditioned upon Your compliance with Section 3.1
|
||||
|
@ -297,17 +297,17 @@ and/or as part of a Larger Work; and (b) under Patent Claims infringed by the
|
|||
making, using, or selling of Modifications made by that Contributor either
|
||||
alone and/or in combination with its Contributor Version (or portions of such
|
||||
combination), to make, use, sell, offer for sale, have made, and/or otherwise
|
||||
dispose of: (1)áModifications made by that Contributor (or portions thereof);
|
||||
and (2)áthe combination of Modifications made by that Contributor with its
|
||||
dispose of: (1)<EFBFBD>Modifications made by that Contributor (or portions thereof);
|
||||
and (2)<EFBFBD>the combination of Modifications made by that Contributor with its
|
||||
Contributor Version (or portions of such combination). (c) The licenses
|
||||
granted in Sectionsá2.2(a) and 2.2(b) are effective on the date Contributor
|
||||
granted in Sections<EFBFBD>2.2(a) and 2.2(b) are effective on the date Contributor
|
||||
first distributes or otherwise makes the Modifications available to a third
|
||||
party. (d) Notwithstanding Sectioná2.2(b) above, no patent license is granted:
|
||||
(1)áfor any code that Contributor has deleted from the Contributor Version;
|
||||
(2)áfor infringements caused by: (i)áthird party modifications of Contributor
|
||||
Version, or (ii)áthe combination of Modifications made by that Contributor with
|
||||
party. (d) Notwithstanding Section<EFBFBD>2.2(b) above, no patent license is granted:
|
||||
(1)<EFBFBD>for any code that Contributor has deleted from the Contributor Version;
|
||||
(2)<EFBFBD>for infringements caused by: (i)<29>third party modifications of Contributor
|
||||
Version, or (ii)<EFBFBD>the combination of Modifications made by that Contributor with
|
||||
other software (except as part of the Contributor Version) or other devices; or
|
||||
(3)áunder Patent Claims infringed by Covered Software in the absence of
|
||||
(3)<EFBFBD>under Patent Claims infringed by Covered Software in the absence of
|
||||
Modifications made by that Contributor.
|
||||
|
||||
3. Distribution Obligations.
|
||||
|
@ -389,9 +389,9 @@ License published by the license steward. 4.3. Modified Versions.
|
|||
|
||||
When You are an Initial Developer and You want to create a new license for Your
|
||||
Original Software, You may create and use a modified version of this License if
|
||||
You: (a)árename the license and remove any references to the name of the
|
||||
You: (a)<EFBFBD>rename the license and remove any references to the name of the
|
||||
license steward (except to note that the license differs from this License);
|
||||
and (b)áotherwise make it clear that the license contains terms which differ
|
||||
and (b)<EFBFBD>otherwise make it clear that the license contains terms which differ
|
||||
from this License.
|
||||
|
||||
5. DISCLAIMER OF WARRANTY.
|
||||
|
@ -422,14 +422,14 @@ the Participant is a Contributor or the Original Software where the Participant
|
|||
is the Initial Developer) directly or indirectly infringes any patent, then any
|
||||
and all rights granted directly or indirectly to You by such Participant, the
|
||||
Initial Developer (if the Initial Developer is not the Participant) and all
|
||||
Contributors under Sectionsá2.1 and/or 2.2 of this License shall, upon 60 days
|
||||
Contributors under Sections<EFBFBD>2.1 and/or 2.2 of this License shall, upon 60 days
|
||||
notice from Participant terminate prospectively and automatically at the
|
||||
expiration of such 60 day notice period, unless if within such 60 day period
|
||||
You withdraw Your claim with respect to the Participant Software against such
|
||||
Participant either unilaterally or pursuant to a written agreement with
|
||||
Participant.
|
||||
|
||||
6.3. In the event of termination under Sectionsá6.1 or 6.2 above, all end user
|
||||
6.3. In the event of termination under Sections<EFBFBD>6.1 or 6.2 above, all end user
|
||||
licenses that have been validly granted by You or any distributor hereunder
|
||||
prior to termination (excluding licenses granted to You by any distributor)
|
||||
shall survive termination.
|
||||
|
@ -453,9 +453,9 @@ LIMITATION MAY NOT APPLY TO YOU.
|
|||
8. U.S. GOVERNMENT END USERS.
|
||||
|
||||
The Covered Software is a commercial item, as that term is defined in
|
||||
48áC.F.R.á2.101 (Oct. 1995), consisting of commercial computer software (as
|
||||
that term is defined at 48 C.F.R. á252.227-7014(a)(1)) and commercial computer
|
||||
software documentation as such terms are used in 48áC.F.R.á12.212 (Sept. 1995).
|
||||
48<EFBFBD>C.F.R.<2E>2.101 (Oct. 1995), consisting of commercial computer software (as
|
||||
that term is defined at 48 C.F.R. <EFBFBD>252.227-7014(a)(1)) and commercial computer
|
||||
software documentation as such terms are used in 48<EFBFBD>C.F.R.<2E>12.212 (Sept. 1995).
|
||||
Consistent with 48 C.F.R. 12.212 and 48 C.F.R. 227.7202-1 through 227.7202-4
|
||||
(June 1995), all U.S. Government End Users acquire Covered Software with only
|
||||
those rights set forth herein. This U.S. Government Rights clause is in lieu
|
||||
|
@ -736,3 +736,161 @@ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
|||
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
||||
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
||||
===========================================================================
|
||||
Apache Tika Licenses - contrib/extraction
|
||||
---------------------------------------------------------------------------
|
||||
Apache Tika is licensed under the ASL 2.0. See above for the text of the license
|
||||
|
||||
APACHE TIKA SUBCOMPONENTS
|
||||
|
||||
Apache Tika includes a number of subcomponents with separate copyright notices
|
||||
and license terms. Your use of these subcomponents is subject to the terms and
|
||||
conditions of the following licenses.
|
||||
|
||||
Bouncy Castle libraries (bcmail and bcprov)
|
||||
|
||||
Copyright (c) 2000-2006 The Legion Of The Bouncy Castle
|
||||
(http://www.bouncycastle.org)
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining
|
||||
a copy of this software and associated documentation files
|
||||
(the "Software"), to deal in the Software without restriction,
|
||||
including without limitation the rights to use, copy, modify, merge,
|
||||
publish, distribute, sublicense, and/or sell copies of the Software,
|
||||
and to permit persons to whom the Software is furnished to do so,
|
||||
subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included
|
||||
in all copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
|
||||
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
|
||||
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
|
||||
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
|
||||
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
|
||||
OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
||||
PDFBox library (pdfbox)
|
||||
|
||||
Copyright (c) 2003-2005, www.pdfbox.org
|
||||
All rights reserved.
|
||||
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are met:
|
||||
|
||||
1. Redistributions of source code must retain the above copyright notice,
|
||||
this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
this list of conditions and the following disclaimer in the documentation
|
||||
and/or other materials provided with the distribution.
|
||||
3. Neither the name of pdfbox; nor the names of its
|
||||
contributors may be used to endorse or promote products derived from this
|
||||
software without specific prior written permission.
|
||||
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
||||
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
||||
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
||||
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
||||
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
||||
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
|
||||
OF SUCH DAMAGE.
|
||||
|
||||
FontBox and JempBox libraries (fontbox, jempbox)
|
||||
|
||||
Copyright (c) 2003-2005, www.fontbox.org
|
||||
All rights reserved.
|
||||
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions are met:
|
||||
|
||||
1. Redistributions of source code must retain the above copyright notice,
|
||||
this list of conditions and the following disclaimer.
|
||||
2. Redistributions in binary form must reproduce the above copyright notice,
|
||||
this list of conditions and the following disclaimer in the documentation
|
||||
and/or other materials provided with the distribution.
|
||||
3. Neither the name of fontbox; nor the names of its
|
||||
contributors may be used to endorse or promote products derived from this
|
||||
software without specific prior written permission.
|
||||
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
||||
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
||||
OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
||||
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
||||
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
||||
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
|
||||
OF SUCH DAMAGE.
|
||||
|
||||
ICU4J library (icu4j)
|
||||
|
||||
Copyright (c) 1995-2005 International Business Machines Corporation
|
||||
and others
|
||||
|
||||
All rights reserved.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining
|
||||
a copy of this software and associated documentation files (the
|
||||
"Software"), to deal in the Software without restriction, including
|
||||
without limitation the rights to use, copy, modify, merge, publish,
|
||||
distribute, and/or sell copies of the Software, and to permit persons
|
||||
to whom the Software is furnished to do so, provided that the above
|
||||
copyright notice(s) and this permission notice appear in all copies
|
||||
of the Software and that both the above copyright notice(s) and this
|
||||
permission notice appear in supporting documentation.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
|
||||
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS.
|
||||
IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE
|
||||
BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES,
|
||||
OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
|
||||
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
|
||||
ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
|
||||
SOFTWARE.
|
||||
|
||||
Except as contained in this notice, the name of a copyright holder shall
|
||||
not be used in advertising or otherwise to promote the sale, use or other
|
||||
dealings in this Software without prior written authorization of the
|
||||
copyright holder.
|
||||
|
||||
ASM library (asm)
|
||||
|
||||
Copyright (c) 2000-2005 INRIA, France Telecom
|
||||
All rights reserved.
|
||||
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted provided that the following conditions
|
||||
are met:
|
||||
|
||||
1. Redistributions of source code must retain the above copyright
|
||||
notice, this list of conditions and the following disclaimer.
|
||||
|
||||
2. Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in the
|
||||
documentation and/or other materials provided with the distribution.
|
||||
|
||||
3. Neither the name of the copyright holders nor the names of its
|
||||
contributors may be used to endorse or promote products derived from
|
||||
this software without specific prior written permission.
|
||||
|
||||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
||||
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
|
||||
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
|
||||
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
|
||||
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
|
||||
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
|
||||
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
||||
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
|
||||
THE POSSIBILITY OF SUCH DAMAGE.
|
||||
|
||||
|
||||
|
|
21
NOTICE.txt
21
NOTICE.txt
|
@ -113,3 +113,24 @@ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
|
|||
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
|
||||
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
|
||||
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
||||
=========================================================================
|
||||
== Apache Tika Notices ==
|
||||
=========================================================================
|
||||
|
||||
The following notices apply to the Apache Tika libraries in contrib/extraction/lib:
|
||||
|
||||
This product includes software developed by the following copyright owners:
|
||||
|
||||
Copyright (c) 2000-2006 The Legion Of The Bouncy Castle
|
||||
(http://www.bouncycastle.org)
|
||||
|
||||
Copyright (c) 2003-2005, www.pdfbox.org
|
||||
|
||||
Copyright (c) 2003-2005, www.fontbox.org
|
||||
|
||||
Copyright (c) 1995-2005 International Business Machines Corporation and others
|
||||
|
||||
Copyright (c) 2000-2005 INRIA, France Telecom
|
||||
|
||||
|
||||
|
|
|
@ -30,8 +30,7 @@
|
|||
<!-- Destination for distribution files (demo WAR, src distro, etc.) -->
|
||||
<property name="dist" value="dist" />
|
||||
|
||||
<!-- Example directory -->
|
||||
<property name="example" value="example" />
|
||||
|
||||
|
||||
<property name="clover.db.dir" location="${dest}/tests/clover/db"/>
|
||||
<property name="clover.report.dir" location="${dest}/tests/clover/reports"/>
|
||||
|
@ -612,7 +611,7 @@
|
|||
|
||||
<target name="example"
|
||||
description="Creates a runnable example configuration."
|
||||
depends="init-forrest-entities,dist-contrib,dist-war">
|
||||
depends="init-forrest-entities,dist-contrib,dist-war,example-contrib">
|
||||
<copy file="${dist}/${fullnamever}.war"
|
||||
tofile="${example}/webapps/${ant.project.name}.war"/>
|
||||
<jar destfile="${example}/exampledocs/post.jar"
|
||||
|
@ -624,7 +623,7 @@
|
|||
value="org.apache.solr.util.SimplePostTool"/>
|
||||
</manifest>
|
||||
</jar>
|
||||
|
||||
|
||||
<copy todir="${example}/solr/bin">
|
||||
<fileset dir="${src}/scripts">
|
||||
<exclude name="scripts.conf"/>
|
||||
|
|
|
@ -23,17 +23,14 @@ import java.io.Writer;
|
|||
import java.net.URLEncoder;
|
||||
import java.text.DateFormat;
|
||||
import java.text.ParseException;
|
||||
import java.text.SimpleDateFormat;
|
||||
import java.util.ArrayList;
|
||||
import java.util.Collection;
|
||||
import java.util.Date;
|
||||
import java.util.Iterator;
|
||||
import java.util.TimeZone;
|
||||
import java.util.regex.Matcher;
|
||||
import java.util.regex.Pattern;
|
||||
|
||||
import org.apache.commons.httpclient.util.DateParseException;
|
||||
import org.apache.commons.httpclient.util.DateUtil;
|
||||
|
||||
import org.apache.solr.common.SolrDocument;
|
||||
import org.apache.solr.common.SolrInputDocument;
|
||||
import org.apache.solr.common.SolrInputField;
|
||||
|
@ -41,6 +38,7 @@ import org.apache.solr.common.params.SolrParams;
|
|||
import org.apache.solr.common.util.ContentStream;
|
||||
import org.apache.solr.common.util.ContentStreamBase;
|
||||
import org.apache.solr.common.util.XML;
|
||||
import org.apache.solr.common.util.DateUtil;
|
||||
|
||||
|
||||
/**
|
||||
|
@ -61,17 +59,17 @@ public class ClientUtils
|
|||
{
|
||||
if( str == null )
|
||||
return null;
|
||||
|
||||
|
||||
ArrayList<ContentStream> streams = new ArrayList<ContentStream>( 1 );
|
||||
ContentStreamBase ccc = new ContentStreamBase.StringStream( str );
|
||||
ccc.setContentType( contentType );
|
||||
streams.add( ccc );
|
||||
return streams;
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
* @param d SolrDocument to convert
|
||||
* @return a SolrInputDocument with the same fields and values as the
|
||||
* @return a SolrInputDocument with the same fields and values as the
|
||||
* SolrDocument. All boosts are 1.0f
|
||||
*/
|
||||
public static SolrInputDocument toSolrInputDocument( SolrDocument d )
|
||||
|
@ -95,38 +93,38 @@ public class ClientUtils
|
|||
}
|
||||
return doc;
|
||||
}
|
||||
|
||||
|
||||
//------------------------------------------------------------------------
|
||||
//------------------------------------------------------------------------
|
||||
|
||||
|
||||
public static void writeXML( SolrInputDocument doc, Writer writer ) throws IOException
|
||||
{
|
||||
writer.write("<doc boost=\""+doc.getDocumentBoost()+"\">");
|
||||
|
||||
|
||||
for( SolrInputField field : doc ) {
|
||||
float boost = field.getBoost();
|
||||
String name = field.getName();
|
||||
for( Object v : field ) {
|
||||
if (v instanceof Date) {
|
||||
v = fmtThreadLocal.get().format( (Date)v );
|
||||
v = DateUtil.getThreadLocalDateFormat().format( (Date)v );
|
||||
}
|
||||
if( boost != 1.0f ) {
|
||||
XML.writeXML(writer, "field", v.toString(), "name", name, "boost", boost );
|
||||
XML.writeXML(writer, "field", v.toString(), "name", name, "boost", boost );
|
||||
}
|
||||
else {
|
||||
XML.writeXML(writer, "field", v.toString(), "name", name );
|
||||
XML.writeXML(writer, "field", v.toString(), "name", name );
|
||||
}
|
||||
|
||||
|
||||
// only write the boost for the first multi-valued field
|
||||
// otherwise, the used boost is the product of all the boost values
|
||||
boost = 1.0f;
|
||||
boost = 1.0f;
|
||||
}
|
||||
}
|
||||
writer.write("</doc>");
|
||||
}
|
||||
|
||||
|
||||
public static String toXML( SolrInputDocument doc )
|
||||
|
||||
public static String toXML( SolrInputDocument doc )
|
||||
{
|
||||
StringWriter str = new StringWriter();
|
||||
try {
|
||||
|
@ -135,59 +133,45 @@ public class ClientUtils
|
|||
catch( Exception ex ){}
|
||||
return str.toString();
|
||||
}
|
||||
|
||||
|
||||
//---------------------------------------------------------------------------------------
|
||||
|
||||
public static final Collection<String> fmts = new ArrayList<String>();
|
||||
static {
|
||||
fmts.add( "yyyy-MM-dd'T'HH:mm:ss'Z'" );
|
||||
fmts.add( "yyyy-MM-dd'T'HH:mm:ss" );
|
||||
fmts.add( "yyyy-MM-dd" );
|
||||
}
|
||||
|
||||
/**
|
||||
* @deprecated Use {@link org.apache.solr.common.util.DateUtil#DEFAULT_DATE_FORMATS}
|
||||
*/
|
||||
public static final Collection<String> fmts = DateUtil.DEFAULT_DATE_FORMATS;
|
||||
|
||||
/**
|
||||
* Returns a formatter that can be use by the current thread if needed to
|
||||
* convert Date objects to the Internal representation.
|
||||
* @throws ParseException
|
||||
* @throws DateParseException
|
||||
* @throws ParseException
|
||||
* @throws DateParseException
|
||||
*
|
||||
* @deprecated Use {@link org.apache.solr.common.util.DateUtil#parseDate(String)}
|
||||
*/
|
||||
public static Date parseDate( String d ) throws ParseException, DateParseException
|
||||
public static Date parseDate( String d ) throws ParseException, DateParseException
|
||||
{
|
||||
// 2007-04-26T08:05:04Z
|
||||
if( d.endsWith( "Z" ) && d.length() > 20 ) {
|
||||
return getThreadLocalDateFormat().parse( d );
|
||||
}
|
||||
return DateUtil.parseDate( d, fmts );
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns a formatter that can be use by the current thread if needed to
|
||||
* convert Date objects to the Internal representation.
|
||||
*/
|
||||
public static DateFormat getThreadLocalDateFormat() {
|
||||
|
||||
return fmtThreadLocal.get();
|
||||
return DateUtil.parseDate(d);
|
||||
}
|
||||
|
||||
public static TimeZone UTC = TimeZone.getTimeZone("UTC");
|
||||
private static ThreadLocalDateFormat fmtThreadLocal = new ThreadLocalDateFormat();
|
||||
|
||||
private static class ThreadLocalDateFormat extends ThreadLocal<DateFormat> {
|
||||
DateFormat proto;
|
||||
public ThreadLocalDateFormat() {
|
||||
super();
|
||||
//2007-04-26T08:05:04Z
|
||||
SimpleDateFormat tmp = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
|
||||
tmp.setTimeZone(UTC);
|
||||
proto = tmp;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected DateFormat initialValue() {
|
||||
return (DateFormat) proto.clone();
|
||||
}
|
||||
/**
|
||||
* Returns a formatter that can be use by the current thread if needed to
|
||||
* convert Date objects to the Internal representation.
|
||||
*
|
||||
* @deprecated use {@link org.apache.solr.common.util.DateUtil#getThreadLocalDateFormat()}
|
||||
*/
|
||||
public static DateFormat getThreadLocalDateFormat() {
|
||||
|
||||
return DateUtil.getThreadLocalDateFormat();
|
||||
}
|
||||
|
||||
/**
|
||||
* @deprecated Use {@link org.apache.solr.common.util.DateUtil#UTC}.
|
||||
*/
|
||||
public static TimeZone UTC = DateUtil.UTC;
|
||||
|
||||
|
||||
|
||||
/**
|
||||
* See: http://lucene.apache.org/java/docs/queryparsersyntax.html#Escaping Special Characters
|
||||
*/
|
||||
|
@ -206,7 +190,7 @@ public class ClientUtils
|
|||
}
|
||||
return sb.toString();
|
||||
}
|
||||
|
||||
|
||||
public static String toQueryString( SolrParams params, boolean xml ) {
|
||||
StringBuilder sb = new StringBuilder(128);
|
||||
try {
|
||||
|
|
|
@ -38,6 +38,9 @@
|
|||
<format property="dateversion" pattern="yyyy.MM.dd.HH.mm.ss" />
|
||||
</tstamp>
|
||||
|
||||
|
||||
<!-- Example directory -->
|
||||
<property name="example" value="${common.dir}/example" />
|
||||
<!--
|
||||
we attempt to exec svnversion to get details build information
|
||||
for jar manifests. this property can be set at runtime to an
|
||||
|
@ -332,6 +335,10 @@
|
|||
<contrib-crawl target="dist" failonerror="true" />
|
||||
</target>
|
||||
|
||||
<target name="example-contrib" description="Tell the contrib to add their stuff to examples">
|
||||
<contrib-crawl target="example" failonerror="true" />
|
||||
</target>
|
||||
|
||||
<!-- Creates a Manifest file for Jars and WARs -->
|
||||
<target name="make-manifest">
|
||||
<!-- If possible, include the svnversion -->
|
||||
|
|
|
@ -121,6 +121,8 @@
|
|||
</sources>
|
||||
</invoke-javadoc>
|
||||
</sequential>
|
||||
</target>
|
||||
</target>
|
||||
|
||||
<target name="example" depends="build"/>
|
||||
|
||||
</project>
|
||||
|
|
|
@ -0,0 +1,25 @@
|
|||
Apache Solr Content Extraction Library (Solr Cell)
|
||||
Version 1.4-dev
|
||||
Release Notes
|
||||
|
||||
This file describes changes to the Solr Cell (contrib/extraction) module. See SOLR-284 for details.
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
Apache Solr Extraction provides a means for extracting and indexing content contained in "rich" documents, such
|
||||
as Microsoft Word, Adobe PDF, etc. (Each name is a trademark of their respective owners) This contrib module
|
||||
uses Apache Tika to extract content and metadata from the files, which can then be indexed. For more information,
|
||||
see http://wiki.apache.org/solr/ExtractingRequestHandler
|
||||
|
||||
Getting Started
|
||||
---------------
|
||||
You will need Solr up and running. Then, simply add the extraction JAR file, plus the Tika dependencies (in the ./lib folder)
|
||||
to your Solr Home lib directory. See http://wiki.apache.org/solr/ExtractingRequestHandler for more details on hooking it in
|
||||
and configuring.
|
||||
|
||||
|
||||
$Id:$
|
||||
================== Release 1.4-dev ==================
|
||||
|
||||
1. SOLR-284: Added in support for extraction. (Eric Pugh, Chris Harris, gsingers)
|
|
@ -0,0 +1,134 @@
|
|||
<?xml version="1.0"?>
|
||||
|
||||
<!--
|
||||
Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
-->
|
||||
|
||||
<project name="solr-extraction" default="build">
|
||||
|
||||
<property name="solr-path" value="../.." />
|
||||
<property name="tika.version" value="0.2-SNAPSHOT"/>
|
||||
<property name="tika.lib" value="lib/tika-${tika.version}-standalone.jar"/>
|
||||
|
||||
<import file="../../common-build.xml"/>
|
||||
|
||||
<description>
|
||||
Solr Integration with Tika for extracting content from binary file formats such as Microsoft Word and Adobe PDF.
|
||||
</description>
|
||||
|
||||
<path id="common.classpath">
|
||||
<pathelement location="${solr-path}/build/common" />
|
||||
<pathelement location="${solr-path}/build/core" />
|
||||
<fileset dir="lib" includes="*.jar"/>
|
||||
<fileset dir="${solr-path}/lib" includes="*.jar"></fileset>
|
||||
</path>
|
||||
|
||||
<path id="test.classpath">
|
||||
<path refid="common.classpath" />
|
||||
<pathelement path="${dest}/classes" />
|
||||
<pathelement path="${dest}/test-classes" />
|
||||
<pathelement path="${java.class.path}"/>
|
||||
</path>
|
||||
|
||||
<target name="clean">
|
||||
<delete failonerror="false" dir="${dest}"/>
|
||||
</target>
|
||||
|
||||
<target name="init">
|
||||
<mkdir dir="${dest}/classes"/>
|
||||
<mkdir dir="${build.javadoc}" />
|
||||
<ant dir="../../" inheritall="false" target="compile" />
|
||||
<ant dir="../../" inheritall="false" target="make-manifest" />
|
||||
</target>
|
||||
|
||||
<target name="compile" depends="init">
|
||||
<solr-javac destdir="${dest}/classes"
|
||||
classpathref="common.classpath">
|
||||
<src path="src/main/java" />
|
||||
</solr-javac>
|
||||
</target>
|
||||
|
||||
<target name="build" depends="compile">
|
||||
<solr-jar destfile="${dest}/${fullnamever}.jar" basedir="${dest}/classes"
|
||||
manifest="${common.dir}/${dest}/META-INF/MANIFEST.MF">
|
||||
<!--<zipfileset src="${tika.lib}"/>-->
|
||||
</solr-jar>
|
||||
</target>
|
||||
|
||||
<target name="compileTests" depends="compile">
|
||||
<solr-javac destdir="${dest}/test-classes"
|
||||
classpathref="test.classpath">
|
||||
<src path="src/test/java" />
|
||||
</solr-javac>
|
||||
</target>
|
||||
|
||||
<target name="test" depends="compileTests">
|
||||
<mkdir dir="${junit.output.dir}"/>
|
||||
|
||||
<junit printsummary="on"
|
||||
haltonfailure="no"
|
||||
errorProperty="tests.failed"
|
||||
failureProperty="tests.failed"
|
||||
dir="src/test/resources/"
|
||||
>
|
||||
<formatter type="brief" usefile="false" if="junit.details"/>
|
||||
<classpath refid="test.classpath"/>
|
||||
<formatter type="xml"/>
|
||||
<batchtest fork="yes" todir="${junit.output.dir}" unless="testcase">
|
||||
<fileset dir="src/test/java" includes="${junit.includes}"/>
|
||||
</batchtest>
|
||||
<batchtest fork="yes" todir="${junit.output.dir}" if="testcase">
|
||||
<fileset dir="src/test/java" includes="**/${testcase}.java"/>
|
||||
</batchtest>
|
||||
</junit>
|
||||
|
||||
<fail if="tests.failed">Tests failed!</fail>
|
||||
</target>
|
||||
|
||||
<target name="dist" depends="build">
|
||||
|
||||
</target>
|
||||
|
||||
<target name="example" depends="build">
|
||||
<!-- Copy the jar into example/solr/lib -->
|
||||
<copy file="${dest}/${fullnamever}.jar" todir="${example}/solr/lib"/>
|
||||
<copy todir="${example}/solr/lib">
|
||||
<fileset dir="lib">
|
||||
<include name="**/*.jar"/>
|
||||
</fileset>
|
||||
</copy>
|
||||
</target>
|
||||
|
||||
<target name="javadoc">
|
||||
<sequential>
|
||||
<mkdir dir="${build.javadoc}/contrib-${name}"/>
|
||||
|
||||
<path id="javadoc.classpath">
|
||||
<path refid="common.classpath"/>
|
||||
</path>
|
||||
|
||||
<invoke-javadoc
|
||||
destdir="${build.javadoc}/contrib-${name}"
|
||||
title="${Name} ${version} contrib-${fullnamever} API">
|
||||
<sources>
|
||||
<packageset dir="src/main/java"/>
|
||||
</sources>
|
||||
</invoke-javadoc>
|
||||
</sequential>
|
||||
</target>
|
||||
|
||||
|
||||
</project>
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[8217cae0a1bc977b241e0c8517cc2e3e7cede276] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[680f8c60c1f0393f7e56595e24b29b3ceb46e933] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[552721d0e8deb28f2909cfc5ec900a5e35736795] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[957b6752af9a60c1bb2a4f65db0e90e5ce00f521] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[133dc6cb35f5ca2c5920fd0933a557c2def88680] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[87b80ab5db1729662ccf3439e147430a28c36d03] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[b73a80fab641131e6fbe3ae833549efb3c540d17] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[c9030febd2ae484532407db9ef98247cbe61b779] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[f5e8c167e7f7f3d078407859cb50b8abf23c697e] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[674d71e89ea154dbe2e3cd032821c22b39e8fd68] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[625130719013f195869881a36dcb8d2b14d64d1e] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[037b4fe2743eb161eec649f6fa5fa4725585b518] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[f821d644766c4d5c95e53db4b83cc6cb37b553f6] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[9e472a1610fa5d6736ecd56aec663623170003a3] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[58a33ac11683bec703fadffdbb263036146d7a74] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[16b9a3ed370d5a617d72f0b8935859bf0eac7678] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[3b351f6e2b566f73b742510738a52b866b4ffd0d] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,2 @@
|
|||
AnyObjectId[b338fb66932a763d6939dc93f27ed985ca5d1ebb] was removed in git history.
|
||||
Apache SVN contains full history.
|
|
@ -0,0 +1,179 @@
|
|||
package org.apache.solr.handler;
|
||||
|
||||
import org.apache.commons.io.IOUtils;
|
||||
import org.apache.solr.common.SolrException;
|
||||
import org.apache.solr.common.params.SolrParams;
|
||||
import org.apache.solr.common.params.UpdateParams;
|
||||
import org.apache.solr.common.util.ContentStream;
|
||||
import org.apache.solr.request.SolrQueryRequest;
|
||||
import org.apache.solr.request.SolrQueryResponse;
|
||||
import org.apache.solr.schema.IndexSchema;
|
||||
import org.apache.solr.update.AddUpdateCommand;
|
||||
import org.apache.solr.update.processor.UpdateRequestProcessor;
|
||||
import org.apache.tika.config.TikaConfig;
|
||||
import org.apache.tika.metadata.Metadata;
|
||||
import org.apache.tika.parser.AutoDetectParser;
|
||||
import org.apache.tika.parser.Parser;
|
||||
import org.apache.tika.sax.XHTMLContentHandler;
|
||||
import org.apache.tika.sax.xpath.Matcher;
|
||||
import org.apache.tika.sax.xpath.MatchingContentHandler;
|
||||
import org.apache.tika.sax.xpath.XPathParser;
|
||||
import org.apache.xml.serialize.OutputFormat;
|
||||
import org.apache.xml.serialize.XMLSerializer;
|
||||
import org.xml.sax.ContentHandler;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.io.InputStream;
|
||||
import java.io.StringWriter;
|
||||
|
||||
|
||||
/**
|
||||
*
|
||||
*
|
||||
**/
|
||||
public class ExtractingDocumentLoader extends ContentStreamLoader {
|
||||
|
||||
/**
|
||||
* XHTML XPath parser.
|
||||
*/
|
||||
private static final XPathParser PARSER =
|
||||
new XPathParser("xhtml", XHTMLContentHandler.XHTML);
|
||||
|
||||
final IndexSchema schema;
|
||||
final SolrParams params;
|
||||
final UpdateRequestProcessor processor;
|
||||
protected AutoDetectParser autoDetectParser;
|
||||
|
||||
private final AddUpdateCommand templateAdd;
|
||||
|
||||
protected TikaConfig config;
|
||||
protected SolrContentHandlerFactory factory;
|
||||
//protected Collection<String> dateFormats = DateUtil.DEFAULT_DATE_FORMATS;
|
||||
|
||||
ExtractingDocumentLoader(SolrQueryRequest req, UpdateRequestProcessor processor,
|
||||
TikaConfig config, SolrContentHandlerFactory factory) {
|
||||
this.params = req.getParams();
|
||||
schema = req.getSchema();
|
||||
this.config = config;
|
||||
this.processor = processor;
|
||||
|
||||
templateAdd = new AddUpdateCommand();
|
||||
templateAdd.allowDups = false;
|
||||
templateAdd.overwriteCommitted = true;
|
||||
templateAdd.overwritePending = true;
|
||||
|
||||
if (params.getBool(UpdateParams.OVERWRITE, true)) {
|
||||
templateAdd.allowDups = false;
|
||||
templateAdd.overwriteCommitted = true;
|
||||
templateAdd.overwritePending = true;
|
||||
} else {
|
||||
templateAdd.allowDups = true;
|
||||
templateAdd.overwriteCommitted = false;
|
||||
templateAdd.overwritePending = false;
|
||||
}
|
||||
//this is lightweight
|
||||
autoDetectParser = new AutoDetectParser(config);
|
||||
this.factory = factory;
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
* this must be MT safe... may be called concurrently from multiple threads.
|
||||
*
|
||||
* @param
|
||||
* @param
|
||||
*/
|
||||
void doAdd(SolrContentHandler handler, AddUpdateCommand template)
|
||||
throws IOException {
|
||||
template.solrDoc = handler.newDocument();
|
||||
processor.processAdd(template);
|
||||
}
|
||||
|
||||
void addDoc(SolrContentHandler handler) throws IOException {
|
||||
templateAdd.indexedId = null;
|
||||
doAdd(handler, templateAdd);
|
||||
}
|
||||
|
||||
/**
|
||||
* @param req
|
||||
* @param stream
|
||||
* @throws java.io.IOException
|
||||
*/
|
||||
public void load(SolrQueryRequest req, SolrQueryResponse rsp, ContentStream stream) throws IOException {
|
||||
errHeader = "ExtractingDocumentLoader: " + stream.getSourceInfo();
|
||||
Parser parser = null;
|
||||
String streamType = req.getParams().get(ExtractingParams.STREAM_TYPE, null);
|
||||
if (streamType != null) {
|
||||
//Cache? Parsers are lightweight to construct and thread-safe, so I'm told
|
||||
parser = config.getParser(streamType.trim().toLowerCase());
|
||||
} else {
|
||||
parser = autoDetectParser;
|
||||
}
|
||||
if (parser != null) {
|
||||
Metadata metadata = new Metadata();
|
||||
metadata.add(ExtractingMetadataConstants.STREAM_NAME, stream.getName());
|
||||
metadata.add(ExtractingMetadataConstants.STREAM_SOURCE_INFO, stream.getSourceInfo());
|
||||
metadata.add(ExtractingMetadataConstants.STREAM_SIZE, String.valueOf(stream.getSize()));
|
||||
metadata.add(ExtractingMetadataConstants.STREAM_CONTENT_TYPE, stream.getContentType());
|
||||
|
||||
// If you specify the resource name (the filename, roughly) with this parameter,
|
||||
// then Tika can make use of it in guessing the appropriate MIME type:
|
||||
String resourceName = req.getParams().get(ExtractingParams.RESOURCE_NAME, null);
|
||||
if (resourceName != null) {
|
||||
metadata.add(Metadata.RESOURCE_NAME_KEY, resourceName);
|
||||
}
|
||||
|
||||
SolrContentHandler handler = factory.createSolrContentHandler(metadata, params, schema);
|
||||
InputStream inputStream = null;
|
||||
try {
|
||||
inputStream = stream.getStream();
|
||||
String xpathExpr = params.get(ExtractingParams.XPATH_EXPRESSION);
|
||||
boolean extractOnly = params.getBool(ExtractingParams.EXTRACT_ONLY, false);
|
||||
ContentHandler parsingHandler = handler;
|
||||
|
||||
StringWriter writer = null;
|
||||
XMLSerializer serializer = null;
|
||||
if (extractOnly == true) {
|
||||
writer = new StringWriter();
|
||||
serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true));
|
||||
if (xpathExpr != null) {
|
||||
Matcher matcher =
|
||||
PARSER.parse(xpathExpr);
|
||||
serializer.startDocument();//The MatchingContentHandler does not invoke startDocument. See http://tika.markmail.org/message/kknu3hw7argwiqin
|
||||
parsingHandler = new MatchingContentHandler(serializer, matcher);
|
||||
} else {
|
||||
parsingHandler = serializer;
|
||||
}
|
||||
} else if (xpathExpr != null) {
|
||||
Matcher matcher =
|
||||
PARSER.parse(xpathExpr);
|
||||
parsingHandler = new MatchingContentHandler(handler, matcher);
|
||||
} //else leave it as is
|
||||
|
||||
//potentially use a wrapper handler for parsing, but we still need the SolrContentHandler for getting the document.
|
||||
parser.parse(inputStream, parsingHandler, metadata);
|
||||
if (extractOnly == false) {
|
||||
addDoc(handler);
|
||||
} else {
|
||||
//serializer is not null, so we need to call endDoc on it if using xpath
|
||||
if (xpathExpr != null){
|
||||
serializer.endDocument();
|
||||
}
|
||||
rsp.add(stream.getName(), writer.toString());
|
||||
writer.close();
|
||||
|
||||
}
|
||||
} catch (Exception e) {
|
||||
//TODO: handle here with an option to not fail and just log the exception
|
||||
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, e);
|
||||
|
||||
} finally {
|
||||
IOUtils.closeQuietly(inputStream);
|
||||
}
|
||||
} else {
|
||||
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Stream type of " + streamType + " didn't match any known parsers. Please supply the " + ExtractingParams.STREAM_TYPE + " parameter.");
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
}
|
|
@ -0,0 +1,13 @@
|
|||
package org.apache.solr.handler;
|
||||
|
||||
|
||||
/**
|
||||
*
|
||||
*
|
||||
**/
|
||||
public interface ExtractingMetadataConstants {
|
||||
String STREAM_NAME = "stream_name";
|
||||
String STREAM_SOURCE_INFO = "stream_source_info";
|
||||
String STREAM_SIZE = "stream_size";
|
||||
String STREAM_CONTENT_TYPE = "stream_content_type";
|
||||
}
|
|
@ -0,0 +1,125 @@
|
|||
package org.apache.solr.handler;
|
||||
|
||||
|
||||
/**
|
||||
* The various parameters to use when extracting content.
|
||||
*
|
||||
**/
|
||||
public interface ExtractingParams {
|
||||
|
||||
public static final String EXTRACTING_PREFIX = "ext.";
|
||||
|
||||
/**
|
||||
* The param prefix for mapping Tika metadata to Solr fields.
|
||||
* <p/>
|
||||
* To map a field, add a name like:
|
||||
* <pre>ext.map.title=solr.title</pre>
|
||||
*
|
||||
* In this example, the tika "title" metadata value will be added to a Solr field named "solr.title"
|
||||
*
|
||||
*
|
||||
*/
|
||||
public static final String MAP_PREFIX = EXTRACTING_PREFIX + "map.";
|
||||
|
||||
/**
|
||||
* The boost value for the name of the field. The boost can be specified by a name mapping.
|
||||
* <p/>
|
||||
* For example
|
||||
* <pre>
|
||||
* ext.map.title=solr.title
|
||||
* ext.boost.solr.title=2.5
|
||||
* </pre>
|
||||
* will boost the solr.title field for this document by 2.5
|
||||
*
|
||||
*/
|
||||
public static final String BOOST_PREFIX = EXTRACTING_PREFIX + "boost.";
|
||||
|
||||
/**
|
||||
* Pass in literal values to be added to the document, as in
|
||||
* <pre>
|
||||
* ext.literal.myField=Foo
|
||||
* </pre>
|
||||
*
|
||||
*/
|
||||
public static final String LITERALS_PREFIX = EXTRACTING_PREFIX + "literal.";
|
||||
|
||||
|
||||
/**
|
||||
* Restrict the extracted parts of a document to be indexed
|
||||
* by passing in an XPath expression. All content that satisfies the XPath expr.
|
||||
* will be passed to the {@link org.apache.solr.handler.SolrContentHandler}.
|
||||
* <p/>
|
||||
* See Tika's docs for what the extracted document looks like.
|
||||
* <p/>
|
||||
* @see #DEFAULT_FIELDNAME
|
||||
* @see #CAPTURE_FIELDS
|
||||
*/
|
||||
public static final String XPATH_EXPRESSION = EXTRACTING_PREFIX + "xpath";
|
||||
|
||||
|
||||
/**
|
||||
* Only extract and return the document, do not index it.
|
||||
*/
|
||||
public static final String EXTRACT_ONLY = EXTRACTING_PREFIX + "extract.only";
|
||||
|
||||
/**
|
||||
* Don't throw an exception if a field doesn't exist, just ignore it
|
||||
*/
|
||||
public static final String IGNORE_UNDECLARED_FIELDS = EXTRACTING_PREFIX + "ignore.und.fl";
|
||||
|
||||
/**
|
||||
* Index attributes separately according to their name, instead of just adding them to the string buffer
|
||||
*/
|
||||
public static final String INDEX_ATTRIBUTES = EXTRACTING_PREFIX + "idx.attr";
|
||||
|
||||
/**
|
||||
* The field to index the contents to by default. If you want to capture a specific piece
|
||||
* of the Tika document separately, see {@link #CAPTURE_FIELDS}.
|
||||
*
|
||||
* @see #CAPTURE_FIELDS
|
||||
*/
|
||||
public static final String DEFAULT_FIELDNAME = EXTRACTING_PREFIX + "def.fl";
|
||||
|
||||
/**
|
||||
* Capture the specified fields (and everything included below it that isn't capture by some other capture field) separately from the default. This is different
|
||||
* then the case of passing in an XPath expression.
|
||||
* <p/>
|
||||
* The Capture field is based on the localName returned to the {@link org.apache.solr.handler.SolrContentHandler}
|
||||
* by Tika, not to be confused by the mapped field. The field name can then
|
||||
* be mapped into the index schema.
|
||||
* <p/>
|
||||
* For instance, a Tika document may look like:
|
||||
* <pre>
|
||||
* <html>
|
||||
* ...
|
||||
* <body>
|
||||
* <p>some text here. <div>more text</div></p>
|
||||
* Some more text
|
||||
* </body>
|
||||
* </pre>
|
||||
* By passing in the p tag, you could capture all P tags separately from the rest of the text.
|
||||
* Thus, in the example, the capture of the P tag would be: "some text here. more text"
|
||||
*
|
||||
* @see #DEFAULT_FIELDNAME
|
||||
*/
|
||||
public static final String CAPTURE_FIELDS = EXTRACTING_PREFIX + "capture";
|
||||
|
||||
/**
|
||||
* The type of the stream. If not specified, Tika will use mime type detection.
|
||||
*/
|
||||
public static final String STREAM_TYPE = EXTRACTING_PREFIX + "stream.type";
|
||||
|
||||
|
||||
/**
|
||||
* Optional. The file name. If specified, Tika can take this into account while
|
||||
* guessing the MIME type.
|
||||
*/
|
||||
public static final String RESOURCE_NAME = EXTRACTING_PREFIX + "resource.name";
|
||||
|
||||
|
||||
/**
|
||||
* Optional. If specified, the prefix will be prepended to all Metadata, such that it would be possible
|
||||
* to setup a dynamic field to automatically capture it
|
||||
*/
|
||||
public static final String METADATA_PREFIX = EXTRACTING_PREFIX + "metadata.prefix";
|
||||
}
|
|
@ -0,0 +1,134 @@
|
|||
package org.apache.solr.handler;
|
||||
|
||||
/**
|
||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
* contributor license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright ownership.
|
||||
* The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
* (the "License"); you may not use this file except in compliance with
|
||||
* the License. You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
import org.apache.solr.common.SolrException;
|
||||
import org.apache.solr.common.SolrException.ErrorCode;
|
||||
import org.apache.solr.common.util.DateUtil;
|
||||
import org.apache.solr.common.util.NamedList;
|
||||
import org.apache.solr.core.SolrCore;
|
||||
import org.apache.solr.request.SolrQueryRequest;
|
||||
import org.apache.solr.update.processor.UpdateRequestProcessor;
|
||||
import org.apache.solr.util.plugin.SolrCoreAware;
|
||||
import org.apache.tika.config.TikaConfig;
|
||||
import org.apache.tika.exception.TikaException;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
|
||||
import java.io.File;
|
||||
import java.util.Collection;
|
||||
import java.util.HashSet;
|
||||
|
||||
|
||||
/**
|
||||
* Handler for rich documents like PDF or Word or any other file format that Tika handles that need the text to be extracted
|
||||
* first from the document.
|
||||
* <p/>
|
||||
*/
|
||||
|
||||
public class ExtractingRequestHandler extends ContentStreamHandlerBase implements SolrCoreAware {
|
||||
|
||||
private transient static Logger log = LoggerFactory.getLogger(ExtractingRequestHandler.class);
|
||||
|
||||
public static final String CONFIG_LOCATION = "tika.config";
|
||||
public static final String DATE_FORMATS = "date.formats";
|
||||
|
||||
protected TikaConfig config;
|
||||
|
||||
|
||||
protected Collection<String> dateFormats = DateUtil.DEFAULT_DATE_FORMATS;
|
||||
protected SolrContentHandlerFactory factory;
|
||||
|
||||
|
||||
@Override
|
||||
public void init(NamedList args) {
|
||||
super.init(args);
|
||||
}
|
||||
|
||||
public void inform(SolrCore core) {
|
||||
if (initArgs != null) {
|
||||
//if relative,then relative to config dir, otherwise, absolute path
|
||||
String tikaConfigLoc = (String) initArgs.get(CONFIG_LOCATION);
|
||||
if (tikaConfigLoc != null) {
|
||||
File configFile = new File(tikaConfigLoc);
|
||||
if (configFile.isAbsolute() == false) {
|
||||
configFile = new File(core.getResourceLoader().getConfigDir(), configFile.getPath());
|
||||
}
|
||||
try {
|
||||
config = new TikaConfig(configFile);
|
||||
} catch (Exception e) {
|
||||
throw new SolrException(ErrorCode.SERVER_ERROR, e);
|
||||
}
|
||||
} else {
|
||||
try {
|
||||
config = TikaConfig.getDefaultConfig();
|
||||
} catch (TikaException e) {
|
||||
throw new SolrException(ErrorCode.SERVER_ERROR, e);
|
||||
}
|
||||
}
|
||||
NamedList configDateFormats = (NamedList) initArgs.get(DATE_FORMATS);
|
||||
if (configDateFormats != null && configDateFormats.size() > 0) {
|
||||
dateFormats = new HashSet<String>();
|
||||
while (configDateFormats.iterator().hasNext()) {
|
||||
String format = (String) configDateFormats.iterator().next();
|
||||
log.info("Adding Date Format: " + format);
|
||||
dateFormats.add(format);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
try {
|
||||
config = TikaConfig.getDefaultConfig();
|
||||
} catch (TikaException e) {
|
||||
throw new SolrException(ErrorCode.SERVER_ERROR, e);
|
||||
}
|
||||
}
|
||||
factory = createFactory();
|
||||
}
|
||||
|
||||
protected SolrContentHandlerFactory createFactory() {
|
||||
return new SolrContentHandlerFactory(dateFormats);
|
||||
}
|
||||
|
||||
|
||||
protected ContentStreamLoader newLoader(SolrQueryRequest req, UpdateRequestProcessor processor) {
|
||||
return new ExtractingDocumentLoader(req, processor, config, factory);
|
||||
}
|
||||
|
||||
// ////////////////////// SolrInfoMBeans methods //////////////////////
|
||||
@Override
|
||||
public String getDescription() {
|
||||
return "Add/Update Rich document";
|
||||
}
|
||||
|
||||
@Override
|
||||
public String getVersion() {
|
||||
return "$Revision:$";
|
||||
}
|
||||
|
||||
@Override
|
||||
public String getSourceId() {
|
||||
return "$Id:$";
|
||||
}
|
||||
|
||||
@Override
|
||||
public String getSource() {
|
||||
return "$URL:$";
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -0,0 +1,353 @@
|
|||
package org.apache.solr.handler;
|
||||
|
||||
import org.apache.solr.common.SolrException;
|
||||
import org.apache.solr.common.SolrInputDocument;
|
||||
import org.apache.solr.common.SolrInputField;
|
||||
import org.apache.solr.common.params.SolrParams;
|
||||
import org.apache.solr.common.util.DateUtil;
|
||||
import org.apache.solr.schema.DateField;
|
||||
import org.apache.solr.schema.IndexSchema;
|
||||
import org.apache.solr.schema.SchemaField;
|
||||
import org.apache.solr.schema.StrField;
|
||||
import org.apache.solr.schema.TextField;
|
||||
import org.apache.solr.schema.FieldType;
|
||||
import org.apache.solr.schema.UUIDField;
|
||||
import org.apache.tika.metadata.Metadata;
|
||||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
import org.xml.sax.Attributes;
|
||||
import org.xml.sax.SAXException;
|
||||
import org.xml.sax.helpers.DefaultHandler;
|
||||
|
||||
import java.text.DateFormat;
|
||||
import java.util.Collection;
|
||||
import java.util.Collections;
|
||||
import java.util.Date;
|
||||
import java.util.HashMap;
|
||||
import java.util.Iterator;
|
||||
import java.util.Map;
|
||||
import java.util.Stack;
|
||||
import java.util.UUID;
|
||||
|
||||
|
||||
/**
|
||||
* This class is not thread-safe. It is responsible for responding to Tika extraction events and producing a Solr document
|
||||
*/
|
||||
public class SolrContentHandler extends DefaultHandler implements ExtractingParams {
|
||||
private transient static Logger log = LoggerFactory.getLogger(SolrContentHandler.class);
|
||||
protected SolrInputDocument document;
|
||||
|
||||
protected Collection<String> dateFormats = DateUtil.DEFAULT_DATE_FORMATS;
|
||||
|
||||
protected Metadata metadata;
|
||||
protected SolrParams params;
|
||||
protected StringBuilder catchAllBuilder = new StringBuilder(2048);
|
||||
//private StringBuilder currentBuilder;
|
||||
protected IndexSchema schema;
|
||||
//create empty so we don't have to worry about null checks
|
||||
protected Map<String, StringBuilder> fieldBuilders = Collections.emptyMap();
|
||||
protected Stack<StringBuilder> bldrStack = new Stack<StringBuilder>();
|
||||
|
||||
protected boolean ignoreUndeclaredFields = false;
|
||||
protected boolean indexAttribs = false;
|
||||
protected String defaultFieldName;
|
||||
|
||||
protected String metadataPrefix = "";
|
||||
|
||||
/**
|
||||
* Only access through getNextId();
|
||||
*/
|
||||
private static long identifier = Long.MIN_VALUE;
|
||||
|
||||
|
||||
public SolrContentHandler(Metadata metadata, SolrParams params, IndexSchema schema) {
|
||||
this(metadata, params, schema, DateUtil.DEFAULT_DATE_FORMATS);
|
||||
}
|
||||
|
||||
|
||||
public SolrContentHandler(Metadata metadata, SolrParams params,
|
||||
IndexSchema schema, Collection<String> dateFormats) {
|
||||
document = new SolrInputDocument();
|
||||
this.metadata = metadata;
|
||||
this.params = params;
|
||||
this.schema = schema;
|
||||
this.dateFormats = dateFormats;
|
||||
this.ignoreUndeclaredFields = params.getBool(ExtractingParams.IGNORE_UNDECLARED_FIELDS, false);
|
||||
this.indexAttribs = params.getBool(ExtractingParams.INDEX_ATTRIBUTES, false);
|
||||
this.defaultFieldName = params.get(ExtractingParams.DEFAULT_FIELDNAME);
|
||||
this.metadataPrefix = params.get(ExtractingParams.METADATA_PREFIX, "");
|
||||
//if there's no default field and we are intending to index, then throw an exception
|
||||
if (defaultFieldName == null && params.getBool(ExtractingParams.EXTRACT_ONLY, false) == false) {
|
||||
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "No default field name specified");
|
||||
}
|
||||
String[] captureFields = params.getParams(ExtractingParams.CAPTURE_FIELDS);
|
||||
if (captureFields != null && captureFields.length > 0) {
|
||||
fieldBuilders = new HashMap<String, StringBuilder>();
|
||||
for (int i = 0; i < captureFields.length; i++) {
|
||||
fieldBuilders.put(captureFields[i], new StringBuilder());
|
||||
}
|
||||
}
|
||||
bldrStack.push(catchAllBuilder);
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
* This is called by a consumer when it is ready to deal with a new SolrInputDocument. Overriding
|
||||
* classes can use this hook to add in or change whatever they deem fit for the document at that time.
|
||||
* The base implementation adds the metadata as fields, allowing for potential remapping.
|
||||
*
|
||||
* @return The {@link org.apache.solr.common.SolrInputDocument}.
|
||||
*/
|
||||
public SolrInputDocument newDocument() {
|
||||
float boost = 1.0f;
|
||||
//handle the metadata extracted from the document
|
||||
for (String name : metadata.names()) {
|
||||
String[] vals = metadata.getValues(name);
|
||||
name = findMappedMetadataName(name);
|
||||
SchemaField schFld = schema.getFieldOrNull(name);
|
||||
if (schFld != null) {
|
||||
boost = getBoost(name);
|
||||
if (schFld.multiValued()) {
|
||||
for (int i = 0; i < vals.length; i++) {
|
||||
String val = vals[i];
|
||||
document.addField(name, transformValue(val, schFld), boost);
|
||||
}
|
||||
} else {
|
||||
StringBuilder builder = new StringBuilder();
|
||||
for (int i = 0; i < vals.length; i++) {
|
||||
builder.append(vals[i]).append(' ');
|
||||
}
|
||||
document.addField(name, transformValue(builder.toString().trim(), schFld), boost);
|
||||
}
|
||||
} else {
|
||||
//TODO: error or log?
|
||||
if (ignoreUndeclaredFields == false) {
|
||||
// Arguably we should handle this as a special case. Why? Because unlike basically
|
||||
// all the other fields in metadata, this one was probably set not by Tika by in
|
||||
// ExtractingDocumentLoader.load(). You shouldn't have to define a mapping for this
|
||||
// field just because you specified a resource.name parameter to the handler, should
|
||||
// you?
|
||||
if (name != Metadata.RESOURCE_NAME_KEY) {
|
||||
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Invalid field: " + name);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
//handle the literals from the params
|
||||
Iterator<String> paramNames = params.getParameterNamesIterator();
|
||||
while (paramNames.hasNext()) {
|
||||
String name = paramNames.next();
|
||||
if (name.startsWith(LITERALS_PREFIX)) {
|
||||
String fieldName = name.substring(LITERALS_PREFIX.length());
|
||||
//no need to map names here, since they are literals from the user
|
||||
SchemaField schFld = schema.getFieldOrNull(fieldName);
|
||||
if (schFld != null) {
|
||||
String value = params.get(name);
|
||||
boost = getBoost(fieldName);
|
||||
//no need to transform here, b/c we can assume the user sent it in correctly
|
||||
document.addField(fieldName, value, boost);
|
||||
} else {
|
||||
handleUndeclaredField(fieldName);
|
||||
}
|
||||
}
|
||||
}
|
||||
//add in the content
|
||||
document.addField(defaultFieldName, catchAllBuilder.toString(), getBoost(defaultFieldName));
|
||||
|
||||
//add in the captured content
|
||||
for (Map.Entry<String, StringBuilder> entry : fieldBuilders.entrySet()) {
|
||||
if (entry.getValue().length() > 0) {
|
||||
String fieldName = findMappedName(entry.getKey());
|
||||
SchemaField schFld = schema.getFieldOrNull(fieldName);
|
||||
if (schFld != null) {
|
||||
document.addField(fieldName, transformValue(entry.getValue().toString(), schFld), getBoost(fieldName));
|
||||
} else {
|
||||
handleUndeclaredField(fieldName);
|
||||
}
|
||||
}
|
||||
}
|
||||
//make sure we have a unique id, if one is needed
|
||||
SchemaField uniqueField = schema.getUniqueKeyField();
|
||||
if (uniqueField != null) {
|
||||
String uniqueFieldName = uniqueField.getName();
|
||||
SolrInputField uniqFld = document.getField(uniqueFieldName);
|
||||
if (uniqFld == null) {
|
||||
String uniqId = generateId(uniqueField);
|
||||
if (uniqId != null) {
|
||||
document.addField(uniqueFieldName, uniqId);
|
||||
}
|
||||
}
|
||||
}
|
||||
if (log.isDebugEnabled()) {
|
||||
log.debug("Doc: " + document);
|
||||
}
|
||||
return document;
|
||||
}
|
||||
|
||||
/**
|
||||
* Generate an ID for the document. First try to get
|
||||
* {@link org.apache.solr.handler.ExtractingMetadataConstants#STREAM_NAME} from the
|
||||
* {@link org.apache.tika.metadata.Metadata}, then try {@link ExtractingMetadataConstants#STREAM_SOURCE_INFO}
|
||||
* then try {@link org.apache.tika.metadata.Metadata#IDENTIFIER}.
|
||||
* If those all are null, then generate a random UUID using {@link java.util.UUID#randomUUID()}.
|
||||
*
|
||||
* @param uniqueField The SchemaField representing the unique field.
|
||||
* @return The id as a string
|
||||
*/
|
||||
protected String generateId(SchemaField uniqueField) {
|
||||
//we don't have a unique field specified, so let's add one
|
||||
String uniqId = null;
|
||||
FieldType type = uniqueField.getType();
|
||||
if (type instanceof StrField || type instanceof TextField) {
|
||||
uniqId = metadata.get(ExtractingMetadataConstants.STREAM_NAME);
|
||||
if (uniqId == null) {
|
||||
uniqId = metadata.get(ExtractingMetadataConstants.STREAM_SOURCE_INFO);
|
||||
}
|
||||
if (uniqId == null) {
|
||||
uniqId = metadata.get(Metadata.IDENTIFIER);
|
||||
}
|
||||
if (uniqId == null) {
|
||||
//last chance, just create one
|
||||
uniqId = UUID.randomUUID().toString();
|
||||
}
|
||||
} else if (type instanceof UUIDField){
|
||||
uniqId = UUID.randomUUID().toString();
|
||||
}
|
||||
else {
|
||||
uniqId = String.valueOf(getNextId());
|
||||
}
|
||||
return uniqId;
|
||||
}
|
||||
|
||||
|
||||
@Override
|
||||
public void startDocument() throws SAXException {
|
||||
document.clear();
|
||||
catchAllBuilder.setLength(0);
|
||||
for (StringBuilder builder : fieldBuilders.values()) {
|
||||
builder.setLength(0);
|
||||
}
|
||||
bldrStack.clear();
|
||||
bldrStack.push(catchAllBuilder);
|
||||
}
|
||||
|
||||
|
||||
@Override
|
||||
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
|
||||
StringBuilder theBldr = fieldBuilders.get(localName);
|
||||
if (theBldr != null) {
|
||||
//we need to switch the currentBuilder
|
||||
bldrStack.push(theBldr);
|
||||
}
|
||||
if (indexAttribs == true) {
|
||||
for (int i = 0; i < attributes.getLength(); i++) {
|
||||
String fieldName = findMappedName(localName);
|
||||
SchemaField schFld = schema.getFieldOrNull(fieldName);
|
||||
if (schFld != null) {
|
||||
document.addField(fieldName, transformValue(attributes.getValue(i), schFld), getBoost(fieldName));
|
||||
} else {
|
||||
handleUndeclaredField(fieldName);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
for (int i = 0; i < attributes.getLength(); i++) {
|
||||
bldrStack.peek().append(attributes.getValue(i)).append(' ');
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
protected void handleUndeclaredField(String fieldName) {
|
||||
if (ignoreUndeclaredFields == false) {
|
||||
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Invalid field: " + fieldName);
|
||||
} else {
|
||||
if (log.isInfoEnabled()) {
|
||||
log.info("Ignoring Field: " + fieldName);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public void endElement(String uri, String localName, String qName) throws SAXException {
|
||||
StringBuilder theBldr = fieldBuilders.get(localName);
|
||||
if (theBldr != null) {
|
||||
//pop the stack
|
||||
bldrStack.pop();
|
||||
assert (bldrStack.size() >= 1);
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
|
||||
|
||||
@Override
|
||||
public void characters(char[] chars, int offset, int length) throws SAXException {
|
||||
bldrStack.peek().append(chars, offset, length);
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
/**
|
||||
* Can be used to transform input values based on their {@link org.apache.solr.schema.SchemaField}
|
||||
* <p/>
|
||||
* This implementation only formats dates using the {@link org.apache.solr.common.util.DateUtil}.
|
||||
*
|
||||
* @param val The value to transform
|
||||
* @param schFld The {@link org.apache.solr.schema.SchemaField}
|
||||
* @return The potentially new value.
|
||||
*/
|
||||
protected String transformValue(String val, SchemaField schFld) {
|
||||
String result = val;
|
||||
if (schFld.getType() instanceof DateField) {
|
||||
//try to transform the date
|
||||
try {
|
||||
Date date = DateUtil.parseDate(val, dateFormats);
|
||||
DateFormat df = DateUtil.getThreadLocalDateFormat();
|
||||
result = df.format(date);
|
||||
|
||||
} catch (Exception e) {
|
||||
//TODO: error or log?
|
||||
throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Invalid value: " + val + " for field: " + schFld, e);
|
||||
}
|
||||
}
|
||||
return result;
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
* Get the value of any boost factor for the mapped name.
|
||||
*
|
||||
* @param name The name of the field to see if there is a boost specified
|
||||
* @return The boost value
|
||||
*/
|
||||
protected float getBoost(String name) {
|
||||
return params.getFloat(BOOST_PREFIX + name, 1.0f);
|
||||
}
|
||||
|
||||
/**
|
||||
* Get the name mapping
|
||||
*
|
||||
* @param name The name to check to see if there is a mapping
|
||||
* @return The new name, if there is one, else <code>name</code>
|
||||
*/
|
||||
protected String findMappedName(String name) {
|
||||
return params.get(ExtractingParams.MAP_PREFIX + name, name);
|
||||
}
|
||||
|
||||
/**
|
||||
* Get the name mapping for the metadata field. Prepends metadataPrefix onto the returned result.
|
||||
*
|
||||
* @param name The name to check to see if there is a mapping
|
||||
* @return The new name, else <code>name</code>
|
||||
*/
|
||||
protected String findMappedMetadataName(String name) {
|
||||
return metadataPrefix + params.get(ExtractingParams.MAP_PREFIX + name, name);
|
||||
}
|
||||
|
||||
|
||||
protected synchronized long getNextId(){
|
||||
return identifier++;
|
||||
}
|
||||
|
||||
|
||||
}
|
|
@ -0,0 +1,25 @@
|
|||
package org.apache.solr.handler;
|
||||
|
||||
import org.apache.tika.metadata.Metadata;
|
||||
import org.apache.solr.common.params.SolrParams;
|
||||
import org.apache.solr.schema.IndexSchema;
|
||||
|
||||
import java.util.Collection;
|
||||
|
||||
|
||||
/**
|
||||
*
|
||||
*
|
||||
**/
|
||||
public class SolrContentHandlerFactory {
|
||||
protected Collection<String> dateFormats;
|
||||
|
||||
public SolrContentHandlerFactory(Collection<String> dateFormats) {
|
||||
this.dateFormats = dateFormats;
|
||||
}
|
||||
|
||||
public SolrContentHandler createSolrContentHandler(Metadata metadata, SolrParams params, IndexSchema schema) {
|
||||
return new SolrContentHandler(metadata, params, schema,
|
||||
dateFormats);
|
||||
}
|
||||
}
|
|
@ -0,0 +1,140 @@
|
|||
package org.apache.solr.handler;
|
||||
|
||||
import org.apache.solr.util.AbstractSolrTestCase;
|
||||
import org.apache.solr.request.LocalSolrQueryRequest;
|
||||
import org.apache.solr.request.SolrQueryResponse;
|
||||
import org.apache.solr.common.util.ContentStream;
|
||||
import org.apache.solr.common.util.ContentStreamBase;
|
||||
import org.apache.solr.common.util.NamedList;
|
||||
|
||||
import java.util.List;
|
||||
import java.util.ArrayList;
|
||||
import java.io.File;
|
||||
|
||||
|
||||
/**
|
||||
*
|
||||
*
|
||||
**/
|
||||
public class ExtractingRequestHandlerTest extends AbstractSolrTestCase {
|
||||
@Override public String getSchemaFile() { return "schema.xml"; }
|
||||
@Override public String getSolrConfigFile() { return "solrconfig.xml"; }
|
||||
|
||||
|
||||
public void testExtraction() throws Exception {
|
||||
ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
|
||||
assertTrue("handler is null and it shouldn't be", handler != null);
|
||||
loadLocal("solr-word.pdf", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
|
||||
"ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
|
||||
"ext.map.Author", "extractedAuthor",
|
||||
"ext.def.fl", "extractedContent",
|
||||
"ext.map.Last-Modified", "extractedDate"
|
||||
);
|
||||
assertQ(req("title:solr-word"),"//*[@numFound='0']");
|
||||
assertU(commit());
|
||||
assertQ(req("title:solr-word"),"//*[@numFound='1']");
|
||||
|
||||
loadLocal("simple.html", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
|
||||
"ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
|
||||
"ext.map.Author", "extractedAuthor",
|
||||
"ext.map.language", "extractedLanguage",
|
||||
"ext.def.fl", "extractedContent",
|
||||
"ext.map.Last-Modified", "extractedDate"
|
||||
);
|
||||
assertQ(req("title:Welcome"),"//*[@numFound='0']");
|
||||
assertU(commit());
|
||||
assertQ(req("title:Welcome"),"//*[@numFound='1']");
|
||||
|
||||
loadLocal("version_control.xml", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
|
||||
"ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
|
||||
"ext.map.Author", "extractedAuthor",
|
||||
"ext.def.fl", "extractedContent",
|
||||
"ext.map.Last-Modified", "extractedDate"
|
||||
);
|
||||
assertQ(req("stream_name:version_control.xml"),"//*[@numFound='0']");
|
||||
assertU(commit());
|
||||
assertQ(req("stream_name:version_control.xml"),"//*[@numFound='1']");
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
public void testPlainTextSpecifyingMimeType() throws Exception {
|
||||
ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
|
||||
assertTrue("handler is null and it shouldn't be", handler != null);
|
||||
|
||||
// Load plain text specifying MIME type:
|
||||
loadLocal("version_control.txt", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
|
||||
"ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
|
||||
"ext.map.Author", "extractedAuthor",
|
||||
"ext.map.language", "extractedLanguage",
|
||||
"ext.def.fl", "extractedContent",
|
||||
ExtractingParams.STREAM_TYPE, "text/plain"
|
||||
);
|
||||
assertQ(req("extractedContent:Apache"),"//*[@numFound='0']");
|
||||
assertU(commit());
|
||||
assertQ(req("extractedContent:Apache"),"//*[@numFound='1']");
|
||||
}
|
||||
|
||||
public void testPlainTextSpecifyingResourceName() throws Exception {
|
||||
ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
|
||||
assertTrue("handler is null and it shouldn't be", handler != null);
|
||||
|
||||
// Load plain text specifying filename
|
||||
loadLocal("version_control.txt", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
|
||||
"ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
|
||||
"ext.map.Author", "extractedAuthor",
|
||||
"ext.map.language", "extractedLanguage",
|
||||
"ext.def.fl", "extractedContent",
|
||||
ExtractingParams.RESOURCE_NAME, "version_control.txt"
|
||||
);
|
||||
assertQ(req("extractedContent:Apache"),"//*[@numFound='0']");
|
||||
assertU(commit());
|
||||
assertQ(req("extractedContent:Apache"),"//*[@numFound='1']");
|
||||
}
|
||||
|
||||
// Note: If you load a plain text file specifying neither MIME type nor filename, extraction will silently fail. This is because Tika's
|
||||
// automatic MIME type detection will fail, and it will default to using an empty-string-returning default parser
|
||||
|
||||
|
||||
public void testExtractOnly() throws Exception {
|
||||
ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
|
||||
assertTrue("handler is null and it shouldn't be", handler != null);
|
||||
SolrQueryResponse rsp = loadLocal("solr-word.pdf", ExtractingParams.EXTRACT_ONLY, "true");
|
||||
assertTrue("rsp is null and it shouldn't be", rsp != null);
|
||||
NamedList list = rsp.getValues();
|
||||
|
||||
String extraction = (String) list.get("solr-word.pdf");
|
||||
assertTrue("extraction is null and it shouldn't be", extraction != null);
|
||||
assertTrue(extraction + " does not contain " + "solr-word", extraction.indexOf("solr-word") != -1);
|
||||
|
||||
}
|
||||
|
||||
public void testXPath() throws Exception {
|
||||
ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
|
||||
assertTrue("handler is null and it shouldn't be", handler != null);
|
||||
SolrQueryResponse rsp = loadLocal("example.html",
|
||||
ExtractingParams.XPATH_EXPRESSION, "/xhtml:html/xhtml:body/xhtml:a/descendant:node()",
|
||||
ExtractingParams.EXTRACT_ONLY, "true"
|
||||
);
|
||||
assertTrue("rsp is null and it shouldn't be", rsp != null);
|
||||
NamedList list = rsp.getValues();
|
||||
String val = (String) list.get("example.html");
|
||||
val = val.trim();
|
||||
assertTrue(val + " is not equal to " + "linkNews", val.equals("linkNews") == true);//there are two <a> tags, and they get collapesd
|
||||
}
|
||||
|
||||
|
||||
SolrQueryResponse loadLocal(String filename, String... args) throws Exception {
|
||||
LocalSolrQueryRequest req = (LocalSolrQueryRequest)req(args);
|
||||
|
||||
// TODO: stop using locally defined streams once stream.file and
|
||||
// stream.body work everywhere
|
||||
List<ContentStream> cs = new ArrayList<ContentStream>();
|
||||
cs.add(new ContentStreamBase.FileStream(new File(filename)));
|
||||
req.setContentStreams(cs);
|
||||
return h.queryAndResponse("/update/extract", req);
|
||||
}
|
||||
|
||||
|
||||
}
|
|
@ -0,0 +1,49 @@
|
|||
<html>
|
||||
<head>
|
||||
<title>Welcome to Solr</title>
|
||||
</head>
|
||||
<body>
|
||||
<p>
|
||||
Here is some text
|
||||
</p>
|
||||
<div>Here is some text in a div</div>
|
||||
<div>This has a <a href="http://www.apache.org">link</a>.</div>
|
||||
<a href="#news">News</a>
|
||||
<ul class="minitoc">
|
||||
<li>
|
||||
<a href="#03+October+2008+-+Solr+Logo+Contest">03 October 2008 - Solr Logo Contest</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#15+September+2008+-+Solr+1.3.0+Available">15 September 2008 - Solr 1.3.0 Available</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#28+August+2008+-+Lucene%2FSolr+at+ApacheCon+New+Orleans">28 August 2008 - Lucene/Solr at ApacheCon New Orleans</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#03+September+2007+-+Lucene+at+ApacheCon+Atlanta">03 September 2007 - Lucene at ApacheCon Atlanta</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#06+June+2007%3A+Release+1.2+available">06 June 2007: Release 1.2 available</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#17+January+2007%3A+Solr+graduates+from+Incubator">17 January 2007: Solr graduates from Incubator</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#22+December+2006%3A+Release+1.1.0+available">22 December 2006: Release 1.1.0 available</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#15+August+2006%3A+Solr+at+ApacheCon+US">15 August 2006: Solr at ApacheCon US</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#21+April+2006%3A+Solr+at+ApacheCon">21 April 2006: Solr at ApacheCon</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#21+February+2006%3A+nightly+builds">21 February 2006: nightly builds</a>
|
||||
</li>
|
||||
<li>
|
||||
<a href="#17+January+2006%3A+Solr+Joins+Apache+Incubator">17 January 2006: Solr Joins Apache Incubator</a>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
</body>
|
||||
</html>
|
|
@ -0,0 +1,12 @@
|
|||
<html>
|
||||
<head>
|
||||
<title>Welcome to Solr</title>
|
||||
</head>
|
||||
<body>
|
||||
<p>
|
||||
Here is some text
|
||||
</p>
|
||||
<div>Here is some text in a div</div>
|
||||
<div>This has a <a href="http://www.apache.org">link</a>.</div>
|
||||
</body>
|
||||
</html>
|
Binary file not shown.
|
@ -0,0 +1,20 @@
|
|||
# Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
# contributor license agreements. See the NOTICE file distributed with
|
||||
# this work for additional information regarding copyright ownership.
|
||||
# The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
# (the "License"); you may not use this file except in compliance with
|
||||
# the License. You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
#use a protected word file to avoid stemming two
|
||||
#unrelated words to the same base word.
|
||||
#to test, we will use words that would normally obviously be stemmed.
|
||||
cats
|
||||
ridding
|
|
@ -0,0 +1,467 @@
|
|||
<?xml version="1.0" ?>
|
||||
<!--
|
||||
Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
(the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
-->
|
||||
|
||||
<!-- The Solr schema file. This file should be named "schema.xml" and
|
||||
should be located where the classloader for the Solr webapp can find it.
|
||||
|
||||
This schema is used for testing, and as such has everything and the
|
||||
kitchen sink thrown in. See example/solr/conf/schema.xml for a
|
||||
more concise example.
|
||||
|
||||
$Id: schema.xml 382610 2006-03-03 01:43:03Z yonik $
|
||||
$Source: /cvs/main/searching/solr-configs/test/WEB-INF/classes/schema.xml,v $
|
||||
$Name: $
|
||||
-->
|
||||
|
||||
<schema name="test" version="1.0">
|
||||
<types>
|
||||
|
||||
<!-- field type definitions... note that the "name" attribute is
|
||||
just a label to be used by field definitions. The "class"
|
||||
attribute and any other attributes determine the real type and
|
||||
behavior of the fieldtype.
|
||||
-->
|
||||
|
||||
<!-- numeric field types that store and index the text
|
||||
value verbatim (and hence don't sort correctly or support range queries.)
|
||||
These are provided more for backward compatability, allowing one
|
||||
to create a schema that matches an existing lucene index.
|
||||
-->
|
||||
<fieldType name="integer" class="solr.IntField"/>
|
||||
<fieldType name="long" class="solr.LongField"/>
|
||||
<fieldtype name="float" class="solr.FloatField"/>
|
||||
<fieldType name="double" class="solr.DoubleField"/>
|
||||
|
||||
<!-- numeric field types that manipulate the value into
|
||||
a string value that isn't human readable in it's internal form,
|
||||
but sorts correctly and supports range queries.
|
||||
|
||||
If sortMissingLast="true" then a sort on this field will cause documents
|
||||
without the field to come after documents with the field,
|
||||
regardless of the requested sort order.
|
||||
If sortMissingFirst="true" then a sort on this field will cause documents
|
||||
without the field to come before documents with the field,
|
||||
regardless of the requested sort order.
|
||||
If sortMissingLast="false" and sortMissingFirst="false" (the default),
|
||||
then default lucene sorting will be used which places docs without the field
|
||||
first in an ascending sort and last in a descending sort.
|
||||
-->
|
||||
<fieldtype name="sint" class="solr.SortableIntField" sortMissingLast="true"/>
|
||||
<fieldtype name="slong" class="solr.SortableLongField" sortMissingLast="true"/>
|
||||
<fieldtype name="sfloat" class="solr.SortableFloatField" sortMissingLast="true"/>
|
||||
<fieldtype name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true"/>
|
||||
|
||||
<!-- bcd versions of sortable numeric type may provide smaller
|
||||
storage space and support very large numbers.
|
||||
-->
|
||||
<fieldtype name="bcdint" class="solr.BCDIntField" sortMissingLast="true"/>
|
||||
<fieldtype name="bcdlong" class="solr.BCDLongField" sortMissingLast="true"/>
|
||||
<fieldtype name="bcdstr" class="solr.BCDStrField" sortMissingLast="true"/>
|
||||
|
||||
<!-- Field type demonstrating an Analyzer failure -->
|
||||
<fieldtype name="failtype1" class="solr.TextField">
|
||||
<analyzer type="index">
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
|
||||
<!-- Demonstrating ignoreCaseChange -->
|
||||
<fieldtype name="wdf_nocase" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
|
||||
<fieldtype name="wdf_preserve" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
|
||||
|
||||
<!-- HighlitText optimizes storage for (long) columns which will be highlit -->
|
||||
<fieldtype name="highlittext" class="solr.TextField" compressThreshold="345" />
|
||||
|
||||
<fieldtype name="boolean" class="solr.BoolField" sortMissingLast="true"/>
|
||||
<fieldtype name="string" class="solr.StrField" sortMissingLast="true"/>
|
||||
|
||||
<!-- format for date is 1995-12-31T23:59:59.999Z and only the fractional
|
||||
seconds part (.999) is optional.
|
||||
-->
|
||||
<fieldtype name="date" class="solr.DateField" sortMissingLast="true"/>
|
||||
|
||||
<!-- solr.TextField allows the specification of custom
|
||||
text analyzers specified as a tokenizer and a list
|
||||
of token filters.
|
||||
-->
|
||||
<fieldtype name="text" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.StandardTokenizerFactory"/>
|
||||
<filter class="solr.StandardFilterFactory"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
<filter class="solr.StopFilterFactory"/>
|
||||
<!-- lucene PorterStemFilterFactory deprecated
|
||||
<filter class="solr.PorterStemFilterFactory"/>
|
||||
-->
|
||||
<filter class="solr.EnglishPorterFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
|
||||
|
||||
<fieldtype name="nametext" class="solr.TextField">
|
||||
<analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
|
||||
</fieldtype>
|
||||
|
||||
<fieldtype name="teststop" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
|
||||
<filter class="solr.StandardFilterFactory"/>
|
||||
<filter class="solr.StopFilterFactory" words="stopwords.txt"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
|
||||
<!-- fieldtypes in this section isolate tokenizers and tokenfilters for testing -->
|
||||
<fieldtype name="lowertok" class="solr.TextField">
|
||||
<analyzer><tokenizer class="solr.LowerCaseTokenizerFactory"/></analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="keywordtok" class="solr.TextField">
|
||||
<analyzer><tokenizer class="solr.KeywordTokenizerFactory"/></analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="standardtok" class="solr.TextField">
|
||||
<analyzer><tokenizer class="solr.StandardTokenizerFactory"/></analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="lettertok" class="solr.TextField">
|
||||
<analyzer><tokenizer class="solr.LetterTokenizerFactory"/></analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="whitetok" class="solr.TextField">
|
||||
<analyzer><tokenizer class="solr.WhitespaceTokenizerFactory"/></analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="HTMLstandardtok" class="solr.TextField">
|
||||
<analyzer><tokenizer class="solr.HTMLStripStandardTokenizerFactory"/></analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="HTMLwhitetok" class="solr.TextField">
|
||||
<analyzer><tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/></analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="standardtokfilt" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.StandardTokenizerFactory"/>
|
||||
<filter class="solr.StandardFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="standardfilt" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.StandardFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="lowerfilt" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="patternreplacefilt" class="solr.TextField">
|
||||
<analyzer type="index">
|
||||
<tokenizer class="solr.KeywordTokenizerFactory"/>
|
||||
<filter class="solr.PatternReplaceFilterFactory"
|
||||
pattern="([^a-zA-Z])" replacement="_" replace="all"
|
||||
/>
|
||||
</analyzer>
|
||||
<analyzer type="query">
|
||||
<tokenizer class="solr.KeywordTokenizerFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="porterfilt" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.PorterStemFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
<!-- fieldtype name="snowballfilt" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.SnowballPorterFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype -->
|
||||
<fieldtype name="engporterfilt" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.EnglishPorterFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="custengporterfilt" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="stopfilt" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.StopFilterFactory" ignoreCase="true"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="custstopfilt" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.StopFilterFactory" words="stopwords.txt"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
<fieldtype name="lengthfilt" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.LengthFilterFactory" min="2" max="5"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
|
||||
<fieldtype name="subword" class="solr.TextField" multiValued="true" positionIncrementGap="100">
|
||||
<analyzer type="index">
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
<filter class="solr.StopFilterFactory"/>
|
||||
<filter class="solr.EnglishPorterFilterFactory"/>
|
||||
</analyzer>
|
||||
<analyzer type="query">
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
<filter class="solr.StopFilterFactory"/>
|
||||
<filter class="solr.EnglishPorterFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
|
||||
<!-- more flexible in matching skus, but more chance of a false match -->
|
||||
<fieldtype name="skutype1" class="solr.TextField">
|
||||
<analyzer type="index">
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
</analyzer>
|
||||
<analyzer type="query">
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
|
||||
<!-- less flexible in matching skus, but less chance of a false match -->
|
||||
<fieldtype name="skutype2" class="solr.TextField">
|
||||
<analyzer type="index">
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
</analyzer>
|
||||
<analyzer type="query">
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
|
||||
<!-- less flexible in matching skus, but less chance of a false match -->
|
||||
<fieldtype name="syn" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter name="syn" class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
|
||||
<!-- Demonstrates How RemoveDuplicatesTokenFilter makes stemmed
|
||||
synonyms "better"
|
||||
-->
|
||||
<fieldtype name="dedup" class="solr.TextField">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.SynonymFilterFactory"
|
||||
synonyms="synonyms.txt" expand="true" />
|
||||
<filter class="solr.EnglishPorterFilterFactory"/>
|
||||
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
|
||||
<fieldtype name="unstored" class="solr.StrField" indexed="true" stored="false"/>
|
||||
|
||||
|
||||
<fieldtype name="textgap" class="solr.TextField" multiValued="true" positionIncrementGap="100">
|
||||
<analyzer>
|
||||
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
|
||||
<filter class="solr.LowerCaseFilterFactory"/>
|
||||
</analyzer>
|
||||
</fieldtype>
|
||||
|
||||
</types>
|
||||
|
||||
|
||||
<fields>
|
||||
<field name="id" type="integer" indexed="true" stored="true" multiValued="false" required="false"/>
|
||||
<field name="name" type="nametext" indexed="true" stored="true"/>
|
||||
<field name="text" type="text" indexed="true" stored="false"/>
|
||||
<field name="subject" type="text" indexed="true" stored="true"/>
|
||||
<field name="title" type="nametext" indexed="true" stored="true"/>
|
||||
<field name="weight" type="float" indexed="true" stored="true"/>
|
||||
<field name="bday" type="date" indexed="true" stored="true"/>
|
||||
|
||||
<field name="title_stemmed" type="text" indexed="true" stored="false"/>
|
||||
<field name="title_lettertok" type="lettertok" indexed="true" stored="false"/>
|
||||
|
||||
<field name="syn" type="syn" indexed="true" stored="true"/>
|
||||
|
||||
<!-- to test property inheritance and overriding -->
|
||||
<field name="shouldbeunstored" type="unstored" />
|
||||
<field name="shouldbestored" type="unstored" stored="true"/>
|
||||
<field name="shouldbeunindexed" type="unstored" indexed="false" stored="true"/>
|
||||
|
||||
|
||||
<!-- test different combinations of indexed and stored -->
|
||||
<field name="bind" type="boolean" indexed="true" stored="false"/>
|
||||
<field name="bsto" type="boolean" indexed="false" stored="true"/>
|
||||
<field name="bindsto" type="boolean" indexed="true" stored="true"/>
|
||||
<field name="isto" type="integer" indexed="false" stored="true"/>
|
||||
<field name="iind" type="integer" indexed="true" stored="false"/>
|
||||
<field name="ssto" type="string" indexed="false" stored="true"/>
|
||||
<field name="sind" type="string" indexed="true" stored="false"/>
|
||||
<field name="sindsto" type="string" indexed="true" stored="true"/>
|
||||
|
||||
<!-- test combinations of term vector settings -->
|
||||
<field name="test_basictv" type="text" termVectors="true"/>
|
||||
<field name="test_notv" type="text" termVectors="false"/>
|
||||
<field name="test_postv" type="text" termVectors="true" termPositions="true"/>
|
||||
<field name="test_offtv" type="text" termVectors="true" termOffsets="true"/>
|
||||
<field name="test_posofftv" type="text" termVectors="true"
|
||||
termPositions="true" termOffsets="true"/>
|
||||
|
||||
<!-- test highlit field settings -->
|
||||
<field name="test_hlt" type="highlittext" indexed="true" compressed="true"/>
|
||||
<field name="test_hlt_off" type="highlittext" indexed="true" compressed="false"/>
|
||||
|
||||
<!-- fields to test individual tokenizers and tokenfilters -->
|
||||
<field name="teststop" type="teststop" indexed="true" stored="true"/>
|
||||
<field name="lowertok" type="lowertok" indexed="true" stored="true"/>
|
||||
<field name="keywordtok" type="keywordtok" indexed="true" stored="true"/>
|
||||
<field name="standardtok" type="standardtok" indexed="true" stored="true"/>
|
||||
<field name="HTMLstandardtok" type="HTMLstandardtok" indexed="true" stored="true"/>
|
||||
<field name="lettertok" type="lettertok" indexed="true" stored="true"/>
|
||||
<field name="whitetok" type="whitetok" indexed="true" stored="true"/>
|
||||
<field name="HTMLwhitetok" type="HTMLwhitetok" indexed="true" stored="true"/>
|
||||
<field name="standardtokfilt" type="standardtokfilt" indexed="true" stored="true"/>
|
||||
<field name="standardfilt" type="standardfilt" indexed="true" stored="true"/>
|
||||
<field name="lowerfilt" type="lowerfilt" indexed="true" stored="true"/>
|
||||
<field name="patternreplacefilt" type="patternreplacefilt" indexed="true" stored="true"/>
|
||||
<field name="porterfilt" type="porterfilt" indexed="true" stored="true"/>
|
||||
<field name="engporterfilt" type="engporterfilt" indexed="true" stored="true"/>
|
||||
<field name="custengporterfilt" type="custengporterfilt" indexed="true" stored="true"/>
|
||||
<field name="stopfilt" type="stopfilt" indexed="true" stored="true"/>
|
||||
<field name="custstopfilt" type="custstopfilt" indexed="true" stored="true"/>
|
||||
<field name="lengthfilt" type="lengthfilt" indexed="true" stored="true"/>
|
||||
<field name="dedup" type="dedup" indexed="true" stored="true"/>
|
||||
<field name="wdf_nocase" type="wdf_nocase" indexed="true" stored="true"/>
|
||||
<field name="wdf_preserve" type="wdf_preserve" indexed="true" stored="true"/>
|
||||
|
||||
<field name="numberpartfail" type="failtype1" indexed="true" stored="true"/>
|
||||
|
||||
<field name="nullfirst" type="string" indexed="true" stored="true" sortMissingFirst="true"/>
|
||||
|
||||
<field name="subword" type="subword" indexed="true" stored="true"/>
|
||||
<field name="sku1" type="skutype1" indexed="true" stored="true"/>
|
||||
<field name="sku2" type="skutype2" indexed="true" stored="true"/>
|
||||
|
||||
<field name="textgap" type="textgap" indexed="true" stored="true"/>
|
||||
|
||||
<field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
|
||||
<field name="multiDefault" type="string" indexed="true" stored="true" default="muLti-Default" multiValued="true"/>
|
||||
<field name="intDefault" type="sint" indexed="true" stored="true" default="42" multiValued="false"/>
|
||||
|
||||
<field name="extractedDate" type="date" indexed="true" stored="true" multiValued="true"/>
|
||||
<field name="extractedContent" type="text" indexed="true" stored="true" multiValued="true"/>
|
||||
<field name="extractedProducer" type="text" indexed="true" stored="true" multiValued="true"/>
|
||||
<field name="extractedCreator" type="text" indexed="true" stored="true" multiValued="true"/>
|
||||
<field name="extractedKeywords" type="text" indexed="true" stored="true" multiValued="true"/>
|
||||
<field name="extractedAuthor" type="text" indexed="true" stored="true" multiValued="true"/>
|
||||
<field name="extractedLanguage" type="string" indexed="true" stored="true" multiValued="true"/>
|
||||
<field name="resourceName" type="string" indexed="true" stored="true" multiValued="true"/>
|
||||
|
||||
|
||||
<!-- Dynamic field definitions. If a field name is not found, dynamicFields
|
||||
will be used if the name matches any of the patterns.
|
||||
RESTRICTION: the glob-like pattern in the name attribute must have
|
||||
a "*" only at the start or the end.
|
||||
EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i)
|
||||
Longer patterns will be matched first. if equal size patterns
|
||||
both match, the first appearing in the schema will be used.
|
||||
-->
|
||||
<dynamicField name="*_i" type="sint" indexed="true" stored="true"/>
|
||||
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
|
||||
<dynamicField name="*_s1" type="string" indexed="true" stored="true" multiValued="false"/>
|
||||
<dynamicField name="*_l" type="slong" indexed="true" stored="true"/>
|
||||
<dynamicField name="*_t" type="text" indexed="true" stored="true"/>
|
||||
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
|
||||
<dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/>
|
||||
<dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/>
|
||||
<dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
|
||||
<dynamicField name="*_bcd" type="bcdstr" indexed="true" stored="true"/>
|
||||
|
||||
<dynamicField name="*_sI" type="string" indexed="true" stored="false"/>
|
||||
<dynamicField name="*_sS" type="string" indexed="false" stored="true"/>
|
||||
<dynamicField name="t_*" type="text" indexed="true" stored="true"/>
|
||||
<dynamicField name="tv_*" type="text" indexed="true" stored="true"
|
||||
termVectors="true" termPositions="true" termOffsets="true"/>
|
||||
|
||||
<dynamicField name="stream_*" type="text" indexed="true" stored="true"/>
|
||||
<dynamicField name="Content*" type="text" indexed="true" stored="true"/>
|
||||
|
||||
|
||||
<!-- special fields for dynamic copyField test -->
|
||||
<dynamicField name="dynamic_*" type="string" indexed="true" stored="true"/>
|
||||
<dynamicField name="*_dynamic" type="string" indexed="true" stored="true"/>
|
||||
|
||||
<!-- for testing to ensure that longer patterns are matched first -->
|
||||
<dynamicField name="*aa" type="string" indexed="true" stored="true"/>
|
||||
<dynamicField name="*aaa" type="integer" indexed="false" stored="true"/>
|
||||
|
||||
<!-- ignored becuase not stored or indexed -->
|
||||
<dynamicField name="*_ignored" type="text" indexed="false" stored="false"/>
|
||||
|
||||
</fields>
|
||||
|
||||
<defaultSearchField>text</defaultSearchField>
|
||||
<uniqueKey>id</uniqueKey>
|
||||
|
||||
<!-- copyField commands copy one field to another at the time a document
|
||||
is added to the index. It's used either to index the same field different
|
||||
ways, or to add multiple fields to the same field for easier/faster searching.
|
||||
-->
|
||||
<copyField source="title" dest="title_stemmed"/>
|
||||
<copyField source="title" dest="title_lettertok"/>
|
||||
|
||||
<copyField source="title" dest="text"/>
|
||||
<copyField source="subject" dest="text"/>
|
||||
|
||||
<copyField source="*_t" dest="text"/>
|
||||
|
||||
<!-- dynamic destination -->
|
||||
<copyField source="*_dynamic" dest="dynamic_*"/>
|
||||
|
||||
|
||||
</schema>
|
|
@ -0,0 +1,359 @@
|
|||
<?xml version="1.0" ?>
|
||||
|
||||
<!--
|
||||
Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
(the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
-->
|
||||
|
||||
<!-- $Id: solrconfig.xml 382610 2006-03-03 01:43:03Z yonik $
|
||||
$Source$
|
||||
$Name$
|
||||
-->
|
||||
|
||||
<config>
|
||||
|
||||
<jmx />
|
||||
|
||||
<!-- Used to specify an alternate directory to hold all index data.
|
||||
It defaults to "index" if not present, and should probably
|
||||
not be changed if replication is in use. -->
|
||||
<dataDir>${solr.data.dir:./solr/data}</dataDir>
|
||||
|
||||
<indexDefaults>
|
||||
<!-- Values here affect all index writers and act as a default
|
||||
unless overridden. -->
|
||||
<!-- Values here affect all index writers and act as a default unless overridden. -->
|
||||
<useCompoundFile>false</useCompoundFile>
|
||||
<mergeFactor>10</mergeFactor>
|
||||
<!-- If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first.
|
||||
-->
|
||||
<!--<maxBufferedDocs>1000</maxBufferedDocs>-->
|
||||
<!-- Tell Lucene when to flush documents to disk.
|
||||
Giving Lucene more memory for indexing means faster indexing at the cost of more RAM
|
||||
|
||||
If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first.
|
||||
|
||||
-->
|
||||
<ramBufferSizeMB>32</ramBufferSizeMB>
|
||||
<maxMergeDocs>2147483647</maxMergeDocs>
|
||||
<maxFieldLength>10000</maxFieldLength>
|
||||
<writeLockTimeout>1000</writeLockTimeout>
|
||||
<commitLockTimeout>10000</commitLockTimeout>
|
||||
|
||||
<!--
|
||||
Expert: Turn on Lucene's auto commit capability.
|
||||
|
||||
NOTE: Despite the name, this value does not have any relation to Solr's autoCommit functionality
|
||||
|
||||
-->
|
||||
<luceneAutoCommit>false</luceneAutoCommit>
|
||||
|
||||
<!--
|
||||
Expert:
|
||||
The Merge Policy in Lucene controls how merging is handled by Lucene. The default in 2.3 is the LogByteSizeMergePolicy, previous
|
||||
versions used LogDocMergePolicy.
|
||||
|
||||
LogByteSizeMergePolicy chooses segments to merge based on their size. The Lucene 2.2 default, LogDocMergePolicy chose when
|
||||
to merge based on number of documents
|
||||
|
||||
Other implementations of MergePolicy must have a no-argument constructor
|
||||
-->
|
||||
<mergePolicy>org.apache.lucene.index.LogByteSizeMergePolicy</mergePolicy>
|
||||
|
||||
<!--
|
||||
Expert:
|
||||
The Merge Scheduler in Lucene controls how merges are performed. The ConcurrentMergeScheduler (Lucene 2.3 default)
|
||||
can perform merges in the background using separate threads. The SerialMergeScheduler (Lucene 2.2 default) does not.
|
||||
-->
|
||||
<mergeScheduler>org.apache.lucene.index.ConcurrentMergeScheduler</mergeScheduler>
|
||||
<!-- these are global... can't currently override per index -->
|
||||
<writeLockTimeout>1000</writeLockTimeout>
|
||||
<commitLockTimeout>10000</commitLockTimeout>
|
||||
|
||||
<lockType>single</lockType>
|
||||
</indexDefaults>
|
||||
|
||||
<mainIndex>
|
||||
<!-- lucene options specific to the main on-disk lucene index -->
|
||||
<useCompoundFile>false</useCompoundFile>
|
||||
<mergeFactor>10</mergeFactor>
|
||||
<ramBufferSizeMB>32</ramBufferSizeMB>
|
||||
<maxMergeDocs>2147483647</maxMergeDocs>
|
||||
<maxFieldLength>10000</maxFieldLength>
|
||||
|
||||
<unlockOnStartup>true</unlockOnStartup>
|
||||
</mainIndex>
|
||||
|
||||
<updateHandler class="solr.DirectUpdateHandler2">
|
||||
|
||||
<!-- autocommit pending docs if certain criteria are met
|
||||
<autoCommit>
|
||||
<maxDocs>10000</maxDocs>
|
||||
<maxTime>3600000</maxTime>
|
||||
</autoCommit>
|
||||
-->
|
||||
<!-- represents a lower bound on the frequency that commits may
|
||||
occur (in seconds). NOTE: not yet implemented
|
||||
|
||||
<commitIntervalLowerBound>0</commitIntervalLowerBound>
|
||||
-->
|
||||
|
||||
<!-- The RunExecutableListener executes an external command.
|
||||
exe - the name of the executable to run
|
||||
dir - dir to use as the current working directory. default="."
|
||||
wait - the calling thread waits until the executable returns. default="true"
|
||||
args - the arguments to pass to the program. default=nothing
|
||||
env - environment variables to set. default=nothing
|
||||
-->
|
||||
<!-- A postCommit event is fired after every commit
|
||||
<listener event="postCommit" class="solr.RunExecutableListener">
|
||||
<str name="exe">/var/opt/resin3/__PORT__/scripts/solr/snapshooter</str>
|
||||
<str name="dir">/var/opt/resin3/__PORT__</str>
|
||||
<bool name="wait">true</bool>
|
||||
<arr name="args"> <str>arg1</str> <str>arg2</str> </arr>
|
||||
<arr name="env"> <str>MYVAR=val1</str> </arr>
|
||||
</listener>
|
||||
-->
|
||||
|
||||
|
||||
</updateHandler>
|
||||
|
||||
|
||||
<query>
|
||||
<!-- Maximum number of clauses in a boolean query... can affect
|
||||
range or wildcard queries that expand to big boolean
|
||||
queries. An exception is thrown if exceeded.
|
||||
-->
|
||||
<maxBooleanClauses>1024</maxBooleanClauses>
|
||||
|
||||
|
||||
<!-- Cache specification for Filters or DocSets - unordered set of *all* documents
|
||||
that match a particular query.
|
||||
-->
|
||||
<filterCache
|
||||
class="solr.search.LRUCache"
|
||||
size="512"
|
||||
initialSize="512"
|
||||
autowarmCount="256"/>
|
||||
|
||||
<queryResultCache
|
||||
class="solr.search.LRUCache"
|
||||
size="512"
|
||||
initialSize="512"
|
||||
autowarmCount="1024"/>
|
||||
|
||||
<documentCache
|
||||
class="solr.search.LRUCache"
|
||||
size="512"
|
||||
initialSize="512"
|
||||
autowarmCount="0"/>
|
||||
|
||||
<!-- If true, stored fields that are not requested will be loaded lazily.
|
||||
-->
|
||||
<enableLazyFieldLoading>true</enableLazyFieldLoading>
|
||||
|
||||
<!--
|
||||
|
||||
<cache name="myUserCache"
|
||||
class="solr.search.LRUCache"
|
||||
size="4096"
|
||||
initialSize="1024"
|
||||
autowarmCount="1024"
|
||||
regenerator="MyRegenerator"
|
||||
/>
|
||||
-->
|
||||
|
||||
|
||||
<useFilterForSortedQuery>true</useFilterForSortedQuery>
|
||||
|
||||
<queryResultWindowSize>10</queryResultWindowSize>
|
||||
|
||||
<!-- set maxSize artificially low to exercise both types of sets -->
|
||||
<HashDocSet maxSize="3" loadFactor="0.75"/>
|
||||
|
||||
|
||||
<!-- boolToFilterOptimizer converts boolean clauses with zero boost
|
||||
into cached filters if the number of docs selected by the clause exceeds
|
||||
the threshold (represented as a fraction of the total index)
|
||||
-->
|
||||
<boolTofilterOptimizer enabled="false" cacheSize="32" threshold=".05"/>
|
||||
|
||||
|
||||
<!-- a newSearcher event is fired whenever a new searcher is being prepared
|
||||
and there is a current searcher handling requests (aka registered). -->
|
||||
<!-- QuerySenderListener takes an array of NamedList and executes a
|
||||
local query request for each NamedList in sequence. -->
|
||||
<!--
|
||||
<listener event="newSearcher" class="solr.QuerySenderListener">
|
||||
<arr name="queries">
|
||||
<lst> <str name="q">solr</str> <str name="start">0</str> <str name="rows">10</str> </lst>
|
||||
<lst> <str name="q">rocks</str> <str name="start">0</str> <str name="rows">10</str> </lst>
|
||||
</arr>
|
||||
</listener>
|
||||
-->
|
||||
|
||||
<!-- a firstSearcher event is fired whenever a new searcher is being
|
||||
prepared but there is no current registered searcher to handle
|
||||
requests or to gain prewarming data from. -->
|
||||
<!--
|
||||
<listener event="firstSearcher" class="solr.QuerySenderListener">
|
||||
<arr name="queries">
|
||||
<lst> <str name="q">fast_warm</str> <str name="start">0</str> <str name="rows">10</str> </lst>
|
||||
</arr>
|
||||
</listener>
|
||||
-->
|
||||
|
||||
|
||||
</query>
|
||||
|
||||
|
||||
<!-- An alternate set representation that uses an integer hash to store filters (sets of docids).
|
||||
If the set cardinality <= maxSize elements, then HashDocSet will be used instead of the bitset
|
||||
based HashBitset. -->
|
||||
|
||||
<!-- requestHandler plugins... incoming queries will be dispatched to the
|
||||
correct handler based on the qt (query type) param matching the
|
||||
name of registered handlers.
|
||||
The "standard" request handler is the default and will be used if qt
|
||||
is not specified in the request.
|
||||
-->
|
||||
<requestHandler name="standard" class="solr.StandardRequestHandler">
|
||||
<bool name="httpCaching">true</bool>
|
||||
</requestHandler>
|
||||
<requestHandler name="dismaxOldStyleDefaults"
|
||||
class="solr.DisMaxRequestHandler" >
|
||||
<!-- for historic reasons, DisMaxRequestHandler will use all of
|
||||
it's init params as "defaults" if there is no "defaults" list
|
||||
specified
|
||||
-->
|
||||
<float name="tie">0.01</float>
|
||||
<str name="qf">
|
||||
text^0.5 features_t^1.0 subject^1.4 title_stemmed^2.0
|
||||
</str>
|
||||
<str name="pf">
|
||||
text^0.2 features_t^1.1 subject^1.4 title_stemmed^2.0 title^1.5
|
||||
</str>
|
||||
<str name="bf">
|
||||
ord(weight)^0.5 recip(rord(iind),1,1000,1000)^0.3
|
||||
</str>
|
||||
<str name="mm">
|
||||
3<-1 5<-2 6<90%
|
||||
</str>
|
||||
<int name="ps">100</int>
|
||||
</requestHandler>
|
||||
<requestHandler name="dismax" class="solr.DisMaxRequestHandler" >
|
||||
<lst name="defaults">
|
||||
<str name="q.alt">*:*</str>
|
||||
<float name="tie">0.01</float>
|
||||
<str name="qf">
|
||||
text^0.5 features_t^1.0 subject^1.4 title_stemmed^2.0
|
||||
</str>
|
||||
<str name="pf">
|
||||
text^0.2 features_t^1.1 subject^1.4 title_stemmed^2.0 title^1.5
|
||||
</str>
|
||||
<str name="bf">
|
||||
ord(weight)^0.5 recip(rord(iind),1,1000,1000)^0.3
|
||||
</str>
|
||||
<str name="mm">
|
||||
3<-1 5<-2 6<90%
|
||||
</str>
|
||||
<int name="ps">100</int>
|
||||
</lst>
|
||||
</requestHandler>
|
||||
<requestHandler name="old" class="solr.tst.OldRequestHandler" >
|
||||
<int name="myparam">1000</int>
|
||||
<float name="ratio">1.4142135</float>
|
||||
<arr name="myarr"><int>1</int><int>2</int></arr>
|
||||
<str>foo</str>
|
||||
</requestHandler>
|
||||
<requestHandler name="oldagain" class="solr.tst.OldRequestHandler" >
|
||||
<lst name="lst1"> <str name="op">sqrt</str> <int name="val">2</int> </lst>
|
||||
<lst name="lst2"> <str name="op">log</str> <float name="val">10</float> </lst>
|
||||
</requestHandler>
|
||||
|
||||
<requestHandler name="test" class="solr.tst.TestRequestHandler" />
|
||||
|
||||
<!-- test query parameter defaults -->
|
||||
<requestHandler name="defaults" class="solr.StandardRequestHandler">
|
||||
<lst name="defaults">
|
||||
<int name="rows">4</int>
|
||||
<bool name="hl">true</bool>
|
||||
<str name="hl.fl">text,name,subject,title,whitetok</str>
|
||||
</lst>
|
||||
</requestHandler>
|
||||
|
||||
<!-- test query parameter defaults -->
|
||||
<requestHandler name="lazy" class="solr.StandardRequestHandler" startup="lazy">
|
||||
<lst name="defaults">
|
||||
<int name="rows">4</int>
|
||||
<bool name="hl">true</bool>
|
||||
<str name="hl.fl">text,name,subject,title,whitetok</str>
|
||||
</lst>
|
||||
</requestHandler>
|
||||
|
||||
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
|
||||
<requestHandler name="/update/csv" class="solr.CSVRequestHandler" startup="lazy">
|
||||
<bool name="httpCaching">false</bool>
|
||||
</requestHandler>
|
||||
|
||||
<requestHandler name="/update/extract" class="org.apache.solr.handler.ExtractingRequestHandler"/>
|
||||
|
||||
|
||||
<highlighting>
|
||||
<!-- Configure the standard fragmenter -->
|
||||
<fragmenter name="gap" class="org.apache.solr.highlight.GapFragmenter" default="true">
|
||||
<lst name="defaults">
|
||||
<int name="hl.fragsize">100</int>
|
||||
</lst>
|
||||
</fragmenter>
|
||||
|
||||
<fragmenter name="regex" class="org.apache.solr.highlight.RegexFragmenter">
|
||||
<lst name="defaults">
|
||||
<int name="hl.fragsize">70</int>
|
||||
</lst>
|
||||
</fragmenter>
|
||||
|
||||
<!-- Configure the standard formatter -->
|
||||
<formatter name="html" class="org.apache.solr.highlight.HtmlFormatter" default="true">
|
||||
<lst name="defaults">
|
||||
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
|
||||
<str name="hl.simple.post"><![CDATA[</em>]]></str>
|
||||
</lst>
|
||||
</formatter>
|
||||
</highlighting>
|
||||
|
||||
|
||||
<!-- enable streaming for testing... -->
|
||||
<requestDispatcher handleSelect="true" >
|
||||
<requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048" />
|
||||
<httpCaching lastModifiedFrom="openTime" etagSeed="Solr" never304="false">
|
||||
<cacheControl>max-age=30, public</cacheControl>
|
||||
</httpCaching>
|
||||
</requestDispatcher>
|
||||
|
||||
<admin>
|
||||
<defaultQuery>solr</defaultQuery>
|
||||
<gettableFiles>solrconfig.xml scheam.xml admin-extra.html</gettableFiles>
|
||||
</admin>
|
||||
|
||||
<!-- test getting system property -->
|
||||
<propTest attr1="${solr.test.sys.prop1}-$${literal}"
|
||||
attr2="${non.existent.sys.prop:default-from-config}">prefix-${solr.test.sys.prop2}-suffix</propTest>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
</config>
|
|
@ -0,0 +1,16 @@
|
|||
# Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
# contributor license agreements. See the NOTICE file distributed with
|
||||
# this work for additional information regarding copyright ownership.
|
||||
# The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
# (the "License"); you may not use this file except in compliance with
|
||||
# the License. You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
stopworda
|
||||
stopwordb
|
|
@ -0,0 +1,22 @@
|
|||
# Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
# contributor license agreements. See the NOTICE file distributed with
|
||||
# this work for additional information regarding copyright ownership.
|
||||
# The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
# (the "License"); you may not use this file except in compliance with
|
||||
# the License. You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
a => aa
|
||||
b => b1 b2
|
||||
c => c1,c2
|
||||
a\=>a => b\=>b
|
||||
a\,a => b\,b
|
||||
foo,bar,baz
|
||||
|
||||
Television,TV,Televisions
|
|
@ -0,0 +1,18 @@
|
|||
Solr Version Control System
|
||||
|
||||
Overview
|
||||
|
||||
The Solr source code resides in the Apache Subversion (SVN) repository.
|
||||
The command-line SVN client can be obtained here or as an optional package
|
||||
for cygwin.
|
||||
|
||||
The TortoiseSVN GUI client for Windows can be obtained here. There
|
||||
are also SVN plugins available for older versions of Eclipse and
|
||||
IntelliJ IDEA that don't have subversion support already included.
|
||||
|
||||
-------------------------------
|
||||
|
||||
Note: This document is an excerpt from a document Licensed to the
|
||||
Apache Software Foundation (ASF) under one or more contributor
|
||||
license agreements. See the XML version (version_control.xml) for
|
||||
more details.
|
|
@ -0,0 +1,42 @@
|
|||
<?xml version="1.0"?>
|
||||
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
|
||||
<!--
|
||||
Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
(the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
-->
|
||||
|
||||
<document>
|
||||
|
||||
<header>
|
||||
<title>Solr Version Control System</title>
|
||||
</header>
|
||||
|
||||
<body>
|
||||
|
||||
<section>
|
||||
<title>Overview</title>
|
||||
<p>
|
||||
The Solr source code resides in the Apache <a href="http://subversion.tigris.org/">Subversion (SVN)</a> repository.
|
||||
The command-line SVN client can be obtained <a href="http://subversion.tigris.org/project_packages.html">here</a> or as an optional package for <a href="http://www.cygwin.com/">cygwin</a>.
|
||||
The TortoiseSVN GUI client for Windows can be obtained <a href="http://tortoisesvn.tigris.org/">here</a>. There
|
||||
are also SVN plugins available for older versions of <a href="http://subclipse.tigris.org/">Eclipse</a> and
|
||||
<a href="http://svnup.tigris.org/">IntelliJ IDEA</a> that don't have subversion support already included.
|
||||
</p>
|
||||
</section>
|
||||
<p>Here is some more text. It contains <a href="http://lucene.apache.org">a link</a>. </p>
|
||||
<p>Text Here</p>
|
||||
</body>
|
||||
|
||||
</document>
|
|
@ -105,6 +105,8 @@
|
|||
</target>
|
||||
-->
|
||||
|
||||
<target name="example" depends="build"/>
|
||||
|
||||
|
||||
<!-- do nothing for now, required for generate maven artifacts -->
|
||||
<target name="build"/>
|
||||
|
|
|
@ -121,4 +121,6 @@
|
|||
<!-- TODO: Autolaunch Solr -->
|
||||
</target>
|
||||
|
||||
<target name="example" depends="build"/>
|
||||
|
||||
</project>
|
||||
|
|
|
@ -0,0 +1,200 @@
|
|||
package org.apache.solr.common.util;
|
||||
/**
|
||||
* Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
* contributor license agreements. See the NOTICE file distributed with
|
||||
* this work for additional information regarding copyright ownership.
|
||||
* The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
* (the "License"); you may not use this file except in compliance with
|
||||
* the License. You may obtain a copy of the License at
|
||||
*
|
||||
* http://www.apache.org/licenses/LICENSE-2.0
|
||||
*
|
||||
* Unless required by applicable law or agreed to in writing, software
|
||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
* See the License for the specific language governing permissions and
|
||||
* limitations under the License.
|
||||
*/
|
||||
|
||||
import java.text.DateFormat;
|
||||
import java.text.ParseException;
|
||||
import java.text.SimpleDateFormat;
|
||||
import java.util.ArrayList;
|
||||
import java.util.Arrays;
|
||||
import java.util.Calendar;
|
||||
import java.util.Collection;
|
||||
import java.util.Date;
|
||||
import java.util.Iterator;
|
||||
import java.util.Locale;
|
||||
import java.util.TimeZone;
|
||||
|
||||
|
||||
/**
|
||||
* This class has some code from HttpClient DateUtil.
|
||||
*/
|
||||
public class DateUtil {
|
||||
//start HttpClient
|
||||
/**
|
||||
* Date format pattern used to parse HTTP date headers in RFC 1123 format.
|
||||
*/
|
||||
public static final String PATTERN_RFC1123 = "EEE, dd MMM yyyy HH:mm:ss zzz";
|
||||
|
||||
/**
|
||||
* Date format pattern used to parse HTTP date headers in RFC 1036 format.
|
||||
*/
|
||||
public static final String PATTERN_RFC1036 = "EEEE, dd-MMM-yy HH:mm:ss zzz";
|
||||
|
||||
/**
|
||||
* Date format pattern used to parse HTTP date headers in ANSI C
|
||||
* <code>asctime()</code> format.
|
||||
*/
|
||||
public static final String PATTERN_ASCTIME = "EEE MMM d HH:mm:ss yyyy";
|
||||
//These are included for back compat
|
||||
private static final Collection<String> DEFAULT_HTTP_CLIENT_PATTERNS = Arrays.asList(
|
||||
PATTERN_ASCTIME, PATTERN_RFC1036, PATTERN_RFC1123);
|
||||
|
||||
private static final Date DEFAULT_TWO_DIGIT_YEAR_START;
|
||||
|
||||
static {
|
||||
Calendar calendar = Calendar.getInstance();
|
||||
calendar.set(2000, Calendar.JANUARY, 1, 0, 0);
|
||||
DEFAULT_TWO_DIGIT_YEAR_START = calendar.getTime();
|
||||
}
|
||||
|
||||
private static final TimeZone GMT = TimeZone.getTimeZone("GMT");
|
||||
|
||||
//end HttpClient
|
||||
|
||||
//---------------------------------------------------------------------------------------
|
||||
|
||||
/**
|
||||
* A suite of default date formats that can be parsed, and thus transformed to the Solr specific format
|
||||
*/
|
||||
public static final Collection<String> DEFAULT_DATE_FORMATS = new ArrayList<String>();
|
||||
|
||||
static {
|
||||
DEFAULT_DATE_FORMATS.add("yyyy-MM-dd'T'HH:mm:ss'Z'");
|
||||
DEFAULT_DATE_FORMATS.add("yyyy-MM-dd'T'HH:mm:ss");
|
||||
DEFAULT_DATE_FORMATS.add("yyyy-MM-dd");
|
||||
DEFAULT_DATE_FORMATS.add("yyyy-MM-dd hh:mm:ss");
|
||||
DEFAULT_DATE_FORMATS.add("yyyy-MM-dd HH:mm:ss");
|
||||
DEFAULT_DATE_FORMATS.add("EEE MMM d hh:mm:ss z yyyy");
|
||||
DEFAULT_DATE_FORMATS.addAll(DEFAULT_HTTP_CLIENT_PATTERNS);
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns a formatter that can be use by the current thread if needed to
|
||||
* convert Date objects to the Internal representation.
|
||||
*
|
||||
* @param d The input date to parse
|
||||
* @return The parsed {@link java.util.Date}
|
||||
* @throws java.text.ParseException If the input can't be parsed
|
||||
* @throws org.apache.commons.httpclient.util.DateParseException
|
||||
* If the input can't be parsed
|
||||
*/
|
||||
public static Date parseDate(String d) throws ParseException {
|
||||
return parseDate(d, DEFAULT_DATE_FORMATS);
|
||||
}
|
||||
|
||||
public static Date parseDate(String d, Collection<String> fmts) throws ParseException {
|
||||
// 2007-04-26T08:05:04Z
|
||||
if (d.endsWith("Z") && d.length() > 20) {
|
||||
return getThreadLocalDateFormat().parse(d);
|
||||
}
|
||||
return parseDate(d, fmts, null);
|
||||
}
|
||||
|
||||
/**
|
||||
* Slightly modified from org.apache.commons.httpclient.util.DateUtil.parseDate
|
||||
* <p/>
|
||||
* Parses the date value using the given date formats.
|
||||
*
|
||||
* @param dateValue the date value to parse
|
||||
* @param dateFormats the date formats to use
|
||||
* @param startDate During parsing, two digit years will be placed in the range
|
||||
* <code>startDate</code> to <code>startDate + 100 years</code>. This value may
|
||||
* be <code>null</code>. When <code>null</code> is given as a parameter, year
|
||||
* <code>2000</code> will be used.
|
||||
* @return the parsed date
|
||||
* @throws ParseException if none of the dataFormats could parse the dateValue
|
||||
*/
|
||||
public static Date parseDate(
|
||||
String dateValue,
|
||||
Collection<String> dateFormats,
|
||||
Date startDate
|
||||
) throws ParseException {
|
||||
|
||||
if (dateValue == null) {
|
||||
throw new IllegalArgumentException("dateValue is null");
|
||||
}
|
||||
if (dateFormats == null) {
|
||||
dateFormats = DEFAULT_HTTP_CLIENT_PATTERNS;
|
||||
}
|
||||
if (startDate == null) {
|
||||
startDate = DEFAULT_TWO_DIGIT_YEAR_START;
|
||||
}
|
||||
// trim single quotes around date if present
|
||||
// see issue #5279
|
||||
if (dateValue.length() > 1
|
||||
&& dateValue.startsWith("'")
|
||||
&& dateValue.endsWith("'")
|
||||
) {
|
||||
dateValue = dateValue.substring(1, dateValue.length() - 1);
|
||||
}
|
||||
|
||||
SimpleDateFormat dateParser = null;
|
||||
Iterator formatIter = dateFormats.iterator();
|
||||
|
||||
while (formatIter.hasNext()) {
|
||||
String format = (String) formatIter.next();
|
||||
if (dateParser == null) {
|
||||
dateParser = new SimpleDateFormat(format, Locale.US);
|
||||
dateParser.setTimeZone(GMT);
|
||||
dateParser.set2DigitYearStart(startDate);
|
||||
} else {
|
||||
dateParser.applyPattern(format);
|
||||
}
|
||||
try {
|
||||
return dateParser.parse(dateValue);
|
||||
} catch (ParseException pe) {
|
||||
// ignore this exception, we will try the next format
|
||||
}
|
||||
}
|
||||
|
||||
// we were unable to parse the date
|
||||
throw new ParseException("Unable to parse the date " + dateValue, 0);
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
* Returns a formatter that can be use by the current thread if needed to
|
||||
* convert Date objects to the Internal representation.
|
||||
*
|
||||
* @return The {@link java.text.DateFormat} for the current thread
|
||||
*/
|
||||
public static DateFormat getThreadLocalDateFormat() {
|
||||
return fmtThreadLocal.get();
|
||||
}
|
||||
|
||||
public static TimeZone UTC = TimeZone.getTimeZone("UTC");
|
||||
private static ThreadLocalDateFormat fmtThreadLocal = new ThreadLocalDateFormat();
|
||||
|
||||
private static class ThreadLocalDateFormat extends ThreadLocal<DateFormat> {
|
||||
DateFormat proto;
|
||||
|
||||
public ThreadLocalDateFormat() {
|
||||
super();
|
||||
//2007-04-26T08:05:04Z
|
||||
SimpleDateFormat tmp = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
|
||||
tmp.setTimeZone(UTC);
|
||||
proto = tmp;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected DateFormat initialValue() {
|
||||
return (DateFormat) proto.clone();
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
}
|
|
@ -31,6 +31,7 @@ import java.util.*;
|
|||
import org.slf4j.Logger;
|
||||
import org.slf4j.LoggerFactory;
|
||||
import java.nio.charset.Charset;
|
||||
import java.lang.reflect.Constructor;
|
||||
|
||||
import javax.naming.Context;
|
||||
import javax.naming.InitialContext;
|
||||
|
@ -308,6 +309,36 @@ public class SolrResourceLoader implements ResourceLoader
|
|||
}
|
||||
return obj;
|
||||
}
|
||||
|
||||
public Object newInstance(String cName, String [] subPackages, Class[] params, Object[] args){
|
||||
Class clazz = findClass(cName,subPackages);
|
||||
if( clazz == null ) {
|
||||
throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
|
||||
"Can not find class: "+cName + " in " + classLoader, false);
|
||||
}
|
||||
|
||||
Object obj = null;
|
||||
try {
|
||||
|
||||
Constructor constructor = clazz.getConstructor(params);
|
||||
obj = constructor.newInstance(args);
|
||||
}
|
||||
catch (Exception e) {
|
||||
throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
|
||||
"Error instantiating class: '" + clazz.getName()+"'", e, false );
|
||||
}
|
||||
|
||||
if( obj instanceof SolrCoreAware ) {
|
||||
assertAwareCompatibility( SolrCoreAware.class, obj );
|
||||
waitingForCore.add( (SolrCoreAware)obj );
|
||||
}
|
||||
if( obj instanceof ResourceLoaderAware ) {
|
||||
assertAwareCompatibility( ResourceLoaderAware.class, obj );
|
||||
waitingForResources.add( (ResourceLoaderAware)obj );
|
||||
}
|
||||
return obj;
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
* Tell all {@link SolrCoreAware} instances about the SolrCore
|
||||
|
@ -436,4 +467,4 @@ public class SolrResourceLoader implements ResourceLoader
|
|||
throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, builder.toString() );
|
||||
}
|
||||
|
||||
}
|
||||
}
|
||||
|
|
|
@ -315,11 +315,7 @@ public class TestHarness {
|
|||
* @see LocalSolrQueryRequest
|
||||
*/
|
||||
public String query(String handler, SolrQueryRequest req) throws IOException, Exception {
|
||||
SolrQueryResponse rsp = new SolrQueryResponse();
|
||||
core.execute(core.getRequestHandler(handler),req,rsp);
|
||||
if (rsp.getException() != null) {
|
||||
throw rsp.getException();
|
||||
}
|
||||
SolrQueryResponse rsp = queryAndResponse(handler, req);
|
||||
|
||||
StringWriter sw = new StringWriter(32000);
|
||||
QueryResponseWriter responseWriter = core.getQueryResponseWriter(req);
|
||||
|
@ -330,6 +326,15 @@ public class TestHarness {
|
|||
return sw.toString();
|
||||
}
|
||||
|
||||
public SolrQueryResponse queryAndResponse(String handler, SolrQueryRequest req) throws Exception {
|
||||
SolrQueryResponse rsp = new SolrQueryResponse();
|
||||
core.execute(core.getRequestHandler(handler),req,rsp);
|
||||
if (rsp.getException() != null) {
|
||||
throw rsp.getException();
|
||||
}
|
||||
return rsp;
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
* A helper method which valides a String against an array of XPath test
|
||||
|
|
Loading…
Reference in New Issue