SOLR-284: Solr Cell: Add support for Tika content extraction

git-svn-id: https://svn.apache.org/repos/asf/lucene/solr/trunk@723977 13f79535-47bb-0310-9956-ffa450edef68
2008-12-06 13:04:26 +00:00 · 2008-12-06 13:04:26 +00:00 · cedd07b500
parent 474ab9a515
commit cedd07b500
49 changed files with 2675 additions and 93 deletions
--- a/CHANGES.txt
+++ b/CHANGES.txt
@ -98,6 +98,8 @@ New Features
    can be specified.
    (Georgios Stamatis, Lars Kotthoff, Chris Harris via koji)
 20. SOLR-284: Added support for extracting content from binary documents like MS Word and PDF using Apache Tika.  See also contrib/extraction/CHANGES.txt (Eric Pugh, Chris Harris, gsingers)
 Optimizations
 ----------------------
 1. SOLR-374: Use IndexReader.reopen to save resources by re-using parts of the
--- a/LICENSE.txt
+++ b/LICENSE.txt
@ -261,9 +261,9 @@ such code.
 1.13. You (or Your) means an individual or a legal entity exercising rights
 under, and complying with all of the terms of, this License. For legal
 entities, You includes any entity which controls, is controlled by, or is under
-common control with You. For purposes of this definition, control means (a)áthe
+common control with You. For purposes of this definition, control means (a)<EFBFBD>the
 power, direct or indirect, to cause the direction or management of such entity,
-whether by contract or otherwise, or (b)áownership of more than fifty percent
+whether by contract or otherwise, or (b)<EFBFBD>ownership of more than fifty percent
 (50%) of the outstanding shares or beneficial ownership of such entity.
 2. License Grants.
@ -278,12 +278,12 @@ with or without Modifications, and/or as part of a Larger Work; and (b) under
 Patent Claims infringed by the making, using or selling of Original Software,
 to make, have made, use, practice, sell, and offer for sale, and/or otherwise
 dispose of the Original Software (or portions thereof).  (c) The licenses
-granted in Sectionsá2.1(a) and (b) are effective on the date Initial Developer
+granted in Sections<EFBFBD>2.1(a) and (b) are effective on the date Initial Developer
 first distributes or otherwise makes the Original Software available to a third
-party under the terms of this License.  (d) Notwithstanding Sectioná2.1(b)
+party under the terms of this License.  (d) Notwithstanding Section<EFBFBD>2.1(b)
-above, no patent license is granted: (1)áfor code that You delete from the
+above, no patent license is granted: (1)<EFBFBD>for code that You delete from the
-Original Software, or (2)áfor infringements caused by: (i)áthe modification of
+Original Software, or (2)<EFBFBD>for infringements caused by: (i)<29>the modification of
-the Original Software, or (ii)áthe combination of the Original Software with
+the Original Software, or (ii)<EFBFBD>the combination of the Original Software with
 other software or devices.
 2.2. Contributor Grant.  Conditioned upon Your compliance with Section 3.1
@ -297,17 +297,17 @@ and/or as part of a Larger Work; and (b) under Patent Claims infringed by the
 making, using, or selling of Modifications made by that Contributor either
 alone and/or in combination with its Contributor Version (or portions of such
 combination), to make, use, sell, offer for sale, have made, and/or otherwise
-dispose of: (1)áModifications made by that Contributor (or portions thereof);
+dispose of: (1)<EFBFBD>Modifications made by that Contributor (or portions thereof);
-and (2)áthe combination of Modifications made by that Contributor with its
+and (2)<EFBFBD>the combination of Modifications made by that Contributor with its
 Contributor Version (or portions of such combination).  (c) The licenses
-granted in Sectionsá2.2(a) and 2.2(b) are effective on the date Contributor
+granted in Sections<EFBFBD>2.2(a) and 2.2(b) are effective on the date Contributor
 first distributes or otherwise makes the Modifications available to a third
-party.  (d) Notwithstanding Sectioná2.2(b) above, no patent license is granted:
+party.  (d) Notwithstanding Section<EFBFBD>2.2(b) above, no patent license is granted:
-(1)áfor any code that Contributor has deleted from the Contributor Version;
+(1)<EFBFBD>for any code that Contributor has deleted from the Contributor Version;
-(2)áfor infringements caused by: (i)áthird party modifications of Contributor
+(2)<EFBFBD>for infringements caused by: (i)<29>third party modifications of Contributor
-Version, or (ii)áthe combination of Modifications made by that Contributor with
+Version, or (ii)<EFBFBD>the combination of Modifications made by that Contributor with
 other software (except as part of the Contributor Version) or other devices; or
-(3)áunder Patent Claims infringed by Covered Software in the absence of
+(3)<EFBFBD>under Patent Claims infringed by Covered Software in the absence of
 Modifications made by that Contributor.
 3. Distribution Obligations.
@ -389,9 +389,9 @@ License published by the license steward.  4.3. Modified Versions.
 When You are an Initial Developer and You want to create a new license for Your
 Original Software, You may create and use a modified version of this License if
-You: (a)árename the license and remove any references to the name of the
+You: (a)<EFBFBD>rename the license and remove any references to the name of the
 license steward (except to note that the license differs from this License);
-and (b)áotherwise make it clear that the license contains terms which differ
+and (b)<EFBFBD>otherwise make it clear that the license contains terms which differ
 from this License.
 5. DISCLAIMER OF WARRANTY.
@ -422,14 +422,14 @@ the Participant is a Contributor or the Original Software where the Participant
 is the Initial Developer) directly or indirectly infringes any patent, then any
 and all rights granted directly or indirectly to You by such Participant, the
 Initial Developer (if the Initial Developer is not the Participant) and all
-Contributors under Sectionsá2.1 and/or 2.2 of this License shall, upon 60 days
+Contributors under Sections<EFBFBD>2.1 and/or 2.2 of this License shall, upon 60 days
 notice from Participant terminate prospectively and automatically at the
 expiration of such 60 day notice period, unless if within such 60 day period
 You withdraw Your claim with respect to the Participant Software against such
 Participant either unilaterally or pursuant to a written agreement with
 Participant.
-6.3. In the event of termination under Sectionsá6.1 or 6.2 above, all end user
+6.3. In the event of termination under Sections<EFBFBD>6.1 or 6.2 above, all end user
 licenses that have been validly granted by You or any distributor hereunder
 prior to termination (excluding licenses granted to You by any distributor)
 shall survive termination.
@ -453,9 +453,9 @@ LIMITATION MAY NOT APPLY TO YOU.
 8. U.S. GOVERNMENT END USERS.
 The Covered Software is a commercial item, as that term is defined in
-48áC.F.R.á2.101 (Oct. 1995), consisting of commercial computer software (as
+48<EFBFBD>C.F.R.<2E>2.101 (Oct. 1995), consisting of commercial computer software (as
-that term is defined at 48 C.F.R. á252.227-7014(a)(1)) and commercial computer
+that term is defined at 48 C.F.R. <EFBFBD>252.227-7014(a)(1)) and commercial computer
-software documentation as such terms are used in 48áC.F.R.á12.212 (Sept. 1995).
+software documentation as such terms are used in 48<EFBFBD>C.F.R.<2E>12.212 (Sept. 1995).
 Consistent with 48 C.F.R. 12.212 and 48 C.F.R. 227.7202-1 through 227.7202-4
 (June 1995), all U.S. Government End Users acquire Covered Software with only
 those rights set forth herein. This U.S. Government Rights clause is in lieu
@ -736,3 +736,161 @@ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 OF CONTRACT, TORT OR OTHERWISE,  ARISING FROM, OUT OF OR IN CONNECTION
 WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 ===========================================================================
 Apache Tika Licenses - contrib/extraction
 ---------------------------------------------------------------------------
 Apache Tika is licensed under the ASL 2.0.  See above for the text of the license
 APACHE TIKA SUBCOMPONENTS
 Apache Tika includes a number of subcomponents with separate copyright notices
 and license terms. Your use of these subcomponents is subject to the terms and
 conditions of the following licenses.
 Bouncy Castle libraries (bcmail and bcprov)
    Copyright (c) 2000-2006 The Legion Of The Bouncy Castle
    (http://www.bouncycastle.org)
    Permission is hereby granted, free of charge, to any person obtaining
    a copy of this software and associated documentation files
    (the "Software"), to deal in the Software without restriction,
    including without limitation the rights to use, copy, modify, merge,
    publish, distribute, sublicense, and/or sell copies of the Software,
    and to permit persons to whom the Software is furnished to do so,
    subject to the following conditions:
    The above copyright notice and this permission notice shall be included
    in all copies or substantial portions of the Software.
    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
    OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
    THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
    OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
    ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
    OTHER DEALINGS IN THE SOFTWARE.
 PDFBox library (pdfbox)
    Copyright (c) 2003-2005, www.pdfbox.org
    All rights reserved.
    Redistribution and use in source and binary forms, with or without
    modification, are permitted provided that the following conditions are met:
    1. Redistributions of source code must retain the above copyright notice,
       this list of conditions and the following disclaimer.
    2. Redistributions in binary form must reproduce the above copyright notice,
       this list of conditions and the following disclaimer in the documentation
       and/or other materials provided with the distribution.
    3. Neither the name of pdfbox; nor the names of its
       contributors may be used to endorse or promote products derived from this
       software without specific prior written permission.
    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
    AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
    IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
    ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
    FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
    DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
    OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
    HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
    LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
    OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
    OF SUCH DAMAGE.
 FontBox and JempBox libraries (fontbox, jempbox)
    Copyright (c) 2003-2005, www.fontbox.org
    All rights reserved.
    Redistribution and use in source and binary forms, with or without
    modification, are permitted provided that the following conditions are met:
    1. Redistributions of source code must retain the above copyright notice,
       this list of conditions and the following disclaimer.
    2. Redistributions in binary form must reproduce the above copyright notice,
       this list of conditions and the following disclaimer in the documentation
       and/or other materials provided with the distribution.
    3. Neither the name of fontbox; nor the names of its
       contributors may be used to endorse or promote products derived from this
       software without specific prior written permission.
    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
    AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
    IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
    ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
    FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
    DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
    OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
    HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
    LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
    OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
    OF SUCH DAMAGE.
 ICU4J library (icu4j)
    Copyright (c) 1995-2005 International Business Machines Corporation
    and others
    All rights reserved.
    Permission is hereby granted, free of charge, to any person obtaining
    a copy of this software and associated documentation files (the
    "Software"), to deal in the Software without restriction, including
    without limitation the rights to use, copy, modify, merge, publish,
    distribute, and/or sell copies of the Software, and to permit persons
    to whom the Software is furnished to do so, provided that the above
    copyright notice(s) and this permission notice appear in all copies
    of the Software and that both the above copyright notice(s) and this
    permission notice appear in supporting documentation.
    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
    OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS.
    IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE
    BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES,
    OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
    WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
    ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
    SOFTWARE.
    Except as contained in this notice, the name of a copyright holder shall
    not be used in advertising or otherwise to promote the sale, use or other
    dealings in this Software without prior written authorization of the
    copyright holder.
 ASM library (asm)
    Copyright (c) 2000-2005 INRIA, France Telecom
    All rights reserved.
    Redistribution and use in source and binary forms, with or without
    modification, are permitted provided that the following conditions
    are met:
    1. Redistributions of source code must retain the above copyright
       notice, this list of conditions and the following disclaimer.
    2. Redistributions in binary form must reproduce the above copyright
       notice, this list of conditions and the following disclaimer in the
       documentation and/or other materials provided with the distribution.
    3. Neither the name of the copyright holders nor the names of its
       contributors may be used to endorse or promote products derived from
       this software without specific prior written permission.
    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
    AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
    IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
    ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
    LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
    CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
    SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
    INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
    CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
    ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
    THE POSSIBILITY OF SUCH DAMAGE.
--- a/NOTICE.txt
+++ b/NOTICE.txt
@ -113,3 +113,24 @@ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
 LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
 OF CONTRACT, TORT OR OTHERWISE,  ARISING FROM, OUT OF OR IN CONNECTION
 WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 =========================================================================
 ==  Apache Tika Notices                                                ==
 =========================================================================
 The following notices apply to the Apache Tika libraries in contrib/extraction/lib:
 This product includes software developed by the following copyright owners:
 Copyright (c) 2000-2006 The Legion Of The Bouncy Castle
 (http://www.bouncycastle.org)
 Copyright (c) 2003-2005, www.pdfbox.org
 Copyright (c) 2003-2005, www.fontbox.org
 Copyright (c) 1995-2005 International Business Machines Corporation and others
 Copyright (c) 2000-2005 INRIA, France Telecom
--- a/build.xml
+++ b/build.xml
@ -30,8 +30,7 @@
  <!-- Destination for distribution files (demo WAR, src distro, etc.) -->
  <property name="dist" value="dist" />
-  <!-- Example directory -->
+  
  <property name="example" value="example" />
  <property name="clover.db.dir" location="${dest}/tests/clover/db"/>
  <property name="clover.report.dir" location="${dest}/tests/clover/reports"/>
@ -612,7 +611,7 @@
  <target name="example" 
          description="Creates a runnable example configuration."
-          depends="init-forrest-entities,dist-contrib,dist-war">
+          depends="init-forrest-entities,dist-contrib,dist-war,example-contrib">
    <copy file="${dist}/${fullnamever}.war"
          tofile="${example}/webapps/${ant.project.name}.war"/>
    <jar destfile="${example}/exampledocs/post.jar"
@ -624,7 +623,7 @@
                     value="org.apache.solr.util.SimplePostTool"/>
       </manifest>
    </jar>
-    
+
    <copy todir="${example}/solr/bin">
      <fileset dir="${src}/scripts">
        <exclude name="scripts.conf"/>
--- a/client/java/solrj/src/org/apache/solr/client/solrj/util/ClientUtils.java
+++ b/client/java/solrj/src/org/apache/solr/client/solrj/util/ClientUtils.java
@ -23,17 +23,14 @@ import java.io.Writer;
 import java.net.URLEncoder;
 import java.text.DateFormat;
 import java.text.ParseException;
 import java.text.SimpleDateFormat;
 import java.util.ArrayList;
 import java.util.Collection;
 import java.util.Date;
 import java.util.Iterator;
 import java.util.TimeZone;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 import org.apache.commons.httpclient.util.DateParseException;
-import org.apache.commons.httpclient.util.DateUtil;
+
 import org.apache.solr.common.SolrDocument;
 import org.apache.solr.common.SolrInputDocument;
 import org.apache.solr.common.SolrInputField;
@ -41,6 +38,7 @@ import org.apache.solr.common.params.SolrParams;
 import org.apache.solr.common.util.ContentStream;
 import org.apache.solr.common.util.ContentStreamBase;
 import org.apache.solr.common.util.XML;
 import org.apache.solr.common.util.DateUtil;
 /**
@ -61,17 +59,17 @@ public class ClientUtils
  {
    if( str == null )
      return null;
-    
+
    ArrayList<ContentStream> streams = new ArrayList<ContentStream>( 1 );
    ContentStreamBase ccc = new ContentStreamBase.StringStream( str );
    ccc.setContentType( contentType );
    streams.add( ccc );
    return streams;
  }
-  
+
  /**
   * @param d SolrDocument to convert
-   * @return a SolrInputDocument with the same fields and values as the 
+   * @return a SolrInputDocument with the same fields and values as the
   *   SolrDocument.  All boosts are 1.0f
   */
  public static SolrInputDocument toSolrInputDocument( SolrDocument d )
@ -95,38 +93,38 @@ public class ClientUtils
    }
    return doc;
  }
-  
+
  //------------------------------------------------------------------------
  //------------------------------------------------------------------------
-  
+
  public static void writeXML( SolrInputDocument doc, Writer writer ) throws IOException
  {
    writer.write("<doc boost=\""+doc.getDocumentBoost()+"\">");
-   
+
    for( SolrInputField field : doc ) {
      float boost = field.getBoost();
      String name = field.getName();
      for( Object v : field ) {
        if (v instanceof Date) {
-          v = fmtThreadLocal.get().format( (Date)v );
+          v = DateUtil.getThreadLocalDateFormat().format( (Date)v );
        }
        if( boost != 1.0f ) {
-          XML.writeXML(writer, "field", v.toString(), "name", name, "boost", boost ); 
+          XML.writeXML(writer, "field", v.toString(), "name", name, "boost", boost );
        }
        else {
-          XML.writeXML(writer, "field", v.toString(), "name", name ); 
+          XML.writeXML(writer, "field", v.toString(), "name", name );
        }
-        
+
        // only write the boost for the first multi-valued field
        // otherwise, the used boost is the product of all the boost values
-        boost = 1.0f; 
+        boost = 1.0f;
      }
    }
    writer.write("</doc>");
  }
-  public static String toXML( SolrInputDocument doc ) 
+
  public static String toXML( SolrInputDocument doc )
  {
    StringWriter str = new StringWriter();
    try {
@ -135,59 +133,45 @@ public class ClientUtils
    catch( Exception ex ){}
    return str.toString();
  }
-  
+
  //---------------------------------------------------------------------------------------
-  public static final Collection<String> fmts = new ArrayList<String>();
+  /**
-  static {
+   * @deprecated Use {@link org.apache.solr.common.util.DateUtil#DEFAULT_DATE_FORMATS}
-    fmts.add( "yyyy-MM-dd'T'HH:mm:ss'Z'" );
+   */
-    fmts.add( "yyyy-MM-dd'T'HH:mm:ss" );
+  public static final Collection<String> fmts = DateUtil.DEFAULT_DATE_FORMATS;
-    fmts.add( "yyyy-MM-dd" );
+
  }
  /**
   * Returns a formatter that can be use by the current thread if needed to
   * convert Date objects to the Internal representation.
-   * @throws ParseException 
+   * @throws ParseException
-   * @throws DateParseException 
+   * @throws DateParseException
   *
   * @deprecated Use {@link org.apache.solr.common.util.DateUtil#parseDate(String)}
   */
-  public static Date parseDate( String d ) throws ParseException, DateParseException 
+  public static Date parseDate( String d ) throws ParseException, DateParseException
  {
-    // 2007-04-26T08:05:04Z
+    return DateUtil.parseDate(d);
    if( d.endsWith( "Z" ) && d.length() > 20 ) {
      return getThreadLocalDateFormat().parse( d );
    }
    return DateUtil.parseDate( d, fmts ); 
  }
  /**
   * Returns a formatter that can be use by the current thread if needed to
   * convert Date objects to the Internal representation.
   */
  public static DateFormat getThreadLocalDateFormat() {
    return fmtThreadLocal.get();
  }
-  public static TimeZone UTC = TimeZone.getTimeZone("UTC");
+  /**
-  private static ThreadLocalDateFormat fmtThreadLocal = new ThreadLocalDateFormat();
+   * Returns a formatter that can be use by the current thread if needed to
-  
+   * convert Date objects to the Internal representation.
-  private static class ThreadLocalDateFormat extends ThreadLocal<DateFormat> {
+   *
-    DateFormat proto;
+   * @deprecated use {@link org.apache.solr.common.util.DateUtil#getThreadLocalDateFormat()}
-    public ThreadLocalDateFormat() {
+   */
-      super();
+  public static DateFormat getThreadLocalDateFormat() {
-                                    //2007-04-26T08:05:04Z
+
-      SimpleDateFormat tmp = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
+    return DateUtil.getThreadLocalDateFormat();
      tmp.setTimeZone(UTC);
      proto = tmp;
    }
    @Override
    protected DateFormat initialValue() {
      return (DateFormat) proto.clone();
    }
  }
  /**
   * @deprecated Use {@link org.apache.solr.common.util.DateUtil#UTC}.
   */
  public static TimeZone UTC = DateUtil.UTC;
  /**
   * See: http://lucene.apache.org/java/docs/queryparsersyntax.html#Escaping Special Characters
   */
@ -206,7 +190,7 @@ public class ClientUtils
    }
    return sb.toString();
  }
-  
+
  public static String toQueryString( SolrParams params, boolean xml ) {
    StringBuilder sb = new StringBuilder(128);
    try {
--- a/common-build.xml
+++ b/common-build.xml
@ -38,6 +38,9 @@
    <format property="dateversion" pattern="yyyy.MM.dd.HH.mm.ss" />
  </tstamp>
  <!-- Example directory -->
  <property name="example" value="${common.dir}/example" />
  <!-- 
    we attempt to exec svnversion to get details build information
    for jar manifests.  this property can be set at runtime to an
@ -332,6 +335,10 @@
  	<contrib-crawl target="dist" failonerror="true" />
  </target>
  <target name="example-contrib" description="Tell the contrib to add their stuff to examples">
  	<contrib-crawl target="example" failonerror="true" />
  </target>
  <!-- Creates a Manifest file for Jars and WARs -->
  <target name="make-manifest">
     <!-- If possible, include the svnversion -->
--- a/contrib/dataimporthandler/build.xml
+++ b/contrib/dataimporthandler/build.xml
@ -121,6 +121,8 @@
        </sources>
      </invoke-javadoc>
    </sequential>
-  </target>	
+  </target>
  <target name="example" depends="build"/>
 </project>
--- a/contrib/extraction/CHANGES.txt
+++ b/contrib/extraction/CHANGES.txt
@ -0,0 +1,25 @@
 Apache Solr Content Extraction Library (Solr Cell)
                            Version 1.4-dev
                            Release Notes
 This file describes changes to the Solr Cell (contrib/extraction) module.  See SOLR-284 for details.
 Introduction
 ------------
 Apache Solr Extraction provides a means for extracting and indexing content contained in "rich" documents, such
 as Microsoft Word, Adobe PDF, etc.  (Each name is a trademark of their respective owners)  This contrib module
 uses Apache Tika to extract content and metadata from the files, which can then be indexed.  For more information,
 see http://wiki.apache.org/solr/ExtractingRequestHandler
 Getting Started
 ---------------
 You will need Solr up and running.  Then, simply add the extraction JAR file, plus the Tika dependencies (in the ./lib folder)
 to your Solr Home lib directory.  See http://wiki.apache.org/solr/ExtractingRequestHandler for more details on hooking it in
 and configuring.
 $Id:$
 ================== Release 1.4-dev ==================
 1. SOLR-284:  Added in support for extraction. (Eric Pugh, Chris Harris, gsingers)
--- a/contrib/extraction/build.xml
+++ b/contrib/extraction/build.xml
@ -0,0 +1,134 @@
 <?xml version="1.0"?>
 <!--
    Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at
        http://www.apache.org/licenses/LICENSE-2.0
    Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
 -->
 <project name="solr-extraction" default="build">
  <property name="solr-path" value="../.." />
  <property name="tika.version" value="0.2-SNAPSHOT"/>
  <property name="tika.lib" value="lib/tika-${tika.version}-standalone.jar"/>
  <import file="../../common-build.xml"/>
  <description>
    Solr Integration with Tika for extracting content from binary file formats such as Microsoft Word and Adobe PDF.
  </description>
  <path id="common.classpath">
    <pathelement location="${solr-path}/build/common" />
    <pathelement location="${solr-path}/build/core" />
    <fileset dir="lib" includes="*.jar"/>
    <fileset dir="${solr-path}/lib" includes="*.jar"></fileset>
  </path>
  <path id="test.classpath">
    <path refid="common.classpath" />
    <pathelement path="${dest}/classes" />
    <pathelement path="${dest}/test-classes" />
    <pathelement path="${java.class.path}"/>
  </path>
  <target name="clean">
    <delete failonerror="false" dir="${dest}"/>
  </target>
  <target name="init">
    <mkdir dir="${dest}/classes"/>
    <mkdir dir="${build.javadoc}" />
    <ant dir="../../" inheritall="false" target="compile" />
    <ant dir="../../" inheritall="false" target="make-manifest" />
  </target>
  <target name="compile" depends="init">
    <solr-javac destdir="${dest}/classes"
    classpathref="common.classpath">
      <src path="src/main/java" />
    </solr-javac>
  </target>
  <target name="build" depends="compile">
    <solr-jar destfile="${dest}/${fullnamever}.jar" basedir="${dest}/classes"
              manifest="${common.dir}/${dest}/META-INF/MANIFEST.MF">
      <!--<zipfileset src="${tika.lib}"/>-->
    </solr-jar>
  </target>
  <target name="compileTests" depends="compile">
  	<solr-javac destdir="${dest}/test-classes"
  	                classpathref="test.classpath">
  	  <src path="src/test/java" />
  	</solr-javac>
  </target>
  <target name="test" depends="compileTests">
  	<mkdir dir="${junit.output.dir}"/>
  	<junit printsummary="on"
           haltonfailure="no"
           errorProperty="tests.failed"
           failureProperty="tests.failed"
           dir="src/test/resources/"
           >
      <formatter type="brief" usefile="false" if="junit.details"/>
      <classpath refid="test.classpath"/>
      <formatter type="xml"/>
      <batchtest fork="yes" todir="${junit.output.dir}" unless="testcase">
        <fileset dir="src/test/java" includes="${junit.includes}"/>
      </batchtest>
      <batchtest fork="yes" todir="${junit.output.dir}" if="testcase">
        <fileset dir="src/test/java" includes="**/${testcase}.java"/>
      </batchtest>
    </junit>
    <fail if="tests.failed">Tests failed!</fail>
  </target>
  <target name="dist" depends="build">
  </target>
  <target name="example" depends="build">
    <!-- Copy the jar into example/solr/lib -->
    <copy file="${dest}/${fullnamever}.jar" todir="${example}/solr/lib"/>
    <copy todir="${example}/solr/lib">
      <fileset dir="lib">
        <include name="**/*.jar"/>
      </fileset>
    </copy>
  </target>
  <target name="javadoc">
   	<sequential>
      <mkdir dir="${build.javadoc}/contrib-${name}"/>
      <path id="javadoc.classpath">
        <path refid="common.classpath"/>
      </path>
      <invoke-javadoc
        destdir="${build.javadoc}/contrib-${name}"
      	title="${Name} ${version} contrib-${fullnamever} API">
        <sources>
          <packageset dir="src/main/java"/>
        </sources>
      </invoke-javadoc>
    </sequential>
  </target>
 </project>
--- a/contrib/extraction/lib/asm-3.1.jar
+++ b/contrib/extraction/lib/asm-3.1.jar
@ -0,0 +1,2 @@
 AnyObjectId[8217cae0a1bc977b241e0c8517cc2e3e7cede276] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/bcmail-jdk14-132.jar
+++ b/contrib/extraction/lib/bcmail-jdk14-132.jar
@ -0,0 +1,2 @@
 AnyObjectId[680f8c60c1f0393f7e56595e24b29b3ceb46e933] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/bcprov-jdk14-132.jar
+++ b/contrib/extraction/lib/bcprov-jdk14-132.jar
@ -0,0 +1,2 @@
 AnyObjectId[552721d0e8deb28f2909cfc5ec900a5e35736795] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/commons-codec-1.3.jar
+++ b/contrib/extraction/lib/commons-codec-1.3.jar
@ -0,0 +1,2 @@
 AnyObjectId[957b6752af9a60c1bb2a4f65db0e90e5ce00f521] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/commons-io-1.4.jar
+++ b/contrib/extraction/lib/commons-io-1.4.jar
@ -0,0 +1,2 @@
 AnyObjectId[133dc6cb35f5ca2c5920fd0933a557c2def88680] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/commons-lang-2.1.jar
+++ b/contrib/extraction/lib/commons-lang-2.1.jar
@ -0,0 +1,2 @@
 AnyObjectId[87b80ab5db1729662ccf3439e147430a28c36d03] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/commons-logging-1.0.4.jar
+++ b/contrib/extraction/lib/commons-logging-1.0.4.jar
@ -0,0 +1,2 @@
 AnyObjectId[b73a80fab641131e6fbe3ae833549efb3c540d17] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/fontbox-0.1.0-dev.jar
+++ b/contrib/extraction/lib/fontbox-0.1.0-dev.jar
@ -0,0 +1,2 @@
 AnyObjectId[c9030febd2ae484532407db9ef98247cbe61b779] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/icu4j-3.4.4.jar
+++ b/contrib/extraction/lib/icu4j-3.4.4.jar
@ -0,0 +1,2 @@
 AnyObjectId[f5e8c167e7f7f3d078407859cb50b8abf23c697e] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/junit-3.8.1.jar
+++ b/contrib/extraction/lib/junit-3.8.1.jar
@ -0,0 +1,2 @@
 AnyObjectId[674d71e89ea154dbe2e3cd032821c22b39e8fd68] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/log4j-1.2.14.jar
+++ b/contrib/extraction/lib/log4j-1.2.14.jar
@ -0,0 +1,2 @@
 AnyObjectId[625130719013f195869881a36dcb8d2b14d64d1e] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/nekohtml-1.9.7.jar
+++ b/contrib/extraction/lib/nekohtml-1.9.7.jar
@ -0,0 +1,2 @@
 AnyObjectId[037b4fe2743eb161eec649f6fa5fa4725585b518] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/pdfbox-0.7.3.jar
+++ b/contrib/extraction/lib/pdfbox-0.7.3.jar
@ -0,0 +1,2 @@
 AnyObjectId[f821d644766c4d5c95e53db4b83cc6cb37b553f6] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/poi-3.1-FINAL.jar
+++ b/contrib/extraction/lib/poi-3.1-FINAL.jar
@ -0,0 +1,2 @@
 AnyObjectId[9e472a1610fa5d6736ecd56aec663623170003a3] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/poi-scratchpad-3.1-FINAL.jar
+++ b/contrib/extraction/lib/poi-scratchpad-3.1-FINAL.jar
@ -0,0 +1,2 @@
 AnyObjectId[58a33ac11683bec703fadffdbb263036146d7a74] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/tika-0.2-SNAPSHOT.jar
+++ b/contrib/extraction/lib/tika-0.2-SNAPSHOT.jar
@ -0,0 +1,2 @@
 AnyObjectId[16b9a3ed370d5a617d72f0b8935859bf0eac7678] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/xercesImpl-2.8.1.jar
+++ b/contrib/extraction/lib/xercesImpl-2.8.1.jar
@ -0,0 +1,2 @@
 AnyObjectId[3b351f6e2b566f73b742510738a52b866b4ffd0d] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/lib/xml-apis-1.3.03.jar
+++ b/contrib/extraction/lib/xml-apis-1.3.03.jar
@ -0,0 +1,2 @@
 AnyObjectId[b338fb66932a763d6939dc93f27ed985ca5d1ebb] was removed in git history.
 Apache SVN contains full history.
--- a/contrib/extraction/src/main/java/org/apache/solr/handler/ExtractingDocumentLoader.java
+++ b/contrib/extraction/src/main/java/org/apache/solr/handler/ExtractingDocumentLoader.java
@ -0,0 +1,179 @@
 package org.apache.solr.handler;
 import org.apache.commons.io.IOUtils;
 import org.apache.solr.common.SolrException;
 import org.apache.solr.common.params.SolrParams;
 import org.apache.solr.common.params.UpdateParams;
 import org.apache.solr.common.util.ContentStream;
 import org.apache.solr.request.SolrQueryRequest;
 import org.apache.solr.request.SolrQueryResponse;
 import org.apache.solr.schema.IndexSchema;
 import org.apache.solr.update.AddUpdateCommand;
 import org.apache.solr.update.processor.UpdateRequestProcessor;
 import org.apache.tika.config.TikaConfig;
 import org.apache.tika.metadata.Metadata;
 import org.apache.tika.parser.AutoDetectParser;
 import org.apache.tika.parser.Parser;
 import org.apache.tika.sax.XHTMLContentHandler;
 import org.apache.tika.sax.xpath.Matcher;
 import org.apache.tika.sax.xpath.MatchingContentHandler;
 import org.apache.tika.sax.xpath.XPathParser;
 import org.apache.xml.serialize.OutputFormat;
 import org.apache.xml.serialize.XMLSerializer;
 import org.xml.sax.ContentHandler;
 import java.io.IOException;
 import java.io.InputStream;
 import java.io.StringWriter;
 /**
 *
 *
 **/
 public class ExtractingDocumentLoader extends ContentStreamLoader {
  /**
   * XHTML XPath parser.
   */
  private static final XPathParser PARSER =
          new XPathParser("xhtml", XHTMLContentHandler.XHTML);
  final IndexSchema schema;
  final SolrParams params;
  final UpdateRequestProcessor processor;
  protected AutoDetectParser autoDetectParser;
  private final AddUpdateCommand templateAdd;
  protected TikaConfig config;
  protected SolrContentHandlerFactory factory;
  //protected Collection<String> dateFormats = DateUtil.DEFAULT_DATE_FORMATS;
  ExtractingDocumentLoader(SolrQueryRequest req, UpdateRequestProcessor processor,
                           TikaConfig config, SolrContentHandlerFactory factory) {
    this.params = req.getParams();
    schema = req.getSchema();
    this.config = config;
    this.processor = processor;
    templateAdd = new AddUpdateCommand();
    templateAdd.allowDups = false;
    templateAdd.overwriteCommitted = true;
    templateAdd.overwritePending = true;
    if (params.getBool(UpdateParams.OVERWRITE, true)) {
      templateAdd.allowDups = false;
      templateAdd.overwriteCommitted = true;
      templateAdd.overwritePending = true;
    } else {
      templateAdd.allowDups = true;
      templateAdd.overwriteCommitted = false;
      templateAdd.overwritePending = false;
    }
    //this is lightweight
    autoDetectParser = new AutoDetectParser(config);
    this.factory = factory;
  }
  /**
   * this must be MT safe... may be called concurrently from multiple threads.
   *
   * @param
   * @param
   */
  void doAdd(SolrContentHandler handler, AddUpdateCommand template)
          throws IOException {
    template.solrDoc = handler.newDocument();
    processor.processAdd(template);
  }
  void addDoc(SolrContentHandler handler) throws IOException {
    templateAdd.indexedId = null;
    doAdd(handler, templateAdd);
  }
  /**
   * @param req
   * @param stream
   * @throws java.io.IOException
   */
  public void load(SolrQueryRequest req, SolrQueryResponse rsp, ContentStream stream) throws IOException {
    errHeader = "ExtractingDocumentLoader: " + stream.getSourceInfo();
    Parser parser = null;
    String streamType = req.getParams().get(ExtractingParams.STREAM_TYPE, null);
    if (streamType != null) {
      //Cache?  Parsers are lightweight to construct and thread-safe, so I'm told
      parser = config.getParser(streamType.trim().toLowerCase());
    } else {
      parser = autoDetectParser;
    }
    if (parser != null) {
      Metadata metadata = new Metadata();
      metadata.add(ExtractingMetadataConstants.STREAM_NAME, stream.getName());
      metadata.add(ExtractingMetadataConstants.STREAM_SOURCE_INFO, stream.getSourceInfo());
      metadata.add(ExtractingMetadataConstants.STREAM_SIZE, String.valueOf(stream.getSize()));
      metadata.add(ExtractingMetadataConstants.STREAM_CONTENT_TYPE, stream.getContentType());
      // If you specify the resource name (the filename, roughly) with this parameter,
      // then Tika can make use of it in guessing the appropriate MIME type:
      String resourceName = req.getParams().get(ExtractingParams.RESOURCE_NAME, null);
      if (resourceName != null) {
        metadata.add(Metadata.RESOURCE_NAME_KEY, resourceName);
      }
      SolrContentHandler handler = factory.createSolrContentHandler(metadata, params, schema);
      InputStream inputStream = null;
      try {
        inputStream = stream.getStream();
        String xpathExpr = params.get(ExtractingParams.XPATH_EXPRESSION);
        boolean extractOnly = params.getBool(ExtractingParams.EXTRACT_ONLY, false);
        ContentHandler parsingHandler = handler;
        StringWriter writer = null;
        XMLSerializer serializer = null;
        if (extractOnly == true) {
          writer = new StringWriter();
          serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true));
          if (xpathExpr != null) {
            Matcher matcher =
                    PARSER.parse(xpathExpr);
            serializer.startDocument();//The MatchingContentHandler does not invoke startDocument.  See http://tika.markmail.org/message/kknu3hw7argwiqin
            parsingHandler = new MatchingContentHandler(serializer, matcher);
          } else {
            parsingHandler = serializer;
          }
        } else if (xpathExpr != null) {
          Matcher matcher =
                  PARSER.parse(xpathExpr);
          parsingHandler = new MatchingContentHandler(handler, matcher);
        } //else leave it as is
        //potentially use a wrapper handler for parsing, but we still need the SolrContentHandler for getting the document.
        parser.parse(inputStream, parsingHandler, metadata);
        if (extractOnly == false) {
          addDoc(handler);
        } else {
          //serializer is not null, so we need to call endDoc on it if using xpath
          if (xpathExpr != null){
            serializer.endDocument();
          }
          rsp.add(stream.getName(), writer.toString());
          writer.close();
        }
      } catch (Exception e) {
        //TODO: handle here with an option to not fail and just log the exception
        throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, e);
      } finally {
        IOUtils.closeQuietly(inputStream);
      }
    } else {
      throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "Stream type of " + streamType + " didn't match any known parsers.  Please supply the " + ExtractingParams.STREAM_TYPE + " parameter.");
    }
  }
 }
--- a/contrib/extraction/src/main/java/org/apache/solr/handler/ExtractingMetadataConstants.java
+++ b/contrib/extraction/src/main/java/org/apache/solr/handler/ExtractingMetadataConstants.java
@ -0,0 +1,13 @@
 package org.apache.solr.handler;
 /**
 *
 *
 **/
 public interface ExtractingMetadataConstants {
  String STREAM_NAME = "stream_name";
  String STREAM_SOURCE_INFO = "stream_source_info";
  String STREAM_SIZE = "stream_size";
  String STREAM_CONTENT_TYPE = "stream_content_type";
 }
--- a/contrib/extraction/src/main/java/org/apache/solr/handler/ExtractingParams.java
+++ b/contrib/extraction/src/main/java/org/apache/solr/handler/ExtractingParams.java
@ -0,0 +1,125 @@
 package org.apache.solr.handler;
 /**
 * The various parameters to use when extracting content.
 *
 **/
 public interface ExtractingParams {
  public static final String EXTRACTING_PREFIX = "ext.";
  /**
   * The param prefix for mapping Tika metadata to Solr fields.
   * <p/>
   * To map a field, add a name like:
   * <pre>ext.map.title=solr.title</pre>
   *
   * In this example, the tika "title" metadata value will be added to a Solr field named "solr.title"
   *
   *
   */
  public static final String MAP_PREFIX = EXTRACTING_PREFIX + "map.";
  /**
   * The boost value for the name of the field.  The boost can be specified by a name mapping.
   * <p/>
   * For example
   * <pre>
   * ext.map.title=solr.title
   * ext.boost.solr.title=2.5
   * </pre>
   * will boost the solr.title field for this document by 2.5
   *
   */
  public static final String BOOST_PREFIX = EXTRACTING_PREFIX + "boost.";
  /**
   * Pass in literal values to be added to the document, as in
   * <pre>
   *  ext.literal.myField=Foo 
   * </pre>
   *
   */
  public static final String LITERALS_PREFIX = EXTRACTING_PREFIX + "literal.";
  /**
   * Restrict the extracted parts of a document to be indexed
   *  by passing in an XPath expression.  All content that satisfies the XPath expr.
   * will be passed to the {@link org.apache.solr.handler.SolrContentHandler}.
   * <p/>
   * See Tika's docs for what the extracted document looks like.
   * <p/>
   * @see #DEFAULT_FIELDNAME
   * @see #CAPTURE_FIELDS
   */
  public static final String XPATH_EXPRESSION = EXTRACTING_PREFIX + "xpath";
  /**
   * Only extract and return the document, do not index it.
   */
  public static final String EXTRACT_ONLY = EXTRACTING_PREFIX + "extract.only";
  /**
    *  Don't throw an exception if a field doesn't exist, just ignore it
   */
  public static final String IGNORE_UNDECLARED_FIELDS = EXTRACTING_PREFIX + "ignore.und.fl";
  /**
   * Index attributes separately according to their name, instead of just adding them to the string buffer
   */
  public static final String INDEX_ATTRIBUTES = EXTRACTING_PREFIX + "idx.attr";
  /**
   * The field to index the contents to by default.  If you want to capture a specific piece
   * of the Tika document separately, see {@link #CAPTURE_FIELDS}.
   *
   * @see #CAPTURE_FIELDS
   */
  public static final String DEFAULT_FIELDNAME = EXTRACTING_PREFIX + "def.fl";
  /**
   * Capture the specified fields (and everything included below it that isn't capture by some other capture field) separately from the default.  This is different
   * then the case of passing in an XPath expression.
   * <p/>
   * The Capture field is based on the localName returned to the {@link org.apache.solr.handler.SolrContentHandler}
   * by Tika, not to be confused by the mapped field.  The field name can then
   * be mapped into the index schema.
   * <p/>
   * For instance, a Tika document may look like:
   * <pre>
   *  &lt;html&gt;
   *    ...
   *    &lt;body&gt;
   *      &lt;p&gt;some text here.  &lt;div&gt;more text&lt;/div&gt;&lt;/p&gt;
   *      Some more text
   *    &lt;/body&gt;
   * </pre>
   * By passing in the p tag, you could capture all P tags separately from the rest of the text.
   * Thus, in the example, the capture of the P tag would be: "some text here.  more text"
   *
   * @see #DEFAULT_FIELDNAME
   */
  public static final String CAPTURE_FIELDS = EXTRACTING_PREFIX + "capture";
  /**
   * The type of the stream.  If not specified, Tika will use mime type detection.
   */
  public static final String STREAM_TYPE = EXTRACTING_PREFIX + "stream.type";
  /**
   * Optional.  The file name. If specified, Tika can take this into account while
   * guessing the MIME type.
   */
  public static final String RESOURCE_NAME = EXTRACTING_PREFIX + "resource.name";
  /**
   * Optional.  If specified, the prefix will be prepended to all Metadata, such that it would be possible
   * to setup a dynamic field to automatically capture it
   */
  public static final String METADATA_PREFIX = EXTRACTING_PREFIX + "metadata.prefix";
 }
--- a/contrib/extraction/src/main/java/org/apache/solr/handler/ExtractingRequestHandler.java
+++ b/contrib/extraction/src/main/java/org/apache/solr/handler/ExtractingRequestHandler.java
@ -0,0 +1,134 @@
 package org.apache.solr.handler;
 /**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 import org.apache.solr.common.SolrException;
 import org.apache.solr.common.SolrException.ErrorCode;
 import org.apache.solr.common.util.DateUtil;
 import org.apache.solr.common.util.NamedList;
 import org.apache.solr.core.SolrCore;
 import org.apache.solr.request.SolrQueryRequest;
 import org.apache.solr.update.processor.UpdateRequestProcessor;
 import org.apache.solr.util.plugin.SolrCoreAware;
 import org.apache.tika.config.TikaConfig;
 import org.apache.tika.exception.TikaException;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 import java.io.File;
 import java.util.Collection;
 import java.util.HashSet;
 /**
 * Handler for rich documents like PDF or Word or any other file format that Tika handles that need the text to be extracted
 * first from the document.
 * <p/>
 */
 public class ExtractingRequestHandler extends ContentStreamHandlerBase implements SolrCoreAware {
  private transient static Logger log = LoggerFactory.getLogger(ExtractingRequestHandler.class);
  public static final String CONFIG_LOCATION = "tika.config";
  public static final String DATE_FORMATS = "date.formats";
  protected TikaConfig config;
  protected Collection<String> dateFormats = DateUtil.DEFAULT_DATE_FORMATS;
  protected SolrContentHandlerFactory factory;
  @Override
  public void init(NamedList args) {
    super.init(args);
  }
  public void inform(SolrCore core) {
    if (initArgs != null) {
      //if relative,then relative to config dir, otherwise, absolute path
      String tikaConfigLoc = (String) initArgs.get(CONFIG_LOCATION);
      if (tikaConfigLoc != null) {
        File configFile = new File(tikaConfigLoc);
        if (configFile.isAbsolute() == false) {
          configFile = new File(core.getResourceLoader().getConfigDir(), configFile.getPath());
        }
        try {
          config = new TikaConfig(configFile);
        } catch (Exception e) {
          throw new SolrException(ErrorCode.SERVER_ERROR, e);
        }
      } else {
        try {
          config = TikaConfig.getDefaultConfig();
        } catch (TikaException e) {
          throw new SolrException(ErrorCode.SERVER_ERROR, e);
        }
      }
      NamedList configDateFormats = (NamedList) initArgs.get(DATE_FORMATS);
      if (configDateFormats != null && configDateFormats.size() > 0) {
        dateFormats = new HashSet<String>();
        while (configDateFormats.iterator().hasNext()) {
          String format = (String) configDateFormats.iterator().next();
          log.info("Adding Date Format: " + format);
          dateFormats.add(format);
        }
      }
    } else {
      try {
        config = TikaConfig.getDefaultConfig();
      } catch (TikaException e) {
        throw new SolrException(ErrorCode.SERVER_ERROR, e);
      }
    }
    factory = createFactory();
  }
  protected SolrContentHandlerFactory createFactory() {
    return new SolrContentHandlerFactory(dateFormats);
  }
  protected ContentStreamLoader newLoader(SolrQueryRequest req, UpdateRequestProcessor processor) {
    return new ExtractingDocumentLoader(req, processor, config, factory);
  }
  // ////////////////////// SolrInfoMBeans methods //////////////////////
  @Override
  public String getDescription() {
    return "Add/Update Rich document";
  }
  @Override
  public String getVersion() {
    return "$Revision:$";
  }
  @Override
  public String getSourceId() {
    return "$Id:$";
  }
  @Override
  public String getSource() {
    return "$URL:$";
  }
 }
--- a/contrib/extraction/src/main/java/org/apache/solr/handler/SolrContentHandler.java
+++ b/contrib/extraction/src/main/java/org/apache/solr/handler/SolrContentHandler.java
@ -0,0 +1,353 @@
 package org.apache.solr.handler;
 import org.apache.solr.common.SolrException;
 import org.apache.solr.common.SolrInputDocument;
 import org.apache.solr.common.SolrInputField;
 import org.apache.solr.common.params.SolrParams;
 import org.apache.solr.common.util.DateUtil;
 import org.apache.solr.schema.DateField;
 import org.apache.solr.schema.IndexSchema;
 import org.apache.solr.schema.SchemaField;
 import org.apache.solr.schema.StrField;
 import org.apache.solr.schema.TextField;
 import org.apache.solr.schema.FieldType;
 import org.apache.solr.schema.UUIDField;
 import org.apache.tika.metadata.Metadata;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 import org.xml.sax.Attributes;
 import org.xml.sax.SAXException;
 import org.xml.sax.helpers.DefaultHandler;
 import java.text.DateFormat;
 import java.util.Collection;
 import java.util.Collections;
 import java.util.Date;
 import java.util.HashMap;
 import java.util.Iterator;
 import java.util.Map;
 import java.util.Stack;
 import java.util.UUID;
 /**
 * This class is not thread-safe.  It is responsible for responding to Tika extraction events and producing a Solr document
 */
 public class SolrContentHandler extends DefaultHandler implements ExtractingParams {
  private transient static Logger log = LoggerFactory.getLogger(SolrContentHandler.class);
  protected SolrInputDocument document;
  protected Collection<String> dateFormats = DateUtil.DEFAULT_DATE_FORMATS;
  protected Metadata metadata;
  protected SolrParams params;
  protected StringBuilder catchAllBuilder = new StringBuilder(2048);
  //private StringBuilder currentBuilder;
  protected IndexSchema schema;
  //create empty so we don't have to worry about null checks
  protected Map<String, StringBuilder> fieldBuilders = Collections.emptyMap();
  protected Stack<StringBuilder> bldrStack = new Stack<StringBuilder>();
  protected boolean ignoreUndeclaredFields = false;
  protected boolean indexAttribs = false;
  protected String defaultFieldName;
  protected String metadataPrefix = "";
  /**
   * Only access through getNextId();
   */
  private static long identifier = Long.MIN_VALUE;
  public SolrContentHandler(Metadata metadata, SolrParams params, IndexSchema schema) {
    this(metadata, params, schema, DateUtil.DEFAULT_DATE_FORMATS);
  }
  public SolrContentHandler(Metadata metadata, SolrParams params,
                            IndexSchema schema, Collection<String> dateFormats) {
    document = new SolrInputDocument();
    this.metadata = metadata;
    this.params = params;
    this.schema = schema;
    this.dateFormats = dateFormats;
    this.ignoreUndeclaredFields = params.getBool(ExtractingParams.IGNORE_UNDECLARED_FIELDS, false);
    this.indexAttribs = params.getBool(ExtractingParams.INDEX_ATTRIBUTES, false);
    this.defaultFieldName = params.get(ExtractingParams.DEFAULT_FIELDNAME);
    this.metadataPrefix = params.get(ExtractingParams.METADATA_PREFIX, "");
    //if there's no default field and we are intending to index, then throw an exception
    if (defaultFieldName == null && params.getBool(ExtractingParams.EXTRACT_ONLY, false) == false) {
      throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "No default field name specified");
    }
    String[] captureFields = params.getParams(ExtractingParams.CAPTURE_FIELDS);
    if (captureFields != null && captureFields.length > 0) {
      fieldBuilders = new HashMap<String, StringBuilder>();
      for (int i = 0; i < captureFields.length; i++) {
        fieldBuilders.put(captureFields[i], new StringBuilder());
      }
    }
    bldrStack.push(catchAllBuilder);
  }
  /**
   * This is called by a consumer when it is ready to deal with a new SolrInputDocument.  Overriding
   * classes can use this hook to add in or change whatever they deem fit for the document at that time.
   * The base implementation adds the metadata as fields, allowing for potential remapping.
   *
   * @return The {@link org.apache.solr.common.SolrInputDocument}.
   */
  public SolrInputDocument newDocument() {
    float boost = 1.0f;
    //handle the metadata extracted from the document
    for (String name : metadata.names()) {
      String[] vals = metadata.getValues(name);
      name = findMappedMetadataName(name);
      SchemaField schFld = schema.getFieldOrNull(name);
      if (schFld != null) {
        boost = getBoost(name);
        if (schFld.multiValued()) {
          for (int i = 0; i < vals.length; i++) {
            String val = vals[i];
            document.addField(name, transformValue(val, schFld), boost);
          }
        } else {
          StringBuilder builder = new StringBuilder();
          for (int i = 0; i < vals.length; i++) {
            builder.append(vals[i]).append(' ');
          }
          document.addField(name, transformValue(builder.toString().trim(), schFld), boost);
        }
      } else {
        //TODO: error or log?
        if (ignoreUndeclaredFields == false) {
          // Arguably we should handle this as a special case. Why? Because unlike basically
          // all the other fields in metadata, this one was probably set not by Tika by in
          // ExtractingDocumentLoader.load(). You shouldn't have to define a mapping for this
          // field just because you specified a resource.name parameter to the handler, should
          // you?
          if (name != Metadata.RESOURCE_NAME_KEY) {
            throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Invalid field: " + name);
          }
        }
      }
    }
    //handle the literals from the params
    Iterator<String> paramNames = params.getParameterNamesIterator();
    while (paramNames.hasNext()) {
      String name = paramNames.next();
      if (name.startsWith(LITERALS_PREFIX)) {
        String fieldName = name.substring(LITERALS_PREFIX.length());
        //no need to map names here, since they are literals from the user
        SchemaField schFld = schema.getFieldOrNull(fieldName);
        if (schFld != null) {
          String value = params.get(name);
          boost = getBoost(fieldName);
          //no need to transform here, b/c we can assume the user sent it in correctly
          document.addField(fieldName, value, boost);
        } else {
          handleUndeclaredField(fieldName);
        }
      }
    }
    //add in the content
    document.addField(defaultFieldName, catchAllBuilder.toString(), getBoost(defaultFieldName));
    //add in the captured content
    for (Map.Entry<String, StringBuilder> entry : fieldBuilders.entrySet()) {
      if (entry.getValue().length() > 0) {
        String fieldName = findMappedName(entry.getKey());
        SchemaField schFld = schema.getFieldOrNull(fieldName);
        if (schFld != null) {
          document.addField(fieldName, transformValue(entry.getValue().toString(), schFld), getBoost(fieldName));
        } else {
          handleUndeclaredField(fieldName);
        }
      }
    }
    //make sure we have a unique id, if one is needed
    SchemaField uniqueField = schema.getUniqueKeyField();
    if (uniqueField != null) {
      String uniqueFieldName = uniqueField.getName();
      SolrInputField uniqFld = document.getField(uniqueFieldName);
      if (uniqFld == null) {
        String uniqId = generateId(uniqueField);
        if (uniqId != null) {
          document.addField(uniqueFieldName, uniqId);
        }
      }
    }
    if (log.isDebugEnabled()) {
      log.debug("Doc: " + document);
    }
    return document;
  }
  /**
   * Generate an ID for the document.  First try to get
   * {@link org.apache.solr.handler.ExtractingMetadataConstants#STREAM_NAME} from the
   * {@link org.apache.tika.metadata.Metadata}, then try {@link ExtractingMetadataConstants#STREAM_SOURCE_INFO}
   * then try {@link org.apache.tika.metadata.Metadata#IDENTIFIER}.
   * If those all are null, then generate a random UUID using {@link java.util.UUID#randomUUID()}.
   *
   * @param uniqueField The SchemaField representing the unique field.
   * @return The id as a string
   */
  protected String generateId(SchemaField uniqueField) {
    //we don't have a unique field specified, so let's add one
    String uniqId = null;
    FieldType type = uniqueField.getType();
    if (type instanceof StrField || type instanceof TextField) {
      uniqId = metadata.get(ExtractingMetadataConstants.STREAM_NAME);
      if (uniqId == null) {
        uniqId = metadata.get(ExtractingMetadataConstants.STREAM_SOURCE_INFO);
      }
      if (uniqId == null) {
        uniqId = metadata.get(Metadata.IDENTIFIER);
      }
      if (uniqId == null) {
        //last chance, just create one
        uniqId = UUID.randomUUID().toString();
      }
    } else if (type instanceof UUIDField){
      uniqId = UUID.randomUUID().toString();
    }
    else {
      uniqId = String.valueOf(getNextId());
    }
    return uniqId;
  }
  @Override
  public void startDocument() throws SAXException {
    document.clear();
    catchAllBuilder.setLength(0);
    for (StringBuilder builder : fieldBuilders.values()) {
      builder.setLength(0);
    }
    bldrStack.clear();
    bldrStack.push(catchAllBuilder);
  }
  @Override
  public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
    StringBuilder theBldr = fieldBuilders.get(localName);
    if (theBldr != null) {
      //we need to switch the currentBuilder
      bldrStack.push(theBldr);
    }
    if (indexAttribs == true) {
      for (int i = 0; i < attributes.getLength(); i++) {
        String fieldName = findMappedName(localName);
        SchemaField schFld = schema.getFieldOrNull(fieldName);
        if (schFld != null) {
          document.addField(fieldName, transformValue(attributes.getValue(i), schFld), getBoost(fieldName));
        } else {
          handleUndeclaredField(fieldName);
        }
      }
    } else {
      for (int i = 0; i < attributes.getLength(); i++) {
        bldrStack.peek().append(attributes.getValue(i)).append(' ');
      }
    }
  }
  protected void handleUndeclaredField(String fieldName) {
    if (ignoreUndeclaredFields == false) {
      throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Invalid field: " + fieldName);
    } else {
      if (log.isInfoEnabled()) {
        log.info("Ignoring Field: " + fieldName);
      }
    }
  }
  @Override
  public void endElement(String uri, String localName, String qName) throws SAXException {
    StringBuilder theBldr = fieldBuilders.get(localName);
    if (theBldr != null) {
      //pop the stack
      bldrStack.pop();
      assert (bldrStack.size() >= 1);
    }
  }
  @Override
  public void characters(char[] chars, int offset, int length) throws SAXException {
    bldrStack.peek().append(chars, offset, length);
  }
  /**
   * Can be used to transform input values based on their {@link org.apache.solr.schema.SchemaField}
   * <p/>
   * This implementation only formats dates using the {@link org.apache.solr.common.util.DateUtil}.
   *
   * @param val    The value to transform
   * @param schFld The {@link org.apache.solr.schema.SchemaField}
   * @return The potentially new value.
   */
  protected String transformValue(String val, SchemaField schFld) {
    String result = val;
    if (schFld.getType() instanceof DateField) {
      //try to transform the date
      try {
        Date date = DateUtil.parseDate(val, dateFormats);
        DateFormat df = DateUtil.getThreadLocalDateFormat();
        result = df.format(date);
      } catch (Exception e) {
        //TODO: error or log?
        throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Invalid value: " + val + " for field: " + schFld, e);
      }
    }
    return result;
  }
  /**
   * Get the value of any boost factor for the mapped name.
   *
   * @param name The name of the field to see if there is a boost specified
   * @return The boost value
   */
  protected float getBoost(String name) {
    return params.getFloat(BOOST_PREFIX + name, 1.0f);
  }
  /**
   * Get the name mapping
   *
   * @param name The name to check to see if there is a mapping
   * @return The new name, if there is one, else <code>name</code>
   */
  protected String findMappedName(String name) {
    return params.get(ExtractingParams.MAP_PREFIX + name, name);
  }
  /**
   * Get the name mapping for the metadata field.  Prepends metadataPrefix onto the returned result.
   *
   * @param name The name to check to see if there is a mapping
   * @return The new name, else <code>name</code>
   */
  protected String findMappedMetadataName(String name) {
    return metadataPrefix + params.get(ExtractingParams.MAP_PREFIX + name, name);
  }
  protected synchronized long getNextId(){
    return identifier++;
  }
 }
--- a/contrib/extraction/src/main/java/org/apache/solr/handler/SolrContentHandlerFactory.java
+++ b/contrib/extraction/src/main/java/org/apache/solr/handler/SolrContentHandlerFactory.java
@ -0,0 +1,25 @@
 package org.apache.solr.handler;
 import org.apache.tika.metadata.Metadata;
 import org.apache.solr.common.params.SolrParams;
 import org.apache.solr.schema.IndexSchema;
 import java.util.Collection;
 /**
 *
 *
 **/
 public class SolrContentHandlerFactory {
  protected Collection<String> dateFormats;
  public SolrContentHandlerFactory(Collection<String> dateFormats) {
    this.dateFormats = dateFormats;
  }
  public SolrContentHandler createSolrContentHandler(Metadata metadata, SolrParams params, IndexSchema schema) {
    return new SolrContentHandler(metadata, params, schema,
            dateFormats);
  }
 }
--- a/contrib/extraction/src/test/java/org/apache/solr/handler/ExtractingRequestHandlerTest.java
+++ b/contrib/extraction/src/test/java/org/apache/solr/handler/ExtractingRequestHandlerTest.java
@ -0,0 +1,140 @@
 package org.apache.solr.handler;
 import org.apache.solr.util.AbstractSolrTestCase;
 import org.apache.solr.request.LocalSolrQueryRequest;
 import org.apache.solr.request.SolrQueryResponse;
 import org.apache.solr.common.util.ContentStream;
 import org.apache.solr.common.util.ContentStreamBase;
 import org.apache.solr.common.util.NamedList;
 import java.util.List;
 import java.util.ArrayList;
 import java.io.File;
 /**
 *
 *
 **/
 public class ExtractingRequestHandlerTest extends AbstractSolrTestCase {
  @Override public String getSchemaFile() { return "schema.xml"; }
  @Override public String getSolrConfigFile() { return "solrconfig.xml"; }
  public void testExtraction() throws Exception {
    ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
    assertTrue("handler is null and it shouldn't be", handler != null);
    loadLocal("solr-word.pdf", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
            "ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
            "ext.map.Author", "extractedAuthor",
            "ext.def.fl", "extractedContent",
            "ext.map.Last-Modified", "extractedDate"
    );
    assertQ(req("title:solr-word"),"//*[@numFound='0']");
    assertU(commit());
    assertQ(req("title:solr-word"),"//*[@numFound='1']");
    loadLocal("simple.html", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
            "ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
            "ext.map.Author", "extractedAuthor",
            "ext.map.language", "extractedLanguage",
            "ext.def.fl", "extractedContent",
            "ext.map.Last-Modified", "extractedDate"
    );
    assertQ(req("title:Welcome"),"//*[@numFound='0']");
    assertU(commit());
    assertQ(req("title:Welcome"),"//*[@numFound='1']");
    loadLocal("version_control.xml", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
            "ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
            "ext.map.Author", "extractedAuthor",
            "ext.def.fl", "extractedContent",
            "ext.map.Last-Modified", "extractedDate"
    );
    assertQ(req("stream_name:version_control.xml"),"//*[@numFound='0']");
    assertU(commit());
    assertQ(req("stream_name:version_control.xml"),"//*[@numFound='1']");
  }
  public void testPlainTextSpecifyingMimeType() throws Exception {
    ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
    assertTrue("handler is null and it shouldn't be", handler != null);
    // Load plain text specifying MIME type:
    loadLocal("version_control.txt", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
            "ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
            "ext.map.Author", "extractedAuthor",
            "ext.map.language", "extractedLanguage",
            "ext.def.fl", "extractedContent",
 	    ExtractingParams.STREAM_TYPE, "text/plain"
    );
    assertQ(req("extractedContent:Apache"),"//*[@numFound='0']");
    assertU(commit());
    assertQ(req("extractedContent:Apache"),"//*[@numFound='1']");
  }
  public void testPlainTextSpecifyingResourceName() throws Exception {
    ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
    assertTrue("handler is null and it shouldn't be", handler != null);
    // Load plain text specifying filename
    loadLocal("version_control.txt", "ext.map.created", "extractedDate", "ext.map.producer", "extractedProducer",
            "ext.map.creator", "extractedCreator", "ext.map.Keywords", "extractedKeywords",
            "ext.map.Author", "extractedAuthor",
            "ext.map.language", "extractedLanguage",
            "ext.def.fl", "extractedContent",
 	    ExtractingParams.RESOURCE_NAME, "version_control.txt"
    );
    assertQ(req("extractedContent:Apache"),"//*[@numFound='0']");
    assertU(commit());
    assertQ(req("extractedContent:Apache"),"//*[@numFound='1']");
  }
  // Note: If you load a plain text file specifying neither MIME type nor filename, extraction will silently fail. This is because Tika's
  // automatic MIME type detection will fail, and it will default to using an empty-string-returning default parser
  public void testExtractOnly() throws Exception {
    ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
    assertTrue("handler is null and it shouldn't be", handler != null);
    SolrQueryResponse rsp = loadLocal("solr-word.pdf", ExtractingParams.EXTRACT_ONLY, "true");
    assertTrue("rsp is null and it shouldn't be", rsp != null);
    NamedList list = rsp.getValues();
    String extraction = (String) list.get("solr-word.pdf");
    assertTrue("extraction is null and it shouldn't be", extraction != null);
    assertTrue(extraction + " does not contain " + "solr-word", extraction.indexOf("solr-word") != -1);
  }
  public void testXPath() throws Exception {
    ExtractingRequestHandler handler = (ExtractingRequestHandler) h.getCore().getRequestHandler("/update/extract");
    assertTrue("handler is null and it shouldn't be", handler != null);
    SolrQueryResponse rsp = loadLocal("example.html",
            ExtractingParams.XPATH_EXPRESSION, "/xhtml:html/xhtml:body/xhtml:a/descendant:node()",
            ExtractingParams.EXTRACT_ONLY, "true"
    );
    assertTrue("rsp is null and it shouldn't be", rsp != null);
    NamedList list = rsp.getValues();
    String val = (String) list.get("example.html");
    val = val.trim();
    assertTrue(val + " is not equal to " + "linkNews", val.equals("linkNews") == true);//there are two <a> tags, and they get collapesd
  }
  SolrQueryResponse loadLocal(String filename, String... args) throws Exception {
    LocalSolrQueryRequest req =  (LocalSolrQueryRequest)req(args);
    // TODO: stop using locally defined streams once stream.file and
    // stream.body work everywhere
    List<ContentStream> cs = new ArrayList<ContentStream>();
    cs.add(new ContentStreamBase.FileStream(new File(filename)));
    req.setContentStreams(cs);
    return h.queryAndResponse("/update/extract", req);
  }
 }
--- a/contrib/extraction/src/test/resources/example.html
+++ b/contrib/extraction/src/test/resources/example.html
@ -0,0 +1,49 @@
 <html>
 <head>
  <title>Welcome to Solr</title>
 </head>
 <body>
 <p>
  Here is some text
 </p>
 <div>Here is some text in a div</div>
 <div>This has a <a href="http://www.apache.org">link</a>.</div>
 <a href="#news">News</a>
 <ul class="minitoc">
 <li>
 <a href="#03+October+2008+-+Solr+Logo+Contest">03 October 2008 - Solr Logo Contest</a>
 </li>
 <li>
 <a href="#15+September+2008+-+Solr+1.3.0+Available">15 September 2008 - Solr 1.3.0 Available</a>
 </li>
 <li>
 <a href="#28+August+2008+-+Lucene%2FSolr+at+ApacheCon+New+Orleans">28 August 2008 - Lucene/Solr at ApacheCon New Orleans</a>
 </li>
 <li>
 <a href="#03+September+2007+-+Lucene+at+ApacheCon+Atlanta">03 September 2007 - Lucene at ApacheCon Atlanta</a>
 </li>
 <li>
 <a href="#06+June+2007%3A+Release+1.2+available">06 June 2007: Release 1.2 available</a>
 </li>
 <li>
 <a href="#17+January+2007%3A+Solr+graduates+from+Incubator">17 January 2007: Solr graduates from Incubator</a>
 </li>
 <li>
 <a href="#22+December+2006%3A+Release+1.1.0+available">22 December 2006: Release 1.1.0 available</a>
 </li>
 <li>
 <a href="#15+August+2006%3A+Solr+at+ApacheCon+US">15 August 2006: Solr at ApacheCon US</a>
 </li>
 <li>
 <a href="#21+April+2006%3A+Solr+at+ApacheCon">21 April 2006: Solr at ApacheCon</a>
 </li>
 <li>
 <a href="#21+February+2006%3A+nightly+builds">21 February 2006: nightly builds</a>
 </li>
 <li>
 <a href="#17+January+2006%3A+Solr+Joins+Apache+Incubator">17 January 2006: Solr Joins Apache Incubator</a>
 </li>
 </ul>
 </body>
 </html>
--- a/contrib/extraction/src/test/resources/simple.html
+++ b/contrib/extraction/src/test/resources/simple.html
@ -0,0 +1,12 @@
 <html>
 <head>
  <title>Welcome to Solr</title>
 </head>
 <body>
 <p>
  Here is some text
 </p>
 <div>Here is some text in a div</div>
 <div>This has a <a href="http://www.apache.org">link</a>.</div>
 </body>
 </html>
--- a/contrib/extraction/src/test/resources/solr-word.pdf
+++ b/contrib/extraction/src/test/resources/solr-word.pdf
--- a/contrib/extraction/src/test/resources/solr/conf/protwords.txt
+++ b/contrib/extraction/src/test/resources/solr/conf/protwords.txt
@ -0,0 +1,20 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #use a protected word file to avoid stemming two
 #unrelated words to the same base word.
 #to test, we will use words that would normally obviously be stemmed.
 cats
 ridding
--- a/contrib/extraction/src/test/resources/solr/conf/schema.xml
+++ b/contrib/extraction/src/test/resources/solr/conf/schema.xml
@ -0,0 +1,467 @@
 <?xml version="1.0" ?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at
     http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
 <!-- The Solr schema file. This file should be named "schema.xml" and
     should be located where the classloader for the Solr webapp can find it.
     This schema is used for testing, and as such has everything and the 
     kitchen sink thrown in. See example/solr/conf/schema.xml for a 
     more concise example.
     $Id: schema.xml 382610 2006-03-03 01:43:03Z yonik $
     $Source: /cvs/main/searching/solr-configs/test/WEB-INF/classes/schema.xml,v $
     $Name:  $
  -->
 <schema name="test" version="1.0">
  <types>
    <!-- field type definitions... note that the "name" attribute is
         just a label to be used by field definitions.  The "class"
         attribute and any other attributes determine the real type and
         behavior of the fieldtype.
      -->
    <!-- numeric field types that store and index the text
         value verbatim (and hence don't sort correctly or support range queries.)
         These are provided more for backward compatability, allowing one
         to create a schema that matches an existing lucene index.
    -->
    <fieldType name="integer" class="solr.IntField"/>
    <fieldType name="long" class="solr.LongField"/>
    <fieldtype name="float" class="solr.FloatField"/>
    <fieldType name="double" class="solr.DoubleField"/>
    <!-- numeric field types that manipulate the value into
       a string value that isn't human readable in it's internal form,
       but sorts correctly and supports range queries.
         If sortMissingLast="true" then a sort on this field will cause documents
       without the field to come after documents with the field,
       regardless of the requested sort order.
         If sortMissingFirst="true" then a sort on this field will cause documents
       without the field to come before documents with the field,
       regardless of the requested sort order.
         If sortMissingLast="false" and sortMissingFirst="false" (the default),
       then default lucene sorting will be used which places docs without the field
       first in an ascending sort and last in a descending sort.
    -->
    <fieldtype name="sint" class="solr.SortableIntField" sortMissingLast="true"/>
    <fieldtype name="slong" class="solr.SortableLongField" sortMissingLast="true"/>
    <fieldtype name="sfloat" class="solr.SortableFloatField" sortMissingLast="true"/>
    <fieldtype name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true"/>
    <!-- bcd versions of sortable numeric type may provide smaller
         storage space and support very large numbers.
    -->
    <fieldtype name="bcdint" class="solr.BCDIntField" sortMissingLast="true"/>
    <fieldtype name="bcdlong" class="solr.BCDLongField" sortMissingLast="true"/>
    <fieldtype name="bcdstr" class="solr.BCDStrField" sortMissingLast="true"/>
    <!-- Field type demonstrating an Analyzer failure -->
    <fieldtype name="failtype1" class="solr.TextField">
      <analyzer type="index">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>
    <!-- Demonstrating ignoreCaseChange -->
    <fieldtype name="wdf_nocase" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>
     <fieldtype name="wdf_preserve" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>
    <!-- HighlitText optimizes storage for (long) columns which will be highlit -->
    <fieldtype name="highlittext" class="solr.TextField" compressThreshold="345" />
    <fieldtype name="boolean" class="solr.BoolField" sortMissingLast="true"/>
    <fieldtype name="string" class="solr.StrField" sortMissingLast="true"/>
    <!-- format for date is 1995-12-31T23:59:59.999Z and only the fractional
         seconds part (.999) is optional.
      -->
    <fieldtype name="date" class="solr.DateField" sortMissingLast="true"/>
    <!-- solr.TextField allows the specification of custom
         text analyzers specified as a tokenizer and a list
         of token filters.
      -->
    <fieldtype name="text" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <!-- lucene PorterStemFilterFactory deprecated
          <filter class="solr.PorterStemFilterFactory"/>
        -->
        <filter class="solr.EnglishPorterFilterFactory"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="nametext" class="solr.TextField">
      <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
    </fieldtype>
    <fieldtype name="teststop" class="solr.TextField">
       <analyzer>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
      </analyzer>
    </fieldtype>
    <!-- fieldtypes in this section isolate tokenizers and tokenfilters for testing -->
    <fieldtype name="lowertok" class="solr.TextField">
      <analyzer><tokenizer class="solr.LowerCaseTokenizerFactory"/></analyzer>
    </fieldtype>
    <fieldtype name="keywordtok" class="solr.TextField">
      <analyzer><tokenizer class="solr.KeywordTokenizerFactory"/></analyzer>
    </fieldtype>
    <fieldtype name="standardtok" class="solr.TextField">
      <analyzer><tokenizer class="solr.StandardTokenizerFactory"/></analyzer>
    </fieldtype>
    <fieldtype name="lettertok" class="solr.TextField">
      <analyzer><tokenizer class="solr.LetterTokenizerFactory"/></analyzer>
    </fieldtype>
    <fieldtype name="whitetok" class="solr.TextField">
      <analyzer><tokenizer class="solr.WhitespaceTokenizerFactory"/></analyzer>
    </fieldtype>
    <fieldtype name="HTMLstandardtok" class="solr.TextField">
      <analyzer><tokenizer class="solr.HTMLStripStandardTokenizerFactory"/></analyzer>
    </fieldtype>
    <fieldtype name="HTMLwhitetok" class="solr.TextField">
      <analyzer><tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/></analyzer>
    </fieldtype>
    <fieldtype name="standardtokfilt" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="standardfilt" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="lowerfilt" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="patternreplacefilt" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.PatternReplaceFilterFactory"
                pattern="([^a-zA-Z])" replacement="_" replace="all"
        />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="porterfilt" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldtype>
    <!-- fieldtype name="snowballfilt" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SnowballPorterFilterFactory"/>
      </analyzer>
    </fieldtype -->
    <fieldtype name="engporterfilt" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="custengporterfilt" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="stopfilt" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="custstopfilt" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" words="stopwords.txt"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="lengthfilt" class="solr.TextField">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LengthFilterFactory" min="2" max="5"/>
      </analyzer>
    </fieldtype>
    <fieldtype name="subword" class="solr.TextField" multiValued="true" positionIncrementGap="100">
      <analyzer type="index">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory"/>
          <filter class="solr.EnglishPorterFilterFactory"/>
      </analyzer>
      <analyzer type="query">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory"/>
          <filter class="solr.EnglishPorterFilterFactory"/>
      </analyzer>
    </fieldtype>
    <!-- more flexible in matching skus, but more chance of a false match -->
    <fieldtype name="skutype1" class="solr.TextField">
      <analyzer type="index">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>
    <!-- less flexible in matching skus, but less chance of a false match -->
    <fieldtype name="skutype2" class="solr.TextField">
      <analyzer type="index">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>
    <!-- less flexible in matching skus, but less chance of a false match -->
    <fieldtype name="syn" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter name="syn" class="solr.SynonymFilterFactory" synonyms="synonyms.txt"/>
      </analyzer>
    </fieldtype>
    <!-- Demonstrates How RemoveDuplicatesTokenFilter makes stemmed
         synonyms "better"
      -->
    <fieldtype name="dedup" class="solr.TextField">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.SynonymFilterFactory"
                  synonyms="synonyms.txt" expand="true" />
          <filter class="solr.EnglishPorterFilterFactory"/>
          <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
      </analyzer>
    </fieldtype>
    <fieldtype  name="unstored" class="solr.StrField" indexed="true" stored="false"/>
  <fieldtype name="textgap" class="solr.TextField" multiValued="true" positionIncrementGap="100">
      <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
  </fieldtype>
 </types>
 <fields>
   <field name="id" type="integer" indexed="true" stored="true" multiValued="false" required="false"/>
   <field name="name" type="nametext" indexed="true" stored="true"/>
   <field name="text" type="text" indexed="true" stored="false"/>
   <field name="subject" type="text" indexed="true" stored="true"/>
   <field name="title" type="nametext" indexed="true" stored="true"/>
   <field name="weight" type="float" indexed="true" stored="true"/>
   <field name="bday" type="date" indexed="true" stored="true"/>
   <field name="title_stemmed" type="text" indexed="true" stored="false"/>
   <field name="title_lettertok" type="lettertok" indexed="true" stored="false"/>
   <field name="syn" type="syn" indexed="true" stored="true"/>
   <!-- to test property inheritance and overriding -->
   <field name="shouldbeunstored" type="unstored" />
   <field name="shouldbestored" type="unstored" stored="true"/>
   <field name="shouldbeunindexed" type="unstored" indexed="false" stored="true"/>
   <!-- test different combinations of indexed and stored -->
   <field name="bind" type="boolean" indexed="true" stored="false"/>
   <field name="bsto" type="boolean" indexed="false" stored="true"/>
   <field name="bindsto" type="boolean" indexed="true" stored="true"/>
   <field name="isto" type="integer" indexed="false" stored="true"/>
   <field name="iind" type="integer" indexed="true" stored="false"/>
   <field name="ssto" type="string" indexed="false" stored="true"/>
   <field name="sind" type="string" indexed="true" stored="false"/>
   <field name="sindsto" type="string" indexed="true" stored="true"/>
   <!-- test combinations of term vector settings -->
   <field name="test_basictv" type="text" termVectors="true"/>
   <field name="test_notv" type="text" termVectors="false"/>
   <field name="test_postv" type="text" termVectors="true" termPositions="true"/>
   <field name="test_offtv" type="text" termVectors="true" termOffsets="true"/>
   <field name="test_posofftv" type="text" termVectors="true" 
     termPositions="true" termOffsets="true"/>
   <!-- test highlit field settings -->
   <field name="test_hlt" type="highlittext" indexed="true" compressed="true"/>
   <field name="test_hlt_off" type="highlittext" indexed="true" compressed="false"/>
   <!-- fields to test individual tokenizers and tokenfilters -->
   <field name="teststop" type="teststop" indexed="true" stored="true"/>
   <field name="lowertok" type="lowertok" indexed="true" stored="true"/>
   <field name="keywordtok" type="keywordtok" indexed="true" stored="true"/>
   <field name="standardtok" type="standardtok" indexed="true" stored="true"/>
   <field name="HTMLstandardtok" type="HTMLstandardtok" indexed="true" stored="true"/>
   <field name="lettertok" type="lettertok" indexed="true" stored="true"/>
   <field name="whitetok" type="whitetok" indexed="true" stored="true"/>
   <field name="HTMLwhitetok" type="HTMLwhitetok" indexed="true" stored="true"/>
   <field name="standardtokfilt" type="standardtokfilt" indexed="true" stored="true"/>
   <field name="standardfilt" type="standardfilt" indexed="true" stored="true"/>
   <field name="lowerfilt" type="lowerfilt" indexed="true" stored="true"/>
   <field name="patternreplacefilt" type="patternreplacefilt" indexed="true" stored="true"/>
   <field name="porterfilt" type="porterfilt" indexed="true" stored="true"/>
   <field name="engporterfilt" type="engporterfilt" indexed="true" stored="true"/>
   <field name="custengporterfilt" type="custengporterfilt" indexed="true" stored="true"/>
   <field name="stopfilt" type="stopfilt" indexed="true" stored="true"/>
   <field name="custstopfilt" type="custstopfilt" indexed="true" stored="true"/>
   <field name="lengthfilt" type="lengthfilt" indexed="true" stored="true"/>
   <field name="dedup" type="dedup" indexed="true" stored="true"/>
   <field name="wdf_nocase" type="wdf_nocase" indexed="true" stored="true"/>
   <field name="wdf_preserve" type="wdf_preserve" indexed="true" stored="true"/>
   <field name="numberpartfail" type="failtype1" indexed="true" stored="true"/>
   <field name="nullfirst" type="string" indexed="true" stored="true" sortMissingFirst="true"/>
   <field name="subword" type="subword" indexed="true" stored="true"/>
   <field name="sku1" type="skutype1" indexed="true" stored="true"/>
   <field name="sku2" type="skutype2" indexed="true" stored="true"/>
   <field name="textgap" type="textgap" indexed="true" stored="true"/>
   <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>
   <field name="multiDefault" type="string" indexed="true" stored="true" default="muLti-Default" multiValued="true"/>
   <field name="intDefault" type="sint" indexed="true" stored="true" default="42" multiValued="false"/>
   <field name="extractedDate" type="date" indexed="true" stored="true" multiValued="true"/>
   <field name="extractedContent" type="text" indexed="true" stored="true" multiValued="true"/>
   <field name="extractedProducer" type="text" indexed="true" stored="true" multiValued="true"/>
   <field name="extractedCreator" type="text" indexed="true" stored="true" multiValued="true"/>
   <field name="extractedKeywords" type="text" indexed="true" stored="true" multiValued="true"/>
   <field name="extractedAuthor" type="text" indexed="true" stored="true" multiValued="true"/>
   <field name="extractedLanguage" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="resourceName" type="string" indexed="true" stored="true" multiValued="true"/>
   <!-- Dynamic field definitions.  If a field name is not found, dynamicFields
        will be used if the name matches any of the patterns.
        RESTRICTION: the glob-like pattern in the name attribute must have
        a "*" only at the start or the end.
        EXAMPLE:  name="*_i" will match any field ending in _i (like myid_i, z_i)
        Longer patterns will be matched first.  if equal size patterns
        both match, the first appearing in the schema will be used.
   -->
   <dynamicField name="*_i"  type="sint"    indexed="true"  stored="true"/>
   <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
   <dynamicField name="*_s1"  type="string"  indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="*_l"  type="slong"   indexed="true"  stored="true"/>
   <dynamicField name="*_t"  type="text"    indexed="true"  stored="true"/>
   <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
   <dynamicField name="*_f"  type="sfloat"  indexed="true"  stored="true"/>
   <dynamicField name="*_d"  type="sdouble" indexed="true"  stored="true"/>
   <dynamicField name="*_dt" type="date"    indexed="true"  stored="true"/>
   <dynamicField name="*_bcd" type="bcdstr" indexed="true"  stored="true"/>
   <dynamicField name="*_sI" type="string"  indexed="true"  stored="false"/>
   <dynamicField name="*_sS" type="string"  indexed="false" stored="true"/>
   <dynamicField name="t_*"  type="text"    indexed="true"  stored="true"/>
   <dynamicField name="tv_*"  type="text" indexed="true"  stored="true" 
      termVectors="true" termPositions="true" termOffsets="true"/>
   <dynamicField name="stream_*"  type="text" indexed="true"  stored="true"/>
   <dynamicField name="Content*"  type="text" indexed="true"  stored="true"/>
   <!-- special fields for dynamic copyField test -->
   <dynamicField name="dynamic_*" type="string" indexed="true" stored="true"/>
   <dynamicField name="*_dynamic" type="string" indexed="true" stored="true"/>
   <!-- for testing to ensure that longer patterns are matched first -->
   <dynamicField name="*aa"  type="string"  indexed="true" stored="true"/>
   <dynamicField name="*aaa" type="integer" indexed="false" stored="true"/>
   <!-- ignored becuase not stored or indexed -->
   <dynamicField name="*_ignored" type="text" indexed="false" stored="false"/>
 </fields>
 <defaultSearchField>text</defaultSearchField>
 <uniqueKey>id</uniqueKey>
  <!-- copyField commands copy one field to another at the time a document
        is added to the index.  It's used either to index the same field different
        ways, or to add multiple fields to the same field for easier/faster searching.
   -->
   <copyField source="title" dest="title_stemmed"/>
   <copyField source="title" dest="title_lettertok"/>
   <copyField source="title" dest="text"/>
   <copyField source="subject" dest="text"/>
   <copyField source="*_t" dest="text"/>
   <!-- dynamic destination -->
   <copyField source="*_dynamic" dest="dynamic_*"/>
 </schema>
--- a/contrib/extraction/src/test/resources/solr/conf/solrconfig.xml
+++ b/contrib/extraction/src/test/resources/solr/conf/solrconfig.xml
@ -0,0 +1,359 @@
 <?xml version="1.0" ?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at
     http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
 <!-- $Id: solrconfig.xml 382610 2006-03-03 01:43:03Z yonik $
     $Source$
     $Name$
  -->
 <config>
  <jmx />
  <!-- Used to specify an alternate directory to hold all index data.
       It defaults to "index" if not present, and should probably
       not be changed if replication is in use. -->
  <dataDir>${solr.data.dir:./solr/data}</dataDir>
  <indexDefaults>
   <!-- Values here affect all index writers and act as a default
   unless overridden. -->
    <!-- Values here affect all index writers and act as a default unless overridden. -->
    <useCompoundFile>false</useCompoundFile>
    <mergeFactor>10</mergeFactor>
    <!-- If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first.
     -->
    <!--<maxBufferedDocs>1000</maxBufferedDocs>-->
    <!-- Tell Lucene when to flush documents to disk.
    Giving Lucene more memory for indexing means faster indexing at the cost of more RAM
    If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first.
    -->
    <ramBufferSizeMB>32</ramBufferSizeMB>
    <maxMergeDocs>2147483647</maxMergeDocs>
    <maxFieldLength>10000</maxFieldLength>
    <writeLockTimeout>1000</writeLockTimeout>
    <commitLockTimeout>10000</commitLockTimeout>
    <!-- 
     Expert: Turn on Lucene's auto commit capability.
     NOTE: Despite the name, this value does not have any relation to Solr's autoCommit functionality
     -->
    <luceneAutoCommit>false</luceneAutoCommit>
    <!--
     Expert:
     The Merge Policy in Lucene controls how merging is handled by Lucene.  The default in 2.3 is the LogByteSizeMergePolicy, previous
     versions used LogDocMergePolicy.
     LogByteSizeMergePolicy chooses segments to merge based on their size.  The Lucene 2.2 default, LogDocMergePolicy chose when
     to merge based on number of documents
     Other implementations of MergePolicy must have a no-argument constructor
     -->
    <mergePolicy>org.apache.lucene.index.LogByteSizeMergePolicy</mergePolicy>
    <!--
     Expert:
     The Merge Scheduler in Lucene controls how merges are performed.  The ConcurrentMergeScheduler (Lucene 2.3 default)
      can perform merges in the background using separate threads.  The SerialMergeScheduler (Lucene 2.2 default) does not.
     -->
    <mergeScheduler>org.apache.lucene.index.ConcurrentMergeScheduler</mergeScheduler>
    <!-- these are global... can't currently override per index -->
    <writeLockTimeout>1000</writeLockTimeout>
    <commitLockTimeout>10000</commitLockTimeout>
    <lockType>single</lockType>
  </indexDefaults>
  <mainIndex>
    <!-- lucene options specific to the main on-disk lucene index -->
    <useCompoundFile>false</useCompoundFile>
    <mergeFactor>10</mergeFactor>
    <ramBufferSizeMB>32</ramBufferSizeMB>
    <maxMergeDocs>2147483647</maxMergeDocs>
    <maxFieldLength>10000</maxFieldLength>
    <unlockOnStartup>true</unlockOnStartup>
  </mainIndex>
  <updateHandler class="solr.DirectUpdateHandler2">
    <!-- autocommit pending docs if certain criteria are met 
    <autoCommit> 
      <maxDocs>10000</maxDocs>
      <maxTime>3600000</maxTime> 
    </autoCommit>
    -->
    <!-- represents a lower bound on the frequency that commits may
    occur (in seconds). NOTE: not yet implemented
    <commitIntervalLowerBound>0</commitIntervalLowerBound>
    -->
    <!-- The RunExecutableListener executes an external command.
         exe - the name of the executable to run
         dir - dir to use as the current working directory. default="."
         wait - the calling thread waits until the executable returns. default="true"
         args - the arguments to pass to the program.  default=nothing
         env - environment variables to set.  default=nothing
      -->
    <!-- A postCommit event is fired after every commit
    <listener event="postCommit" class="solr.RunExecutableListener">
      <str name="exe">/var/opt/resin3/__PORT__/scripts/solr/snapshooter</str>
      <str name="dir">/var/opt/resin3/__PORT__</str>
      <bool name="wait">true</bool>
      <arr name="args"> <str>arg1</str> <str>arg2</str> </arr>
      <arr name="env"> <str>MYVAR=val1</str> </arr>
    </listener>
    -->
  </updateHandler>
  <query>
    <!-- Maximum number of clauses in a boolean query... can affect
        range or wildcard queries that expand to big boolean
        queries.  An exception is thrown if exceeded.
    -->
    <maxBooleanClauses>1024</maxBooleanClauses>
    <!-- Cache specification for Filters or DocSets - unordered set of *all* documents
         that match a particular query.
      -->
    <filterCache
      class="solr.search.LRUCache"
      size="512"
      initialSize="512"
      autowarmCount="256"/>
    <queryResultCache
      class="solr.search.LRUCache"
      size="512"
      initialSize="512"
      autowarmCount="1024"/>
    <documentCache
      class="solr.search.LRUCache"
      size="512"
      initialSize="512"
      autowarmCount="0"/>
    <!-- If true, stored fields that are not requested will be loaded lazily.
    -->
    <enableLazyFieldLoading>true</enableLazyFieldLoading>
    <!--
    <cache name="myUserCache"
      class="solr.search.LRUCache"
      size="4096"
      initialSize="1024"
      autowarmCount="1024"
      regenerator="MyRegenerator"
      />
    -->
    <useFilterForSortedQuery>true</useFilterForSortedQuery>
    <queryResultWindowSize>10</queryResultWindowSize>
    <!-- set maxSize artificially low to exercise both types of sets -->
    <HashDocSet maxSize="3" loadFactor="0.75"/>
    <!-- boolToFilterOptimizer converts boolean clauses with zero boost
         into cached filters if the number of docs selected by the clause exceeds
         the threshold (represented as a fraction of the total index)
    -->
    <boolTofilterOptimizer enabled="false" cacheSize="32" threshold=".05"/>
    <!-- a newSearcher event is fired whenever a new searcher is being prepared
         and there is a current searcher handling requests (aka registered). -->
    <!-- QuerySenderListener takes an array of NamedList and executes a
         local query request for each NamedList in sequence. -->
    <!--
    <listener event="newSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
        <lst> <str name="q">solr</str> <str name="start">0</str> <str name="rows">10</str> </lst>
        <lst> <str name="q">rocks</str> <str name="start">0</str> <str name="rows">10</str> </lst>
      </arr>
    </listener>
    -->
    <!-- a firstSearcher event is fired whenever a new searcher is being
         prepared but there is no current registered searcher to handle
         requests or to gain prewarming data from. -->
    <!--
    <listener event="firstSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
        <lst> <str name="q">fast_warm</str> <str name="start">0</str> <str name="rows">10</str> </lst>
      </arr>
    </listener>
    -->
  </query>
  <!-- An alternate set representation that uses an integer hash to store filters (sets of docids).
       If the set cardinality <= maxSize elements, then HashDocSet will be used instead of the bitset
       based HashBitset. -->
  <!-- requestHandler plugins... incoming queries will be dispatched to the
     correct handler based on the qt (query type) param matching the
     name of registered handlers.
      The "standard" request handler is the default and will be used if qt
     is not specified in the request.
  -->
  <requestHandler name="standard" class="solr.StandardRequestHandler">
  	<bool name="httpCaching">true</bool>
  </requestHandler>
  <requestHandler name="dismaxOldStyleDefaults"
                  class="solr.DisMaxRequestHandler" >
     <!-- for historic reasons, DisMaxRequestHandler will use all of
          it's init params as "defaults" if there is no "defaults" list
          specified
     -->
     <float name="tie">0.01</float>
     <str name="qf">
        text^0.5 features_t^1.0 subject^1.4 title_stemmed^2.0
     </str>
     <str name="pf">
        text^0.2 features_t^1.1 subject^1.4 title_stemmed^2.0 title^1.5
     </str>
     <str name="bf">
        ord(weight)^0.5 recip(rord(iind),1,1000,1000)^0.3
     </str>
     <str name="mm">
        3&lt;-1 5&lt;-2 6&lt;90%
     </str>
     <int name="ps">100</int>
  </requestHandler>
  <requestHandler name="dismax" class="solr.DisMaxRequestHandler" >
    <lst name="defaults">
     <str name="q.alt">*:*</str>
     <float name="tie">0.01</float>
     <str name="qf">
        text^0.5 features_t^1.0 subject^1.4 title_stemmed^2.0
     </str>
     <str name="pf">
        text^0.2 features_t^1.1 subject^1.4 title_stemmed^2.0 title^1.5
     </str>
     <str name="bf">
        ord(weight)^0.5 recip(rord(iind),1,1000,1000)^0.3
     </str>
     <str name="mm">
        3&lt;-1 5&lt;-2 6&lt;90%
     </str>
     <int name="ps">100</int>
    </lst>
  </requestHandler>
  <requestHandler name="old" class="solr.tst.OldRequestHandler" >
    <int name="myparam">1000</int>
    <float name="ratio">1.4142135</float>
    <arr name="myarr"><int>1</int><int>2</int></arr>
    <str>foo</str>
  </requestHandler>
  <requestHandler name="oldagain" class="solr.tst.OldRequestHandler" >
    <lst name="lst1"> <str name="op">sqrt</str> <int name="val">2</int> </lst>
    <lst name="lst2"> <str name="op">log</str> <float name="val">10</float> </lst>
  </requestHandler>
  <requestHandler name="test" class="solr.tst.TestRequestHandler" />
  <!-- test query parameter defaults -->
  <requestHandler name="defaults" class="solr.StandardRequestHandler">
    <lst name="defaults">
      <int name="rows">4</int>
      <bool name="hl">true</bool>
      <str name="hl.fl">text,name,subject,title,whitetok</str>
    </lst>
  </requestHandler>
  <!-- test query parameter defaults -->
  <requestHandler name="lazy" class="solr.StandardRequestHandler" startup="lazy">
    <lst name="defaults">
      <int name="rows">4</int>
      <bool name="hl">true</bool>
      <str name="hl.fl">text,name,subject,title,whitetok</str>
    </lst>
  </requestHandler>
  <requestHandler name="/update"     class="solr.XmlUpdateRequestHandler"          />
  <requestHandler name="/update/csv" class="solr.CSVRequestHandler" startup="lazy">
  	<bool name="httpCaching">false</bool>
  </requestHandler>
  <requestHandler name="/update/extract" class="org.apache.solr.handler.ExtractingRequestHandler"/>
  <highlighting>
   <!-- Configure the standard fragmenter -->
   <fragmenter name="gap" class="org.apache.solr.highlight.GapFragmenter" default="true">
    <lst name="defaults">
     <int name="hl.fragsize">100</int>
    </lst>
   </fragmenter>
   <fragmenter name="regex" class="org.apache.solr.highlight.RegexFragmenter">
    <lst name="defaults">
     <int name="hl.fragsize">70</int>
    </lst>
   </fragmenter>
   <!-- Configure the standard formatter -->
   <formatter name="html" class="org.apache.solr.highlight.HtmlFormatter" default="true">
    <lst name="defaults">
     <str name="hl.simple.pre"><![CDATA[<em>]]></str>
     <str name="hl.simple.post"><![CDATA[</em>]]></str>
    </lst>
   </formatter>
  </highlighting>
  <!-- enable streaming for testing... -->
  <requestDispatcher handleSelect="true" >
    <requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048" />
    <httpCaching lastModifiedFrom="openTime" etagSeed="Solr" never304="false">
      <cacheControl>max-age=30, public</cacheControl>
    </httpCaching>
  </requestDispatcher>
  <admin>
    <defaultQuery>solr</defaultQuery>
    <gettableFiles>solrconfig.xml scheam.xml admin-extra.html</gettableFiles>
  </admin>
  <!-- test getting system property -->
  <propTest attr1="${solr.test.sys.prop1}-$${literal}"
            attr2="${non.existent.sys.prop:default-from-config}">prefix-${solr.test.sys.prop2}-suffix</propTest>
 </config>
--- a/contrib/extraction/src/test/resources/solr/conf/stopwords.txt
+++ b/contrib/extraction/src/test/resources/solr/conf/stopwords.txt
@ -0,0 +1,16 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 stopworda
 stopwordb
--- a/contrib/extraction/src/test/resources/solr/conf/synonyms.txt
+++ b/contrib/extraction/src/test/resources/solr/conf/synonyms.txt
@ -0,0 +1,22 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 a => aa
 b => b1 b2
 c => c1,c2
 a\=>a => b\=>b
 a\,a => b\,b
 foo,bar,baz
 Television,TV,Televisions
--- a/contrib/extraction/src/test/resources/version_control.txt
+++ b/contrib/extraction/src/test/resources/version_control.txt
@ -0,0 +1,18 @@
 Solr Version Control System
 Overview
 The Solr source code resides in the Apache Subversion (SVN) repository.
 The command-line SVN client can be obtained here or as an optional package
 for cygwin.
 The TortoiseSVN GUI client for Windows can be obtained here. There
 are also SVN plugins available for older versions of Eclipse and 
 IntelliJ IDEA that don't have subversion support already included.
 -------------------------------
 Note: This document is an excerpt from a document Licensed to the
 Apache Software Foundation (ASF) under one or more contributor
 license agreements. See the XML version (version_control.xml) for
 more details.
--- a/contrib/extraction/src/test/resources/version_control.xml
+++ b/contrib/extraction/src/test/resources/version_control.xml
@ -0,0 +1,42 @@
 <?xml version="1.0"?>
 <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
 <!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at
     http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
 <document>
  <header>
    <title>Solr Version Control System</title>
  </header>
  <body>
    <section>
      <title>Overview</title>
      <p>
        The Solr source code resides in the Apache <a href="http://subversion.tigris.org/">Subversion (SVN)</a> repository.
        The command-line SVN client can be obtained <a href="http://subversion.tigris.org/project_packages.html">here</a> or as an optional package for <a href="http://www.cygwin.com/">cygwin</a>.
        The TortoiseSVN GUI client for Windows can be obtained <a href="http://tortoisesvn.tigris.org/">here</a>. There
        are also SVN plugins available for older versions of <a href="http://subclipse.tigris.org/">Eclipse</a> and 
        <a href="http://svnup.tigris.org/">IntelliJ IDEA</a> that don't have subversion support already included.
      </p>
    </section>
    <p>Here is some more text.  It contains <a href="http://lucene.apache.org">a link</a>. </p>
    <p>Text Here</p>
  </body>
 </document>
--- a/contrib/javascript/build.xml
+++ b/contrib/javascript/build.xml
@ -105,6 +105,8 @@
 	  </target>
 	-->
  <target name="example" depends="build"/>
 	<!-- do nothing for now, required for generate maven artifacts -->
  <target name="build"/>
--- a/contrib/velocity/build.xml
+++ b/contrib/velocity/build.xml
@ -121,4 +121,6 @@
    <!-- TODO: Autolaunch Solr  --> 
  </target>
  <target name="example" depends="build"/>
 </project>
--- a/src/java/org/apache/solr/common/util/DateUtil.java
+++ b/src/java/org/apache/solr/common/util/DateUtil.java
@ -0,0 +1,200 @@
 package org.apache.solr.common.util;
 /**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
 import java.text.DateFormat;
 import java.text.ParseException;
 import java.text.SimpleDateFormat;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Calendar;
 import java.util.Collection;
 import java.util.Date;
 import java.util.Iterator;
 import java.util.Locale;
 import java.util.TimeZone;
 /**
 * This class has some code from HttpClient DateUtil.
 */
 public class DateUtil {
  //start HttpClient
  /**
   * Date format pattern used to parse HTTP date headers in RFC 1123 format.
   */
  public static final String PATTERN_RFC1123 = "EEE, dd MMM yyyy HH:mm:ss zzz";
  /**
   * Date format pattern used to parse HTTP date headers in RFC 1036 format.
   */
  public static final String PATTERN_RFC1036 = "EEEE, dd-MMM-yy HH:mm:ss zzz";
  /**
   * Date format pattern used to parse HTTP date headers in ANSI C
   * <code>asctime()</code> format.
   */
  public static final String PATTERN_ASCTIME = "EEE MMM d HH:mm:ss yyyy";
  //These are included for back compat
  private static final Collection<String> DEFAULT_HTTP_CLIENT_PATTERNS = Arrays.asList(
          PATTERN_ASCTIME, PATTERN_RFC1036, PATTERN_RFC1123);
  private static final Date DEFAULT_TWO_DIGIT_YEAR_START;
  static {
    Calendar calendar = Calendar.getInstance();
    calendar.set(2000, Calendar.JANUARY, 1, 0, 0);
    DEFAULT_TWO_DIGIT_YEAR_START = calendar.getTime();
  }
  private static final TimeZone GMT = TimeZone.getTimeZone("GMT");
  //end HttpClient
  //---------------------------------------------------------------------------------------
  /**
   * A suite of default date formats that can be parsed, and thus transformed to the Solr specific format
   */
  public static final Collection<String> DEFAULT_DATE_FORMATS = new ArrayList<String>();
  static {
    DEFAULT_DATE_FORMATS.add("yyyy-MM-dd'T'HH:mm:ss'Z'");
    DEFAULT_DATE_FORMATS.add("yyyy-MM-dd'T'HH:mm:ss");
    DEFAULT_DATE_FORMATS.add("yyyy-MM-dd");
    DEFAULT_DATE_FORMATS.add("yyyy-MM-dd hh:mm:ss");
    DEFAULT_DATE_FORMATS.add("yyyy-MM-dd HH:mm:ss");
    DEFAULT_DATE_FORMATS.add("EEE MMM d hh:mm:ss z yyyy");
    DEFAULT_DATE_FORMATS.addAll(DEFAULT_HTTP_CLIENT_PATTERNS);
  }
  /**
   * Returns a formatter that can be use by the current thread if needed to
   * convert Date objects to the Internal representation.
   *
   * @param d The input date to parse
   * @return The parsed {@link java.util.Date}
   * @throws java.text.ParseException If the input can't be parsed
   * @throws org.apache.commons.httpclient.util.DateParseException
   *                                  If the input can't be parsed
   */
  public static Date parseDate(String d) throws ParseException {
    return parseDate(d, DEFAULT_DATE_FORMATS);
  }
  public static Date parseDate(String d, Collection<String> fmts) throws ParseException {
    // 2007-04-26T08:05:04Z
    if (d.endsWith("Z") && d.length() > 20) {
      return getThreadLocalDateFormat().parse(d);
    }
    return parseDate(d, fmts, null);
  }
  /**
   * Slightly modified from org.apache.commons.httpclient.util.DateUtil.parseDate
   * <p/>
   * Parses the date value using the given date formats.
   *
   * @param dateValue   the date value to parse
   * @param dateFormats the date formats to use
   * @param startDate   During parsing, two digit years will be placed in the range
   *                    <code>startDate</code> to <code>startDate + 100 years</code>. This value may
   *                    be <code>null</code>. When <code>null</code> is given as a parameter, year
   *                    <code>2000</code> will be used.
   * @return the parsed date
   * @throws ParseException if none of the dataFormats could parse the dateValue
   */
  public static Date parseDate(
          String dateValue,
          Collection<String> dateFormats,
          Date startDate
  ) throws ParseException {
    if (dateValue == null) {
      throw new IllegalArgumentException("dateValue is null");
    }
    if (dateFormats == null) {
      dateFormats = DEFAULT_HTTP_CLIENT_PATTERNS;
    }
    if (startDate == null) {
      startDate = DEFAULT_TWO_DIGIT_YEAR_START;
    }
    // trim single quotes around date if present
    // see issue #5279
    if (dateValue.length() > 1
            && dateValue.startsWith("'")
            && dateValue.endsWith("'")
            ) {
      dateValue = dateValue.substring(1, dateValue.length() - 1);
    }
    SimpleDateFormat dateParser = null;
    Iterator formatIter = dateFormats.iterator();
    while (formatIter.hasNext()) {
      String format = (String) formatIter.next();
      if (dateParser == null) {
        dateParser = new SimpleDateFormat(format, Locale.US);
        dateParser.setTimeZone(GMT);
        dateParser.set2DigitYearStart(startDate);
      } else {
        dateParser.applyPattern(format);
      }
      try {
        return dateParser.parse(dateValue);
      } catch (ParseException pe) {
        // ignore this exception, we will try the next format
      }
    }
    // we were unable to parse the date
    throw new ParseException("Unable to parse the date " + dateValue, 0);
  }
  /**
   * Returns a formatter that can be use by the current thread if needed to
   * convert Date objects to the Internal representation.
   *
   * @return The {@link java.text.DateFormat} for the current thread
   */
  public static DateFormat getThreadLocalDateFormat() {
    return fmtThreadLocal.get();
  }
  public static TimeZone UTC = TimeZone.getTimeZone("UTC");
  private static ThreadLocalDateFormat fmtThreadLocal = new ThreadLocalDateFormat();
  private static class ThreadLocalDateFormat extends ThreadLocal<DateFormat> {
    DateFormat proto;
    public ThreadLocalDateFormat() {
      super();
      //2007-04-26T08:05:04Z
      SimpleDateFormat tmp = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
      tmp.setTimeZone(UTC);
      proto = tmp;
    }
    @Override
    protected DateFormat initialValue() {
      return (DateFormat) proto.clone();
    }
  }
 }
--- a/src/java/org/apache/solr/core/SolrResourceLoader.java
+++ b/src/java/org/apache/solr/core/SolrResourceLoader.java
@ -31,6 +31,7 @@ import java.util.*;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 import java.nio.charset.Charset;
 import java.lang.reflect.Constructor;
 import javax.naming.Context;
 import javax.naming.InitialContext;
@ -308,6 +309,36 @@ public class SolrResourceLoader implements ResourceLoader
    }
    return obj;
  }
  public Object newInstance(String cName, String [] subPackages, Class[] params, Object[] args){
    Class clazz = findClass(cName,subPackages);
    if( clazz == null ) {
      throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
          "Can not find class: "+cName + " in " + classLoader, false);
    }
    Object obj = null;
    try {
      Constructor constructor = clazz.getConstructor(params);
      obj = constructor.newInstance(args);
    }
    catch (Exception e) {
      throw new SolrException( SolrException.ErrorCode.SERVER_ERROR,
          "Error instantiating class: '" + clazz.getName()+"'", e, false );
    }
    if( obj instanceof SolrCoreAware ) {
      assertAwareCompatibility( SolrCoreAware.class, obj );
      waitingForCore.add( (SolrCoreAware)obj );
    }
    if( obj instanceof ResourceLoaderAware ) {
      assertAwareCompatibility( ResourceLoaderAware.class, obj );
      waitingForResources.add( (ResourceLoaderAware)obj );
    }
    return obj;
  }
  /**
   * Tell all {@link SolrCoreAware} instances about the SolrCore
@ -436,4 +467,4 @@ public class SolrResourceLoader implements ResourceLoader
    throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, builder.toString() );
  }
-}
+}
--- a/src/java/org/apache/solr/util/TestHarness.java
+++ b/src/java/org/apache/solr/util/TestHarness.java
@ -315,11 +315,7 @@ public class TestHarness {
   * @see LocalSolrQueryRequest
   */
  public String query(String handler, SolrQueryRequest req) throws IOException, Exception {
-    SolrQueryResponse rsp = new SolrQueryResponse();
+    SolrQueryResponse rsp = queryAndResponse(handler, req);
    core.execute(core.getRequestHandler(handler),req,rsp);
    if (rsp.getException() != null) {
      throw rsp.getException();
    }
    StringWriter sw = new StringWriter(32000);
    QueryResponseWriter responseWriter = core.getQueryResponseWriter(req);
@ -330,6 +326,15 @@ public class TestHarness {
    return sw.toString();
  }
  public SolrQueryResponse queryAndResponse(String handler, SolrQueryRequest req) throws Exception {
    SolrQueryResponse rsp = new SolrQueryResponse();
    core.execute(core.getRequestHandler(handler),req,rsp);
    if (rsp.getException() != null) {
      throw rsp.getException();
    }
    return rsp;
  }
  /**
   * A helper method which valides a String against an array of XPath test
		`@ -0,0 +1,2 @@`
							`AnyObjectId[8217cae0a1bc977b241e0c8517cc2e3e7cede276] was removed in git history.`
							`Apache SVN contains full history.`
		`@ -0,0 +1,2 @@`
							`AnyObjectId[680f8c60c1f0393f7e56595e24b29b3ceb46e933] was removed in git history.`
							`Apache SVN contains full history.`
		`@ -0,0 +1,2 @@`
							`AnyObjectId[552721d0e8deb28f2909cfc5ec900a5e35736795] was removed in git history.`
							`Apache SVN contains full history.`
		`@ -0,0 +1,2 @@`
							`AnyObjectId[957b6752af9a60c1bb2a4f65db0e90e5ce00f521] was removed in git history.`
							`Apache SVN contains full history.`
		`@ -0,0 +1,2 @@`
							`AnyObjectId[133dc6cb35f5ca2c5920fd0933a557c2def88680] was removed in git history.`
							`Apache SVN contains full history.`
		`@ -0,0 +1,2 @@`
							`AnyObjectId[87b80ab5db1729662ccf3439e147430a28c36d03] was removed in git history.`
							`Apache SVN contains full history.`
		`@ -0,0 +1,2 @@`
							`AnyObjectId[b73a80fab641131e6fbe3ae833549efb3c540d17] was removed in git history.`
							`Apache SVN contains full history.`
		`@ -0,0 +1,2 @@`
							`AnyObjectId[c9030febd2ae484532407db9ef98247cbe61b779] was removed in git history.`
							`Apache SVN contains full history.`
		`@ -0,0 +1,2 @@`
							`AnyObjectId[f5e8c167e7f7f3d078407859cb50b8abf23c697e] was removed in git history.`
							`Apache SVN contains full history.`
		`@ -0,0 +1,2 @@`
							`AnyObjectId[674d71e89ea154dbe2e3cd032821c22b39e8fd68] was removed in git history.`
							`Apache SVN contains full history.`