lucene/solr/contrib/extraction
Uwe Schindler 57acbcfd00 SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.
git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1512296 13f79535-47bb-0310-9956-ffa450edef68
2013-08-09 13:26:55 +00:00
..
src SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words. 2013-08-09 13:26:55 +00:00
README.txt SOLR-3650: checkpoint, migrated CHANGES.txt for contrib/uima and contrib/extraction 2012-07-31 01:37:17 +00:00
build.xml SOLR-2452: merged with trunk up r1144161; applied the svn movement script and the latest version of the post-svn-movement patch 2011-07-08 06:41:23 +00:00
ivy.xml SOLR-4986: Upgrade to Tika 1.4 2013-07-03 11:49:59 +00:00

README.txt

Apache Solr Content Extraction Library (Solr Cell)

Introduction
------------

Apache Solr Extraction provides a means for extracting and indexing content contained in "rich" documents, such
as Microsoft Word, Adobe PDF, etc.  (Each name is a trademark of their respective owners)  This contrib module
uses Apache Tika to extract content and metadata from the files, which can then be indexed.  For more information,
see http://wiki.apache.org/solr/ExtractingRequestHandler

Getting Started
---------------
You will need Solr up and running.  Then, simply add the extraction JAR file, plus the Tika dependencies (in the ./lib folder)
to your Solr Home lib directory.  See http://wiki.apache.org/solr/ExtractingRequestHandler for more details on hooking it in
 and configuring.