lucene

History

Uwe Schindler 57acbcfd00 SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words. git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1512296 13f79535-47bb-0310-9956-ffa450edef68		2013-08-09 13:26:55 +00:00
..
src	SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.	2013-08-09 13:26:55 +00:00
README.txt	SOLR-3650: checkpoint, migrated CHANGES.txt for contrib/uima and contrib/extraction	2012-07-31 01:37:17 +00:00
build.xml	SOLR-2452: merged with trunk up r1144161; applied the svn movement script and the latest version of the post-svn-movement patch	2011-07-08 06:41:23 +00:00
ivy.xml	SOLR-4986: Upgrade to Tika 1.4	2013-07-03 11:49:59 +00:00

README.txt

Apache Solr Content Extraction Library (Solr Cell)

Introduction
------------

Apache Solr Extraction provides a means for extracting and indexing content contained in "rich" documents, such
as Microsoft Word, Adobe PDF, etc.  (Each name is a trademark of their respective owners)  This contrib module
uses Apache Tika to extract content and metadata from the files, which can then be indexed.  For more information,
see http://wiki.apache.org/solr/ExtractingRequestHandler

Getting Started
---------------
You will need Solr up and running.  Then, simply add the extraction JAR file, plus the Tika dependencies (in the ./lib folder)
to your Solr Home lib directory.  See http://wiki.apache.org/solr/ExtractingRequestHandler for more details on hooking it in
 and configuring.