diff --git a/solr/CHANGES.txt b/solr/CHANGES.txt index fde092f626d..de1757f1896 100644 --- a/solr/CHANGES.txt +++ b/solr/CHANGES.txt @@ -1045,6 +1045,11 @@ New Features support numeric collation, ignore punctuation/whitespace, ignore accents but not case, control whether upper/lowercase values are sorted first, etc. (rmuir) +* SOLR-2346: Add a chance to set content encoding explicitly via content type + of stream for extracting request handler. This is convenient when Tika's + auto detector cannot detect encoding, especially the text file is too short + to detect encoding. (koji) + Optimizations ---------------------- * SOLR-1931: Speedup for LukeRequestHandler and admin/schema browser. New parameter @@ -1286,6 +1291,12 @@ Other Changes repository). Also updated dependencies jackson-core-asl and jackson-mapper-asl (both v1.5.2 -> v1.7.4). (Dawid Weiss, Steve Rowe) +* SOLR-3295: netcdf jar is excluded from the binary release (and disabled in + ivy.xml) because it requires java 6. If you want to parse this content with + extracting request handler and are willing to use java 6, just add the jar. + (rmuir) + + Build ---------------------- * SOLR-2487: Add build target to package war without slf4j jars (janhoy) @@ -1419,6 +1430,9 @@ Bug Fixes * SOLR-2591: Remove commitLockTimeout option from solrconfig.xml (Luca Cavanna via Martijn van Groningen) +* SOLR-2746: Upgraded UIMA dependencies from *-2.3.1-SNAPSHOT.jar to *-2.3.1.jar. + + ================== 3.4.0 ================== Upgrading from Solr 3.3 @@ -1577,6 +1591,9 @@ Bug Fixes * SOLR-2629: Eliminate deprecation warnings in some JSPs. (Bernd Fehling, hossman) +* SOLR-2743: Remove commons logging from contrib/extraction. (koji) + + Build ---------------------- @@ -1648,6 +1665,13 @@ New Features * SOLR-2610 -- Add an option to delete index through CoreAdmin UNLOAD action (shalin) +* SOLR-2480: Add ignoreTikaException flag to the extraction request handler so + that users can ignore TikaException but index meta data. + (Shinichiro Abe, koji) + +* SOLR-2582: Use uniqueKey for error log in UIMAUpdateRequestProcessor. + (Tommaso Teofili via koji) + Optimizations ---------------------- @@ -1667,6 +1691,12 @@ Bug Fixes parameter is added to avoid excessive CPU time in extreme cases (e.g. long queries with many misspelled words). (James Dyer via rmuir) +* SOLR-2579: UIMAUpdateRequestProcessor ignore error fails if text.length() < 100. + (Elmer Garduno via koji) + +* SOLR-2581: UIMAToSolrMapper wrongly instantiates Type with reflection. + (Tommaso Teofili via koji) + Other Changes ---------------------- @@ -1706,6 +1736,10 @@ Upgrading from Solr 3.1 with update.chain rather than update.processor. The latter still works, but has been deprecated. +* just beneath ... is no longer supported. + It should move to UIMAUpdateRequestProcessorFactory setting. + See contrib/uima/README.txt for more details. (SOLR-2436) + Detailed Change List ---------------------- @@ -1732,6 +1766,12 @@ New Features for clustering (SOLR-2450), output of cluster scores (SOLR-2505) (Stanislaw Osinski, Dawid Weiss). +* SOLR-2503: extend UIMAUpdateRequestProcessorFactory mapping function to + map feature value to dynamicField. (koji) + +* SOLR-2512: add ignoreErrors flag to UIMAUpdateRequestProcessorFactory so + that users can ignore exceptions in AE. (Tommaso Teofili, koji) + Optimizations ---------------------- @@ -1818,6 +1858,12 @@ Other Changes * SOLR-2528: Remove default="true" from HtmlEncoder in example solrconfig.xml, because html encoding confuses non-ascii users. (koji) +* SOLR-2387: add mock annotators for improved testing in contrib/uima, + (Tommaso Teofili via rmuir) + +* SOLR-2436: move uimaConfig to under the uima's update processor in + solrconfig.xml. (Tommaso Teofili, koji) + Build ---------------------- @@ -2377,6 +2423,12 @@ Bug Fixes * SOLR-1692: Fix bug in clustering component relating to carrot.produceSummary option (gsingers) +* SOLR-1756: The date.format setting for extraction request handler causes + ClassCastException when enabled and the config code that parses this setting + does not properly use the same iterator instance. + (Christoph Brill, Mark Miller) + + Other Changes ---------------------- @@ -2504,6 +2556,10 @@ Other Changes * SOLR-141: Errors and Exceptions are formated by ResponseWriter. (Mike Sokolov, Rich Cariens, Daniel Naber, ryan) +* SOLR-1902: Upgraded to Tika 0.8 and changed deprecated parse call + +* SOLR-1813: Add ICU4j to contrib/extraction libs and add tests for Arabic + extraction (Robert Muir via gsingers) Build ---------------------- @@ -2874,6 +2930,11 @@ New Features 84. SOLR-1449: Add elements to solrconfig.xml to specifying additional classpath directories and regular expressions. (hossman via yonik) +85. SOLR-1128: Added metadata output to extraction request handler "extract + only" option. (gsingers) + +86. SOLR-1274: Added text serialization output for extractOnly + (Peter Wolanin, gsingers) Optimizations ---------------------- @@ -3288,6 +3349,14 @@ Other Changes 50. SOLR-1357 SolrInputDocument cannot process dynamic fields (Lars Grote via noble) +51. SOLR-1075: Upgrade to Tika 0.3. See http://www.apache.org/dist/lucene/tika/CHANGES-0.3.txt (gsingers) + +52. SOLR-1310: Upgrade to Tika 0.4. Note there are some differences in + detecting Languages now in extracting request handler. + See http://www.lucidimagination.com/search/document/d6f1899a85b2a45c/vote_apache_tika_0_4_release_candidate_2#d6f1899a85b2a45c + for discussion on language detection. + See http://www.apache.org/dist/lucene/tika/CHANGES-0.4.txt. (gsingers) + Build ---------------------- 1. SOLR-776: Added in ability to sign artifacts via Ant for releases (gsingers) diff --git a/solr/contrib/extraction/CHANGES.txt b/solr/contrib/extraction/CHANGES.txt deleted file mode 100644 index 728c0479a30..00000000000 --- a/solr/contrib/extraction/CHANGES.txt +++ /dev/null @@ -1,97 +0,0 @@ -Apache Solr Content Extraction Library (Solr Cell) - Release Notes - -This file describes changes to the Solr Cell (contrib/extraction) module. See SOLR-284 for details. - -Introduction ------------- - -Apache Solr Extraction provides a means for extracting and indexing content contained in "rich" documents, such -as Microsoft Word, Adobe PDF, etc. (Each name is a trademark of their respective owners) This contrib module -uses Apache Tika to extract content and metadata from the files, which can then be indexed. For more information, -see http://wiki.apache.org/solr/ExtractingRequestHandler - -Getting Started ---------------- -You will need Solr up and running. Then, simply add the extraction JAR file, plus the Tika dependencies (in the ./lib folder) -to your Solr Home lib directory. See http://wiki.apache.org/solr/ExtractingRequestHandler for more details on hooking it in - and configuring. - -Tika Dependency ---------------- - -Current Version: Tika 1.1 (released 2012-03-23) - -$Id$ - -================== Release 5.0.0 ============== - - (No changes) - -================== Release 4.0.0-ALPHA ============== - -* SOLR-3254: Upgrade Solr to Tika 1.1 (janhoy) - -================== Release 3.6.1 ================== - -(No Changes) - -================== Release 3.6.0 ================== - -* SOLR-2346: Add a chance to set content encoding explicitly via content type of stream. - This is convenient when Tika's auto detector cannot detect encoding, especially - the text file is too short to detect encoding. (koji) - -* SOLR-2901: Upgrade Solr to Tika 1.0 (janhoy) - -* SOLR-3295: netcdf jar is excluded from the binary release (and disabled in ivy.xml) - because it requires java 6. If you want to parse this content and are willing to - use java 6, just add the jar. (rmuir) - -================== Release 3.5.0 ================== - -* SOLR-2372: Upgrade Solr to Tika 0.10 (janhoy) - -================== Release 3.4.0 ================== - -* SOLR-2540: CommitWithin as an Update Request parameter - You can now specify &commitWithin=N (ms) on the update request (janhoy) - -* SOLR-2743: Remove commons logging. (koji) - -================== Release 3.3.0 ================== - -(No Changes) - -================== Release 3.2.0 ================== - -* SOLR-2480: Add ignoreTikaException flag so that users can ignore TikaException but index - meta data. (Shinichiro Abe, koji) - -================== Release 3.1.0 ================== - -* SOLR-1902: Upgraded to Tika 0.8 and changed deprecated parse call - -* SOLR-1756: The date.format setting causes ClassCastException when enabled and the config code that - parses this setting does not properly use the same iterator instance. (Christoph Brill, Mark Miller) - -* SOLR-18913: Add ICU4j to libs and add tests for Arabic extraction (Robert Muir via gsingers) - -* SOLR-1902: Upgraded to Tika 0.8-SNAPSHOT to incorporate passing in Solr's custom ClassLoader (gsingers) - -================== Release 1.4.0 ================== - -1. SOLR-284: Added in support for extraction. (Eric Pugh, Chris Harris, gsingers) - -2. SOLR-284: Removed "silent success" key generation (gsingers) - -3. SOLR-1075: Upgrade to Tika 0.3. See http://www.apache.org/dist/lucene/tika/CHANGES-0.3.txt (gsingers) - -4. SOLR-1128: Added metadata output to "extract only" option. (gsingers) - -5. SOLR-1310: Upgrade to Tika 0.4. Note there are some differences in detecting Languages now. - See http://www.lucidimagination.com/search/document/d6f1899a85b2a45c/vote_apache_tika_0_4_release_candidate_2#d6f1899a85b2a45c - for discussion on language detection. - See http://www.apache.org/dist/lucene/tika/CHANGES-0.4.txt. (gsingers) - -6. SOLR-1274: Added text serialization output for extractOnly (Peter Wolanin, gsingers) diff --git a/solr/contrib/extraction/README.txt b/solr/contrib/extraction/README.txt new file mode 100644 index 00000000000..b2ba66d50f9 --- /dev/null +++ b/solr/contrib/extraction/README.txt @@ -0,0 +1,16 @@ +Apache Solr Content Extraction Library (Solr Cell) + +Introduction +------------ + +Apache Solr Extraction provides a means for extracting and indexing content contained in "rich" documents, such +as Microsoft Word, Adobe PDF, etc. (Each name is a trademark of their respective owners) This contrib module +uses Apache Tika to extract content and metadata from the files, which can then be indexed. For more information, +see http://wiki.apache.org/solr/ExtractingRequestHandler + +Getting Started +--------------- +You will need Solr up and running. Then, simply add the extraction JAR file, plus the Tika dependencies (in the ./lib folder) +to your Solr Home lib directory. See http://wiki.apache.org/solr/ExtractingRequestHandler for more details on hooking it in + and configuring. + diff --git a/solr/contrib/uima/CHANGES.txt b/solr/contrib/uima/CHANGES.txt deleted file mode 100644 index 3e515ddab25..00000000000 --- a/solr/contrib/uima/CHANGES.txt +++ /dev/null @@ -1,100 +0,0 @@ -Apache Solr UIMA Metadata Extraction Library - Release Notes - -This file describes changes to the Solr UIMA (contrib/uima) module. See SOLR-2129 for details. - -Introduction ------------- -This module is intended to be used both as an UpdateRequestProcessor while indexing documents and as a set of tokenizer/filters -to be configured inside the schema.xml for use during analysis phase. -UIMAUpdateRequestProcessor purpose is to provide additional on the fly automatically generated fields to the Solr index. -Such fields could be language, concepts, keywords, sentences, named entities, etc. -UIMA based tokenizers/filters can be used either inside plain Lucene or as index/query analyzers to be defined -inside the schema.xml of a Solr core to create/filter tokens using specific UIMA annotations. - - UIMA Dependency - --------------- -uimaj-core v2.3.1 -OpenCalaisAnnotator v2.3.1 -HMMTagger v2.3.1 -AlchemyAPIAnnotator v2.3.1 -WhitespaceTokenizer v2.3.1 - -$Id$ - -================== 5.0.0 ============== - -(No Changes) - -================== 4.0.0-ALPHA ============== - -(No Changes) - -================== 3.6.1 ================== - -(No Changes) - -================== 3.6.0 ================== - -(No Changes) - -================== 3.5.0 ================== - -Other Changes ----------------------- - -* SOLR-2746: Upgraded dependencies from *-2.3.1-SNAPSHOT.jar to *-2.3.1.jar. - -================== 3.4.0 ================== - -(No Changes) - -================== 3.3.0 ================== - -New Features ----------------------- - -* SOLR-2582: Use uniqueKey for error log in UIMAUpdateRequestProcessor. - (Tommaso Teofili via koji) - -Bug Fixes ----------------------- - -* SOLR-2579: UIMAUpdateRequestProcessor ignore error fails if text.length() < 100. - (Elmer Garduno via koji) - -* SOLR-2581: UIMAToSolrMapper wrongly instantiates Type with reflection. - (Tommaso Teofili via koji) - -================== 3.2.0 ================== - -Upgrading from Solr 3.1 ----------------------- - -* just beneath ... is no longer supported. - It should move to UIMAUpdateRequestProcessorFactory setting. - See contrib/uima/README.txt for more details. (SOLR-2436) - -New Features ----------------------- - -* SOLR-2503: extend mapping function to map feature value to dynamicField. (koji) - -* SOLR-2512: add ignoreErrors flag so that users can ignore exceptions in AE. - (Tommaso Teofili, koji) - -Test Cases: ----------------------- - -* SOLR-2387: add mock annotators for improved testing, - (Tommaso Teofili via rmuir) - -Other Changes ----------------------- - -* SOLR-2436: move uimaConfig to under the uima's update processor in solrconfig.xml. - (Tommaso Teofili, koji) - -================== 3.1.0 ================== - -Initial Release diff --git a/solr/contrib/uima/README.txt b/solr/contrib/uima/README.txt index 70d49f8ff37..9a862b7e7d1 100644 --- a/solr/contrib/uima/README.txt +++ b/solr/contrib/uima/README.txt @@ -1,3 +1,15 @@ +Apache Solr UIMA Metadata Extraction Library + +Introduction +------------ +This module is intended to be used both as an UpdateRequestProcessor while indexing documents and as a set of tokenizer/filters +to be configured inside the schema.xml for use during analysis phase. +UIMAUpdateRequestProcessor purpose is to provide additional on the fly automatically generated fields to the Solr index. +Such fields could be language, concepts, keywords, sentences, named entities, etc. +UIMA based tokenizers/filters can be used either inside plain Lucene or as index/query analyzers to be defined +inside the schema.xml of a Solr core to create/filter tokens using specific UIMA annotations. + + Getting Started --------------- To start using Solr UIMA Metadata Extraction Library you should go through the following configuration steps: