SOLR-3650: migrate DIH CHANGES.txt

git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1368190 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
Chris M. Hostetter 2012-08-01 18:44:02 +00:00
parent f1ae7dad35
commit 4eb362c0b3
3 changed files with 514 additions and 549 deletions

View File

@ -709,6 +709,13 @@ Bug Fixes
* SOLR-3470: contrib/clustering: custom Carrot2 tokenizer and stemmer factories
are respected now (Stanislaw Osinski, Dawid Weiss)
* SOLR-3430: Added a new DIH test against a real SQL database. Fixed problems
revealed by this new test related to the expanded cache support added to
3.6/SOLR-2382 (James Dyer)
* SOLR-1958: When using the MailEntityProcessor, import would fail if
fetchMailsSince was not specified. (Max Lynch via James Dyer)
Other Changes
----------------------
@ -862,7 +869,13 @@ Other Changes
* SOLR-3534: The Dismax and eDismax query parsers will fall back on the 'df' parameter
when 'qf' is absent. And if neither is present nor the schema default search field
then an exception will be thrown now. (dsmiley)
* SOLR-3262: The "threads" feature of DIH is removed (deprecated in Solr 3.6)
(James Dyer)
* SOLR-3422: Refactored DIH internal data classes. All entities in
data-config.xml must have a name (James Dyer)
Documentation
----------------------
@ -898,6 +911,17 @@ Bug Fixes:
* SOLR-3470: contrib/clustering: custom Carrot2 tokenizer and stemmer factories
are respected now (Stanislaw Osinski, Dawid Weiss)
* SOLR-3360: More DIH bug fixes for the deprecated "threads" parameter.
(Mikhail Khludnev, Claudio R, via James Dyer)
* SOLR-3430: Added a new DIH test against a real SQL database. Fixed problems
revealed by this new test related to the expanded cache support added to
3.6/SOLR-2382 (James Dyer)
* SOLR-3336: SolrEntityProcessor substitutes most variables at query time.
(Michael Kroh, Lance Norskog, via Martijn van Groningen)
================== 3.6.0 ==================
More information about this release, including any errata related to the
release notes, upgrade instructions, or other changes may be found online at:
@ -1050,6 +1074,27 @@ New Features
auto detector cannot detect encoding, especially the text file is too short
to detect encoding. (koji)
* SOLR-1499: Added SolrEntityProcessor that imports data from another Solr core
or instance based on a specified query.
(Lance Norskog, Erik Hatcher, Pulkit Singhal, Ahmet Arslan, Luca Cavanna,
Martijn van Groningen)
* SOLR-3190: Minor improvements to SolrEntityProcessor. Add more consistency
between solr parameters and parameters used in SolrEntityProcessor and
ability to specify a custom HttpClient instance.
(Luca Cavanna via Martijn van Groningen)
* SOLR-2382: Added pluggable cache support to DIH so that any Entity can be
made cache-able by adding the "cacheImpl" parameter. Include
"SortedMapBackedCache" to provide in-memory caching (as previously this was
the only option when using CachedSqlEntityProcessor). Users can provide
their own implementations of DIHCache for other caching strategies.
Deprecate CachedSqlEntityProcessor in favor of specifing "cacheImpl" with
SqlEntityProcessor. Make SolrWriter implement DIHWriter and allow the
possibility of pluggable Writers (DIH writing to something other than Solr).
(James Dyer, Noble Paul)
Optimizations
----------------------
* SOLR-1931: Speedup for LukeRequestHandler and admin/schema browser. New parameter
@ -1296,6 +1341,10 @@ Other Changes
extracting request handler and are willing to use java 6, just add the jar.
(rmuir)
* SOLR-3142: DIH Imports no longer default optimize to true, instead false.
If you want to force all segments to be merged into one, you can specify
this parameter yourself. NOTE: this can be very expensive operation and
usually does not make sense for delta-imports. (Robert Muir)
Build
----------------------
@ -1393,6 +1442,9 @@ Bug Fixes
a wrong number of collation results in the response.
(Bastiaan Verhoef, James Dyer via Simon Willnauer)
* SOLR-2875: Fix the incorrect url in DIH example tika-data-config.xml
(Shinichiro Abe via koji)
Other Changes
----------------------
@ -1585,6 +1637,24 @@ Bug Fixes
* SOLR-2692: contrib/clustering: Typo in param name fixed: "carrot.fragzise"
changed to "carrot.fragSize" (Stanislaw Osinski).
* SOLR-2644: When using DIH with threads=2 the default logging is set too high
(Bill Bell via shalin)
* SOLR-2492: DIH does not commit if only deletes are processed
(James Dyer via shalin)
* SOLR-2186: DataImportHandler's multi-threaded option throws NPE
(Lance Norskog, Frank Wesemann, shalin)
* SOLR-2655: DIH multi threaded mode does not resolve attributes correctly
(Frank Wesemann, shalin)
* SOLR-2695: DIH: Documents are collected in unsynchronized list in
multi-threaded debug mode (Michael McCandless, shalin)
* SOLR-2668: DIH multithreaded mode does not rollback on errors from
EntityProcessor (Frank Wesemann, shalin)
Other Changes
----------------------
@ -1697,6 +1767,9 @@ Bug Fixes
* SOLR-2581: UIMAToSolrMapper wrongly instantiates Type with reflection.
(Tommaso Teofili via koji)
* SOLR-2551: Check dataimport.properties for write access (if delta-import is
supported in DIH configuration) before starting an import (C S, shalin)
Other Changes
----------------------
@ -2141,6 +2214,30 @@ New Features
* SOLR-2237: Added StempelPolishStemFilterFactory to contrib/analysis-extras (rmuir)
* SOLR-1525: allow DIH to refer to core properties (noble)
* SOLR-1547: DIH TemplateTransformer copy objects more intelligently when the
template is a single variable (noble)
* SOLR-1627: DIH VariableResolver should be fetched just in time (noble)
* SOLR-1583: DIH Create DataSources that return InputStream (noble)
* SOLR-1358: Integration of Tika and DataImportHandler (Akshay Ukey, noble)
* SOLR-1654: TikaEntityProcessor example added DIHExample
(Akshay Ukey via noble)
* SOLR-1678: Move onError handling to DIH framework (noble)
* SOLR-1352: Multi-threaded implementation of DIH (noble)
* SOLR-1721: Add explicit option to run DataImportHandler in synchronous mode
(Alexey Serba via noble)
* SOLR-1737: Added FieldStreamDataSource (noble)
Optimizations
----------------------
@ -2166,6 +2263,9 @@ Optimizations
SolrIndexSearcher.doc(int, Set<String>) method b/c it can use the document
cache (gsingers)
* SOLR-2200: Improve the performance of DataImportHandler for large
delta-import updates. (Mark Waddle via rmuir)
Bug Fixes
----------------------
* SOLR-1769: Solr 1.4 Replication - Repeater throwing NullPointerException (Jörgen Rydenius via noble)
@ -2428,6 +2528,61 @@ Bug Fixes
does not properly use the same iterator instance.
(Christoph Brill, Mark Miller)
* SOLR-1638: Fixed NullPointerException during DIH import if uniqueKey is not
specified in schema (Akshay Ukey via shalin)
* SOLR-1639: Fixed misleading error message when dataimport.properties is not
writable (shalin)
* SOLR-1598: DIH: Reader used in PlainTextEntityProcessor is not explicitly
closed (Sascha Szott via noble)
* SOLR-1759: DIH: $skipDoc was not working correctly
(Gian Marco Tagliani via noble)
* SOLR-1762: DIH: DateFormatTransformer does not work correctly with
non-default locale dates (tommy chheng via noble)
* SOLR-1757: DIH multithreading sometimes throws NPE (noble)
* SOLR-1766: DIH with threads enabled doesn't respond to the abort command
(Michael Henson via noble)
* SOLR-1767: dataimporter.functions.escapeSql() does not escape backslash
character (Sean Timm via noble)
* SOLR-1811: formatDate should use the current NOW value always
(Sean Timm via noble)
* SOLR-1794: Dataimport of CLOB fields fails when getCharacterStream() is
defined in a superclass. (Gunnar Gauslaa Bergem via rmuir)
* SOLR-2057: DataImportHandler never calls UpdateRequestProcessor.finish()
(Drew Farris via koji)
* SOLR-1973: Empty fields in XML update messages confuse DataImportHandler.
(koji)
* SOLR-2221: Use StrUtils.parseBool() to get values of boolean options in DIH.
true/on/yes (for TRUE) and false/off/no (for FALSE) can be used for
sub-options (debug, verbose, synchronous, commit, clean, optimize) for
full/delta-import commands. (koji)
* SOLR-2310: DIH: getTimeElapsedSince() returns incorrect hour value when
the elapse is over 60 hours (tom liu via koji)
* SOLR-2252: DIH: When a child entity in nested entities is rootEntity="true",
delta-import doesn't work. (koji)
* SOLR-2330: solrconfig.xml files in example-DIH are broken. (Matt Parker, koji)
* SOLR-1191: resolve DataImportHandler deltaQuery column against pk when pk
has a prefix (e.g. pk="book.id" deltaQuery="select id from ..."). More
useful error reporting when no match found (previously failed with a
NullPointerException in log and no clear user feedback). (gthb via yonik)
* SOLR-2116: Fix TikaConfig classloader bug in TikaEntityProcessor
(Martijn van Groningen via hossman)
Other Changes
----------------------
@ -2561,6 +2716,12 @@ Other Changes
* SOLR-1813: Add ICU4j to contrib/extraction libs and add tests for Arabic
extraction (Robert Muir via gsingers)
* SOLR-1821: Fix TimeZone-dependent test failure in TestEvaluatorBag.
(Chris Male via rmuir)
* SOLR-2367: Reduced noise in test output by ensuring the properties file
can be written. (Gunnlaugur Thor Briem via rmuir)
Build
----------------------
@ -2645,6 +2806,33 @@ error. See SOLR-1410 for more information.
* RussianLowerCaseFilterFactory
* RussianLetterTokenizerFactory
DIH: Evaluator API has been changed in a non back-compatible way. Users who
have developed custom Evaluators will need to change their code according to
the new API for it to work. See SOLR-996 for details.
DIH: The formatDate evaluator's syntax has been changed. The new syntax is
formatDate(<variable>, '<format_string>'). For example,
formatDate(x.date, 'yyyy-MM-dd'). In the old syntax, the date string was
written without a single-quotes. The old syntax has been deprecated and will
be removed in 1.5, until then, using the old syntax will log a warning.
DIH: The Context API has been changed in a non back-compatible way. In
particular, the Context.currentProcess() method now returns a String
describing the type of the current import process instead of an int.
Similarily, the public constants in Context viz. FULL_DUMP, DELTA_DUMP and
FIND_DELTA are changed to a String type. See SOLR-969 for details.
DIH: The EntityProcessor API has been simplified by moving logic for applying
transformers and handling multi-row outputs from Transformers into an
EntityProcessorWrapper class. The EntityProcessor#destroy is now called once
per parent-row at the end of row (end of data). A new method
EntityProcessor#close is added which is called at the end of import.
DIH: In Solr 1.3, if the last_index_time was not available (first import) and
a delta-import was requested, a full-import was run instead. This is no longer
the case. In Solr 1.4 delta import is run with last_index_time as the epoch
date (January 1, 1970, 00:00:00 GMT) if last_index_time is not available.
Versions of Major Components
----------------------------
Apache Lucene 2.9.1 (r832363 on 2.9 branch)
@ -2936,6 +3124,141 @@ New Features
86. SOLR-1274: Added text serialization output for extractOnly
(Peter Wolanin, gsingers)
87. SOLR-768: DIH: Set last_index_time variable in full-import command.
(Wojtek Piaseczny, Noble Paul via shalin)
88. SOLR-811: Allow a "deltaImportQuery" attribute in SqlEntityProcessor
which is used for delta imports instead of DataImportHandler manipulating
the SQL itself. (Noble Paul via shalin)
89. SOLR-842: Better error handling in DataImportHandler with options to
abort, skip and continue imports. (Noble Paul, shalin)
90. SOLR-833: DIH: A DataSource to read data from a field as a reader. This
can be used, for example, to read XMLs residing as CLOBs or BLOBs in
databases. (Noble Paul via shalin)
91. SOLR-887: A DIH Transformer to strip HTML tags. (Ahmed Hammad via shalin)
92. SOLR-886: DataImportHandler should rollback when an import fails or it is
aborted (shalin)
93. SOLR-891: A DIH Transformer to read strings from Clob type.
(Noble Paul via shalin)
94. SOLR-812: Configurable JDBC settings in JdbcDataSource including optimized
defaults for read only mode. (David Smiley, Glen Newton, shalin)
95. SOLR-910: Add a few utility commands to the DIH admin page such as full
import, delta import, status, reload config. (Ahmed Hammad via shalin)
96. SOLR-938: Add event listener API for DIH import start and end.
(Kay Kay, Noble Paul via shalin)
97. SOLR-801: DIH: Add support for configurable pre-import and post-import
delete query per root-entity. (Noble Paul via shalin)
98. SOLR-988: Add a new scope for session data stored in Context to store
objects across imports. (Noble Paul via shalin)
99. SOLR-980: A PlainTextEntityProcessor which can read from any
DataSource<Reader> and output a String.
(Nathan Adams, Noble Paul via shalin)
100.SOLR-1003: XPathEntityprocessor must allow slurping all text from a given
xml node and its children. (Noble Paul via shalin)
101.SOLR-1001: Allow variables in various attributes of RegexTransformer,
HTMLStripTransformer and NumberFormatTransformer.
(Fergus McMenemie, Noble Paul, shalin)
102.SOLR-989: DIH: Expose running statistics from the Context API.
(Noble Paul, shalin)
103.SOLR-996: DIH: Expose Context to Evaluators. (Noble Paul, shalin)
104.SOLR-783: DIH: Enhance delta-imports by maintaining separate
last_index_time for each entity. (Jon Baer, Noble Paul via shalin)
105.SOLR-1033: Current entity's namespace is made available to all DIH
Transformers. This allows one to use an output field of TemplateTransformer
in other transformers, among other things.
(Fergus McMenemie, Noble Paul via shalin)
106.SOLR-1066: New methods in DIH Context to expose Script details.
ScriptTransformer changed to read scripts through the new API methods.
(Noble Paul via shalin)
107.SOLR-1062: A DIH LogTransformer which can log data in a given template
format. (Jon Baer, Noble Paul via shalin)
108.SOLR-1065: A DIH ContentStreamDataSource which can accept HTTP POST data
in a content stream. This can be used to push data to Solr instead of
just pulling it from DB/Files/URLs. (Noble Paul via shalin)
109.SOLR-1061: Improve DIH RegexTransformer to create multiple columns from
regex groups. (Noble Paul via shalin)
110.SOLR-1059: Special DIH flags introduced for deleting documents by query or
id, skipping rows and stopping further transforms. Use $deleteDocById,
$deleteDocByQuery for deleting by id and query respectively. Use $skipRow
to skip the current row but continue with the document. Use $stopTransform
to stop further transformers. New methods are introduced in Context for
deleting by id and query. (Noble Paul, Fergus McMenemie, shalin)
111.SOLR-1076: JdbcDataSource should resolve DIH variables in all its
configuration parameters. (shalin)
112.SOLR-1055: Make DIH JdbcDataSource easily extensible by making the
createConnectionFactory method protected and return a
Callable<Connection> object. (Noble Paul, shalin)
113.SOLR-1058: DIH: JdbcDataSource can lookup javax.sql.DataSource using JNDI.
Use a jndiName attribute to specify the location of the data source.
(Jason Shepherd, Noble Paul via shalin)
114.SOLR-1083: A DIH Evaluator for escaping query characters.
(Noble Paul, shalin)
115.SOLR-934: A MailEntityProcessor to enable indexing mails from
POP/IMAP sources into a solr index. (Preetam Rao, shalin)
116.SOLR-1060: A DIH LineEntityProcessor which can stream lines of text from a
given file to be indexed directly or for processing with transformers and
child entities.
(Fergus McMenemie, Noble Paul, shalin)
117.SOLR-1127: Add support for DIH field name to be templatized.
(Noble Paul, shalin)
118.SOLR-1092: Added a new DIH command named 'import' which does not
automatically clean the index. This is useful and more appropriate when one
needs to import only some of the entities.
(Noble Paul via shalin)
119.SOLR-1153: DIH 'deltaImportQuery' is honored on child entities as well
(noble)
120.SOLR-1230: Enhanced dataimport.jsp to work with all DataImportHandler
request handler configurations, rather than just a hardcoded /dataimport
handler. (ehatcher)
121.SOLR-1235: disallow period (.) in DIH entity names (noble)
122.SOLR-1234: Multiple DIH does not work because all of them write to
dataimport.properties. Use the handler name as the properties file name
(noble)
123.SOLR-1348: Support binary field type in convertType logic in DIH
JdbcDataSource (shalin)
124.SOLR-1406: DIH: Make FileDataSource and FileListEntityProcessor to be more
extensible (Luke Forehand, shalin)
125.SOLR-1437: DIH: XPathEntityProcessor can deal with xpath syntaxes such as
//tagname , /root//tagname (Fergus McMenemie via noble)
Optimizations
----------------------
1. SOLR-374: Use IndexReader.reopen to save resources by re-using parts of the
@ -2993,6 +3316,21 @@ Optimizations
17. SOLR-1296: Enables setting IndexReader's termInfosIndexDivisor via a new attribute to StandardIndexReaderFactory. Enables
setting termIndexInterval to IndexWriter via SolrIndexConfig. (Jason Rutherglen, hossman, gsingers)
18. SOLR-846: DIH: Reduce memory consumption during delta import by removing
keys when used (Ricky Leung, Noble Paul via shalin)
19. SOLR-974: DataImportHandler skips commit if no data has been updated.
(Wojtek Piaseczny, shalin)
20. SOLR-1004: DIH: Check for abort more frequently during delta-imports.
(Marc Sturlese, shalin)
21. SOLR-1098: DIH DateFormatTransformer can cache the format objects.
(Noble Paul via shalin)
22. SOLR-1465: Replaced string concatenations with StringBuilder append
calls in DIH XPathRecordReader. (Mark Miller, shalin)
Bug Fixes
----------------------
1. SOLR-774: Fixed logging level display (Sean Timm via Otis Gospodnetic)
@ -3210,6 +3548,103 @@ Bug Fixes
caused an error to be returned, although the deletes were
still executed. (asmodean via yonik)
76. SOLR-800: Deep copy collections to avoid ConcurrentModificationException
in XPathEntityprocessor while streaming
(Kyle Morrison, Noble Paul via shalin)
77. SOLR-823: Request parameter variables ${dataimporter.request.xxx} are not
resolved in DIH (Mck SembWever, Noble Paul, shalin)
78. SOLR-728: Add synchronization to avoid race condition of multiple DIH
imports working concurrently (Walter Ferrara, shalin)
79. SOLR-742: Add ability to create dynamic fields with custom
DataImportHandler transformers (Wojtek Piaseczny, Noble Paul, shalin)
80. SOLR-832: Rows parameter is not honored in DIH non-debug mode and can
abort a running import in debug mode. (Akshay Ukey, shalin)
81. SOLR-838: The DIH VariableResolver obtained from a DataSource's context
does not have current data. (Noble Paul via shalin)
82. SOLR-864: DataImportHandler does not catch and log Errors (shalin)
83. SOLR-873: Fix case-sensitive field names and columns (Jon Baer, shalin)
84. SOLR-893: Unable to delete documents via SQL and deletedPkQuery with
deltaimport (Dan Rosher via shalin)
85. SOLR-888: DIH DateFormatTransformer cannot convert non-string type
(Amit Nithian via shalin)
86. SOLR-841: DataImportHandler should throw exception if a field does not
have column attribute (Michael Henson, shalin)
87. SOLR-884: CachedSqlEntityProcessor should check if the cache key is
present in the query results (Noble Paul via shalin)
88. SOLR-985: Fix thread-safety issue with DIH TemplateString for concurrent
imports with multiple cores. (Ryuuichi Kumai via shalin)
89. SOLR-999: DIH XPathRecordReader fails on XMLs with nodes mixed with
CDATA content. (Fergus McMenemie, Noble Paul via shalin)
90. SOLR-1000: DIH FileListEntityProcessor should not apply fileName filter to
directory names. (Fergus McMenemie via shalin)
91. SOLR-1009: Repeated column names result in duplicate values.
(Fergus McMenemie, Noble Paul via shalin)
92. SOLR-1017: Fix DIH thread-safety issue with last_index_time for concurrent
imports in multiple cores due to unsafe usage of SimpleDateFormat by
multiple threads. (Ryuuichi Kumai via shalin)
93. SOLR-1024: Calling abort on DataImportHandler import commits data instead
of calling rollback. (shalin)
94. SOLR-1037: DIH should not add null values in a row returned by
EntityProcessor to documents. (shalin)
95. SOLR-1040: DIH XPathEntityProcessor fails with an xpath like
/feed/entry/link[@type='text/html']/@href (Noble Paul via shalin)
96. SOLR-1042: Fix memory leak in DIH by making TemplateString non-static
member in VariableResolverImpl (Ryuuichi Kumai via shalin)
97. SOLR-1053: IndexOutOfBoundsException in DIH SolrWriter.getResourceAsString
when size of data-config.xml is a multiple of 1024 bytes.
(Herb Jiang via shalin)
98. SOLR-1077: IndexOutOfBoundsException with useSolrAddSchema in DIH
XPathEntityProcessor. (Sam Keen, Noble Paul via shalin)
99. SOLR-1080: DIH RegexTransformer should not replace if regex is not matched.
(Noble Paul, Fergus McMenemie via shalin)
100.SOLR-1090: DataImportHandler should load the data-config.xml using UTF-8
encoding. (Rui Pereira, shalin)
101.SOLR-1146: ConcurrentModificationException in DataImporter.getStatusMessages
(Walter Ferrara, Noble Paul via shalin)
102.SOLR-1229: Fixes for DIH deletedPkQuery, particularly when using
transformed Solr unique id's
(Lance Norskog, Noble Paul via ehatcher)
103.SOLR-1286: Fix the IH commit parameter always defaulting to "true" even
if "false" is explicitly passed in. (Jay Hill, Noble Paul via ehatcher)
104.SOLR-1323: Reset XPathEntityProcessor's $hasMore/$nextUrl when fetching
next URL (noble, ehatcher)
105.SOLR-1450: DIH: Jdbc connection properties such as batchSize are not
applied if the driver jar is placed in solr_home/lib.
(Steve Sun via shalin)
106.SOLR-1474: DIH Delta-import should run even if last_index_time is not set.
(shalin)
Other Changes
----------------------
1. Upgraded to Lucene 2.4.0 (yonik)
@ -3357,6 +3792,55 @@ Other Changes
for discussion on language detection.
See http://www.apache.org/dist/lucene/tika/CHANGES-0.4.txt. (gsingers)
53. SOLR-782: DIH: Refactored SolrWriter to make it a concrete class and
removed wrappers over SolrInputDocument. Refactored to load Evaluators
lazily. Removed multiple document nodes in the configuration xml. Removed
support for 'default' variables, they are automatically available as
request parameters. (Noble Paul via shalin)
54. SOLR-964: DIH: XPathEntityProcessor now ignores DTD validations
(Fergus McMenemie, Noble Paul via shalin)
55. SOLR-1029: DIH: Standardize Evaluator parameter parsing and added helper
functions for parsing all evaluator parameters in a standard way.
(Noble Paul, shalin)
56. SOLR-1081: Change DIH EventListener to be an interface so that components
such as an EntityProcessor or a Transformer can act as an event listener.
(Noble Paul, shalin)
57. SOLR-1027: DIH: Alias the 'dataimporter' namespace to a shorter name 'dih'.
(Noble Paul via shalin)
58. SOLR-1084: Better error reporting when DIH entity name is a reserved word
and data-config.xml root node is not <dataConfig>.
(Noble Paul via shalin)
59. SOLR-1087: Deprecate 'where' attribute in CachedSqlEntityProcessor in
favor of cacheKey and cacheLookup. (Noble Paul via shalin)
60. SOLR-969: Change the FULL_DUMP, DELTA_DUMP, FIND_DELTA constants in DIH
Context to String. Change Context.currentProcess() to return a string
instead of an integer. (Kay Kay, Noble Paul, shalin)
61. SOLR-1120: Simplified DIH EntityProcessor API by moving logic for applying
transformers and handling multi-row outputs from Transformers into an
EntityProcessorWrapper class. The behavior of the method
EntityProcessor#destroy has been modified to be called once per parent-row
at the end of row. A new method EntityProcessor#close is added which is
called at the end of import. A new method
Context#getResolvedEntityAttribute is added which returns the resolved
value of an entity's attribute. Introduced a DocWrapper which takes care
of maintaining document level session variables.
(Noble Paul, shalin)
62. SOLR-1265: Add DIH variable resolving for URLDataSource properties like
baseUrl. (Chris Eldredge via ehatcher)
63. SOLR-1269: Better error messages from DIH JdbcDataSource when JDBC Driver
name or SQL is incorrect. (ehatcher, shalin)
Build
----------------------
1. SOLR-776: Added in ability to sign artifacts via Ant for releases (gsingers)
@ -3382,6 +3866,10 @@ Documentation
3. SOLR-1409: Added Solr Powered By Logos
4. SOLR-1369: Add HSQLDB Jar to example-DIH, unzip database and update
instructions.
================== Release 1.3.0 ==================
Upgrading from Solr 1.2
@ -3727,7 +4215,10 @@ New Features
71. SOLR-1129 : Support binding dynamic fields to beans in SolrJ (Avlesh Singh , noble)
72. SOLR-920 : Cache and reuse IndexSchema . A new attribute added in solr.xml called 'shareSchema' (noble)
73. SOLR-700: DIH: Allow configurable locales through a locale attribute in
fields for NumberFormatTransformer. (Stefan Oestreicher, shalin)
Changes in runtime behavior
1. SOLR-559: use Lucene updateDocument, deleteDocuments methods. This
removes the maxBufferedDeletes parameter added by SOLR-310 as Lucene
@ -3942,6 +4433,18 @@ Bug Fixes
50. SOLR-749: Allow QParser and ValueSourceParsers to be extended with same name (hossman, gsingers)
51. SOLR-704: DIH NumberFormatTransformer can silently ignore part of the
string while parsing. Now it tries to use the complete string for parsing.
Failure to do so will result in an exception.
(Stefan Oestreicher via shalin)
52. SOLR-729: DIH Context.getDataSource(String) gives current entity's
DataSource instance regardless of argument. (Noble Paul, shalin)
53. SOLR-726: DIH: Jdbc Drivers and DataSources fail to load if placed in
multicore sharedLib or core's lib directory.
(Walter Ferrara, Noble Paul, shalin)
Other Changes
1. SOLR-135: Moved common classes to org.apache.solr.common and altered the
build scripts to make two jars: apache-solr-1.3.jar and

View File

@ -1,547 +0,0 @@
Apache Solr - DataImportHandler
Release Notes
Introduction
------------
DataImportHandler is a data import tool for Solr which makes importing data from Databases, XML files and
HTTP data sources quick and easy.
$Id$
================== 5.0.0 ==============
(No changes)
================== 4.0.0-ALPHA ==============
Bug Fixes
----------------------
* SOLR-3430: Added a new test against a real SQL database. Fixed problems revealed by this new test
related to the expanded cache support added to 3.6/SOLR-2382 (James Dyer)
* SOLR-1958: When using the MailEntityProcessor, import would fail if fetchMailsSince was not specified.
(Max Lynch via James Dyer)
Other Changes
----------------------
* SOLR-3262: The "threads" feature is removed (deprecated in Solr 3.6) (James Dyer)
* SOLR-3422: Refactored internal data classes.
All entities in data-config.xml must have a name (James Dyer)
================== 3.6.1 ==================
Bug Fixes
----------------------
* SOLR-3360: More bug fixes for the deprecated "threads" parameter. (Mikhail Khludnev, Claudio R, via James Dyer)
* SOLR-3430: Added a new test against a real SQL database. Fixed problems revealed by this new test
related to the expanded cache support added to 3.6/SOLR-2382 (James Dyer)
* SOLR-3336: SolrEntityProcessor substitutes most variables at query time.
(Michael Kroh, Lance Norskog, via Martijn van Groningen)
================== 3.6.0 ==================
New Features
----------------------
* SOLR-1499: Added SolrEntityProcessor that imports data from another Solr core or instance based on a specified query.
(Lance Norskog, Erik Hatcher, Pulkit Singhal, Ahmet Arslan, Luca Cavanna, Martijn van Groningen)
Additional Work:
SOLR-3190: Minor improvements to SolrEntityProcessor. Add more consistency between solr parameters
and parameters used in SolrEntityProcessor and ability to specify a custom HttpClient instance.
(Luca Cavanna via Martijn van Groningen)
* SOLR-2382: Added pluggable cache support so that any Entity can be made cache-able by adding the "cacheImpl" parameter.
Include "SortedMapBackedCache" to provide in-memory caching (as previously this was the only option when
using CachedSqlEntityProcessor). Users can provide their own implementations of DIHCache for other
caching strategies. Deprecate CachedSqlEntityProcessor in favor of specifing "cacheImpl" with
SqlEntityProcessor. Make SolrWriter implement DIHWriter and allow the possibility of pluggable Writers
(DIH writing to something other than Solr). (James Dyer, Noble Paul)
Changes in Runtime Behavior
----------------------
* SOLR-3142: Imports no longer default optimize to true, instead false. If you want to force all segments to be merged
into one, you can specify this parameter yourself. NOTE: this can be very expensive operation and usually
does not make sense for delta-imports. (Robert MUir)
================== 3.5.0 ==================
Bug Fixes
----------------------
* SOLR-2875: Fix the incorrect url in tika-data-config.xml (Shinichiro Abe via koji)
================== 3.4.0 ==================
Bug Fixes
----------------------
* SOLR-2644: When using threads=2 the default logging is set too high (Bill Bell via shalin)
* SOLR-2492: DIH does not commit if only deletes are processed (James Dyer via shalin)
* SOLR-2186: DataImportHandler's multi-threaded option throws NPE (Lance Norskog, Frank Wesemann, shalin)
* SOLR-2655: DIH multi threaded mode does not resolve attributes correctly (Frank Wesemann, shalin)
* SOLR-2695: Documents are collected in unsynchronized list in multi-threaded debug mode (Michael McCandless, shalin)
* SOLR-2668: DIH multithreaded mode does not rollback on errors from EntityProcessor (Frank Wesemann, shalin)
================== 3.3.0 ==================
* SOLR-2551: Check dataimport.properties for write access (if delta-import is supported
in DIH configuration) before starting an import (C S, shalin)
================== 3.2.0 ==================
(No Changes)
================== 3.1.0 ==================
Upgrading from Solr 1.4
----------------------
Versions of Major Components
---------------------
Detailed Change List
----------------------
New Features
----------------------
* SOLR-1525 : allow DIH to refer to core properties (noble)
* SOLR-1547 : TemplateTransformer copy objects more intelligently when there when the template is a single variable (noble)
* SOLR-1627 : VariableResolver should be fetched just in time (noble)
* SOLR-1583 : Create DataSources that return InputStream (noble)
* SOLR-1358 : Integration of Tika and DataImportHandler ( Akshay Ukey, noble)
* SOLR-1654 : TikaEntityProcessor example added DIHExample (Akshay Ukey via noble)
* SOLR-1678 : Move onError handling to DIH framework (noble)
* SOLR-1352 : Multi-threaded implementation of DIH (noble)
* SOLR-1721 : Add explicit option to run DataImportHandler in synchronous mode (Alexey Serba via noble)
* SOLR-1737 : Added FieldStreamDataSource (noble)
Optimizations
----------------------
* SOLR-2200: Improve the performance of DataImportHandler for large delta-import
updates. (Mark Waddle via rmuir)
Bug Fixes
----------------------
* SOLR-1638: Fixed NullPointerException during import if uniqueKey is not specified
in schema (Akshay Ukey via shalin)
* SOLR-1639: Fixed misleading error message when dataimport.properties is not writable (shalin)
* SOLR-1598: Reader used in PlainTextEntityProcessor is not explicitly closed (Sascha Szott via noble)
* SOLR-1759: $skipDoc was not working correctly (Gian Marco Tagliani via noble)
* SOLR-1762: DateFormatTransformer does not work correctly with non-default locale dates (tommy chheng via noble)
* SOLR-1757: DIH multithreading sometimes throws NPE (noble)
* SOLR-1766: DIH with threads enabled doesn't respond to the abort command (Michael Henson via noble)
* SOLR-1767: dataimporter.functions.escapeSql() does not escape backslash character (Sean Timm via noble)
* SOLR-1811: formatDate should use the current NOW value always (Sean Timm via noble)
* SOLR-1794: Dataimport of CLOB fields fails when getCharacterStream() is
defined in a superclass. (Gunnar Gauslaa Bergem via rmuir)
* SOLR-2057: DataImportHandler never calls UpdateRequestProcessor.finish()
(Drew Farris via koji)
* SOLR-1973: Empty fields in XML update messages confuse DataImportHandler. (koji)
* SOLR-2221: Use StrUtils.parseBool() to get values of boolean options in DIH.
true/on/yes (for TRUE) and false/off/no (for FALSE) can be used for sub-options
(debug, verbose, synchronous, commit, clean, optimize) for full/delta-import commands. (koji)
* SOLR-2310: getTimeElapsedSince() returns incorrect hour value when the elapse is over 60 hours
(tom liu via koji)
* SOLR-2252: When a child entity in nested entities is rootEntity="true", delta-import doesn't work.
(koji)
* SOLR-2330: solrconfig.xml files in example-DIH are broken. (Matt Parker, koji)
* SOLR-1191: resolve DataImportHandler deltaQuery column against pk when pk
has a prefix (e.g. pk="book.id" deltaQuery="select id from ..."). More
useful error reporting when no match found (previously failed with a
NullPointerException in log and no clear user feedback). (gthb via yonik)
* SOLR-2116: Fix TikaConfig classloader bug in TikaEntityProcessor
(Martijn van Groningen via hossman)
Other Changes
----------------------
* SOLR-1821: Fix TimeZone-dependent test failure in TestEvaluatorBag.
(Chris Male via rmuir)
* SOLR-2367: Reduced noise in test output by ensuring the properties file can be written.
(Gunnlaugur Thor Briem via rmuir)
Build
----------------------
Documentation
----------------------
================== Release 1.4.0 ==================
Upgrading from Solr 1.3
-----------------------
Evaluator API has been changed in a non back-compatible way. Users who have developed custom Evaluators will need
to change their code according to the new API for it to work. See SOLR-996 for details.
The formatDate evaluator's syntax has been changed. The new syntax is formatDate(<variable>, '<format_string>').
For example, formatDate(x.date, 'yyyy-MM-dd'). In the old syntax, the date string was written without a single-quotes.
The old syntax has been deprecated and will be removed in 1.5, until then, using the old syntax will log a warning.
The Context API has been changed in a non back-compatible way. In particular, the Context.currentProcess() method
now returns a String describing the type of the current import process instead of an int. Similarily, the public
constants in Context viz. FULL_DUMP, DELTA_DUMP and FIND_DELTA are changed to a String type. See SOLR-969 for details.
The EntityProcessor API has been simplified by moving logic for applying transformers and handling multi-row outputs
from Transformers into an EntityProcessorWrapper class. The EntityProcessor#destroy is now called once per
parent-row at the end of row (end of data). A new method EntityProcessor#close is added which is called at the end
of import.
In Solr 1.3, if the last_index_time was not available (first import) and a delta-import was requested, a full-import
was run instead. This is no longer the case. In Solr 1.4 delta import is run with last_index_time as the epoch
date (January 1, 1970, 00:00:00 GMT) if last_index_time is not available.
Detailed Change List
----------------------
New Features
----------------------
1. SOLR-768: Set last_index_time variable in full-import command.
(Wojtek Piaseczny, Noble Paul via shalin)
2. SOLR-811: Allow a "deltaImportQuery" attribute in SqlEntityProcessor which is used for delta imports
instead of DataImportHandler manipulating the SQL itself.
(Noble Paul via shalin)
3. SOLR-842: Better error handling in DataImportHandler with options to abort, skip and continue imports.
(Noble Paul, shalin)
4. SOLR-833: A DataSource to read data from a field as a reader. This can be used, for example, to read XMLs
residing as CLOBs or BLOBs in databases.
(Noble Paul via shalin)
5. SOLR-887: A Transformer to strip HTML tags.
(Ahmed Hammad via shalin)
6. SOLR-886: DataImportHandler should rollback when an import fails or it is aborted
(shalin)
7. SOLR-891: A Transformer to read strings from Clob type.
(Noble Paul via shalin)
8. SOLR-812: Configurable JDBC settings in JdbcDataSource including optimized defaults for read only mode.
(David Smiley, Glen Newton, shalin)
9. SOLR-910: Add a few utility commands to the DIH admin page such as full import, delta import, status, reload config.
(Ahmed Hammad via shalin)
10.SOLR-938: Add event listener API for import start and end.
(Kay Kay, Noble Paul via shalin)
11.SOLR-801: Add support for configurable pre-import and post-import delete query per root-entity.
(Noble Paul via shalin)
12.SOLR-988: Add a new scope for session data stored in Context to store objects across imports.
(Noble Paul via shalin)
13.SOLR-980: A PlainTextEntityProcessor which can read from any DataSource<Reader> and output a String.
(Nathan Adams, Noble Paul via shalin)
14.SOLR-1003: XPathEntityprocessor must allow slurping all text from a given xml node and its children.
(Noble Paul via shalin)
15.SOLR-1001: Allow variables in various attributes of RegexTransformer, HTMLStripTransformer
and NumberFormatTransformer.
(Fergus McMenemie, Noble Paul, shalin)
16.SOLR-989: Expose running statistics from the Context API.
(Noble Paul, shalin)
17.SOLR-996: Expose Context to Evaluators.
(Noble Paul, shalin)
18.SOLR-783: Enhance delta-imports by maintaining separate last_index_time for each entity.
(Jon Baer, Noble Paul via shalin)
19.SOLR-1033: Current entity's namespace is made available to all Transformers. This allows one to use an output field
of TemplateTransformer in other transformers, among other things.
(Fergus McMenemie, Noble Paul via shalin)
20.SOLR-1066: New methods in Context to expose Script details. ScriptTransformer changed to read scripts
through the new API methods.
(Noble Paul via shalin)
21.SOLR-1062: A LogTransformer which can log data in a given template format.
(Jon Baer, Noble Paul via shalin)
22.SOLR-1065: A ContentStreamDataSource which can accept HTTP POST data in a content stream. This can be used to
push data to Solr instead of just pulling it from DB/Files/URLs.
(Noble Paul via shalin)
23.SOLR-1061: Improve RegexTransformer to create multiple columns from regex groups.
(Noble Paul via shalin)
24.SOLR-1059: Special flags introduced for deleting documents by query or id, skipping rows and stopping further
transforms. Use $deleteDocById, $deleteDocByQuery for deleting by id and query respectively.
Use $skipRow to skip the current row but continue with the document. Use $stopTransform to stop
further transformers. New methods are introduced in Context for deleting by id and query.
(Noble Paul, Fergus McMenemie, shalin)
25.SOLR-1076: JdbcDataSource should resolve variables in all its configuration parameters.
(shalin)
26.SOLR-1055: Make DIH JdbcDataSource easily extensible by making the createConnectionFactory method protected and
return a Callable<Connection> object.
(Noble Paul, shalin)
27.SOLR-1058: JdbcDataSource can lookup javax.sql.DataSource using JNDI. Use a jndiName attribute to specify the
location of the data source.
(Jason Shepherd, Noble Paul via shalin)
28.SOLR-1083: An Evaluator for escaping query characters.
(Noble Paul, shalin)
29.SOLR-934: A MailEntityProcessor to enable indexing mails from POP/IMAP sources into a solr index.
(Preetam Rao, shalin)
30.SOLR-1060: A LineEntityProcessor which can stream lines of text from a given file to be indexed directly or
for processing with transformers and child entities.
(Fergus McMenemie, Noble Paul, shalin)
31.SOLR-1127: Add support for field name to be templatized.
(Noble Paul, shalin)
32.SOLR-1092: Added a new command named 'import' which does not automatically clean the index. This is useful and
more appropriate when one needs to import only some of the entities.
(Noble Paul via shalin)
33.SOLR-1153: 'deltaImportQuery' is honored on child entities as well (noble)
34.SOLR-1230: Enhanced dataimport.jsp to work with all DataImportHandler request handler configurations,
rather than just a hardcoded /dataimport handler. (ehatcher)
35.SOLR-1235: disallow period (.) in entity names (noble)
36.SOLR-1234: Multiple DIH does not work because all of them write to dataimport.properties.
Use the handler name as the properties file name (noble)
37.SOLR-1348: Support binary field type in convertType logic in JdbcDataSource (shalin)
38.SOLR-1406: Make FileDataSource and FileListEntityProcessor to be more extensible (Luke Forehand, shalin)
39.SOLR-1437 : XPathEntityProcessor can deal with xpath syntaxes such as //tagname , /root//tagname (Fergus McMenemie via noble)
Optimizations
----------------------
1. SOLR-846: Reduce memory consumption during delta import by removing keys when used
(Ricky Leung, Noble Paul via shalin)
2. SOLR-974: DataImportHandler skips commit if no data has been updated.
(Wojtek Piaseczny, shalin)
3. SOLR-1004: Check for abort more frequently during delta-imports.
(Marc Sturlese, shalin)
4. SOLR-1098: DateFormatTransformer can cache the format objects.
(Noble Paul via shalin)
5. SOLR-1465: Replaced string concatenations with StringBuilder append calls in XPathRecordReader.
(Mark Miller, shalin)
Bug Fixes
----------------------
1. SOLR-800: Deep copy collections to avoid ConcurrentModificationException in XPathEntityprocessor while streaming
(Kyle Morrison, Noble Paul via shalin)
2. SOLR-823: Request parameter variables ${dataimporter.request.xxx} are not resolved
(Mck SembWever, Noble Paul, shalin)
3. SOLR-728: Add synchronization to avoid race condition of multiple imports working concurrently
(Walter Ferrara, shalin)
4. SOLR-742: Add ability to create dynamic fields with custom DataImportHandler transformers
(Wojtek Piaseczny, Noble Paul, shalin)
5. SOLR-832: Rows parameter is not honored in non-debug mode and can abort a running import in debug mode.
(Akshay Ukey, shalin)
6. SOLR-838: The VariableResolver obtained from a DataSource's context does not have current data.
(Noble Paul via shalin)
7. SOLR-864: DataImportHandler does not catch and log Errors (shalin)
8. SOLR-873: Fix case-sensitive field names and columns (Jon Baer, shalin)
9. SOLR-893: Unable to delete documents via SQL and deletedPkQuery with deltaimport
(Dan Rosher via shalin)
10. SOLR-888: DateFormatTransformer cannot convert non-string type
(Amit Nithian via shalin)
11. SOLR-841: DataImportHandler should throw exception if a field does not have column attribute
(Michael Henson, shalin)
12. SOLR-884: CachedSqlEntityProcessor should check if the cache key is present in the query results
(Noble Paul via shalin)
13. SOLR-985: Fix thread-safety issue with TemplateString for concurrent imports with multiple cores.
(Ryuuichi Kumai via shalin)
14. SOLR-999: XPathRecordReader fails on XMLs with nodes mixed with CDATA content.
(Fergus McMenemie, Noble Paul via shalin)
15.SOLR-1000: FileListEntityProcessor should not apply fileName filter to directory names.
(Fergus McMenemie via shalin)
16.SOLR-1009: Repeated column names result in duplicate values.
(Fergus McMenemie, Noble Paul via shalin)
17.SOLR-1017: Fix thread-safety issue with last_index_time for concurrent imports in multiple cores due to unsafe usage
of SimpleDateFormat by multiple threads.
(Ryuuichi Kumai via shalin)
18.SOLR-1024: Calling abort on DataImportHandler import commits data instead of calling rollback.
(shalin)
19.SOLR-1037: DIH should not add null values in a row returned by EntityProcessor to documents.
(shalin)
20.SOLR-1040: XPathEntityProcessor fails with an xpath like /feed/entry/link[@type='text/html']/@href
(Noble Paul via shalin)
21.SOLR-1042: Fix memory leak in DIH by making TemplateString non-static member in VariableResolverImpl
(Ryuuichi Kumai via shalin)
22.SOLR-1053: IndexOutOfBoundsException in SolrWriter.getResourceAsString when size of data-config.xml is a
multiple of 1024 bytes.
(Herb Jiang via shalin)
23.SOLR-1077: IndexOutOfBoundsException with useSolrAddSchema in XPathEntityProcessor.
(Sam Keen, Noble Paul via shalin)
24.SOLR-1080: RegexTransformer should not replace if regex is not matched.
(Noble Paul, Fergus McMenemie via shalin)
25.SOLR-1090: DataImportHandler should load the data-config.xml using UTF-8 encoding.
(Rui Pereira, shalin)
26.SOLR-1146: ConcurrentModificationException in DataImporter.getStatusMessages
(Walter Ferrara, Noble Paul via shalin)
27.SOLR-1229: Fixes for deletedPkQuery, particularly when using transformed Solr unique id's
(Lance Norskog, Noble Paul via ehatcher)
28.SOLR-1286: Fix the commit parameter always defaulting to "true" even if "false" is explicitly passed in.
(Jay Hill, Noble Paul via ehatcher)
29.SOLR-1323: Reset XPathEntityProcessor's $hasMore/$nextUrl when fetching next URL (noble, ehatcher)
30.SOLR-1450: Jdbc connection properties such as batchSize are not applied if the driver jar is placed
in solr_home/lib.
(Steve Sun via shalin)
31.SOLR-1474: Delta-import should run even if last_index_time is not set.
(shalin)
Documentation
----------------------
1. SOLR-1369: Add HSQLDB Jar to example-DIH, unzip database and update instructions.
Other
----------------------
1. SOLR-782: Refactored SolrWriter to make it a concrete class and removed wrappers over SolrInputDocument.
Refactored to load Evaluators lazily. Removed multiple document nodes in the configuration xml.
Removed support for 'default' variables, they are automatically available as request parameters.
(Noble Paul via shalin)
2. SOLR-964: XPathEntityProcessor now ignores DTD validations
(Fergus McMenemie, Noble Paul via shalin)
3. SOLR-1029: Standardize Evaluator parameter parsing and added helper functions for parsing all evaluator
parameters in a standard way.
(Noble Paul, shalin)
4. SOLR-1081: Change EventListener to be an interface so that components such as an EntityProcessor or a Transformer
can act as an event listener.
(Noble Paul, shalin)
5. SOLR-1027: Alias the 'dataimporter' namespace to a shorter name 'dih'.
(Noble Paul via shalin)
6. SOLR-1084: Better error reporting when entity name is a reserved word and data-config.xml root node
is not <dataConfig>.
(Noble Paul via shalin)
7. SOLR-1087: Deprecate 'where' attribute in CachedSqlEntityProcessor in favor of cacheKey and cacheLookup.
(Noble Paul via shalin)
8. SOLR-969: Change the FULL_DUMP, DELTA_DUMP, FIND_DELTA constants in Context to String.
Change Context.currentProcess() to return a string instead of an integer.
(Kay Kay, Noble Paul, shalin)
9. SOLR-1120: Simplified EntityProcessor API by moving logic for applying transformers and handling multi-row outputs
from Transformers into an EntityProcessorWrapper class. The behavior of the method
EntityProcessor#destroy has been modified to be called once per parent-row at the end of row. A new
method EntityProcessor#close is added which is called at the end of import. A new method
Context#getResolvedEntityAttribute is added which returns the resolved value of an entity's attribute.
Introduced a DocWrapper which takes care of maintaining document level session variables.
(Noble Paul, shalin)
10.SOLR-1265: Add variable resolving for URLDataSource properties like baseUrl. (Chris Eldredge via ehatcher)
11.SOLR-1269: Better error messages from JdbcDataSource when JDBC Driver name or SQL is incorrect.
(ehatcher, shalin)
================== Release 1.3.0 ==================
Status
------
This is the first release since DataImportHandler was added to the contrib solr distribution.
The following changes list changes since the code was introduced, not since
the first official release.
Detailed Change List
--------------------
New Features
1. SOLR-700: Allow configurable locales through a locale attribute in fields for NumberFormatTransformer.
(Stefan Oestreicher, shalin)
Changes in runtime behavior
Bug Fixes
1. SOLR-704: NumberFormatTransformer can silently ignore part of the string while parsing. Now it tries to
use the complete string for parsing. Failure to do so will result in an exception.
(Stefan Oestreicher via shalin)
2. SOLR-729: Context.getDataSource(String) gives current entity's DataSource instance regardless of argument.
(Noble Paul, shalin)
3. SOLR-726: Jdbc Drivers and DataSources fail to load if placed in multicore sharedLib or core's lib directory.
(Walter Ferrara, Noble Paul, shalin)
Other Changes

View File

@ -1,3 +1,12 @@
Apache Solr - DataImportHandler
Introduction
------------
DataImportHandler is a data import tool for Solr which makes importing data from Databases, XML files and
HTTP data sources quick and easy.
Important Note
--------------
Although Solr strives to be agnostic of the Locale where the server is
running, some code paths in DataImportHandler are known to depend on the
System default Locale, Timezone, or Charset. It is recommended that when