generates index segments.
The HadoopIndexTask run method wraps a HadoopDruidIndexerJob run method. The
key modifications to the HadoopDruidIndexerJob are as follows:
- The UpDaterJobSpec field of the config that is used to set up the indexer job
is set to null. This ensures that the job does not push a list of published
segments to the database, in order to allow the indexing service to handle this
later.
- Set the version field of the config file based on the TaskContext. Also
changed config.setVersion method to take a string (as opposed to a Date) as
input, and propogated this change where necessary.
- Set the SegmentOutputDir field of the config file based on the TaskToolbox,
to allow the indexing service to handle where to write the segments too.
- Added a method to IndexGeneratorJob called getPublishedSegments, that simply
returns a list of published segments without publishing this list to the
database.
HadoopDruidIndexerConfig:
- Add partitionsSpec (backwards compatible with targetPartitionSize and partitionDimension)
- Add assumeGrouped flag to partitionsSpec
DeterminePartitionsJob:
- Skip group-by job if assumeGrouped is set
- Clean up code a bit
2) Fix bug with IndexMerger and null columns
3) Add QueryableIndexIndexableAdapter so that QueryableIndexes can be merged
4) Adjust twitter example to have multiple values for each hash tag
5) Adjusted GroupByQueryEngine to just drop dimensions that don't exist instead of throwing an NPE
- Can handle non-rolled-up input (by grouping input rows using an additional MR stage)
- Can select its own partitioning dimension, if none is supplied
- Can detect and avoid oversized shards due to bad dimension value distribution
- Shares input parsing code with IndexGeneratorJob
2) Have the DbUpdaterJob read descriptors from the temporary working directory instead of looking in the final segment output location (often the eventually consistent S3)
3) 1 and 2 Fixes#30
2) Add some docs to InputRow/Row to indicate that column names passed into the methods are *always* lowercase and that the rows need to act accordingly. (fixes#29, or at least clarifies the behavior...)