Commit Graph

852 Commits

Author SHA1 Message Date
Fokko Driesprong 67920c114e Fixed info message (#3481) 2016-09-21 15:50:29 -07:00
Gian Merlino 27bd5cb13a Add forceExtendableShardSpecs option to Hadoop indexing, IndexTask. (#3473)
Fixes #3241.
2016-09-21 13:40:04 -06:00
Slim ba6ddf307e Adding hadoop kerberos authentification. (#3419)
* adding kerberos authentication

* make the 2 functions identical
2016-09-13 10:42:50 -07:00
Jonathan Wei df766b2bbd Add dimension handling interface for ingestion and segment creation (#3217)
* Add dimension handling interface for ingestion and segment creation

* update javadocs for DimensionHandler/DimensionIndexer

* Move IndexIO row validation into DimensionHandler

* Fix null column skipping in mergerV9

* Add deprecation note for 'numeric_dims' filename pattern in IndexIO v8->v9 conversion

* Fix java7 test failure
2016-09-12 12:54:02 -07:00
Himanshu 3b6c81e7c0 fix cleanup of hadoop ingestion intermediate path (#3385) 2016-09-08 01:36:56 +05:30
Dave Li c4e8440c22 Adds long compression methods (#3148)
* add read

* update deprecated guava calls

* add write and vsizeserde

* add benchmark

* separate encoding and compression

* add header and reformat

* update doc

* address PR comment

* fix buffer order

* generate benchmark files

* separate encoding strategy and format

* fix benchmark

* modify supplier write to channel

* add float NONE handling

* address PR comment

* address PR comment 2
2016-08-30 16:17:46 -07:00
Hamlet Lee e4f0eac8e6 Fix issue #2707 (#2708) 2016-08-16 12:19:44 -05:00
Gian Merlino a2bcd97512 IncrementalIndex: Fix multi-value dimensions returned from iterators. (#3344)
They had arrays as values, which MapBasedRow doesn't understand and
toStrings rather than converting to lists.
2016-08-10 08:47:29 -07:00
Gian Merlino 9437a7a313 HLL: Avoid some allocations when possible. (#3314)
- HLLC.fold avoids duplicating the other buffer by saving and restoring its position.
- HLLC.makeCollector(buffer) no longer duplicates incoming BBs.
- Updated call sites where appropriate to duplicate BBs passed to HLLC.
2016-08-03 18:08:52 -07:00
kaijianding 50d52a24fc ability to not rollup at index time, make pre aggregation an option (#3020)
* ability to not rollup at index time, make pre aggregation an option

* rename getRowIndexForRollup to getPriorIndex

* fix doc misspelling

* test query using no-rollup indexes

* fix benchmark fail due to jmh bug
2016-08-02 11:13:05 -07:00
kaijianding 3dc2974894 Add timestampSpec to metadata.drd and SegmentMetadataQuery (#3227)
* save TimestampSpec in metadata.drd

* add timestampSpec info in SegmentMetadataQuery
2016-07-25 15:45:30 -07:00
Navis Ryu cd7337fc8a Calculate max split size based on numMapTask in DatasourceInputFormat (#2882)
* Calculate max split size based on numMapTask

* updated docs & fixed possible ArithmeticException
2016-07-20 16:53:51 -07:00
Hyukjin Kwon 55e7a52475 Replace deprecated usage for StringInputRowParser and JSONParseSpec (#3215) 2016-07-14 09:19:17 -07:00
Gian Merlino ea03906fcf Configurable compressRunOnSerialization for Roaring bitmaps. (#3228)
Defaults to true, which is a change in behavior (this used to be false and unconfigurable).
2016-07-08 10:24:19 +05:30
Hyukjin Kwon 45f553fc28 Replace the deprecated usage of NoneShardSpec (#3166) 2016-06-25 10:27:25 -07:00
Nishant 2696b0c451 Retry for transient exceptions while doing cleanup for Hadoop Jobs (#3177)
* fix 1828

fixes https://github.com/druid-io/druid/issues/1828

* remove unused import

* Review comment
2016-06-23 13:38:47 -07:00
Nishant 6f330dc816 Better handling for parseExceptions for Batch Ingestion (#3171)
* Better handling for parseExceptions

* make parseException handling consistent with Realtime

* change combiner default val to true

* review comments

* review comments
2016-06-22 16:38:29 -07:00
Gian Merlino ebf890fe79 Update master version to 0.9.2-SNAPSHOT. (#3133) 2016-06-13 13:10:38 -07:00
Nishant 778f97a80e attempt to fix-2906 (#2985)
* attempt to fix-2984

* review comments

* Add test
2016-05-18 15:12:38 -05:00
Charles Allen 15ccf451f9 Move QueryGranularity static fields to QueryGranularities (#2980)
* Move QueryGranularity static fields to QueryGranularityUtil
* Fixes #2979

* Add test showing #2979

* change name to QueryGranularities
2016-05-17 16:23:48 -07:00
David Lim b489f63698 Supervisor for KafkaIndexTask (#2656)
* supervisor for kafka indexing tasks

* cr changes
2016-05-04 23:13:13 -07:00
Navis Ryu 49ef4d96cb Merge pull request #2802 from navis/optimize_multiplepath_concat
Optimize adding lots of paths to pathspec
2016-04-11 23:35:28 -05:00
jon-wei 0e481d6f93 Allow filters to use extraction functions 2016-04-05 13:24:56 -07:00
Gian Merlino 977e867ad8 Downgrade geoip2, exclude com.google.http-client.
Reverts "Update com.maxmind.geoip2 to 2.6.0" and exclude the google http client
from com.maxmind.geoip2. This should satisfy the original need from #2646 (wanting
to run Druid along with an upgraded com.google.http-client) while preventing
Jackson conflicts pointed out in #2717.

Fixes #2717.

This reverts commit 21b7572533.
2016-03-25 14:43:22 -07:00
Gian Merlino ff25325f3b Improved docs for multi-value dimensions.
- Add central doc for multi-value dimensions, with some content from other docs.
- Link to multi-value dimension doc from topN and groupBy docs.
- Fixes a broken link from dimensionspecs.md, which was presciently already
  linking to this nonexistent doc.
- Resolve inconsistent naming in docs & code (sometimes "multi-valued", sometimes
  "multi-value") in favor of "multi-value".
2016-03-22 14:40:55 -07:00
Himanshu 00d7021291 Merge pull request #2607 from jon-wei/dim_schema
Support use of DimensionSchema class in DimensionsSpec
2016-03-22 11:53:46 -05:00
Himanshu 3220b109ad Merge pull request #2570 from binlijin/single_dimension_partitioning
Single dimension hash-based partitioning
2016-03-22 11:51:06 -05:00
binlijin bce600f5d5 Single dimension hash-based partitioning 2016-03-22 13:15:33 +08:00
jon-wei a59c9ee1b1 Support use of DimensionSchema class in DimensionsSpec 2016-03-21 13:12:04 -07:00
Gian Merlino 738dcd8cd9 Update version to 0.9.1-SNAPSHOT.
Fixes #2462
2016-03-17 10:34:20 -07:00
Himanshu ea3281ad78 Merge pull request #2645 from atomx/gs-scheme
Add gs:// hdfs support
2016-03-14 22:15:42 -05:00
Erik Dubbelboer 375620cfb3 Add gs:// hdfs support
Used to access google cloud storage
2016-03-12 08:57:57 +00:00
Gian Merlino 187569e702 DataSource metadata.
Geared towards supporting transactional inserts of new segments. This involves an
interface "DataSourceMetadata" that allows combining of partially specified metadata
(useful for partitioned ingestion).

DataSource metadata is stored in a new "dataSource" table.
2016-03-10 17:41:50 -08:00
Fangjin Yang 1e49092ce7 Merge pull request #2627 from himanshug/fix_datasource_inputformat_locations
fix regression - bug in DatasourceInputFormat best effort split location finder code
2016-03-10 13:46:04 -08:00
Himanshu Gupta eab8a0b54d in DatasourceInputFormat code for determining segment block locations avoid the split calulation by helper TextInputFormat 2016-03-10 14:28:53 -06:00
Nishant ba1185963b Fix a bunch of dependencies
* Eliminate exclusion groups from pull-deps
* Only consider dependency nodes in pull-deps if they are not in the following scopes
	* provided
	* test
	* system
* Fix a bunch of `<scope>provided</scope>` missing tags
* Better exclusions for a couple of problematic libs
2016-03-10 10:18:08 -08:00
Bingkun Guo c20d7682a9 log exceptions correctly in DatasourceInputFormat and IndexGeneratorJob 2016-03-09 13:41:31 -06:00
gaodayue a6dc3703ca use ISODataTimeFormat for both hdfs and viewfs schema to support Federationed HDFS 2016-03-08 13:55:05 +08:00
Björn Zettergren 2462c82c0e New defaults for maxRowsInMemory rowFlushBoundary
To bring consistency to docs and source this commit changes the default
values for maxRowsInMemory and rowFlushBoundary to 75000 after
discussion in PR https://github.com/druid-io/druid/pull/2457.

The previous default was 500000 and it's lower now on the grounds that
it's better for a default to be somewhat less efficient, and work,
than to reach for the stars and possibly result in
"OutOfMemoryError: java heap space" errors.
2016-03-01 13:50:28 +01:00
Gian Merlino 3534483433 Better handling of ParseExceptions.
Two changes:
- Allow IncrementalIndex to suppress ParseExceptions on "aggregate".
- Add "reportParseExceptions" option to realtime tuning configs. By default this is "false".

Behavior of the counters should now be:

- processed: Number of rows indexed, including rows where some fields could be parsed and some could not.
- thrownAway: Number of rows thrown away due to rejection policy.
- unparseable: Number of rows thrown away due to being completely unparseable (no fields salvageable at all).

If "reportParseExceptions" is true then "unparseable" will always be zero (because a parse error would
cause an exception to be thrown). In addition, "processed" will only include fully parseable rows
(because even partial parse failures will cause exceptions to be thrown).

Fixes #2510.
2016-02-23 10:11:43 -08:00
Himanshu Gupta 09ffcae4ae give user the option to specify the segments for dataSource inputSpec 2016-02-21 23:15:31 -06:00
Himanshu Gupta 2faae9d0d1 In JobHelper.makeSegmentOutputPath(..) use DataSegmentPusherUtils to construct the segment storage path 2016-02-09 21:42:32 -06:00
Himanshu Gupta b3437825f0 add ignoreWhenNoSegments flag to optionally ignore the dataSource inputSpec when no segments were found 2016-01-26 17:23:55 -06:00
binlijin cd1c71ceb4 rename persistBackgroundCount to numBackgroundPersistThreads 2016-01-22 14:29:41 +08:00
Charles Allen 2a69a58570 Merge pull request #2149 from binlijin/master
Do persist IncrementalIndex in another thread in IndexGeneratorReducer
2016-01-20 17:06:42 -08:00
Fangjin Yang 996c1173c6 Merge pull request #2223 from navis/besteffort-split-locations
Best effort to find locations for input splits
2016-01-20 16:53:43 -08:00
Fangjin Yang 695f107870 Merge pull request #2302 from metamx/lowerCaseGranPathTest
Make GranularityPathSpecTest check with lower-case enums
2016-01-20 09:18:06 -08:00
Charles Allen 3c5ca3a5f2 Make GranularityPathSpecTest check with lower-case enums 2016-01-20 08:35:13 -08:00
binlijin 8e43e2c446 Do persist IncrementalIndex in another thread in IndexGeneratorReducer 2016-01-20 09:20:09 +08:00
jon-wei 747343e621 Preserve dimension order across indexes during ingestion 2016-01-19 13:34:11 -08:00
Jonathan Wei df2906a91c Merge pull request #2290 from gianm/index-merger-v9-stuff
Respect buildV9Directly in PlumberSchools, so it works on standalone realtime.
2016-01-19 13:04:00 -08:00
Gian Merlino 1dcf22edb7 Respect buildV9Directly in PlumberSchools, so it works on standalone realtime nodes.
Also parameterize some tests to run with/without buildV9Directly:

- IndexGeneratorJobTest
- RealtimeIndexTaskTest
- RealtimePlumberSchoolTest
2016-01-19 12:15:06 -08:00
Himanshu Gupta 164b0aad7a removing Map<String,Object> segmentMetadata from methods in Index[Maker/Merger] and using Metadata class
instead of a Map to store segment metadata
2016-01-18 22:03:46 -06:00
navis.ryu f03f7fb625 Best effort to find locations for input splits 2016-01-18 08:31:05 +09:00
Kurt Young 82ff98c2bf add config for build v9 directly and update docs 2016-01-16 11:26:34 +08:00
Kurt Young 1f2168fae5 add IndexMergerV9
add unit tests for IndexMergerV9 and fix some bugs

add more unit tests and fix bugs

handle null values and add more tests

minor changes & use LoggingProgressIndicator in IndexGeneratorReducer

make some static class public from IndexMerger

minor changes and add some comments

changes for comments
2016-01-16 11:25:28 +08:00
navis.ryu 976ebc45c0 Simplify information in IncrementalIndex 2016-01-12 10:18:11 +09:00
dclim 2308c8c07f continue hadoop job for sparse intervals 2016-01-07 01:35:08 -07:00
fjy faf421726b remove IndexMaker 2015-12-28 14:19:02 -08:00
Fangjin Yang 14229ba0f2 Merge pull request #1922 from metamx/jsonIgnoresFinalFields
Change DefaultObjectMapper to NOT overwrite final fields unless explicitly asked to
2015-12-18 15:38:32 -08:00
binlijin 219367221b optimize InputRowSerde 2015-12-09 09:51:56 +08:00
Fangjin Yang d957a6602c Merge pull request #2049 from himanshug/hadoop_indexing_unique_path
add a unique string to intermediate path for the hadoop indexing task
2015-12-07 11:46:16 -08:00
Himanshu Gupta 6cfaf59d7e add a unique string to intermediate path for the hadoop indexing task 2015-12-06 22:20:38 -06:00
Himanshu Gupta 62ba9ade37 unifying license header in all java files 2015-12-05 22:16:23 -06:00
Himanshu Gupta 61aaa09012 support multiple intervals in dataSource input spec 2015-12-03 21:28:04 -06:00
Fangjin Yang 21c84b5ff7 Merge pull request #1896 from gianm/allocate-segment
SegmentAllocateAction (fixes #1515)
2015-11-18 21:05:46 -08:00
Gian Merlino e4e5f0375b SegmentAllocateAction (fixes #1515)
This is a feature meant to allow realtime tasks to work without being told upfront
what shardSpec they should use (so we can potentially publish a variable number
of segments per interval).

The idea is that there is a "pendingSegments" table in the metadata store that
tracks allocated segments. Each one has a segment id (the same segment id we know
and love) and is also part of a sequence.

The sequences are an idea from @cheddar that offers a way of doing replication.
If there are N tasks reading exactly the same data with exactly the same logic
(think Kafka tasks reading a fixed range of offsets) then you can place them
in the same sequence, and they will generate the same sequence of segments.
2015-11-11 16:54:35 -08:00
Xavier Léauté fa6142e217 cleanup and remove unused imports 2015-11-11 12:25:21 -08:00
Charles Allen abae47850a Add backwards compatability for PR #1922 2015-11-11 10:27:00 -08:00
Gian Merlino dfbd0e2b60 Merge pull request #1925 from gianm/fix-index-generator
Fix reference to INDEX_MAKER in IndexGeneratorJob.
2015-11-06 09:56:30 -08:00
Gian Merlino 75122dc396 Fix reference to INDEX_MAKER in IndexGeneratorJob. 2015-11-06 09:19:58 -08:00
Himanshu Gupta 6bed633121 do not use LoggingProcessIndicator in IndexGeneratorJob because that uses Stopwatch methods from guava not available in older guava versions, this makes the behavior same as LegacyIndexGeneratorJob 2015-11-06 00:40:51 -06:00
Charles Allen 929b981710 Change DefaultObjectMapper to NOT overwrite final fields unless explicitly asked to 2015-11-05 18:10:13 -08:00
Xavier Léauté 223d1ebe9f fix a very old todo 2015-11-05 13:00:30 -08:00
fjy 8f231fd3e3 cleanup druid codebase 2015-11-04 13:59:53 -08:00
Himanshu Gupta 84f7d8d264 making static final variables in HadoopDruidIndexerConfig upper case 2015-11-02 23:24:26 -06:00
Himanshu Gupta 8b67417ac8 make methods in Index[Merger,Maker,IO] non-static so that they can have
appropriate ObjectMapper injected instead of creating one statically
2015-11-02 23:24:26 -06:00
Himanshu Gupta aeffeaf3e2 fixing hadoop test scope dependencies in indexing-hadoop 2015-10-26 17:09:39 -05:00
Nishant 3641a0e553 Fix Race in jar upload during hadoop indexing - https://github.com/druid-io/druid/issues/582
few fixes

delete intermediate file early

better exception handling

use static pattern instead of compiling it every time

Add retry for transient exceptions

remove usage of deprecated method.

Add test

fix imports

fix javadoc

review comment.

review comment: handle crazy snapshot naming

review comments

remove default retry count in favour of already present constant

review comment

make random intermediate and final paths.

review comment, use temporaryFolder where possible
2015-10-22 21:41:07 +05:30
Xavier Léauté e4ac78e43d bump next snapshot to 0.9.0 2015-10-20 13:46:13 -07:00
Xavier Léauté 4c2c7a2c37 update version to 0.8.3 2015-10-14 21:40:55 -07:00
Himanshu Gupta 0368260018 For dataSource inputSpec in hadoop batch ingestion, use configured query granularity for reading existing segments instead of NONE 2015-10-12 22:19:44 -05:00
Gian Merlino 3aba401ee0 SQLMetadataConnector: Retry table creation, in case something goes wrong.
Also rejigger table creation methods to not take a DBI. It's already available
inside the connector, and everyone was just using that one anyway.
2015-09-24 21:39:36 -07:00
Himanshu Gupta e8b9ee85a7 HadoopyStringInputRowParser to convert stringy Text, BytesWritable etc into InputRow 2015-09-16 10:58:13 -05:00
Himanshu Gupta 74f4572bd4 Lazily deserialize "parser" to InputRowParser in DataSchema
so that user hadoop related InputRowParsers are created only when needed
this allows overlord to accept a HadoopIndexTask with a hadoopy InputRowParser
and not fail because hadoopy InputRowParser might need hadoop libraries
2015-09-16 10:58:13 -05:00
Himanshu Gupta 9ca6106128 user specified hadoop settings are ignored if explicitly set in code 2015-08-31 10:50:18 -05:00
Gian Merlino 940e1aa3eb Replace funky imports with standard ones.
1) Lots of Guava imports were not coming from the actual Guava
2) junit.framework.Assert should be org.junit.Assert
2015-08-28 18:02:05 -07:00
jon-wei e5c4927b14 Add support for parsing BytesWritable strings to Hadoop Indexer 2015-08-28 14:27:14 -07:00
Gian Merlino 414a6fb477 Fix overlapping segments in IngestSegmentFirehose, DatasourceInputFormat.
Fixes #1678. IngestSegmentFirehose (and its users) need to remember which
windows of which segments should actually be read, based on a timeline.
2015-08-28 07:32:41 -07:00
Himanshu Gupta 2e0dd1d792 adding UTs and addressing review comments to
firehoseV2 addition to Realtime[Manager|Plumber],
essential segment metadata persist support,
kafka-simple-consumer-firehose extension patch
2015-08-27 20:50:46 -05:00
lvjq 2237a8cf0f kafka 8 simple consumer firehose 2015-08-27 20:50:46 -05:00
Charles Allen e38cf54bc8 Migrate TestDerbyConnector to a JUnit @Rule 2015-08-26 21:47:40 -07:00
Himanshu Gupta b3c570e78d update BatchDeltaIngestion.testDeltaIngestion(..) to check for proper glob path handling 2015-08-20 21:36:34 -05:00
Himanshu Gupta 85e3ce9096 split hadoop glob path before adding it to MultipleInputs
This can be safely reverted once https://issues.apache.org/jira/browse/MAPREDUCE-5061 is fixed
2015-08-20 21:36:34 -05:00
Himanshu Gupta a603bd9547 HadoopGlobPathSplitter implementation to split hadoop glob paths
This can be safely reverted once https://issues.apache.org/jira/browse/MAPREDUCE-5061 is fixed
2015-08-20 21:36:34 -05:00
Himanshu Gupta cf3ec8eb46 helpful cause explaining why SegmentDescriptorInfo did not exist 2015-08-19 10:29:04 -05:00
Xavier Léauté 3b2e41e42a update for next release 2015-08-18 17:16:46 -07:00
Himanshu Gupta a3bab5b7d9 IndexGeneratorJobTest type unit test for batch delta ingestion and reindexing 2015-08-16 14:07:35 -05:00
Himanshu Gupta 15fa43dd43 changing DatasourcePathSpec, to get segment list, so that hadoop indexer uses overlord action to get list of segments and passes when running as an overlord task. and, uses metadata store directly when running as standalone hadoop indexer
also, serialized list of segments is passed to DatasourcePathSpec so that hadoop classloader issues do not creep up
2015-08-16 14:07:35 -05:00
Himanshu Gupta 45947a1021 add ability to specify Multiple PathSpecs in batch ingestion, so that we can grab data from multiple places in same ingestion
Conflicts:
	indexing-hadoop/src/main/java/io/druid/indexer/HadoopDruidIndexerConfig.java
	indexing-hadoop/src/main/java/io/druid/indexer/JobHelper.java

Conflicts:
	indexing-hadoop/src/main/java/io/druid/indexer/path/PathSpec.java
2015-08-16 13:15:38 -05:00