Apache Druid: a high performance real-time analytics database.
Go to file
Chi Cao Minh bab78fc80e Parallel indexing single dim partitions (#8925)
* Parallel indexing single dim partitions

Implements single dimension range partitioning for native parallel batch
indexing as described in #8769. This initial version requires the
druid-datasketches extension to be loaded.

The algorithm has 5 phases that are orchestrated by the supervisor in
`ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel()`.
These phases and the main classes involved are described below:

1) In parallel, determine the distribution of dimension values for each
   input source split.

   `PartialDimensionDistributionTask` uses `StringSketch` to generate
   the approximate distribution of dimension values for each input
   source split. If the rows are ungrouped,
   `PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter`
   uses a Bloom filter to skip rows that would be grouped. The final
   distribution is sent back to the supervisor via
   `DimensionDistributionReport`.

2) The range partitions are determined.

   In `ParallelIndexSupervisorTask#determineAllRangePartitions()`, the
   supervisor uses `StringSketchMerger` to merge the individual
   `StringSketch`es created in the preceding phase. The merged sketch is
   then used to create the range partitions.

3) In parallel, generate partial range-partitioned segments.

   `PartialRangeSegmentGenerateTask` uses the range partitions
   determined in the preceding phase and
   `RangePartitionCachingLocalSegmentAllocator` to generate
   `SingleDimensionShardSpec`s.  The partition information is sent back
   to the supervisor via `GeneratedGenericPartitionsReport`.

4) The partial range segments are grouped.

   In `ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition()`,
   the supervisor creates the `PartialGenericSegmentMergeIOConfig`s
   necessary for the next phase.

5) In parallel, merge partial range-partitioned segments.

   `PartialGenericSegmentMergeTask` uses `GenericPartitionLocation` to
   retrieve the partial range-partitioned segments generated earlier and
   then merges and publishes them.

* Fix dependencies & forbidden apis

* Fixes for integration test

* Address review comments

* Fix docs, strict compile, sketch check, rollup check

* Fix first shard spec, partition serde, single subtask

* Fix first partition check in test

* Misc rewording/refactoring to address code review

* Fix doc link

* Split batch index integration test

* Do not run parallel-batch-index twice

* Adjust last partition

* Split ITParallelIndexTest to reduce runtime

* Rename test class

* Allow null values in range partitions

* Indicate which phase failed

* Improve asserts in tests
2019-12-09 23:05:49 -08:00
.github add checkbox for licenses.yaml in PR template, mention it in CONTRIBUTING.md (#8367) 2019-08-22 14:14:24 -07:00
.idea Implementing dropwizard emitter for druid (#7363) 2019-10-01 14:59:30 -07:00
benchmarks add query metrics for broker parallel merges, off by default (#8981) 2019-12-06 13:42:53 -08:00
cloud Revert "[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1" 2019-11-27 23:22:43 -08:00
codestyle Add FileUtils.createTempDir() and enforce its usage. (#8932) 2019-11-22 19:48:49 -08:00
core Parallel indexing single dim partitions (#8925) 2019-12-09 23:05:49 -08:00
dev Add an item to concurrency checklist about assertions in parall… (#8701) 2019-10-29 11:38:04 +03:00
distribution Address security vulnerabilities CVSS >= 7 (#8980) 2019-12-05 14:34:35 -08:00
docs Parallel indexing single dim partitions (#8925) 2019-12-09 23:05:49 -08:00
examples Improve verify-default-ports to check both INADDR_ANY and 127.0.0.1. (#8942) 2019-11-26 16:05:15 -08:00
extendedset Revert "[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1" 2019-11-27 23:22:43 -08:00
extensions-contrib Address security vulnerabilities CVSS >= 7 (#8980) 2019-12-05 14:34:35 -08:00
extensions-core Parallel indexing single dim partitions (#8925) 2019-12-09 23:05:49 -08:00
hll Revert "[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1" 2019-11-27 23:22:43 -08:00
indexing-hadoop Revert "[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1" 2019-11-27 23:22:43 -08:00
indexing-service Parallel indexing single dim partitions (#8925) 2019-12-09 23:05:49 -08:00
integration-tests Parallel indexing single dim partitions (#8925) 2019-12-09 23:05:49 -08:00
licenses Address security vulnerabilities CVSS >= 7 (#8980) 2019-12-05 14:34:35 -08:00
processing modify multi-value expression transformation behavior to not treat re-use of the same input as a candidate for cartesian mapping (#8957) 2019-12-09 20:38:15 -08:00
publications [ImgBot] Optimize images (#7873) 2019-06-24 21:27:48 -07:00
server fix npe while logging sql/query request (#9001) 2019-12-09 12:02:11 -08:00
services Add SelfDiscoveryResource; rename org.apache.druid.discovery.No… (#6702) 2019-12-08 18:47:58 +03:00
sql modify multi-value expression transformation behavior to not treat re-use of the same input as a candidate for cartesian mapping (#8957) 2019-12-09 20:38:15 -08:00
web-console better input format detection (#9007) 2019-12-09 22:31:28 -08:00
website Add DruidInputSource (replacement for IngestSegmentFirehose) (#8982) 2019-12-05 16:50:00 -08:00
.codecov.yml Use Codecov (#8388) 2019-08-28 08:49:30 -07:00
.dockerignore Add docker container for druid (#6896) 2019-02-08 12:12:28 +00:00
.gitignore autogenerate NOTICE.BINARY from NOTICE and licenses.yaml (#8306) 2019-08-21 12:46:27 -07:00
.lgtm.yml Add license header for LGTM yaml config file (#8902) 2019-11-18 18:26:45 -08:00
.travis.yml Parallel indexing single dim partitions (#8925) 2019-12-09 23:05:49 -08:00
CONTRIBUTING.md Fix incorrect build from source path in README.md and druid repo url. (#8531) 2019-09-12 19:48:01 -07:00
DISCLAIMER add missing license headers, in particular to MD files; clean up RAT … (#6563) 2018-11-13 09:38:37 -08:00
LABELS Add plain text README.txt, use relative link from README.md to build.md (#7611) 2019-05-09 21:29:26 -07:00
LICENSE Add licenses.yaml entry for Wikipedia sample data (#8968) 2019-11-28 11:41:42 -08:00
NOTICE Address security vulnerabilities (#8878) 2019-11-19 09:14:33 -08:00
README.md Adding quick links to readme (#8946) 2019-11-26 16:04:54 -08:00
README.template switch links from druid.io to druid.apache.org (#7914) 2019-06-18 09:06:27 -07:00
licenses.yaml Address security vulnerabilities CVSS >= 7 (#8980) 2019-12-05 14:34:35 -08:00
owasp-dependency-check-suppressions.xml Address security vulnerabilities CVSS >= 7 (#8980) 2019-12-05 14:34:35 -08:00
pom.xml Parallel indexing single dim partitions (#8925) 2019-12-09 23:05:49 -08:00
upload.sh Adding licenses and enable apache-rat-plugin. (#6215) 2018-09-18 08:39:26 -07:00

README.md

Slack Build Status Language grade: Java Coverage Status Docker


Website | Documentation | Developer Mailing List | User Mailing List | Slack | Twitter | Download


Apache Druid (incubating)

Apache Druid (incubating) is a high performance real-time analytics database.

Druid is a next-gen open source alternative to analytical databases such as Vertica, Greenplum, and Exadata, and data warehouses such as Snowflake, BigQuery, and Redshift.

Getting started

You can get started with Druid with our quickstart.

Druid provides a rich set of APIs (via HTTP and JDBC) for loading, managing, and querying your data. You can also interact with Druid via the built-in console (shown below).

Load data

data loader Kafka

Load streaming and batch data using a point-and-click wizard to guide you through ingestion setup. Monitor one off tasks and ingestion supervisors.

Manage the cluster

management

Manage your cluster with ease. Get a view of your datasources, segments, ingestion tasks, and servers from one convenient location. All powered by SQL systems tables allowing you to see the underlying query for each view.

Issue queries

query view combo

Use the built-in query workbench to prototype DruidSQL and native queries or connect one of the many tools that help you make the most out of Druid.

Documentation

You can find the documentation for the latest Druid release on the project website.

If you would like to contribute documentation, please do so under /docs in this repository and submit a pull request.

Community

Community support is available on the druid-user mailing list, which is hosted at Google Groups.

Development discussions occur on dev@druid.apache.org, which you can subscribe to by emailing dev-subscribe@druid.apache.org.

Chat with Druid committers and users in real-time on the #druid channel in the Apache Slack team. Please use this invitation link to join the ASF Slack, and once joined, go into the #druid channel.

Building from source

Please note that JDK 8 is required to build Druid.

For instructions on building Druid from source, see docs/development/build.md

Contributing

Please follow the community guidelines for contributing.

License

Apache License, Version 2.0

Disclaimer: Apache Druid is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.