druid

Apache Druid: a high performance real-time analytics database.

druid

Go to file

Chi Cao Minh bab78fc80e Parallel indexing single dim partitions (#8925 ) * Parallel indexing single dim partitions Implements single dimension range partitioning for native parallel batch indexing as described in #8769. This initial version requires the druid-datasketches extension to be loaded. The algorithm has 5 phases that are orchestrated by the supervisor in `ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel()`. These phases and the main classes involved are described below: 1) In parallel, determine the distribution of dimension values for each input source split. `PartialDimensionDistributionTask` uses `StringSketch` to generate the approximate distribution of dimension values for each input source split. If the rows are ungrouped, `PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter` uses a Bloom filter to skip rows that would be grouped. The final distribution is sent back to the supervisor via `DimensionDistributionReport`. 2) The range partitions are determined. In `ParallelIndexSupervisorTask#determineAllRangePartitions()`, the supervisor uses `StringSketchMerger` to merge the individual `StringSketch`es created in the preceding phase. The merged sketch is then used to create the range partitions. 3) In parallel, generate partial range-partitioned segments. `PartialRangeSegmentGenerateTask` uses the range partitions determined in the preceding phase and `RangePartitionCachingLocalSegmentAllocator` to generate `SingleDimensionShardSpec`s. The partition information is sent back to the supervisor via `GeneratedGenericPartitionsReport`. 4) The partial range segments are grouped. In `ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition()`, the supervisor creates the `PartialGenericSegmentMergeIOConfig`s necessary for the next phase. 5) In parallel, merge partial range-partitioned segments. `PartialGenericSegmentMergeTask` uses `GenericPartitionLocation` to retrieve the partial range-partitioned segments generated earlier and then merges and publishes them. * Fix dependencies & forbidden apis * Fixes for integration test * Address review comments * Fix docs, strict compile, sketch check, rollup check * Fix first shard spec, partition serde, single subtask * Fix first partition check in test * Misc rewording/refactoring to address code review * Fix doc link * Split batch index integration test * Do not run parallel-batch-index twice * Adjust last partition * Split ITParallelIndexTest to reduce runtime * Rename test class * Allow null values in range partitions * Indicate which phase failed * Improve asserts in tests		2019-12-09 23:05:49 -08:00
.github	add checkbox for licenses.yaml in PR template, mention it in CONTRIBUTING.md (#8367 )	2019-08-22 14:14:24 -07:00
.idea	Implementing dropwizard emitter for druid (#7363 )	2019-10-01 14:59:30 -07:00
benchmarks	add query metrics for broker parallel merges, off by default (#8981 )	2019-12-06 13:42:53 -08:00
cloud	Revert "[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1"	2019-11-27 23:22:43 -08:00
codestyle	Add FileUtils.createTempDir() and enforce its usage. (#8932 )	2019-11-22 19:48:49 -08:00
core	Parallel indexing single dim partitions (#8925 )	2019-12-09 23:05:49 -08:00
dev	Add an item to concurrency checklist about assertions in parall… (#8701 )	2019-10-29 11:38:04 +03:00
distribution	Address security vulnerabilities CVSS >= 7 (#8980 )	2019-12-05 14:34:35 -08:00
docs	Parallel indexing single dim partitions (#8925 )	2019-12-09 23:05:49 -08:00
examples	Improve verify-default-ports to check both INADDR_ANY and 127.0.0.1. (#8942 )	2019-11-26 16:05:15 -08:00
extendedset	Revert "[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1"	2019-11-27 23:22:43 -08:00
extensions-contrib	Address security vulnerabilities CVSS >= 7 (#8980 )	2019-12-05 14:34:35 -08:00
extensions-core	Parallel indexing single dim partitions (#8925 )	2019-12-09 23:05:49 -08:00
hll	Revert "[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1"	2019-11-27 23:22:43 -08:00
indexing-hadoop	Revert "[maven-release-plugin] prepare release druid-0.16.1-incubating-rc1"	2019-11-27 23:22:43 -08:00
indexing-service	Parallel indexing single dim partitions (#8925 )	2019-12-09 23:05:49 -08:00
integration-tests	Parallel indexing single dim partitions (#8925 )	2019-12-09 23:05:49 -08:00
licenses	Address security vulnerabilities CVSS >= 7 (#8980 )	2019-12-05 14:34:35 -08:00
processing	modify multi-value expression transformation behavior to not treat re-use of the same input as a candidate for cartesian mapping (#8957 )	2019-12-09 20:38:15 -08:00
publications	[ImgBot] Optimize images (#7873 )	2019-06-24 21:27:48 -07:00
server	fix npe while logging sql/query request (#9001 )	2019-12-09 12:02:11 -08:00
services	Add SelfDiscoveryResource; rename org.apache.druid.discovery.No… (#6702 )	2019-12-08 18:47:58 +03:00
sql	modify multi-value expression transformation behavior to not treat re-use of the same input as a candidate for cartesian mapping (#8957 )	2019-12-09 20:38:15 -08:00
web-console	better input format detection (#9007 )	2019-12-09 22:31:28 -08:00
website	Add DruidInputSource (replacement for IngestSegmentFirehose) (#8982 )	2019-12-05 16:50:00 -08:00
.codecov.yml	Use Codecov (#8388 )	2019-08-28 08:49:30 -07:00
.dockerignore	Add docker container for druid (#6896 )	2019-02-08 12:12:28 +00:00
.gitignore	autogenerate NOTICE.BINARY from NOTICE and licenses.yaml (#8306 )	2019-08-21 12:46:27 -07:00
.lgtm.yml	Add license header for LGTM yaml config file (#8902 )	2019-11-18 18:26:45 -08:00
.travis.yml	Parallel indexing single dim partitions (#8925 )	2019-12-09 23:05:49 -08:00
CONTRIBUTING.md	Fix incorrect build from source path in README.md and druid repo url. (#8531 )	2019-09-12 19:48:01 -07:00
DISCLAIMER	add missing license headers, in particular to MD files; clean up RAT … (#6563 )	2018-11-13 09:38:37 -08:00
LABELS	Add plain text README.txt, use relative link from README.md to build.md (#7611 )	2019-05-09 21:29:26 -07:00
LICENSE	Add licenses.yaml entry for Wikipedia sample data (#8968 )	2019-11-28 11:41:42 -08:00
NOTICE	Address security vulnerabilities (#8878 )	2019-11-19 09:14:33 -08:00
README.md	Adding quick links to readme (#8946 )	2019-11-26 16:04:54 -08:00
README.template	switch links from druid.io to druid.apache.org (#7914 )	2019-06-18 09:06:27 -07:00
licenses.yaml	Address security vulnerabilities CVSS >= 7 (#8980 )	2019-12-05 14:34:35 -08:00
owasp-dependency-check-suppressions.xml	Address security vulnerabilities CVSS >= 7 (#8980 )	2019-12-05 14:34:35 -08:00
pom.xml	Parallel indexing single dim partitions (#8925 )	2019-12-09 23:05:49 -08:00
upload.sh	Adding licenses and enable apache-rat-plugin. (#6215 )	2018-09-18 08:39:26 -07:00

README.md

Apache Druid (incubating)

Apache Druid (incubating) is a high performance real-time analytics database.

Druid is a next-gen open source alternative to analytical databases such as Vertica, Greenplum, and Exadata, and data warehouses such as Snowflake, BigQuery, and Redshift.

Getting started

You can get started with Druid with our quickstart.

Druid provides a rich set of APIs (via HTTP and JDBC) for loading, managing, and querying your data. You can also interact with Druid via the built-in console (shown below).

Load data

Load streaming and batch data using a point-and-click wizard to guide you through ingestion setup. Monitor one off tasks and ingestion supervisors.

Manage the cluster

Manage your cluster with ease. Get a view of your datasources, segments, ingestion tasks, and servers from one convenient location. All powered by SQL systems tables allowing you to see the underlying query for each view.

Issue queries

Use the built-in query workbench to prototype DruidSQL and native queries or connect one of the many tools that help you make the most out of Druid.

Documentation

You can find the documentation for the latest Druid release on the project website.

If you would like to contribute documentation, please do so under /docs in this repository and submit a pull request.

Community

Community support is available on the druid-user mailing list, which is hosted at Google Groups.

Development discussions occur on dev@druid.apache.org, which you can subscribe to by emailing dev-subscribe@druid.apache.org.

Chat with Druid committers and users in real-time on the #druid channel in the Apache Slack team. Please use this invitation link to join the ASF Slack, and once joined, go into the #druid channel.

Building from source

Please note that JDK 8 is required to build Druid.

For instructions on building Druid from source, see docs/development/build.md

Contributing

Please follow the community guidelines for contributing.

License

Apache License, Version 2.0

Disclaimer: Apache Druid is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.