mirror of https://github.com/apache/druid.git
Added titles and harmonized docs to improve usability and SEO (#6731)
* added titles and harmonized docs * manually fixed some titles
This commit is contained in:
parent
55914687bb
commit
da4836f38c
|
@ -19,10 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Druid vs Elasticsearch"
|
||||
---
|
||||
|
||||
Druid vs Elasticsearch
|
||||
======================
|
||||
# Druid vs Elasticsearch
|
||||
|
||||
We are not experts on search systems, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means.
|
||||
|
||||
|
|
|
@ -19,10 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)"
|
||||
---
|
||||
|
||||
Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
|
||||
====================================================
|
||||
# Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
|
||||
|
||||
Druid is highly optimized for scans and aggregations, it supports arbitrarily deep drill downs into data sets. This same functionality
|
||||
is supported in key/value stores in 2 ways:
|
||||
|
|
|
@ -19,10 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Druid vs Kudu"
|
||||
---
|
||||
|
||||
Druid vs Kudu
|
||||
=============
|
||||
# Druid vs Kudu
|
||||
|
||||
Kudu's storage format enables single row updates, whereas updates to existing Druid segments requires recreating the segment, so theoretically
|
||||
the process for updating old values should be higher latency in Druid. However, the requirements in Kudu for maintaining extra head space to store
|
||||
|
|
|
@ -19,10 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Druid vs Redshift"
|
||||
---
|
||||
Druid vs Redshift
|
||||
=================
|
||||
|
||||
# Druid vs Redshift
|
||||
|
||||
### How does Druid compare to Redshift?
|
||||
|
||||
|
|
|
@ -19,10 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Druid vs Spark"
|
||||
---
|
||||
|
||||
Druid vs Spark
|
||||
==============
|
||||
# Druid vs Spark
|
||||
|
||||
Druid and Spark are complementary solutions as Druid can be used to accelerate OLAP queries in Spark.
|
||||
|
||||
|
|
|
@ -19,17 +19,16 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Druid vs SQL-on-Hadoop"
|
||||
---
|
||||
# Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)
|
||||
|
||||
Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)
|
||||
===========================================================
|
||||
|
||||
SQL-on-Hadoop engines provide an
|
||||
execution engine for various data formats and data stores, and
|
||||
SQL-on-Hadoop engines provide an
|
||||
execution engine for various data formats and data stores, and
|
||||
many can be made to push down computations down to Druid, while providing a SQL interface to Druid.
|
||||
|
||||
For a direct comparison between the technologies and when to only use one or the other, things basically comes down to your
|
||||
product requirements and what the systems were designed to do.
|
||||
For a direct comparison between the technologies and when to only use one or the other, things basically comes down to your
|
||||
product requirements and what the systems were designed to do.
|
||||
|
||||
Druid was designed to
|
||||
|
||||
|
@ -37,7 +36,7 @@ Druid was designed to
|
|||
1. ingest data in real-time
|
||||
1. handle slice-n-dice style ad-hoc queries
|
||||
|
||||
SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems.
|
||||
SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems.
|
||||
Some of these engines (including Impala and Presto) can be colocated with HDFS data nodes and coordinate with them to achieve data locality for queries.
|
||||
What does this mean? We can talk about it in terms of three general areas
|
||||
|
||||
|
@ -47,37 +46,37 @@ What does this mean? We can talk about it in terms of three general areas
|
|||
|
||||
### Queries
|
||||
|
||||
Druid segments stores data in a custom column format. Segments are scanned directly as part of queries and each Druid server
|
||||
calculates a set of results that are eventually merged at the Broker level. This means the data that is transferred between servers
|
||||
Druid segments stores data in a custom column format. Segments are scanned directly as part of queries and each Druid server
|
||||
calculates a set of results that are eventually merged at the Broker level. This means the data that is transferred between servers
|
||||
are queries and results, and all computation is done internally as part of the Druid servers.
|
||||
|
||||
Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats.
|
||||
They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce).
|
||||
Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still
|
||||
some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly
|
||||
Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats.
|
||||
They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce).
|
||||
Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still
|
||||
some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly
|
||||
how much of a performance impact this makes.
|
||||
|
||||
### Data Ingestion
|
||||
|
||||
Druid is built to allow for real-time ingestion of data. You can ingest data and query it immediately upon ingestion,
|
||||
Druid is built to allow for real-time ingestion of data. You can ingest data and query it immediately upon ingestion,
|
||||
the latency between how quickly the event is reflected in the data is dominated by how long it takes to deliver the event to Druid.
|
||||
|
||||
SQL-on-Hadoop, being based on data in HDFS or some other backing store, are limited in their data ingestion rates by the
|
||||
rate at which that backing store can make data available. Generally, the backing store is the biggest bottleneck for
|
||||
SQL-on-Hadoop, being based on data in HDFS or some other backing store, are limited in their data ingestion rates by the
|
||||
rate at which that backing store can make data available. Generally, the backing store is the biggest bottleneck for
|
||||
how quickly data can become available.
|
||||
|
||||
### Query Flexibility
|
||||
|
||||
Druid's query language is fairly low level and maps to how Druid operates internally. Although Druid can be combined with a high level query
|
||||
planner such as [Plywood](https://github.com/implydata/plywood) to support most SQL queries and analytic SQL queries (minus joins among large tables),
|
||||
Druid's query language is fairly low level and maps to how Druid operates internally. Although Druid can be combined with a high level query
|
||||
planner such as [Plywood](https://github.com/implydata/plywood) to support most SQL queries and analytic SQL queries (minus joins among large tables),
|
||||
base Druid is less flexible than SQL-on-Hadoop solutions for generic processing.
|
||||
|
||||
SQL-on-Hadoop support SQL style queries with full joins.
|
||||
|
||||
## Druid vs Parquet
|
||||
|
||||
Parquet is a column storage format that is designed to work with SQL-on-Hadoop engines. Parquet doesn't have a query execution engine, and instead
|
||||
Parquet is a column storage format that is designed to work with SQL-on-Hadoop engines. Parquet doesn't have a query execution engine, and instead
|
||||
relies on external sources to pull data out of it.
|
||||
|
||||
Druid's storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet's storage format is much
|
||||
Druid's storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet's storage format is much
|
||||
more hierachical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Configuration Reference"
|
||||
---
|
||||
|
||||
# Configuration Reference
|
||||
|
||||
This page documents all of the configuration properties for each Druid service type.
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Logging"
|
||||
---
|
||||
Logging
|
||||
==========================
|
||||
# Logging
|
||||
|
||||
Druid nodes will emit logs that are useful for debugging to the console. Druid nodes also emit periodic metrics about their state. For more about metrics, see [Configuration](../configuration/index.html#enabling-metrics). Metric logs are printed to the console by default, and can be disabled with `-Ddruid.emitter.logging.logLevel=debug`.
|
||||
|
||||
|
|
|
@ -19,10 +19,10 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Realtime Node Configuration"
|
||||
---
|
||||
# Realtime Node Configuration
|
||||
|
||||
Realtime Node Configuration
|
||||
==============================
|
||||
For general Realtime Node information, see [here](../design/realtime.html).
|
||||
|
||||
Runtime Configuration
|
||||
|
|
|
@ -19,15 +19,18 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Cassandra Deep Storage"
|
||||
---
|
||||
# Cassandra Deep Storage
|
||||
|
||||
## Introduction
|
||||
|
||||
Druid can use Cassandra as a deep storage mechanism. Segments and their metadata are stored in Cassandra in two tables:
|
||||
`index_storage` and `descriptor_storage`. Underneath the hood, the Cassandra integration leverages Astyanax. The
|
||||
`index_storage` and `descriptor_storage`. Underneath the hood, the Cassandra integration leverages Astyanax. The
|
||||
index storage table is a [Chunked Object](https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store) repository. It contains
|
||||
compressed segments for distribution to historical nodes. Since segments can be large, the Chunked Object storage allows the integration to multi-thread
|
||||
the write to Cassandra, and spreads the data across all the nodes in a cluster. The descriptor storage table is a normal C* table that
|
||||
stores the segment metadatak.
|
||||
the write to Cassandra, and spreads the data across all the nodes in a cluster. The descriptor storage table is a normal C* table that
|
||||
stores the segment metadatak.
|
||||
|
||||
## Schema
|
||||
Below are the create statements for each:
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Deep Storage"
|
||||
---
|
||||
|
||||
# Deep Storage
|
||||
|
||||
Deep storage is where segments are stored. It is a storage mechanism that Druid does not provide. This deep storage infrastructure defines the level of durability of your data, as long as Druid nodes can see this storage infrastructure and get at the segments stored on it, you will not lose data no matter how many Druid nodes you lose. If segments disappear from this storage layer, then you will lose whatever data those segments represented.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Metadata Storage"
|
||||
---
|
||||
|
||||
# Metadata Storage
|
||||
|
||||
The Metadata Storage is an external dependency of Druid. Druid uses it to store
|
||||
|
|
|
@ -19,8 +19,10 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "ZooKeeper"
|
||||
---
|
||||
# ZooKeeper
|
||||
|
||||
Druid uses [ZooKeeper](http://zookeeper.apache.org/) (ZK) for management of current cluster state. The operations that happen over ZK are
|
||||
|
||||
1. [Coordinator](../design/coordinator.html) leader election
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Authentication and Authorization"
|
||||
---
|
||||
|
||||
# Authentication and Authorization
|
||||
|
||||
|Property|Type|Description|Default|Required|
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Broker"
|
||||
---
|
||||
Broker
|
||||
======
|
||||
# Broker
|
||||
|
||||
### Configuration
|
||||
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Coordinator Node"
|
||||
---
|
||||
Coordinator Node
|
||||
================
|
||||
# Coordinator Node
|
||||
|
||||
### Configuration
|
||||
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Historical Node"
|
||||
---
|
||||
Historical Node
|
||||
===============
|
||||
# Historical Node
|
||||
|
||||
### Configuration
|
||||
|
||||
|
|
|
@ -19,6 +19,7 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Design"
|
||||
---
|
||||
|
||||
# What is Druid?<a id="what-is-druid"></a>
|
||||
|
@ -159,7 +160,7 @@ queries:
|
|||
- Bitmap compression for bitmap indexes
|
||||
- Type-aware compression for all columns
|
||||
|
||||
Periodically, segments are committed and published. At this point, they are written to [deep storage](#deep-storage),
|
||||
Periodically, segments are committed and published. At this point, they are written to [deep storage](#deep-storage),
|
||||
become immutable, and move from MiddleManagers to the Historical processes (see [Architecture](#architecture) above
|
||||
for details). An entry about the segment is also written to the [metadata store](#metadata-storage). This entry is a
|
||||
self-describing bit of metadata about the segment, including things like the schema of the segment, its size, and its
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Indexing Service"
|
||||
---
|
||||
Indexing Service
|
||||
================
|
||||
# Indexing Service
|
||||
|
||||
The indexing service is a highly-available, distributed service that runs indexing related tasks.
|
||||
|
||||
|
|
|
@ -19,10 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "MiddleManager Node"
|
||||
---
|
||||
|
||||
Middle Manager Node
|
||||
------------------
|
||||
# MiddleManager Node
|
||||
|
||||
### Configuration
|
||||
|
||||
|
|
|
@ -19,10 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Overlord Node"
|
||||
---
|
||||
|
||||
Overlord Node
|
||||
-------------
|
||||
# Overlord Node
|
||||
|
||||
### Configuration
|
||||
|
||||
|
|
|
@ -19,10 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Peons"
|
||||
---
|
||||
|
||||
Peons
|
||||
-----
|
||||
# Peons
|
||||
|
||||
### Configuration
|
||||
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Druid Plumbers"
|
||||
---
|
||||
|
||||
# Druid Plumbers
|
||||
|
||||
The plumber handles generated segments both while they are being generated and when they are "done". This is also technically a pluggable interface and there are multiple implementations. However, plumbers handle numerous complex details, and therefore an advanced understanding of Druid is recommended before implementing your own.
|
||||
|
|
|
@ -19,10 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Real-time Node"
|
||||
---
|
||||
|
||||
Real-time Node
|
||||
==============
|
||||
# Real-time Node
|
||||
|
||||
<div class="note info">
|
||||
NOTE: Realtime nodes are deprecated. Please use the <a href="../development/extensions-core/kafka-ingestion.html">Kafka Indexing Service</a> for stream pull use cases instead.
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Segments"
|
||||
---
|
||||
Segments
|
||||
========
|
||||
# Segments
|
||||
|
||||
Druid stores its index in *segment files*, which are partitioned by
|
||||
time. In a basic setup, one segment file is created for each time
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Build from Source"
|
||||
---
|
||||
|
||||
### Build from Source
|
||||
# Build from Source
|
||||
|
||||
You can build Druid directly from source. Please note that these instructions are for building the latest stable version of Druid.
|
||||
For building the latest code in master, follow the instructions [here](https://github.com/apache/incubator-druid/blob/master/docs/content/development/build.md).
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Experimental Features"
|
||||
---
|
||||
|
||||
# About Experimental Features
|
||||
# Experimental Features
|
||||
|
||||
Experimental features are features we have developed but have not fully tested in a production environment. If you choose to try them out, there will likely be edge cases that we have not covered. We would love feedback on any of these features, whether they are bug reports, suggestions for improvement, or letting us know they work as intended.
|
||||
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Ambari Metrics Emitter"
|
||||
---
|
||||
|
||||
# Ambari Metrics Emitter
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `ambari-metrics-emitter` extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Microsoft Azure"
|
||||
---
|
||||
|
||||
# Microsoft Azure
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-azure-extensions` extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Apache Cassandra"
|
||||
---
|
||||
|
||||
# Apache Cassandra
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-cassandra-storage` extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Rackspace Cloud Files"
|
||||
---
|
||||
|
||||
# Rackspace Cloud Files
|
||||
|
||||
## Deep Storage
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "DistinctCount Aggregator"
|
||||
---
|
||||
|
||||
# DistinctCount aggregator
|
||||
# DistinctCount Aggregator
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) the `druid-distinctcount` extension.
|
||||
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Google Cloud Storage"
|
||||
---
|
||||
|
||||
# Google Cloud Storage
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-google-extensions` extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Graphite Emitter"
|
||||
---
|
||||
|
||||
# Graphite Emitter
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `graphite-emitter` extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "InfluxDB Line Protocol Parser"
|
||||
---
|
||||
|
||||
# InfluxDB Line Protocol Parser
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-influx-extensions`.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Kafka Emitter"
|
||||
---
|
||||
|
||||
# Kafka Emitter
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `kafka-emitter` extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Kafka Simple Consumer"
|
||||
---
|
||||
|
||||
# Kafka Simple Consumer
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-kafka-eight-simpleConsumer` extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Materialized View"
|
||||
---
|
||||
|
||||
# Materialized View
|
||||
|
||||
To use this feature, make sure to only load materialized-view-selection on broker and load materialized-view-maintenance on overlord. In addtion, this feature currently requires a hadoop cluster.
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "OpenTSDB Emitter"
|
||||
---
|
||||
|
||||
# Opentsdb Emitter
|
||||
# OpenTSDB Emitter
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `opentsdb-emitter` extension.
|
||||
|
||||
|
@ -57,5 +57,5 @@ e.g.
|
|||
"type"
|
||||
]
|
||||
```
|
||||
|
||||
|
||||
For most use-cases, the default configuration is sufficient.
|
||||
|
|
|
@ -19,15 +19,15 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "ORC"
|
||||
---
|
||||
|
||||
# Orc
|
||||
# ORC
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-orc-extensions`.
|
||||
|
||||
This extension enables Druid to ingest and understand the Apache Orc data format offline.
|
||||
This extension enables Druid to ingest and understand the Apache ORC data format offline.
|
||||
|
||||
## Orc Hadoop Parser
|
||||
## ORC Hadoop Parser
|
||||
|
||||
This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inputSpec in ioConfig must be set to `"org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat"`.
|
||||
|
||||
|
@ -35,7 +35,7 @@ This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inp
|
|||
|----------|-------------|----------------------------------------------------------------------------------------|---------|
|
||||
|type | String | This should say `orc` | yes|
|
||||
|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Any parse spec that extends ParseSpec is possible but only their TimestampSpec and DimensionsSpec are used. | yes|
|
||||
|typeString| String | String representation of Orc struct type info. If not specified, auto constructed from parseSpec but all metric columns are dropped | no|
|
||||
|typeString| String | String representation of ORC struct type info. If not specified, auto constructed from parseSpec but all metric columns are dropped | no|
|
||||
|mapFieldNameFormat| String | String format for resolving the flatten map fields. Default is `<PARENT>_<CHILD>`. | no |
|
||||
|
||||
For example of `typeString`, string column col1 and array of string column col2 is represented by `"struct<col1:string,col2:array<string>>"`.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "RabbitMQ"
|
||||
---
|
||||
|
||||
# RabbitMQ
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-rabbitmq` extension.
|
||||
|
|
|
@ -19,10 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Druid Redis Cache"
|
||||
---
|
||||
|
||||
Druid Redis Cache
|
||||
--------------------
|
||||
# Druid Redis Cache
|
||||
|
||||
A cache implementation for Druid based on [Redis](https://github.com/antirez/redis).
|
||||
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "RocketMQ"
|
||||
---
|
||||
|
||||
# RocketMQ
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-rocketmq` extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Microsoft SQLServer"
|
||||
---
|
||||
|
||||
# Microsoft SQLServer
|
||||
|
||||
Make sure to [include](../../operations/including-extensions.html) `sqlserver-metadata-storage` as an extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "StatsD Emitter"
|
||||
---
|
||||
|
||||
# StatsD Emitter
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `statsd-emitter` extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Thrift"
|
||||
---
|
||||
|
||||
# Thrift
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-thrift-extensions`.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Timestamp Min/Max aggregators"
|
||||
---
|
||||
|
||||
# Timestamp Min/Max aggregators
|
||||
|
||||
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-time-min-max`.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Approximate Histogram aggregator"
|
||||
---
|
||||
|
||||
# Approximate Histogram aggregator
|
||||
|
||||
Make sure to [include](../../operations/including-extensions.html) `druid-histogram` as an extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Avro"
|
||||
---
|
||||
|
||||
# Avro
|
||||
|
||||
This extension enables Druid to ingest and understand the Apache Avro data format. Make sure to [include](../../operations/including-extensions.html) `druid-avro-extensions` as an extension.
|
||||
|
|
|
@ -19,25 +19,26 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Bloom Filter"
|
||||
---
|
||||
|
||||
# Druid Bloom Filter
|
||||
# Bloom Filter
|
||||
|
||||
Make sure to [include](../../operations/including-extensions.html) `druid-bloom-filter` as an extension.
|
||||
|
||||
BloomFilter is a probabilistic data structure for set membership check.
|
||||
Following are some characterstics of BloomFilter
|
||||
BloomFilter is a probabilistic data structure for set membership check.
|
||||
Following are some characterstics of BloomFilter
|
||||
- BloomFilters are highly space efficient when compared to using a HashSet.
|
||||
- Because of the probabilistic nature of bloom filter false positive (element not present in bloom filter but test() says true) are possible
|
||||
- false negatives are not possible (if element is present then test() will never say false).
|
||||
- The false positive probability is configurable (default: 5%) depending on which storage requirement may increase or decrease.
|
||||
- false negatives are not possible (if element is present then test() will never say false).
|
||||
- The false positive probability is configurable (default: 5%) depending on which storage requirement may increase or decrease.
|
||||
- Lower the false positive probability greater is the space requirement.
|
||||
- Bloom filters are sensitive to number of elements that will be inserted in the bloom filter.
|
||||
- During the creation of bloom filter expected number of entries must be specified.If the number of insertions exceed the specified initial number of entries then false positive probability will increase accordingly.
|
||||
|
||||
Internally, this implementation of bloom filter uses Murmur3 fast non-cryptographic hash algorithm.
|
||||
|
||||
### Json Representation of Bloom Filter
|
||||
### JSON Representation of Bloom Filter
|
||||
|
||||
```json
|
||||
{
|
||||
"type" : "bloom",
|
||||
|
@ -60,7 +61,7 @@ Internally, this implementation of bloom filter uses Murmur3 fast non-cryptograp
|
|||
- 1 byte for the number of hash functions.
|
||||
- 1 big endian int(That is how OutputStream works) for the number of longs in the bitset
|
||||
- big endian longs in the BloomKFilter bitset
|
||||
|
||||
|
||||
Note: `org.apache.hive.common.util.BloomKFilter` provides a serialize method which can be used to serialize bloom filters to outputStream.
|
||||
|
||||
### SQL Queries
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "DataSketches extension"
|
||||
---
|
||||
|
||||
## DataSketches extension
|
||||
# DataSketches extension
|
||||
|
||||
Druid aggregators based on [datasketches](http://datasketches.github.io/) library. Sketches are data structures implementing approximate streaming mergeable algorithms. Sketches can be ingested from the outside of Druid or built from raw data at ingestion time. Sketches can be stored in Druid segments as additive metrics.
|
||||
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "DataSketches HLL Sketch module"
|
||||
---
|
||||
|
||||
## DataSketches HLL Sketch module
|
||||
# DataSketches HLL Sketch module
|
||||
|
||||
This module provides Druid aggregators for distinct counting based on HLL sketch from [datasketches](http://datasketches.github.io/) library. At ingestion time, this aggregator creates the HLL sketch objects to be stored in Druid segments. At query time, sketches are read and merged together. In the end, by default, you receive the estimate of the number of distinct values presented to the sketch. Also, you can use post aggregator to produce a union of sketch columns in the same row.
|
||||
You can use the HLL sketch aggregator on columns of any identifiers. It will return estimated cardinality of the column.
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "DataSketches Quantiles Sketch module"
|
||||
---
|
||||
|
||||
## DataSketches Quantiles Sketch module
|
||||
# DataSketches Quantiles Sketch module
|
||||
|
||||
This module provides Druid aggregators based on numeric quantiles DoublesSketch from [datasketches](http://datasketches.github.io/) library. Quantiles sketch is a mergeable streaming algorithm to estimate the distribution of values, and approximately answer queries about the rank of a value, probability mass function of the distribution (PMF) or histogram, cummulative distribution function (CDF), and quantiles (median, min, max, 95th percentile and such). See [Quantiles Sketch Overview](https://datasketches.github.io/docs/Quantiles/QuantilesOverview.html).
|
||||
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "DataSketches Theta Sketch module"
|
||||
---
|
||||
|
||||
## DataSketches Theta Sketch module
|
||||
# DataSketches Theta Sketch module
|
||||
|
||||
This module provides Druid aggregators based on Theta sketch from [datasketches](http://datasketches.github.io/) library. Note that sketch algorithms are approximate; see details in the "Accuracy" section of the datasketches doc.
|
||||
At ingestion time, this aggregator creates the Theta sketch objects which get stored in Druid segments. Logically speaking, a Theta sketch object can be thought of as a Set data structure. At query time, sketches are read and aggregated (set unioned) together. In the end, by default, you receive the estimate of the number of unique entries in the sketch object. Also, you can use post aggregators to do union, intersection or difference on sketch columns in the same row.
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "DataSketches Tuple Sketch module"
|
||||
---
|
||||
|
||||
## DataSketches Tuple Sketch module
|
||||
# DataSketches Tuple Sketch module
|
||||
|
||||
This module provides Druid aggregators based on Tuple sketch from [datasketches](http://datasketches.github.io/) library. ArrayOfDoublesSketch sketches extend the functionality of the count-distinct Theta sketches by adding arrays of double values associated with unique keys.
|
||||
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Basic Security"
|
||||
---
|
||||
|
||||
# Druid Basic Security
|
||||
|
||||
This extension adds:
|
||||
|
@ -58,7 +58,7 @@ druid.auth.authenticator.MyBasicAuthenticator.initialInternalClientPassword=pass
|
|||
druid.auth.authenticator.MyBasicAuthenticator.authorizerName=MyBasicAuthorizer
|
||||
```
|
||||
|
||||
To use the Basic authenticator, add an authenticator with type `basic` to the authenticatorChain.
|
||||
To use the Basic authenticator, add an authenticator with type `basic` to the authenticatorChain.
|
||||
|
||||
Configuration of the named authenticator is assigned through properties with the form:
|
||||
|
||||
|
@ -208,14 +208,14 @@ Set the permissions of {roleName}. This replaces the previous set of permissions
|
|||
Content: List of JSON Resource-Action objects, e.g.:
|
||||
```
|
||||
[
|
||||
{
|
||||
{
|
||||
"resource": {
|
||||
"name": "wiki.*",
|
||||
"type": "DATASOURCE"
|
||||
},
|
||||
"action": "READ"
|
||||
},
|
||||
{
|
||||
{
|
||||
"resource": {
|
||||
"name": "wikiticker",
|
||||
"type": "DATASOURCE"
|
||||
|
@ -225,7 +225,7 @@ Content: List of JSON Resource-Action objects, e.g.:
|
|||
]
|
||||
```
|
||||
|
||||
The "name" field for resources in the permission definitions are regexes used to match resource names during authorization checks.
|
||||
The "name" field for resources in the permission definitions are regexes used to match resource names during authorization checks.
|
||||
|
||||
Please see [Defining permissions](#defining-permissions) for more details.
|
||||
|
||||
|
@ -238,7 +238,7 @@ Return the current load status of the local caches of the authorization database
|
|||
### Authenticator
|
||||
If `druid.auth.authenticator.<authenticator-name>.initialAdminPassword` is set, a default admin user named "admin" will be created, with the specified initial password. If this configuration is omitted, the "admin" user will not be created.
|
||||
|
||||
If `druid.auth.authenticator.<authenticator-name>.initialInternalClientPassword` is set, a default internal system user named "druid_system" will be created, with the specified initial password. If this configuration is omitted, the "druid_system" user will not be created.
|
||||
If `druid.auth.authenticator.<authenticator-name>.initialInternalClientPassword` is set, a default internal system user named "druid_system" will be created, with the specified initial password. If this configuration is omitted, the "druid_system" user will not be created.
|
||||
|
||||
|
||||
### Authorizer
|
||||
|
|
|
@ -19,12 +19,12 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Kerberos"
|
||||
---
|
||||
|
||||
# Druid-Kerberos
|
||||
# Kerberos
|
||||
|
||||
Druid Extension to enable Authentication for Druid Nodes using Kerberos.
|
||||
This extension adds an Authenticator which is used to protect HTTP Endpoints using the simple and protected GSSAPI negotiation mechanism [SPNEGO](https://en.wikipedia.org/wiki/SPNEGO).
|
||||
This extension adds an Authenticator which is used to protect HTTP Endpoints using the simple and protected GSSAPI negotiation mechanism [SPNEGO](https://en.wikipedia.org/wiki/SPNEGO).
|
||||
Make sure to [include](../../operations/including-extensions.html) `druid-kerberos` as an extension.
|
||||
|
||||
|
||||
|
@ -57,23 +57,23 @@ The configuration examples in the rest of this document will use "kerberos" as t
|
|||
|`druid.auth.authenticator.kerberos.cookieSignatureSecret`|`secretString`| Secret used to sign authentication cookies. It is advisable to explicitly set it, if you have multiple druid ndoes running on same machine with different ports as the Cookie Specification does not guarantee isolation by port.|<Random value>|No|
|
||||
|`druid.auth.authenticator.kerberos.authorizerName`|Depends on available authorizers|Authorizer that requests should be directed to|Empty|Yes|
|
||||
|
||||
As a note, it is required that the SPNego principal in use by the druid nodes must start with HTTP (This specified by [RFC-4559](https://tools.ietf.org/html/rfc4559)) and must be of the form "HTTP/_HOST@REALM".
|
||||
As a note, it is required that the SPNego principal in use by the druid nodes must start with HTTP (This specified by [RFC-4559](https://tools.ietf.org/html/rfc4559)) and must be of the form "HTTP/_HOST@REALM".
|
||||
The special string _HOST will be replaced automatically with the value of config `druid.host`
|
||||
|
||||
### Auth to Local Syntax
|
||||
`druid.auth.authenticator.kerberos.authToLocal` allows you to set a general rules for mapping principal names to local user names.
|
||||
The syntax for mapping rules is `RULE:\[n:string](regexp)s/pattern/replacement/g`. The integer n indicates how many components the target principal should have. If this matches, then a string will be formed from string, substituting the realm of the principal for $0 and the n‘th component of the principal for $n. e.g. if the principal was druid/admin then `\[2:$2$1suffix]` would result in the string `admindruidsuffix`.
|
||||
If this string matches regexp, then the s//\[g] substitution command will be run over the string. The optional g will cause the substitution to be global over the string, instead of replacing only the first match in the string.
|
||||
If required, multiple rules can be be joined by newline character and specified as a String.
|
||||
If required, multiple rules can be be joined by newline character and specified as a String.
|
||||
|
||||
### Increasing HTTP Header size for large SPNEGO negotiate header
|
||||
In Active Directory environment, SPNEGO token in the Authorization header includes PAC (Privilege Access Certificate) information,
|
||||
which includes all security groups for the user. In some cases when the user belongs to many security groups the header to grow beyond what druid can handle by default.
|
||||
In such cases, max request header size that druid can handle can be increased by setting `druid.server.http.maxRequestHeaderSize` (default 8Kb) and `druid.router.http.maxRequestBufferSize` (default 8Kb).
|
||||
|
||||
## Configuring Kerberos Escalated Client
|
||||
## Configuring Kerberos Escalated Client
|
||||
|
||||
Druid internal nodes communicate with each other using an escalated http Client. A Kerberos enabled escalated HTTP Client can be configured by following properties -
|
||||
Druid internal nodes communicate with each other using an escalated http Client. A Kerberos enabled escalated HTTP Client can be configured by following properties -
|
||||
|
||||
|
||||
|Property|Example Values|Description|Default|required|
|
||||
|
@ -83,15 +83,15 @@ Druid internal nodes communicate with each other using an escalated http Client.
|
|||
|`druid.escalator.internalClientKeytab`|`/etc/security/keytabs/druid.keytab`|Path to keytab file used for internal node communication|n/a|Yes|
|
||||
|`druid.escalator.authorizerName`|`MyBasicAuthorizer`|Authorizer that requests should be directed to.|n/a|Yes|
|
||||
|
||||
## Accessing Druid HTTP end points when kerberos security is enabled
|
||||
1. To access druid HTTP endpoints via curl user will need to first login using `kinit` command as follows -
|
||||
## Accessing Druid HTTP end points when kerberos security is enabled
|
||||
1. To access druid HTTP endpoints via curl user will need to first login using `kinit` command as follows -
|
||||
|
||||
```
|
||||
kinit -k -t <path_to_keytab_file> user@REALM.COM
|
||||
```
|
||||
|
||||
2. Once the login is successful verify that login is successful using `klist` command
|
||||
3. Now you can access druid HTTP endpoints using curl command as follows -
|
||||
3. Now you can access druid HTTP endpoints using curl command as follows -
|
||||
|
||||
```
|
||||
curl --negotiate -u:anyUser -b ~/cookies.txt -c ~/cookies.txt -X POST -H'Content-Type: application/json' <HTTP_END_POINT>
|
||||
|
@ -105,13 +105,13 @@ Druid internal nodes communicate with each other using an escalated http Client.
|
|||
Note: Above command will authenticate the user first time using SPNego negotiate mechanism and store the authentication cookie in file. For subsequent requests the cookie will be used for authentication.
|
||||
|
||||
## Accessing coordinator or overlord console from web browser
|
||||
To access Coordinator/Overlord console from browser you will need to configure your browser for SPNego authentication as follows -
|
||||
To access Coordinator/Overlord console from browser you will need to configure your browser for SPNego authentication as follows -
|
||||
|
||||
1. Safari - No configurations required.
|
||||
2. Firefox - Open firefox and follow these steps -
|
||||
2. Firefox - Open firefox and follow these steps -
|
||||
1. Go to `about:config` and search for `network.negotiate-auth.trusted-uris`.
|
||||
2. Double-click and add the following values: `"http://druid-coordinator-hostname:ui-port"` and `"http://druid-overlord-hostname:port"`
|
||||
3. Google Chrome - From the command line run following commands -
|
||||
3. Google Chrome - From the command line run following commands -
|
||||
1. `google-chrome --auth-server-whitelist="druid-coordinator-hostname" --auth-negotiate-delegate-whitelist="druid-coordinator-hostname"`
|
||||
2. `google-chrome --auth-server-whitelist="druid-overlord-hostname" --auth-negotiate-delegate-whitelist="druid-overlord-hostname"`
|
||||
4. Internet Explorer -
|
||||
|
@ -119,4 +119,4 @@ To access Coordinator/Overlord console from browser you will need to configure y
|
|||
2. Allow negotiation for the UI website.
|
||||
|
||||
## Sending Queries programmatically
|
||||
Many HTTP client libraries, such as Apache Commons [HttpComponents](https://hc.apache.org/), already have support for performing SPNEGO authentication. You can use any of the available HTTP client library to communicate with druid cluster.
|
||||
Many HTTP client libraries, such as Apache Commons [HttpComponents](https://hc.apache.org/), already have support for performing SPNEGO authentication. You can use any of the available HTTP client library to communicate with druid cluster.
|
||||
|
|
|
@ -19,6 +19,7 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Cached Lookup Module"
|
||||
---
|
||||
# Cached Lookup Module
|
||||
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Extension Examples"
|
||||
---
|
||||
|
||||
# Druid examples
|
||||
# Extension Examples
|
||||
|
||||
## TwitterSpritzerFirehose
|
||||
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "HDFS"
|
||||
---
|
||||
|
||||
# HDFS
|
||||
|
||||
Make sure to [include](../../operations/including-extensions.html) `druid-hdfs-storage` as an extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Kafka Eight Firehose"
|
||||
---
|
||||
|
||||
# Kafka Eight Firehose
|
||||
|
||||
Make sure to [include](../../operations/including-extensions.html) `druid-kafka-eight` as an extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Kafka Lookups"
|
||||
---
|
||||
|
||||
# Kafka Lookups
|
||||
|
||||
<div class="note caution">
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Kafka Indexing Service"
|
||||
---
|
||||
|
||||
# Kafka Indexing Service
|
||||
|
||||
The Kafka indexing service enables the configuration of *supervisors* on the Overlord, which facilitate ingestion from
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Globally Cached Lookups"
|
||||
---
|
||||
|
||||
# Globally Cached Lookups
|
||||
|
||||
<div class="note caution">
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "MySQL Metadata Store"
|
||||
---
|
||||
|
||||
# MySQL Metadata Store
|
||||
|
||||
Make sure to [include](../../operations/including-extensions.html) `mysql-metadata-storage` as an extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Druid Parquet Extension"
|
||||
---
|
||||
|
||||
# Druid Parquet Extension
|
||||
|
||||
This module extends [Druid Hadoop based indexing](../../ingestion/hadoop.html) to ingest data directly from offline
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "PostgreSQL Metadata Store"
|
||||
---
|
||||
|
||||
# PostgreSQL Metadata Store
|
||||
|
||||
Make sure to [include](../../operations/including-extensions.html) `postgresql-metadata-storage` as an extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Protobuf"
|
||||
---
|
||||
|
||||
# Protobuf
|
||||
|
||||
This extension enables Druid to ingest and understand the Protobuf data format. Make sure to [include](../../operations/including-extensions.html) `druid-protobuf-extensions` as an extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "S3-compatible"
|
||||
---
|
||||
|
||||
# S3-compatible
|
||||
|
||||
Make sure to [include](../../operations/including-extensions.html) `druid-s3-extensions` as an extension.
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Simple SSLContext Provider Module"
|
||||
---
|
||||
|
||||
## Simple SSLContext Provider Module
|
||||
# Simple SSLContext Provider Module
|
||||
|
||||
This module contains a simple implementation of [SSLContext](http://docs.oracle.com/javase/8/docs/api/javax/net/ssl/SSLContext.html)
|
||||
that will be injected to be used with HttpClient that Druid nodes use internally to communicate with each other. To learn more about
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Stats aggregator"
|
||||
---
|
||||
|
||||
# Stats aggregator
|
||||
|
||||
Includes stat-related aggregators, including variance and standard deviations, etc. Make sure to [include](../../operations/including-extensions.html) `druid-stats` as an extension.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Test Stats Aggregators"
|
||||
---
|
||||
|
||||
# Test Stats Aggregators
|
||||
|
||||
Incorporates test statistics related aggregators, including z-score and p-value. Please refer to [https://www.paypal-engineering.com/2017/06/29/democratizing-experimentation-data-for-product-innovations/](https://www.paypal-engineering.com/2017/06/29/democratizing-experimentation-data-for-product-innovations/) for math background and details.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Druid extensions"
|
||||
---
|
||||
|
||||
# Druid extensions
|
||||
|
||||
Druid implements an extension system that allows for adding functionality at runtime. Extensions
|
||||
|
|
|
@ -19,8 +19,10 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Geographic Queries"
|
||||
---
|
||||
# Geographic Queries
|
||||
|
||||
Druid supports filtering specially spatially indexed columns based on an origin and a bound.
|
||||
|
||||
# Spatial Indexing
|
||||
|
|
|
@ -19,6 +19,7 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Integrating Druid With Other Technologies"
|
||||
---
|
||||
# Integrating Druid With Other Technologies
|
||||
|
||||
|
|
|
@ -19,6 +19,7 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "JavaScript Programming Guide"
|
||||
---
|
||||
# JavaScript Programming Guide
|
||||
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Extending Druid With Custom Modules"
|
||||
---
|
||||
|
||||
# Extending Druid With Custom Modules
|
||||
|
||||
Druid uses a module system that allows for the addition of extensions at runtime.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Developing on Druid"
|
||||
---
|
||||
|
||||
# Developing on Druid
|
||||
|
||||
Druid's codebase consists of several major components. For developers interested in learning the code, this document provides
|
||||
|
|
|
@ -19,10 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Router Node"
|
||||
---
|
||||
|
||||
Router Node
|
||||
===========
|
||||
# Router Node
|
||||
|
||||
You should only ever need the router node if you have a Druid cluster well into the terabyte range. The router node can be used to route queries to different broker nodes. By default, the broker routes queries based on how [Rules](../operations/rule-configuration.html) are set up. For example, if 1 month of recent data is loaded into a `hot` cluster, queries that fall within the recent month can be routed to a dedicated set of brokers. Queries outside this range are routed to another set of brokers. This set up provides query isolation such that queries for more important data are not impacted by queries for less important data.
|
||||
|
||||
|
|
|
@ -19,8 +19,10 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Versioning Druid"
|
||||
---
|
||||
# Versioning Druid
|
||||
|
||||
This page discusses how we do versioning and provides information on our stable releases.
|
||||
|
||||
Versioning Strategy
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Batch Data Ingestion"
|
||||
---
|
||||
|
||||
# Batch Data Ingestion
|
||||
|
||||
Druid can load data from static files through a variety of methods described here.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Command Line Hadoop Indexer"
|
||||
---
|
||||
|
||||
# Command Line Hadoop Indexer
|
||||
|
||||
To run:
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Compaction Task"
|
||||
---
|
||||
|
||||
# Compaction Task
|
||||
|
||||
Compaction tasks merge all segments of the given interval. The syntax is:
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Data Formats for Ingestion"
|
||||
---
|
||||
Data Formats for Ingestion
|
||||
==========================
|
||||
# Data Formats for Ingestion
|
||||
|
||||
Druid can ingest denormalized data in JSON, CSV, or a delimited form such as TSV, or any custom format. While most examples in the documentation use data in JSON format, it is not difficult to configure Druid to ingest any other delimited data.
|
||||
We welcome any contributions to new formats.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Deleting Data"
|
||||
---
|
||||
|
||||
# Deleting Data
|
||||
|
||||
Permanent deletion of a Druid segment has two steps:
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "My Data isn't being loaded"
|
||||
---
|
||||
|
||||
## My Data isn't being loaded
|
||||
# My Data isn't being loaded
|
||||
|
||||
### Realtime Ingestion
|
||||
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Druid Firehoses"
|
||||
---
|
||||
|
||||
# Druid Firehoses
|
||||
|
||||
Firehoses are used in [native batch ingestion tasks](../ingestion/native_tasks.html), stream push tasks automatically created by [Tranquility](../ingestion/stream-push.html), and the [stream-pull (deprecated)](../ingestion/stream-pull.html) ingestion model.
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "JSON Flatten Spec"
|
||||
---
|
||||
|
||||
# JSON Flatten Spec
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Hadoop-based Batch Ingestion"
|
||||
---
|
||||
|
||||
# Hadoop-based Batch Ingestion
|
||||
|
||||
Hadoop-based batch ingestion in Druid is supported via a Hadoop-ingestion task. These tasks can be posted to a running
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Ingestion"
|
||||
---
|
||||
|
||||
# Ingestion
|
||||
|
||||
## Overview
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Ingestion Spec"
|
||||
---
|
||||
|
||||
# Ingestion Spec
|
||||
|
||||
A Druid ingestion spec consists of 3 components:
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Task Locking & Priority"
|
||||
---
|
||||
|
||||
# Task Locking & Priority
|
||||
|
||||
## Locking
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Miscellaneous Tasks"
|
||||
---
|
||||
|
||||
# Miscellaneous Tasks
|
||||
|
||||
## Noop Task
|
||||
|
|
|
@ -19,6 +19,7 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Native Index Tasks"
|
||||
---
|
||||
# Native Index Tasks
|
||||
|
||||
|
|
|
@ -19,6 +19,7 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Ingestion Reports"
|
||||
---
|
||||
# Ingestion Reports
|
||||
|
||||
|
|
|
@ -19,6 +19,7 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Schema Changes"
|
||||
---
|
||||
# Schema Changes
|
||||
|
||||
|
|
|
@ -19,8 +19,8 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Schema Design"
|
||||
---
|
||||
|
||||
# Schema Design
|
||||
|
||||
This page is meant to assist users in designing a schema for data to be ingested in Druid. Druid intakes denormalized data
|
||||
|
|
|
@ -19,22 +19,22 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Loading Streams"
|
||||
---
|
||||
# Loading Streams
|
||||
|
||||
# Loading streams
|
||||
|
||||
Streams can be ingested in Druid using either [Tranquility](https://github.com/druid-io/tranquility) (a Druid-aware
|
||||
Streams can be ingested in Druid using either [Tranquility](https://github.com/druid-io/tranquility) (a Druid-aware
|
||||
client) or the [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.html).
|
||||
|
||||
## Tranquility (Stream Push)
|
||||
|
||||
If you have a program that generates a stream, then you can push that stream directly into Druid in
|
||||
real-time. With this approach, Tranquility is embedded in your data-producing application.
|
||||
Tranquility comes with bindings for the
|
||||
Storm and Samza stream processors. It also has a direct API that can be used from any JVM-based
|
||||
If you have a program that generates a stream, then you can push that stream directly into Druid in
|
||||
real-time. With this approach, Tranquility is embedded in your data-producing application.
|
||||
Tranquility comes with bindings for the
|
||||
Storm and Samza stream processors. It also has a direct API that can be used from any JVM-based
|
||||
program, such as Spark Streaming or a Kafka consumer.
|
||||
|
||||
Tranquility handles partitioning, replication, service discovery, and schema rollover for you,
|
||||
Tranquility handles partitioning, replication, service discovery, and schema rollover for you,
|
||||
seamlessly and without downtime. You only have to define your Druid schema.
|
||||
|
||||
For examples and more information, please see the [Tranquility README](https://github.com/druid-io/tranquility).
|
||||
|
|
|
@ -19,14 +19,14 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Stream Pull Ingestion"
|
||||
---
|
||||
|
||||
<div class="note info">
|
||||
NOTE: Realtime nodes are deprecated. Please use the <a href="../development/extensions-core/kafka-ingestion.html">Kafka Indexing Service</a> for stream pull use cases instead.
|
||||
NOTE: Realtime nodes are deprecated. Please use the <a href="../development/extensions-core/kafka-ingestion.html">Kafka Indexing Service</a> for stream pull use cases instead.
|
||||
</div>
|
||||
|
||||
Stream Pull Ingestion
|
||||
=====================
|
||||
# Stream Pull Ingestion
|
||||
|
||||
If you have an external service that you want to pull data from, you have two options. The simplest
|
||||
option is to set up a "copying" service that reads from the data source and writes to Druid using
|
||||
|
@ -34,7 +34,7 @@ the [stream push method](stream-push.html).
|
|||
|
||||
Another option is *stream pull*. With this approach, a Druid Realtime Node ingests data from a
|
||||
[Firehose](../ingestion/firehose.html) connected to the data you want to
|
||||
read. The Druid quickstart and tutorials do not include information about how to set up standalone realtime nodes, but
|
||||
read. The Druid quickstart and tutorials do not include information about how to set up standalone realtime nodes, but
|
||||
they can be used in place for Tranquility server and the indexing service. Please note that Realtime nodes have different properties and roles than the indexing service.
|
||||
|
||||
## Realtime Node Ingestion
|
||||
|
@ -182,7 +182,7 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
|
|||
|dedupColumn|String|the column to judge whether this row is already in this segment, if so, throw away this row. If it is String type column, to reduce heap cost, use long type hashcode of this column's value to judge whether this row is already ingested, so there maybe very small chance to throw away a row that is not ingested before.|no (default == null)|
|
||||
|indexSpec|Object|Tune how data is indexed. See below for more information.|no|
|
||||
|
||||
Before enabling thread priority settings, users are highly encouraged to read the [original pull request](https://github.com/apache/incubator-druid/pull/984) and other documentation about proper use of `-XX:+UseThreadPriorities`.
|
||||
Before enabling thread priority settings, users are highly encouraged to read the [original pull request](https://github.com/apache/incubator-druid/pull/984) and other documentation about proper use of `-XX:+UseThreadPriorities`.
|
||||
|
||||
#### Rejection Policy
|
||||
|
||||
|
@ -254,7 +254,7 @@ Configure `linear` under `schema`:
|
|||
"partitionNum": 0
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
##### Numbered
|
||||
|
||||
|
@ -269,7 +269,7 @@ Configure `numbered` under `schema`:
|
|||
"partitions": 2
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
##### Scale and Redundancy
|
||||
|
||||
|
@ -283,7 +283,7 @@ For example, if RealTimeNode1 has:
|
|||
"partitionNum": 0
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
and RealTimeNode2 has:
|
||||
|
||||
```json
|
||||
|
@ -329,48 +329,48 @@ The normal, expected use cases have the following overall constraints: `intermed
|
|||
|
||||
Standalone realtime nodes use the Kafka high level consumer, which imposes a few restrictions.
|
||||
|
||||
Druid replicates segment such that logically equivalent data segments are concurrently hosted on N nodes. If N–1 nodes go down,
|
||||
the data will still be available for querying. On real-time nodes, this process depends on maintaining logically equivalent
|
||||
data segments on each of the N nodes, which is not possible with standard Kafka consumer groups if your Kafka topic requires more than one consumer
|
||||
Druid replicates segment such that logically equivalent data segments are concurrently hosted on N nodes. If N–1 nodes go down,
|
||||
the data will still be available for querying. On real-time nodes, this process depends on maintaining logically equivalent
|
||||
data segments on each of the N nodes, which is not possible with standard Kafka consumer groups if your Kafka topic requires more than one consumer
|
||||
(because consumers in different consumer groups will split up the data differently).
|
||||
|
||||
For example, let's say your topic is split across Kafka partitions 1, 2, & 3 and you have 2 real-time nodes with linear shard specs 1 & 2.
|
||||
Both of the real-time nodes are in the same consumer group. Real-time node 1 may consume data from partitions 1 & 3, and real-time node 2 may consume data from partition 2.
|
||||
For example, let's say your topic is split across Kafka partitions 1, 2, & 3 and you have 2 real-time nodes with linear shard specs 1 & 2.
|
||||
Both of the real-time nodes are in the same consumer group. Real-time node 1 may consume data from partitions 1 & 3, and real-time node 2 may consume data from partition 2.
|
||||
Querying for your data through the broker will yield correct results.
|
||||
|
||||
The problem arises if you want to replicate your data by creating real-time nodes 3 & 4. These new real-time nodes also
|
||||
have linear shard specs 1 & 2, and they will consume data from Kafka using a different consumer group. In this case,
|
||||
real-time node 3 may consume data from partitions 1 & 2, and real-time node 4 may consume data from partition 2.
|
||||
From Druid's perspective, the segments hosted by real-time nodes 1 and 3 are the same, and the data hosted by real-time nodes
|
||||
2 and 4 are the same, although they are reading from different Kafka partitions. Querying for the data will yield inconsistent
|
||||
The problem arises if you want to replicate your data by creating real-time nodes 3 & 4. These new real-time nodes also
|
||||
have linear shard specs 1 & 2, and they will consume data from Kafka using a different consumer group. In this case,
|
||||
real-time node 3 may consume data from partitions 1 & 2, and real-time node 4 may consume data from partition 2.
|
||||
From Druid's perspective, the segments hosted by real-time nodes 1 and 3 are the same, and the data hosted by real-time nodes
|
||||
2 and 4 are the same, although they are reading from different Kafka partitions. Querying for the data will yield inconsistent
|
||||
results.
|
||||
|
||||
Is this always a problem? No. If your data is small enough to fit on a single Kafka partition, you can replicate without issues.
|
||||
Is this always a problem? No. If your data is small enough to fit on a single Kafka partition, you can replicate without issues.
|
||||
Otherwise, you can run real-time nodes without replication.
|
||||
|
||||
Please note that druid will skip over event that failed its checksum and it is corrupt.
|
||||
|
||||
### Locking
|
||||
|
||||
Using stream pull ingestion with Realtime nodes together batch ingestion may introduce data override issues. For example, if you
|
||||
are generating hourly segments for the current day, and run a daily batch job for the current day's data, the segments created by
|
||||
the batch job will have a more recent version than most of the segments generated by realtime ingestion. If your batch job is indexing
|
||||
data that isn't yet complete for the day, the daily segment created by the batch job can override recent segments created by
|
||||
Using stream pull ingestion with Realtime nodes together batch ingestion may introduce data override issues. For example, if you
|
||||
are generating hourly segments for the current day, and run a daily batch job for the current day's data, the segments created by
|
||||
the batch job will have a more recent version than most of the segments generated by realtime ingestion. If your batch job is indexing
|
||||
data that isn't yet complete for the day, the daily segment created by the batch job can override recent segments created by
|
||||
realtime nodes. A portion of data will appear to be lost in this case.
|
||||
|
||||
### Schema changes
|
||||
|
||||
Standalone realtime nodes require stopping a node to update a schema, and starting it up again for the schema to take effect.
|
||||
Standalone realtime nodes require stopping a node to update a schema, and starting it up again for the schema to take effect.
|
||||
This can be difficult to manage at scale, especially with multiple partitions.
|
||||
|
||||
### Log management
|
||||
|
||||
Each standalone realtime node has its own set of logs. Diagnosing errors across many partitions across many servers may be
|
||||
Each standalone realtime node has its own set of logs. Diagnosing errors across many partitions across many servers may be
|
||||
difficult to manage and track at scale.
|
||||
|
||||
## Deployment Notes
|
||||
|
||||
Stream ingestion may generate a large number of small segments because it's difficult to optimize the segment size at
|
||||
ingestion time. The number of segments will increase over time, and this might cause the query performance issue.
|
||||
ingestion time. The number of segments will increase over time, and this might cause the query performance issue.
|
||||
|
||||
Details on how to optimize the segment size can be found on [Segment size optimization](../operations/segment-optimization.html).
|
||||
|
|
|
@ -19,9 +19,9 @@
|
|||
|
||||
---
|
||||
layout: doc_page
|
||||
title: "Stream Push"
|
||||
---
|
||||
|
||||
## Stream Push
|
||||
# Stream Push
|
||||
|
||||
Druid can connect to any streaming data source through
|
||||
[Tranquility](https://github.com/druid-io/tranquility/blob/master/README.md), a package for pushing
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue