Added titles and harmonized docs to improve usability and SEO (#6731)

* added titles and harmonized docs

* manually fixed some titles
This commit is contained in:
Vadim Ogievetsky 2018-12-12 20:42:12 -08:00 committed by Fangjin Yang
parent 55914687bb
commit da4836f38c
166 changed files with 312 additions and 271 deletions

View File

@ -19,10 +19,9 @@
---
layout: doc_page
title: "Druid vs Elasticsearch"
---
Druid vs Elasticsearch
======================
# Druid vs Elasticsearch
We are not experts on search systems, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means.

View File

@ -19,10 +19,9 @@
---
layout: doc_page
title: "Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)"
---
Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
====================================================
# Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
Druid is highly optimized for scans and aggregations, it supports arbitrarily deep drill downs into data sets. This same functionality
is supported in key/value stores in 2 ways:

View File

@ -19,10 +19,9 @@
---
layout: doc_page
title: "Druid vs Kudu"
---
Druid vs Kudu
=============
# Druid vs Kudu
Kudu's storage format enables single row updates, whereas updates to existing Druid segments requires recreating the segment, so theoretically
the process for updating old values should be higher latency in Druid. However, the requirements in Kudu for maintaining extra head space to store

View File

@ -19,10 +19,9 @@
---
layout: doc_page
title: "Druid vs Redshift"
---
Druid vs Redshift
=================
# Druid vs Redshift
### How does Druid compare to Redshift?

View File

@ -19,10 +19,9 @@
---
layout: doc_page
title: "Druid vs Spark"
---
Druid vs Spark
==============
# Druid vs Spark
Druid and Spark are complementary solutions as Druid can be used to accelerate OLAP queries in Spark.

View File

@ -19,17 +19,16 @@
---
layout: doc_page
title: "Druid vs SQL-on-Hadoop"
---
# Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)
Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)
===========================================================
SQL-on-Hadoop engines provide an
execution engine for various data formats and data stores, and
SQL-on-Hadoop engines provide an
execution engine for various data formats and data stores, and
many can be made to push down computations down to Druid, while providing a SQL interface to Druid.
For a direct comparison between the technologies and when to only use one or the other, things basically comes down to your
product requirements and what the systems were designed to do.
For a direct comparison between the technologies and when to only use one or the other, things basically comes down to your
product requirements and what the systems were designed to do.
Druid was designed to
@ -37,7 +36,7 @@ Druid was designed to
1. ingest data in real-time
1. handle slice-n-dice style ad-hoc queries
SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems.
SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems.
Some of these engines (including Impala and Presto) can be colocated with HDFS data nodes and coordinate with them to achieve data locality for queries.
What does this mean? We can talk about it in terms of three general areas
@ -47,37 +46,37 @@ What does this mean? We can talk about it in terms of three general areas
### Queries
Druid segments stores data in a custom column format. Segments are scanned directly as part of queries and each Druid server
calculates a set of results that are eventually merged at the Broker level. This means the data that is transferred between servers
Druid segments stores data in a custom column format. Segments are scanned directly as part of queries and each Druid server
calculates a set of results that are eventually merged at the Broker level. This means the data that is transferred between servers
are queries and results, and all computation is done internally as part of the Druid servers.
Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats.
They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce).
Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still
some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly
Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats.
They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce).
Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still
some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly
how much of a performance impact this makes.
### Data Ingestion
Druid is built to allow for real-time ingestion of data. You can ingest data and query it immediately upon ingestion,
Druid is built to allow for real-time ingestion of data. You can ingest data and query it immediately upon ingestion,
the latency between how quickly the event is reflected in the data is dominated by how long it takes to deliver the event to Druid.
SQL-on-Hadoop, being based on data in HDFS or some other backing store, are limited in their data ingestion rates by the
rate at which that backing store can make data available. Generally, the backing store is the biggest bottleneck for
SQL-on-Hadoop, being based on data in HDFS or some other backing store, are limited in their data ingestion rates by the
rate at which that backing store can make data available. Generally, the backing store is the biggest bottleneck for
how quickly data can become available.
### Query Flexibility
Druid's query language is fairly low level and maps to how Druid operates internally. Although Druid can be combined with a high level query
planner such as [Plywood](https://github.com/implydata/plywood) to support most SQL queries and analytic SQL queries (minus joins among large tables),
Druid's query language is fairly low level and maps to how Druid operates internally. Although Druid can be combined with a high level query
planner such as [Plywood](https://github.com/implydata/plywood) to support most SQL queries and analytic SQL queries (minus joins among large tables),
base Druid is less flexible than SQL-on-Hadoop solutions for generic processing.
SQL-on-Hadoop support SQL style queries with full joins.
## Druid vs Parquet
Parquet is a column storage format that is designed to work with SQL-on-Hadoop engines. Parquet doesn't have a query execution engine, and instead
Parquet is a column storage format that is designed to work with SQL-on-Hadoop engines. Parquet doesn't have a query execution engine, and instead
relies on external sources to pull data out of it.
Druid's storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet's storage format is much
Druid's storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet's storage format is much
more hierachical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Configuration Reference"
---
# Configuration Reference
This page documents all of the configuration properties for each Druid service type.

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "Logging"
---
Logging
==========================
# Logging
Druid nodes will emit logs that are useful for debugging to the console. Druid nodes also emit periodic metrics about their state. For more about metrics, see [Configuration](../configuration/index.html#enabling-metrics). Metric logs are printed to the console by default, and can be disabled with `-Ddruid.emitter.logging.logLevel=debug`.

View File

@ -19,10 +19,10 @@
---
layout: doc_page
title: "Realtime Node Configuration"
---
# Realtime Node Configuration
Realtime Node Configuration
==============================
For general Realtime Node information, see [here](../design/realtime.html).
Runtime Configuration

View File

@ -19,15 +19,18 @@
---
layout: doc_page
title: "Cassandra Deep Storage"
---
# Cassandra Deep Storage
## Introduction
Druid can use Cassandra as a deep storage mechanism. Segments and their metadata are stored in Cassandra in two tables:
`index_storage` and `descriptor_storage`. Underneath the hood, the Cassandra integration leverages Astyanax. The
`index_storage` and `descriptor_storage`. Underneath the hood, the Cassandra integration leverages Astyanax. The
index storage table is a [Chunked Object](https://github.com/Netflix/astyanax/wiki/Chunked-Object-Store) repository. It contains
compressed segments for distribution to historical nodes. Since segments can be large, the Chunked Object storage allows the integration to multi-thread
the write to Cassandra, and spreads the data across all the nodes in a cluster. The descriptor storage table is a normal C* table that
stores the segment metadatak.
the write to Cassandra, and spreads the data across all the nodes in a cluster. The descriptor storage table is a normal C* table that
stores the segment metadatak.
## Schema
Below are the create statements for each:

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Deep Storage"
---
# Deep Storage
Deep storage is where segments are stored. It is a storage mechanism that Druid does not provide. This deep storage infrastructure defines the level of durability of your data, as long as Druid nodes can see this storage infrastructure and get at the segments stored on it, you will not lose data no matter how many Druid nodes you lose. If segments disappear from this storage layer, then you will lose whatever data those segments represented.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Metadata Storage"
---
# Metadata Storage
The Metadata Storage is an external dependency of Druid. Druid uses it to store

View File

@ -19,8 +19,10 @@
---
layout: doc_page
title: "ZooKeeper"
---
# ZooKeeper
Druid uses [ZooKeeper](http://zookeeper.apache.org/) (ZK) for management of current cluster state. The operations that happen over ZK are
1. [Coordinator](../design/coordinator.html) leader election

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Authentication and Authorization"
---
# Authentication and Authorization
|Property|Type|Description|Default|Required|

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "Broker"
---
Broker
======
# Broker
### Configuration

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "Coordinator Node"
---
Coordinator Node
================
# Coordinator Node
### Configuration

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "Historical Node"
---
Historical Node
===============
# Historical Node
### Configuration

View File

@ -19,6 +19,7 @@
---
layout: doc_page
title: "Design"
---
# What is Druid?<a id="what-is-druid"></a>
@ -159,7 +160,7 @@ queries:
- Bitmap compression for bitmap indexes
- Type-aware compression for all columns
Periodically, segments are committed and published. At this point, they are written to [deep storage](#deep-storage),
Periodically, segments are committed and published. At this point, they are written to [deep storage](#deep-storage),
become immutable, and move from MiddleManagers to the Historical processes (see [Architecture](#architecture) above
for details). An entry about the segment is also written to the [metadata store](#metadata-storage). This entry is a
self-describing bit of metadata about the segment, including things like the schema of the segment, its size, and its

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "Indexing Service"
---
Indexing Service
================
# Indexing Service
The indexing service is a highly-available, distributed service that runs indexing related tasks.

View File

@ -19,10 +19,9 @@
---
layout: doc_page
title: "MiddleManager Node"
---
Middle Manager Node
------------------
# MiddleManager Node
### Configuration

View File

@ -19,10 +19,9 @@
---
layout: doc_page
title: "Overlord Node"
---
Overlord Node
-------------
# Overlord Node
### Configuration

View File

@ -19,10 +19,9 @@
---
layout: doc_page
title: "Peons"
---
Peons
-----
# Peons
### Configuration

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Druid Plumbers"
---
# Druid Plumbers
The plumber handles generated segments both while they are being generated and when they are "done". This is also technically a pluggable interface and there are multiple implementations. However, plumbers handle numerous complex details, and therefore an advanced understanding of Druid is recommended before implementing your own.

View File

@ -19,10 +19,9 @@
---
layout: doc_page
title: "Real-time Node"
---
Real-time Node
==============
# Real-time Node
<div class="note info">
NOTE: Realtime nodes are deprecated. Please use the <a href="../development/extensions-core/kafka-ingestion.html">Kafka Indexing Service</a> for stream pull use cases instead.

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "Segments"
---
Segments
========
# Segments
Druid stores its index in *segment files*, which are partitioned by
time. In a basic setup, one segment file is created for each time

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "Build from Source"
---
### Build from Source
# Build from Source
You can build Druid directly from source. Please note that these instructions are for building the latest stable version of Druid.
For building the latest code in master, follow the instructions [here](https://github.com/apache/incubator-druid/blob/master/docs/content/development/build.md).

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "Experimental Features"
---
# About Experimental Features
# Experimental Features
Experimental features are features we have developed but have not fully tested in a production environment. If you choose to try them out, there will likely be edge cases that we have not covered. We would love feedback on any of these features, whether they are bug reports, suggestions for improvement, or letting us know they work as intended.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Ambari Metrics Emitter"
---
# Ambari Metrics Emitter
To use this extension, make sure to [include](../../operations/including-extensions.html) `ambari-metrics-emitter` extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Microsoft Azure"
---
# Microsoft Azure
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-azure-extensions` extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Apache Cassandra"
---
# Apache Cassandra
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-cassandra-storage` extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Rackspace Cloud Files"
---
# Rackspace Cloud Files
## Deep Storage

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "DistinctCount Aggregator"
---
# DistinctCount aggregator
# DistinctCount Aggregator
To use this extension, make sure to [include](../../operations/including-extensions.html) the `druid-distinctcount` extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Google Cloud Storage"
---
# Google Cloud Storage
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-google-extensions` extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Graphite Emitter"
---
# Graphite Emitter
To use this extension, make sure to [include](../../operations/including-extensions.html) `graphite-emitter` extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "InfluxDB Line Protocol Parser"
---
# InfluxDB Line Protocol Parser
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-influx-extensions`.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Kafka Emitter"
---
# Kafka Emitter
To use this extension, make sure to [include](../../operations/including-extensions.html) `kafka-emitter` extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Kafka Simple Consumer"
---
# Kafka Simple Consumer
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-kafka-eight-simpleConsumer` extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Materialized View"
---
# Materialized View
To use this feature, make sure to only load materialized-view-selection on broker and load materialized-view-maintenance on overlord. In addtion, this feature currently requires a hadoop cluster.

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "OpenTSDB Emitter"
---
# Opentsdb Emitter
# OpenTSDB Emitter
To use this extension, make sure to [include](../../operations/including-extensions.html) `opentsdb-emitter` extension.
@ -57,5 +57,5 @@ e.g.
"type"
]
```
For most use-cases, the default configuration is sufficient.

View File

@ -19,15 +19,15 @@
---
layout: doc_page
title: "ORC"
---
# Orc
# ORC
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-orc-extensions`.
This extension enables Druid to ingest and understand the Apache Orc data format offline.
This extension enables Druid to ingest and understand the Apache ORC data format offline.
## Orc Hadoop Parser
## ORC Hadoop Parser
This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inputSpec in ioConfig must be set to `"org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat"`.
@ -35,7 +35,7 @@ This is for batch ingestion using the HadoopDruidIndexer. The inputFormat of inp
|----------|-------------|----------------------------------------------------------------------------------------|---------|
|type | String | This should say `orc` | yes|
|parseSpec | JSON Object | Specifies the timestamp and dimensions of the data. Any parse spec that extends ParseSpec is possible but only their TimestampSpec and DimensionsSpec are used. | yes|
|typeString| String | String representation of Orc struct type info. If not specified, auto constructed from parseSpec but all metric columns are dropped | no|
|typeString| String | String representation of ORC struct type info. If not specified, auto constructed from parseSpec but all metric columns are dropped | no|
|mapFieldNameFormat| String | String format for resolving the flatten map fields. Default is `<PARENT>_<CHILD>`. | no |
For example of `typeString`, string column col1 and array of string column col2 is represented by `"struct<col1:string,col2:array<string>>"`.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "RabbitMQ"
---
# RabbitMQ
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-rabbitmq` extension.

View File

@ -19,10 +19,9 @@
---
layout: doc_page
title: "Druid Redis Cache"
---
Druid Redis Cache
--------------------
# Druid Redis Cache
A cache implementation for Druid based on [Redis](https://github.com/antirez/redis).

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "RocketMQ"
---
# RocketMQ
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-rocketmq` extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Microsoft SQLServer"
---
# Microsoft SQLServer
Make sure to [include](../../operations/including-extensions.html) `sqlserver-metadata-storage` as an extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "StatsD Emitter"
---
# StatsD Emitter
To use this extension, make sure to [include](../../operations/including-extensions.html) `statsd-emitter` extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Thrift"
---
# Thrift
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-thrift-extensions`.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Timestamp Min/Max aggregators"
---
# Timestamp Min/Max aggregators
To use this extension, make sure to [include](../../operations/including-extensions.html) `druid-time-min-max`.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Approximate Histogram aggregator"
---
# Approximate Histogram aggregator
Make sure to [include](../../operations/including-extensions.html) `druid-histogram` as an extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Avro"
---
# Avro
This extension enables Druid to ingest and understand the Apache Avro data format. Make sure to [include](../../operations/including-extensions.html) `druid-avro-extensions` as an extension.

View File

@ -19,25 +19,26 @@
---
layout: doc_page
title: "Bloom Filter"
---
# Druid Bloom Filter
# Bloom Filter
Make sure to [include](../../operations/including-extensions.html) `druid-bloom-filter` as an extension.
BloomFilter is a probabilistic data structure for set membership check.
Following are some characterstics of BloomFilter
BloomFilter is a probabilistic data structure for set membership check.
Following are some characterstics of BloomFilter
- BloomFilters are highly space efficient when compared to using a HashSet.
- Because of the probabilistic nature of bloom filter false positive (element not present in bloom filter but test() says true) are possible
- false negatives are not possible (if element is present then test() will never say false).
- The false positive probability is configurable (default: 5%) depending on which storage requirement may increase or decrease.
- false negatives are not possible (if element is present then test() will never say false).
- The false positive probability is configurable (default: 5%) depending on which storage requirement may increase or decrease.
- Lower the false positive probability greater is the space requirement.
- Bloom filters are sensitive to number of elements that will be inserted in the bloom filter.
- During the creation of bloom filter expected number of entries must be specified.If the number of insertions exceed the specified initial number of entries then false positive probability will increase accordingly.
Internally, this implementation of bloom filter uses Murmur3 fast non-cryptographic hash algorithm.
### Json Representation of Bloom Filter
### JSON Representation of Bloom Filter
```json
{
"type" : "bloom",
@ -60,7 +61,7 @@ Internally, this implementation of bloom filter uses Murmur3 fast non-cryptograp
- 1 byte for the number of hash functions.
- 1 big endian int(That is how OutputStream works) for the number of longs in the bitset
- big endian longs in the BloomKFilter bitset
Note: `org.apache.hive.common.util.BloomKFilter` provides a serialize method which can be used to serialize bloom filters to outputStream.
### SQL Queries

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "DataSketches extension"
---
## DataSketches extension
# DataSketches extension
Druid aggregators based on [datasketches](http://datasketches.github.io/) library. Sketches are data structures implementing approximate streaming mergeable algorithms. Sketches can be ingested from the outside of Druid or built from raw data at ingestion time. Sketches can be stored in Druid segments as additive metrics.

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "DataSketches HLL Sketch module"
---
## DataSketches HLL Sketch module
# DataSketches HLL Sketch module
This module provides Druid aggregators for distinct counting based on HLL sketch from [datasketches](http://datasketches.github.io/) library. At ingestion time, this aggregator creates the HLL sketch objects to be stored in Druid segments. At query time, sketches are read and merged together. In the end, by default, you receive the estimate of the number of distinct values presented to the sketch. Also, you can use post aggregator to produce a union of sketch columns in the same row.
You can use the HLL sketch aggregator on columns of any identifiers. It will return estimated cardinality of the column.

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "DataSketches Quantiles Sketch module"
---
## DataSketches Quantiles Sketch module
# DataSketches Quantiles Sketch module
This module provides Druid aggregators based on numeric quantiles DoublesSketch from [datasketches](http://datasketches.github.io/) library. Quantiles sketch is a mergeable streaming algorithm to estimate the distribution of values, and approximately answer queries about the rank of a value, probability mass function of the distribution (PMF) or histogram, cummulative distribution function (CDF), and quantiles (median, min, max, 95th percentile and such). See [Quantiles Sketch Overview](https://datasketches.github.io/docs/Quantiles/QuantilesOverview.html).

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "DataSketches Theta Sketch module"
---
## DataSketches Theta Sketch module
# DataSketches Theta Sketch module
This module provides Druid aggregators based on Theta sketch from [datasketches](http://datasketches.github.io/) library. Note that sketch algorithms are approximate; see details in the "Accuracy" section of the datasketches doc.
At ingestion time, this aggregator creates the Theta sketch objects which get stored in Druid segments. Logically speaking, a Theta sketch object can be thought of as a Set data structure. At query time, sketches are read and aggregated (set unioned) together. In the end, by default, you receive the estimate of the number of unique entries in the sketch object. Also, you can use post aggregators to do union, intersection or difference on sketch columns in the same row.

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "DataSketches Tuple Sketch module"
---
## DataSketches Tuple Sketch module
# DataSketches Tuple Sketch module
This module provides Druid aggregators based on Tuple sketch from [datasketches](http://datasketches.github.io/) library. ArrayOfDoublesSketch sketches extend the functionality of the count-distinct Theta sketches by adding arrays of double values associated with unique keys.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Basic Security"
---
# Druid Basic Security
This extension adds:
@ -58,7 +58,7 @@ druid.auth.authenticator.MyBasicAuthenticator.initialInternalClientPassword=pass
druid.auth.authenticator.MyBasicAuthenticator.authorizerName=MyBasicAuthorizer
```
To use the Basic authenticator, add an authenticator with type `basic` to the authenticatorChain.
To use the Basic authenticator, add an authenticator with type `basic` to the authenticatorChain.
Configuration of the named authenticator is assigned through properties with the form:
@ -208,14 +208,14 @@ Set the permissions of {roleName}. This replaces the previous set of permissions
Content: List of JSON Resource-Action objects, e.g.:
```
[
{
{
"resource": {
"name": "wiki.*",
"type": "DATASOURCE"
},
"action": "READ"
},
{
{
"resource": {
"name": "wikiticker",
"type": "DATASOURCE"
@ -225,7 +225,7 @@ Content: List of JSON Resource-Action objects, e.g.:
]
```
The "name" field for resources in the permission definitions are regexes used to match resource names during authorization checks.
The "name" field for resources in the permission definitions are regexes used to match resource names during authorization checks.
Please see [Defining permissions](#defining-permissions) for more details.
@ -238,7 +238,7 @@ Return the current load status of the local caches of the authorization database
### Authenticator
If `druid.auth.authenticator.<authenticator-name>.initialAdminPassword` is set, a default admin user named "admin" will be created, with the specified initial password. If this configuration is omitted, the "admin" user will not be created.
If `druid.auth.authenticator.<authenticator-name>.initialInternalClientPassword` is set, a default internal system user named "druid_system" will be created, with the specified initial password. If this configuration is omitted, the "druid_system" user will not be created.
If `druid.auth.authenticator.<authenticator-name>.initialInternalClientPassword` is set, a default internal system user named "druid_system" will be created, with the specified initial password. If this configuration is omitted, the "druid_system" user will not be created.
### Authorizer

View File

@ -19,12 +19,12 @@
---
layout: doc_page
title: "Kerberos"
---
# Druid-Kerberos
# Kerberos
Druid Extension to enable Authentication for Druid Nodes using Kerberos.
This extension adds an Authenticator which is used to protect HTTP Endpoints using the simple and protected GSSAPI negotiation mechanism [SPNEGO](https://en.wikipedia.org/wiki/SPNEGO).
This extension adds an Authenticator which is used to protect HTTP Endpoints using the simple and protected GSSAPI negotiation mechanism [SPNEGO](https://en.wikipedia.org/wiki/SPNEGO).
Make sure to [include](../../operations/including-extensions.html) `druid-kerberos` as an extension.
@ -57,23 +57,23 @@ The configuration examples in the rest of this document will use "kerberos" as t
|`druid.auth.authenticator.kerberos.cookieSignatureSecret`|`secretString`| Secret used to sign authentication cookies. It is advisable to explicitly set it, if you have multiple druid ndoes running on same machine with different ports as the Cookie Specification does not guarantee isolation by port.|<Random value>|No|
|`druid.auth.authenticator.kerberos.authorizerName`|Depends on available authorizers|Authorizer that requests should be directed to|Empty|Yes|
As a note, it is required that the SPNego principal in use by the druid nodes must start with HTTP (This specified by [RFC-4559](https://tools.ietf.org/html/rfc4559)) and must be of the form "HTTP/_HOST@REALM".
As a note, it is required that the SPNego principal in use by the druid nodes must start with HTTP (This specified by [RFC-4559](https://tools.ietf.org/html/rfc4559)) and must be of the form "HTTP/_HOST@REALM".
The special string _HOST will be replaced automatically with the value of config `druid.host`
### Auth to Local Syntax
`druid.auth.authenticator.kerberos.authToLocal` allows you to set a general rules for mapping principal names to local user names.
The syntax for mapping rules is `RULE:\[n:string](regexp)s/pattern/replacement/g`. The integer n indicates how many components the target principal should have. If this matches, then a string will be formed from string, substituting the realm of the principal for $0 and the nth component of the principal for $n. e.g. if the principal was druid/admin then `\[2:$2$1suffix]` would result in the string `admindruidsuffix`.
If this string matches regexp, then the s//\[g] substitution command will be run over the string. The optional g will cause the substitution to be global over the string, instead of replacing only the first match in the string.
If required, multiple rules can be be joined by newline character and specified as a String.
If required, multiple rules can be be joined by newline character and specified as a String.
### Increasing HTTP Header size for large SPNEGO negotiate header
In Active Directory environment, SPNEGO token in the Authorization header includes PAC (Privilege Access Certificate) information,
which includes all security groups for the user. In some cases when the user belongs to many security groups the header to grow beyond what druid can handle by default.
In such cases, max request header size that druid can handle can be increased by setting `druid.server.http.maxRequestHeaderSize` (default 8Kb) and `druid.router.http.maxRequestBufferSize` (default 8Kb).
## Configuring Kerberos Escalated Client
## Configuring Kerberos Escalated Client
Druid internal nodes communicate with each other using an escalated http Client. A Kerberos enabled escalated HTTP Client can be configured by following properties -
Druid internal nodes communicate with each other using an escalated http Client. A Kerberos enabled escalated HTTP Client can be configured by following properties -
|Property|Example Values|Description|Default|required|
@ -83,15 +83,15 @@ Druid internal nodes communicate with each other using an escalated http Client.
|`druid.escalator.internalClientKeytab`|`/etc/security/keytabs/druid.keytab`|Path to keytab file used for internal node communication|n/a|Yes|
|`druid.escalator.authorizerName`|`MyBasicAuthorizer`|Authorizer that requests should be directed to.|n/a|Yes|
## Accessing Druid HTTP end points when kerberos security is enabled
1. To access druid HTTP endpoints via curl user will need to first login using `kinit` command as follows -
## Accessing Druid HTTP end points when kerberos security is enabled
1. To access druid HTTP endpoints via curl user will need to first login using `kinit` command as follows -
```
kinit -k -t <path_to_keytab_file> user@REALM.COM
```
2. Once the login is successful verify that login is successful using `klist` command
3. Now you can access druid HTTP endpoints using curl command as follows -
3. Now you can access druid HTTP endpoints using curl command as follows -
```
curl --negotiate -u:anyUser -b ~/cookies.txt -c ~/cookies.txt -X POST -H'Content-Type: application/json' <HTTP_END_POINT>
@ -105,13 +105,13 @@ Druid internal nodes communicate with each other using an escalated http Client.
Note: Above command will authenticate the user first time using SPNego negotiate mechanism and store the authentication cookie in file. For subsequent requests the cookie will be used for authentication.
## Accessing coordinator or overlord console from web browser
To access Coordinator/Overlord console from browser you will need to configure your browser for SPNego authentication as follows -
To access Coordinator/Overlord console from browser you will need to configure your browser for SPNego authentication as follows -
1. Safari - No configurations required.
2. Firefox - Open firefox and follow these steps -
2. Firefox - Open firefox and follow these steps -
1. Go to `about:config` and search for `network.negotiate-auth.trusted-uris`.
2. Double-click and add the following values: `"http://druid-coordinator-hostname:ui-port"` and `"http://druid-overlord-hostname:port"`
3. Google Chrome - From the command line run following commands -
3. Google Chrome - From the command line run following commands -
1. `google-chrome --auth-server-whitelist="druid-coordinator-hostname" --auth-negotiate-delegate-whitelist="druid-coordinator-hostname"`
2. `google-chrome --auth-server-whitelist="druid-overlord-hostname" --auth-negotiate-delegate-whitelist="druid-overlord-hostname"`
4. Internet Explorer -
@ -119,4 +119,4 @@ To access Coordinator/Overlord console from browser you will need to configure y
2. Allow negotiation for the UI website.
## Sending Queries programmatically
Many HTTP client libraries, such as Apache Commons [HttpComponents](https://hc.apache.org/), already have support for performing SPNEGO authentication. You can use any of the available HTTP client library to communicate with druid cluster.
Many HTTP client libraries, such as Apache Commons [HttpComponents](https://hc.apache.org/), already have support for performing SPNEGO authentication. You can use any of the available HTTP client library to communicate with druid cluster.

View File

@ -19,6 +19,7 @@
---
layout: doc_page
title: "Cached Lookup Module"
---
# Cached Lookup Module

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "Extension Examples"
---
# Druid examples
# Extension Examples
## TwitterSpritzerFirehose

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "HDFS"
---
# HDFS
Make sure to [include](../../operations/including-extensions.html) `druid-hdfs-storage` as an extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Kafka Eight Firehose"
---
# Kafka Eight Firehose
Make sure to [include](../../operations/including-extensions.html) `druid-kafka-eight` as an extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Kafka Lookups"
---
# Kafka Lookups
<div class="note caution">

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Kafka Indexing Service"
---
# Kafka Indexing Service
The Kafka indexing service enables the configuration of *supervisors* on the Overlord, which facilitate ingestion from

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Globally Cached Lookups"
---
# Globally Cached Lookups
<div class="note caution">

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "MySQL Metadata Store"
---
# MySQL Metadata Store
Make sure to [include](../../operations/including-extensions.html) `mysql-metadata-storage` as an extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Druid Parquet Extension"
---
# Druid Parquet Extension
This module extends [Druid Hadoop based indexing](../../ingestion/hadoop.html) to ingest data directly from offline

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "PostgreSQL Metadata Store"
---
# PostgreSQL Metadata Store
Make sure to [include](../../operations/including-extensions.html) `postgresql-metadata-storage` as an extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Protobuf"
---
# Protobuf
This extension enables Druid to ingest and understand the Protobuf data format. Make sure to [include](../../operations/including-extensions.html) `druid-protobuf-extensions` as an extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "S3-compatible"
---
# S3-compatible
Make sure to [include](../../operations/including-extensions.html) `druid-s3-extensions` as an extension.

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "Simple SSLContext Provider Module"
---
## Simple SSLContext Provider Module
# Simple SSLContext Provider Module
This module contains a simple implementation of [SSLContext](http://docs.oracle.com/javase/8/docs/api/javax/net/ssl/SSLContext.html)
that will be injected to be used with HttpClient that Druid nodes use internally to communicate with each other. To learn more about

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Stats aggregator"
---
# Stats aggregator
Includes stat-related aggregators, including variance and standard deviations, etc. Make sure to [include](../../operations/including-extensions.html) `druid-stats` as an extension.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Test Stats Aggregators"
---
# Test Stats Aggregators
Incorporates test statistics related aggregators, including z-score and p-value. Please refer to [https://www.paypal-engineering.com/2017/06/29/democratizing-experimentation-data-for-product-innovations/](https://www.paypal-engineering.com/2017/06/29/democratizing-experimentation-data-for-product-innovations/) for math background and details.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Druid extensions"
---
# Druid extensions
Druid implements an extension system that allows for adding functionality at runtime. Extensions

View File

@ -19,8 +19,10 @@
---
layout: doc_page
title: "Geographic Queries"
---
# Geographic Queries
Druid supports filtering specially spatially indexed columns based on an origin and a bound.
# Spatial Indexing

View File

@ -19,6 +19,7 @@
---
layout: doc_page
title: "Integrating Druid With Other Technologies"
---
# Integrating Druid With Other Technologies

View File

@ -19,6 +19,7 @@
---
layout: doc_page
title: "JavaScript Programming Guide"
---
# JavaScript Programming Guide

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Extending Druid With Custom Modules"
---
# Extending Druid With Custom Modules
Druid uses a module system that allows for the addition of extensions at runtime.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Developing on Druid"
---
# Developing on Druid
Druid's codebase consists of several major components. For developers interested in learning the code, this document provides

View File

@ -19,10 +19,9 @@
---
layout: doc_page
title: "Router Node"
---
Router Node
===========
# Router Node
You should only ever need the router node if you have a Druid cluster well into the terabyte range. The router node can be used to route queries to different broker nodes. By default, the broker routes queries based on how [Rules](../operations/rule-configuration.html) are set up. For example, if 1 month of recent data is loaded into a `hot` cluster, queries that fall within the recent month can be routed to a dedicated set of brokers. Queries outside this range are routed to another set of brokers. This set up provides query isolation such that queries for more important data are not impacted by queries for less important data.

View File

@ -19,8 +19,10 @@
---
layout: doc_page
title: "Versioning Druid"
---
# Versioning Druid
This page discusses how we do versioning and provides information on our stable releases.
Versioning Strategy

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Batch Data Ingestion"
---
# Batch Data Ingestion
Druid can load data from static files through a variety of methods described here.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Command Line Hadoop Indexer"
---
# Command Line Hadoop Indexer
To run:

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Compaction Task"
---
# Compaction Task
Compaction tasks merge all segments of the given interval. The syntax is:

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "Data Formats for Ingestion"
---
Data Formats for Ingestion
==========================
# Data Formats for Ingestion
Druid can ingest denormalized data in JSON, CSV, or a delimited form such as TSV, or any custom format. While most examples in the documentation use data in JSON format, it is not difficult to configure Druid to ingest any other delimited data.
We welcome any contributions to new formats.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Deleting Data"
---
# Deleting Data
Permanent deletion of a Druid segment has two steps:

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "My Data isn't being loaded"
---
## My Data isn't being loaded
# My Data isn't being loaded
### Realtime Ingestion

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Druid Firehoses"
---
# Druid Firehoses
Firehoses are used in [native batch ingestion tasks](../ingestion/native_tasks.html), stream push tasks automatically created by [Tranquility](../ingestion/stream-push.html), and the [stream-pull (deprecated)](../ingestion/stream-pull.html) ingestion model.

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "JSON Flatten Spec"
---
# JSON Flatten Spec
| Field | Type | Description | Required |

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Hadoop-based Batch Ingestion"
---
# Hadoop-based Batch Ingestion
Hadoop-based batch ingestion in Druid is supported via a Hadoop-ingestion task. These tasks can be posted to a running

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Ingestion"
---
# Ingestion
## Overview

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Ingestion Spec"
---
# Ingestion Spec
A Druid ingestion spec consists of 3 components:

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Task Locking & Priority"
---
# Task Locking & Priority
## Locking

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Miscellaneous Tasks"
---
# Miscellaneous Tasks
## Noop Task

View File

@ -19,6 +19,7 @@
---
layout: doc_page
title: "Native Index Tasks"
---
# Native Index Tasks

View File

@ -19,6 +19,7 @@
---
layout: doc_page
title: "Ingestion Reports"
---
# Ingestion Reports

View File

@ -19,6 +19,7 @@
---
layout: doc_page
title: "Schema Changes"
---
# Schema Changes

View File

@ -19,8 +19,8 @@
---
layout: doc_page
title: "Schema Design"
---
# Schema Design
This page is meant to assist users in designing a schema for data to be ingested in Druid. Druid intakes denormalized data

View File

@ -19,22 +19,22 @@
---
layout: doc_page
title: "Loading Streams"
---
# Loading Streams
# Loading streams
Streams can be ingested in Druid using either [Tranquility](https://github.com/druid-io/tranquility) (a Druid-aware
Streams can be ingested in Druid using either [Tranquility](https://github.com/druid-io/tranquility) (a Druid-aware
client) or the [Kafka Indexing Service](../development/extensions-core/kafka-ingestion.html).
## Tranquility (Stream Push)
If you have a program that generates a stream, then you can push that stream directly into Druid in
real-time. With this approach, Tranquility is embedded in your data-producing application.
Tranquility comes with bindings for the
Storm and Samza stream processors. It also has a direct API that can be used from any JVM-based
If you have a program that generates a stream, then you can push that stream directly into Druid in
real-time. With this approach, Tranquility is embedded in your data-producing application.
Tranquility comes with bindings for the
Storm and Samza stream processors. It also has a direct API that can be used from any JVM-based
program, such as Spark Streaming or a Kafka consumer.
Tranquility handles partitioning, replication, service discovery, and schema rollover for you,
Tranquility handles partitioning, replication, service discovery, and schema rollover for you,
seamlessly and without downtime. You only have to define your Druid schema.
For examples and more information, please see the [Tranquility README](https://github.com/druid-io/tranquility).

View File

@ -19,14 +19,14 @@
---
layout: doc_page
title: "Stream Pull Ingestion"
---
<div class="note info">
NOTE: Realtime nodes are deprecated. Please use the <a href="../development/extensions-core/kafka-ingestion.html">Kafka Indexing Service</a> for stream pull use cases instead.
NOTE: Realtime nodes are deprecated. Please use the <a href="../development/extensions-core/kafka-ingestion.html">Kafka Indexing Service</a> for stream pull use cases instead.
</div>
Stream Pull Ingestion
=====================
# Stream Pull Ingestion
If you have an external service that you want to pull data from, you have two options. The simplest
option is to set up a "copying" service that reads from the data source and writes to Druid using
@ -34,7 +34,7 @@ the [stream push method](stream-push.html).
Another option is *stream pull*. With this approach, a Druid Realtime Node ingests data from a
[Firehose](../ingestion/firehose.html) connected to the data you want to
read. The Druid quickstart and tutorials do not include information about how to set up standalone realtime nodes, but
read. The Druid quickstart and tutorials do not include information about how to set up standalone realtime nodes, but
they can be used in place for Tranquility server and the indexing service. Please note that Realtime nodes have different properties and roles than the indexing service.
## Realtime Node Ingestion
@ -182,7 +182,7 @@ The tuningConfig is optional and default parameters will be used if no tuningCon
|dedupColumn|String|the column to judge whether this row is already in this segment, if so, throw away this row. If it is String type column, to reduce heap cost, use long type hashcode of this column's value to judge whether this row is already ingested, so there maybe very small chance to throw away a row that is not ingested before.|no (default == null)|
|indexSpec|Object|Tune how data is indexed. See below for more information.|no|
Before enabling thread priority settings, users are highly encouraged to read the [original pull request](https://github.com/apache/incubator-druid/pull/984) and other documentation about proper use of `-XX:+UseThreadPriorities`.
Before enabling thread priority settings, users are highly encouraged to read the [original pull request](https://github.com/apache/incubator-druid/pull/984) and other documentation about proper use of `-XX:+UseThreadPriorities`.
#### Rejection Policy
@ -254,7 +254,7 @@ Configure `linear` under `schema`:
"partitionNum": 0
}
```
##### Numbered
@ -269,7 +269,7 @@ Configure `numbered` under `schema`:
"partitions": 2
}
```
##### Scale and Redundancy
@ -283,7 +283,7 @@ For example, if RealTimeNode1 has:
"partitionNum": 0
}
```
and RealTimeNode2 has:
```json
@ -329,48 +329,48 @@ The normal, expected use cases have the following overall constraints: `intermed
Standalone realtime nodes use the Kafka high level consumer, which imposes a few restrictions.
Druid replicates segment such that logically equivalent data segments are concurrently hosted on N nodes. If N1 nodes go down,
the data will still be available for querying. On real-time nodes, this process depends on maintaining logically equivalent
data segments on each of the N nodes, which is not possible with standard Kafka consumer groups if your Kafka topic requires more than one consumer
Druid replicates segment such that logically equivalent data segments are concurrently hosted on N nodes. If N1 nodes go down,
the data will still be available for querying. On real-time nodes, this process depends on maintaining logically equivalent
data segments on each of the N nodes, which is not possible with standard Kafka consumer groups if your Kafka topic requires more than one consumer
(because consumers in different consumer groups will split up the data differently).
For example, let's say your topic is split across Kafka partitions 1, 2, & 3 and you have 2 real-time nodes with linear shard specs 1 & 2.
Both of the real-time nodes are in the same consumer group. Real-time node 1 may consume data from partitions 1 & 3, and real-time node 2 may consume data from partition 2.
For example, let's say your topic is split across Kafka partitions 1, 2, & 3 and you have 2 real-time nodes with linear shard specs 1 & 2.
Both of the real-time nodes are in the same consumer group. Real-time node 1 may consume data from partitions 1 & 3, and real-time node 2 may consume data from partition 2.
Querying for your data through the broker will yield correct results.
The problem arises if you want to replicate your data by creating real-time nodes 3 & 4. These new real-time nodes also
have linear shard specs 1 & 2, and they will consume data from Kafka using a different consumer group. In this case,
real-time node 3 may consume data from partitions 1 & 2, and real-time node 4 may consume data from partition 2.
From Druid's perspective, the segments hosted by real-time nodes 1 and 3 are the same, and the data hosted by real-time nodes
2 and 4 are the same, although they are reading from different Kafka partitions. Querying for the data will yield inconsistent
The problem arises if you want to replicate your data by creating real-time nodes 3 & 4. These new real-time nodes also
have linear shard specs 1 & 2, and they will consume data from Kafka using a different consumer group. In this case,
real-time node 3 may consume data from partitions 1 & 2, and real-time node 4 may consume data from partition 2.
From Druid's perspective, the segments hosted by real-time nodes 1 and 3 are the same, and the data hosted by real-time nodes
2 and 4 are the same, although they are reading from different Kafka partitions. Querying for the data will yield inconsistent
results.
Is this always a problem? No. If your data is small enough to fit on a single Kafka partition, you can replicate without issues.
Is this always a problem? No. If your data is small enough to fit on a single Kafka partition, you can replicate without issues.
Otherwise, you can run real-time nodes without replication.
Please note that druid will skip over event that failed its checksum and it is corrupt.
### Locking
Using stream pull ingestion with Realtime nodes together batch ingestion may introduce data override issues. For example, if you
are generating hourly segments for the current day, and run a daily batch job for the current day's data, the segments created by
the batch job will have a more recent version than most of the segments generated by realtime ingestion. If your batch job is indexing
data that isn't yet complete for the day, the daily segment created by the batch job can override recent segments created by
Using stream pull ingestion with Realtime nodes together batch ingestion may introduce data override issues. For example, if you
are generating hourly segments for the current day, and run a daily batch job for the current day's data, the segments created by
the batch job will have a more recent version than most of the segments generated by realtime ingestion. If your batch job is indexing
data that isn't yet complete for the day, the daily segment created by the batch job can override recent segments created by
realtime nodes. A portion of data will appear to be lost in this case.
### Schema changes
Standalone realtime nodes require stopping a node to update a schema, and starting it up again for the schema to take effect.
Standalone realtime nodes require stopping a node to update a schema, and starting it up again for the schema to take effect.
This can be difficult to manage at scale, especially with multiple partitions.
### Log management
Each standalone realtime node has its own set of logs. Diagnosing errors across many partitions across many servers may be
Each standalone realtime node has its own set of logs. Diagnosing errors across many partitions across many servers may be
difficult to manage and track at scale.
## Deployment Notes
Stream ingestion may generate a large number of small segments because it's difficult to optimize the segment size at
ingestion time. The number of segments will increase over time, and this might cause the query performance issue.
ingestion time. The number of segments will increase over time, and this might cause the query performance issue.
Details on how to optimize the segment size can be found on [Segment size optimization](../operations/segment-optimization.html).

View File

@ -19,9 +19,9 @@
---
layout: doc_page
title: "Stream Push"
---
## Stream Push
# Stream Push
Druid can connect to any streaming data source through
[Tranquility](https://github.com/druid-io/tranquility/blob/master/README.md), a package for pushing

Some files were not shown because too many files have changed in this diff Show More