mirror of https://github.com/apache/druid.git
Merge pull request #2398 from implydata/doc-render
Numerous fixes to enables docs to render correctly
This commit is contained in:
commit
8ac77a9644
|
@ -5,14 +5,13 @@ layout: doc_page
|
|||
# Caching
|
||||
|
||||
Caching can optionally be enabled on the broker, historical, and realtime
|
||||
nodes, as well as realtime index tasks. See [broker](broker.html#caching),
|
||||
[historical](historical.html#caching), and [realtime](realtime.html#caching)
|
||||
configuration options for how to enable it for individual node types.
|
||||
processing. See [broker](broker.html#caching),
|
||||
[historical](historical.html#caching), and [realtime](realtime.html#caching)
|
||||
configuration options for how to enable it for different processes.
|
||||
|
||||
Druid uses a local in-memory cache by default, unless a diffrent type of cache is specified.
|
||||
Use the `druid.cache.type` configuration to set a different kind of cache.
|
||||
|
||||
|
||||
## Cache configuration
|
||||
|
||||
Cache settings are set globally, so the same configuration can be re-used
|
||||
|
|
|
@ -6,16 +6,7 @@ Logging
|
|||
|
||||
Druid nodes will emit logs that are useful for debugging to the console. Druid nodes also emit periodic metrics about their state. For more about metrics, see [Configuration](../configuration/index.html). Metric logs are printed to the console by default, and can be disabled with `-Ddruid.emitter.logging.logLevel=debug`.
|
||||
|
||||
Druid uses [log4j2](http://logging.apache.org/log4j/2.x/) for logging. Logging can be configured with a log4j2.xml file. Add the path to the directory containing the log4j2.xml file (eg a config dir) to your classpath if you want to override default Druid log configuration. Note that this directory should be earlier in the classpath than the druid jars. The easiest way to do this is to prefix the classpath with the config dir. For example, if the log4j2.xml file is in config/_common:
|
||||
|
||||
```bash
|
||||
java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 \
|
||||
-Ddruid.realtime.specFile=examples/indexing/wikipedia.spec \
|
||||
-classpath "config/_common:config/realtime:lib/*" \
|
||||
io.druid.cli.Main server realtime
|
||||
```
|
||||
|
||||
Note the "-classpath" in this example has the config dir before the jars under lib/*.
|
||||
Druid uses [log4j2](http://logging.apache.org/log4j/2.x/) for logging. Logging can be configured with a log4j2.xml file. Add the path to the directory containing the log4j2.xml file (e.g. the _common/ dir) to your classpath if you want to override default Druid log configuration. Note that this directory should be earlier in the classpath than the druid jars. The easiest way to do this is to prefix the classpath with the config dir.
|
||||
|
||||
To enable java logging to go through log4j2, set the `-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager` server parameter.
|
||||
|
||||
|
|
|
@ -4,40 +4,40 @@ layout: doc_page
|
|||
Production Cluster Configuration
|
||||
================================
|
||||
|
||||
```note-info
|
||||
This configuration is an example of what a production cluster could look like. Many other hardware combinations are
|
||||
<div class="note info">
|
||||
This configuration is an example of what a production cluster could look like. Many other hardware combinations are
|
||||
possible! Cheaper hardware is absolutely possible.
|
||||
```
|
||||
</div>
|
||||
|
||||
This production Druid cluster assumes that metadata storage and Zookeeper are already set up. The deep storage that is
|
||||
This production Druid cluster assumes that metadata storage and Zookeeper are already set up. The deep storage that is
|
||||
used for examples is [S3](https://aws.amazon.com/s3/) and [memcached](http://memcached.org/) is used for a distributed cache.
|
||||
|
||||
```note-info
|
||||
The nodes in this example do not need to be on their own individual servers. Overlord and Coordinator nodes should be
|
||||
co-located on the same hardware.
|
||||
```
|
||||
<div class="note info">
|
||||
The nodes in this example do not need to be on their own individual servers. Overlord and Coordinator nodes should be
|
||||
co-located on the same hardware.
|
||||
</div>
|
||||
|
||||
The nodes that respond to queries (Historical, Broker, and MiddleManager nodes) will use as many cores as are available,
|
||||
depending on usage, so it is best to keep these on dedicated machines. The upper limit of effectively utilized cores is
|
||||
not well characterized yet and would depend on types of queries, query load, and the schema. Historical daemons should
|
||||
have a heap size of at least 1GB per core for normal usage, but could be squeezed into a smaller heap for testing.
|
||||
Since in-memory caching is essential for good performance, even more RAM is better.
|
||||
Broker nodes will use RAM for caching, so they do more than just route queries.
|
||||
The nodes that respond to queries (Historical, Broker, and MiddleManager nodes) will use as many cores as are available,
|
||||
depending on usage, so it is best to keep these on dedicated machines. The upper limit of effectively utilized cores is
|
||||
not well characterized yet and would depend on types of queries, query load, and the schema. Historical daemons should
|
||||
have a heap size of at least 1GB per core for normal usage, but could be squeezed into a smaller heap for testing.
|
||||
Since in-memory caching is essential for good performance, even more RAM is better.
|
||||
Broker nodes will use RAM for caching, so they do more than just route queries.
|
||||
SSDs are highly recommended for Historical nodes when all they have more segments loaded than available memory.
|
||||
|
||||
The nodes that are responsible for coordination (Coordinator and Overlord nodes) require much less processing.
|
||||
|
||||
The effective utilization of cores by Zookeeper, metadata storage, and Coordinator nodes is likely to be between 1 and 2
|
||||
for each process/daemon, so these could potentially share a machine with lots of cores. These daemons work with heap
|
||||
The effective utilization of cores by Zookeeper, metadata storage, and Coordinator nodes is likely to be between 1 and 2
|
||||
for each process/daemon, so these could potentially share a machine with lots of cores. These daemons work with heap
|
||||
size between 500MB and 1GB.
|
||||
|
||||
We'll use [EC2](https://aws.amazon.com/ec2/) r3.8xlarge nodes for query facing nodes and m1.xlarge nodes for coordination nodes.
|
||||
The following examples work relatively well in production, however, a more optimized tuning for the nodes we selected and
|
||||
We'll use [EC2](https://aws.amazon.com/ec2/) r3.8xlarge nodes for query facing nodes and m1.xlarge nodes for coordination nodes.
|
||||
The following examples work relatively well in production, however, a more optimized tuning for the nodes we selected and
|
||||
more optimal hardware for a Druid cluster are both definitely possible.
|
||||
|
||||
```note-caution
|
||||
<div class="note caution">
|
||||
For high availability, there should be at least a redundant copy of every process running on separate hardware.
|
||||
```
|
||||
</div>
|
||||
|
||||
### Common Configuration (common.runtime.properties)
|
||||
|
||||
|
|
|
@ -3,11 +3,6 @@
|
|||
layout: doc_page
|
||||
---
|
||||
|
||||
```note-caution
|
||||
If you are doing stream-pull based ingestion, we suggest using [stream-pushed](../ingestion/stream-push.html) based ingestion instead and not
|
||||
using real-time nodes.
|
||||
```
|
||||
|
||||
Realtime Node Configuration
|
||||
==============================
|
||||
For general Realtime Node information, see [here](../design/realtime.html).
|
||||
|
|
|
@ -154,14 +154,16 @@ an issue).
|
|||
|
||||
The `payload` column stores a JSON blob that has all of the metadata for the segment (some of the data stored in this payload is redundant with some of the columns in the table, that is intentional). This looks something like
|
||||
|
||||
```
|
||||
```json
|
||||
{
|
||||
"dataSource":"wikipedia",
|
||||
"interval":"2012-05-23T00:00:00.000Z/2012-05-24T00:00:00.000Z",
|
||||
"version":"2012-05-24T00:10:00.046Z",
|
||||
"loadSpec":{"type":"s3_zip",
|
||||
"bucket":"bucket_for_segment",
|
||||
"key":"path/to/segment/on/s3"},
|
||||
"loadSpec":{
|
||||
"type":"s3_zip",
|
||||
"bucket":"bucket_for_segment",
|
||||
"key":"path/to/segment/on/s3"
|
||||
},
|
||||
"dimensions":"comma-delimited-list-of-dimension-names",
|
||||
"metrics":"comma-delimited-list-of-metric-names",
|
||||
"shardSpec":{"type":"none"},
|
||||
|
@ -197,13 +199,12 @@ The Audit table is used to store the audit history for configuration changes
|
|||
e.g rule changes done by [Coordinator](../design/coordinator.html) and other
|
||||
config changes.
|
||||
|
||||
+
|
||||
+##Accessed By: ##
|
||||
+
|
||||
+The Metadata Storage is accessed only by:
|
||||
+
|
||||
+1. Realtime Nodes
|
||||
2. Indexing Service Nodes (if any)
|
||||
+3. Coordinator Nodes
|
||||
+
|
||||
+Thus you need to give permissions (eg in AWS Security Groups) only for these machines to access the Metadata storage.
|
||||
##Accessed By: ##
|
||||
|
||||
The Metadata Storage is accessed only by:
|
||||
|
||||
1. Indexing Service Nodes (if any)
|
||||
2. Realtime Nodes (if any)
|
||||
3. Coordinator Nodes
|
||||
|
||||
Thus you need to give permissions (eg in AWS Security Groups) only for these machines to access the Metadata storage.
|
||||
|
|
|
@ -1,45 +0,0 @@
|
|||
---
|
||||
layout: doc_page
|
||||
---
|
||||
Concepts and Terminology
|
||||
========================
|
||||
|
||||
The following definitions are given with respect to the Druid data store. They are intended to help you better understand the Druid documentation, where these terms and concepts occur.
|
||||
|
||||
More definitions are available on the [design page](../design/design.html).
|
||||
|
||||
* **Aggregation** The summarizing of data meeting certain specifications. Druid aggregates [timeseries data](#timeseries), which in effect compacts the data. Time intervals (set in configuration) are used to create buckets, while [timestamps](#timestamp) determine which buckets data aggregated in.
|
||||
|
||||
* **Aggregators** A mechanism for combining records during realtime incremental indexing, Hadoop batch indexing, and in queries.
|
||||
|
||||
* **Compute node** Obsolete name for a [Historical node](../design/historical.html).
|
||||
|
||||
* **DataSource** A table-like view of data; specified in [specFiles](#specfile) and in queries. A dataSource specifies the source of data being ingested and ultimately stored in [segments](#segment).
|
||||
|
||||
* **Dimensions** Aspects or categories of data, such as languages or locations. For example, with *language* and *country* as the type of dimension, values could be "English" or "Mandarin" for language, or "USA" or "China" for country. In Druid, dimensions can serve as filters for narrowing down hits (for example, language = "English" or country = "China").
|
||||
|
||||
* **Ephemeral Node** A Zookeeper node (or "znode") that exists for as long as the session that created the znode is active. More info [here](http://zookeeper.apache.org/doc/r3.2.1/zookeeperProgrammers.html#Ephemeral+Nodes). In a Druid cluster, ephemeral nodes are typically used for commands (such as assigning [segments](#segment) to certain nodes).
|
||||
|
||||
* **Granularity** The time interval corresponding to aggregation by time. Druid configuration settings specify the granularity of [timestamp](#timestamp) buckets in a [segment](#segment) (for example, by minute or by hour), as well as the granularity of the segment itself. The latter is essentially the overall range of absolute time covered by the segment. In queries, granularity settings control the summarization of findings.
|
||||
|
||||
* **Ingestion** The pulling and initial storing and processing of data. Druid supports realtime and batch ingestion of data, and applies indexing in both cases.
|
||||
|
||||
* **Master node** Obsolete name for a [Coordinator node](../design/coordinator.html).
|
||||
|
||||
* **Metrics** Countable data that can be aggregated. Metrics, for example, can be the number of visitors to a website, number of tweets per day, or average revenue.
|
||||
|
||||
* **Rollup** The aggregation of data that occurs at one or more stages, based on settings in a [configuration file](#specfile).
|
||||
|
||||
<a name="segment"></a>
|
||||
* **Segment** A collection of (internal) records that are stored and processed together. Druid chunks data into segments representing a time interval, and these are stored and manipulated in the cluster.
|
||||
|
||||
* **Shard** A sub-partition of the data, allowing multiple [segments](#segment) to represent the data in a certain time interval. Sharding occurs along time partitions to better handle amounts of data that exceed certain limits on segment size, although sharding along dimensions may also occur to optimize efficiency.
|
||||
|
||||
<a name="specfile"></a>
|
||||
* **specFile** The specification for services in JSON format; see [Realtime](../design/realtime.html) and [Batch-ingestion](../ingestion/batch-ingestion.html)
|
||||
|
||||
<a name="timeseries"></a>
|
||||
* **Timeseries Data** Data points which are ordered in time. The closing value of a financial index or the number of tweets per hour with a certain hashtag are examples of timeseries data.
|
||||
|
||||
<a name="timestamp"></a>
|
||||
* **Timestamp** An absolute position on a timeline, given in a standard alpha-numerical format such as with UTC time. [Timeseries data](#timeseries) points can be ordered by timestamp, and in Druid, they are.
|
|
@ -31,7 +31,6 @@ Each of the systems, or components, described below also has a dedicated page wi
|
|||
The node types that currently exist are:
|
||||
|
||||
* [**Historical**](../design/historical.html) nodes are the workhorses that handle storage and querying on "historical" data (non-realtime). Historical nodes download segments from deep storage, respond to the queries from broker nodes about these segments, and return results to the broker nodes. They announce themselves and the segments they are serving in Zookeeper, and also use Zookeeper to monitor for signals to load or drop new segments.
|
||||
* [**Realtime**](../design/realtime.html) nodes ingest data in real time. They are in charge of listening to a stream of incoming data and making it available immediately inside the Druid system. Real-time nodes respond to query requests from Broker nodes, returning query results to those nodes. Aged data is pushed from Realtime nodes to deep storage. Realtime nodes monitor ZooKeeper to discover segments that they've pushed to deep storage have been loaded by Historicals—if so, they drop those segments.
|
||||
* [**Coordinator**](../design/coordinator.html) nodes monitor the grouping of historical nodes to ensure that data is available, replicated and in a generally "optimal" configuration. They do this by reading segment metadata information from metadata storage to determine what segments should be loaded in the cluster, using Zookeeper to determine what Historical nodes exist, and creating Zookeeper entries to tell Historical nodes to load and drop new segments.
|
||||
* [**Broker**](../design/broker.html) nodes receive queries from external clients and forward those queries to Realtime and Historical nodes. When Broker nodes receive results, they merge these results and return them to the caller. For knowing topology, Broker nodes use Zookeeper to determine what Realtime and Historical nodes exist.
|
||||
* [**Indexing Service**](../design/indexing-service.html) nodes form a cluster of workers to load batch and real-time data into the system as well as allow for alterations to the data stored in the system.
|
||||
|
|
|
@ -3,8 +3,9 @@ layout: doc_page
|
|||
---
|
||||
|
||||
## DataSketches aggregator
|
||||
Druid aggregators based on [datasketches]()http://datasketches.github.io/) library. Note that sketch algorithms are approxiate, see details in the "Accuracy" section of the datasketches doc.
|
||||
At ingestion time, this aggregator creates the theta sketch objects which get stored in Druid segments. Logically speaking, a theta sketch object can be thought of as a Set data structure. At query time, sketches are read and aggregated(set unioned) together. In the end, by default, you receive the estimate of number of unique entries in the sketch object. Also, You can use post aggregators to do union, intersection or difference on sketch columns in the same row.
|
||||
|
||||
Druid aggregators based on [datasketches](http://datasketches.github.io/) library. Note that sketch algorithms are approxiate, see details in the "Accuracy" section of the datasketches doc.
|
||||
At ingestion time, this aggregator creates the theta sketch objects which get stored in Druid segments. Logically speaking, a theta sketch object can be thought of as a Set data structure. At query time, sketches are read and aggregated(set unioned) together. In the end, by default, you receive the estimate of number of unique entries in the sketch object. Also, You can use post aggregators to do union, intersection or difference on sketch columns in the same row.
|
||||
Note that you can use `thetaSketch` aggregator on columns which were not ingested using same, it will return estimated cardinality of the column. It is recommended to use it at ingestion time as well to make querying faster.
|
||||
|
||||
### Aggregators
|
||||
|
|
|
@ -9,7 +9,7 @@ This page discusses how we can integrate Druid with other technologies.
|
|||
|
||||
Event streams can be stored in a distributed message bus such as Kafka and further processed via a distributed stream
|
||||
processor system such as Storm, Samza, or Spark Streaming. Data processed by the stream processor can feed into Druid using
|
||||
the [Tranquility](https://github.com/druid-io/tranquility) library. Data can be
|
||||
the [Tranquility](https://github.com/druid-io/tranquility) library.
|
||||
|
||||
<img src="../../img/druid-production.png" width="800"/>
|
||||
|
||||
|
|
|
@ -39,26 +39,13 @@ Some great folks have written their own libraries to interact with Druid
|
|||
#### SQL
|
||||
|
||||
* [implydata/plyql](https://github.com/implydata/plyql) - A command line interface for issuing SQL queries to Druid via [plywood](https://github.com/implydata/plywood)
|
||||
* [srikalyc/Sql4D](https://github.com/srikalyc/Sql4D) - A SQL client for Druid. Used in production at Yahoo.
|
||||
|
||||
|
||||
Community Helper Libraries
|
||||
--------------------------
|
||||
|
||||
* [madvertise/druid-dumbo](https://github.com/madvertise/druid-dumbo) - Scripts to help generate batch configs for the ingestion of data into Druid
|
||||
* [housejester/druid-test-harness](https://github.com/housejester/druid-test-harness) - A set of scripts to simplify standing up some servers and seeing how things work
|
||||
* [mingfang/docker-druid](https://github.com/mingfang/docker-druid) - A Dockerfile to run the entire Druid cluster
|
||||
|
||||
Other Druid Distributions
|
||||
-------------------------
|
||||
|
||||
* [Imply Analytics Platform](http://imply.io/download) - The Imply Analytics platform repackages Druid, all its dependencies, and an UI and SQL layer.
|
||||
|
||||
Tools
|
||||
---
|
||||
|
||||
* [Insert Segments](../../operations/insert-segment-to-db.html) - A tool that can insert segments' metadata into Druid metadata storage.
|
||||
|
||||
UIs
|
||||
---
|
||||
|
||||
|
@ -68,13 +55,20 @@ UIs
|
|||
* [Metabase](https://github.com/metabase/metabase) - Simple dashboards, charts and query tool for your Druid DB
|
||||
|
||||
Tools
|
||||
---
|
||||
-----
|
||||
|
||||
* [Insert Segments](../../operations/insert-segment-to-db.html) - A tool that can insert segments' metadata into Druid metadata storage.
|
||||
|
||||
Other Community Extensions
|
||||
Community Helper Libraries
|
||||
--------------------------
|
||||
|
||||
* [madvertise/druid-dumbo](https://github.com/madvertise/druid-dumbo) - Scripts to help generate batch configs for the ingestion of data into Druid
|
||||
* [housejester/druid-test-harness](https://github.com/housejester/druid-test-harness) - A set of scripts to simplify standing up some servers and seeing how things work
|
||||
* [mingfang/docker-druid](https://github.com/mingfang/docker-druid) - A Dockerfile to run the entire Druid cluster
|
||||
|
||||
Community Extensions
|
||||
--------------------
|
||||
|
||||
These are extensions from the community. (If you would like yours listed please speak up!)
|
||||
|
||||
* [acesinc/druid-cors-filter-extension](https://github.com/acesinc/druid-cors-filter-extension) - An extension to enable CORS headers in http requests.
|
||||
|
|
|
@ -70,7 +70,8 @@ of a Druid [overlord](../design/indexing-service.html). A sample task is shown b
|
|||
"tuningConfig" : {
|
||||
"type": "hadoop"
|
||||
}
|
||||
}
|
||||
},
|
||||
"hadoopDependencyCoordinates": <my_hadoop_version>
|
||||
}
|
||||
```
|
||||
|
||||
|
@ -261,7 +262,7 @@ with Amazon-specific features such as S3 encryption and consistent views. If you
|
|||
features, you will need to make the Amazon EMR Hadoop JARs available to Druid through one of the
|
||||
mechanisms described in the [Using other Hadoop distributions](#using-other-hadoop-distributions) section.
|
||||
|
||||
### Using other Hadoop distributions
|
||||
## Using other Hadoop distributions
|
||||
|
||||
Druid works out of the box with many Hadoop distributions.
|
||||
|
||||
|
|
|
@ -10,6 +10,13 @@ To run:
|
|||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:<hadoop_config_dir> io.druid.cli.Main index hadoop <spec_file>
|
||||
```
|
||||
|
||||
## Options
|
||||
|
||||
- "--coordinate" - provide a version of Hadoop to use. This property will override the default Hadoop coordinates. Once specified, Druid will look for those Hadoop dependencies from the location specified by `druid.extensions.hadoopDependenciesDir`.
|
||||
- "--no-default-hadoop" - don't pull down the default hadoop version
|
||||
|
||||
## Spec file
|
||||
|
||||
The spec file needs to contain a JSON object where the contents are the same as the "spec" field in the Hadoop index task.
|
||||
In addition, the following fields need to be added to the ioConfig:
|
||||
|
||||
|
|
|
@ -66,6 +66,8 @@ All forms of Druid ingestion require some form of schema object. The format of t
|
|||
}
|
||||
```
|
||||
|
||||
If you have nested JSON, [Druid can automatically flatten it for you](flatten-json.html).
|
||||
|
||||
### CSV
|
||||
|
||||
Since the CSV data cannot contain the column names (no header is allowed), these must be added before that data can be processed:
|
||||
|
|
|
@ -93,7 +93,7 @@ If `type` is not included, the parser defaults to `string`.
|
|||
|
||||
### Avro Stream Parser
|
||||
|
||||
This is for realtime ingestion.
|
||||
This is for realtime ingestion. Make sure to include "io.druid.extensions:druid-avro-extensions" as an extension.
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
|
@ -102,6 +102,7 @@ This is for realtime ingestion.
|
|||
| parseSpec | JSON Object | Specifies the format of the data. | yes |
|
||||
|
||||
For example, using Avro stream parser with schema repo Avro bytes decoder:
|
||||
|
||||
```json
|
||||
"parser" : {
|
||||
"type" : "avro_stream",
|
||||
|
@ -116,11 +117,7 @@ For example, using Avro stream parser with schema repo Avro bytes decoder:
|
|||
"url" : "${YOUR_SCHEMA_REPO_END_POINT}",
|
||||
}
|
||||
},
|
||||
"parsSpec" : {
|
||||
"format" : "timeAndDims",
|
||||
"timestampSpec" : {},
|
||||
"dimensionsSpec" : {}
|
||||
}
|
||||
"parseSpec" : <standard_druid_parseSpec>
|
||||
}
|
||||
```
|
||||
|
||||
|
@ -155,7 +152,7 @@ This Avro bytes decoder first extract `subject` and `id` from input message byte
|
|||
|
||||
### Avro Hadoop Parser
|
||||
|
||||
This is for batch ingestion using the HadoopDruidIndexer. The `inputFormat` of `inputSpec` in `ioConfig` must be set to `"io.druid.data.input.avro.AvroValueInputFormat"`. You may want to set Avro reader's schema in `jobProperties` in `tuningConfig`, eg: `"avro.schema.path.input.value": "/path/to/your/schema.avsc"` or `"avro.schema.input.value": "your_schema_JSON_object"`, if reader's schema is not set, the schema in Avro object container file will be used, see [Avro specification](http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution).
|
||||
This is for batch ingestion using the HadoopDruidIndexer. The `inputFormat` of `inputSpec` in `ioConfig` must be set to `"io.druid.data.input.avro.AvroValueInputFormat"`. You may want to set Avro reader's schema in `jobProperties` in `tuningConfig`, eg: `"avro.schema.path.input.value": "/path/to/your/schema.avsc"` or `"avro.schema.input.value": "your_schema_JSON_object"`, if reader's schema is not set, the schema in Avro object container file will be used, see [Avro specification](http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution). Make sure to include "io.druid.extensions:druid-avro-extensions" as an extension.
|
||||
|
||||
| Field | Type | Description | Required |
|
||||
|-------|------|-------------|----------|
|
||||
|
@ -164,20 +161,16 @@ This is for batch ingestion using the HadoopDruidIndexer. The `inputFormat` of `
|
|||
| fromPigAvroStorage | Boolean | Specifies whether the data file is stored using AvroStorage. | no(default == false) |
|
||||
|
||||
For example, using Avro Hadoop parser with custom reader's schema file:
|
||||
|
||||
```json
|
||||
{
|
||||
"type" : "index_hadoop",
|
||||
"hadoopDependencyCoordinates" : ["io.druid.extensions:druid-avro-extensions"],
|
||||
"type" : "index_hadoop",
|
||||
"spec" : {
|
||||
"dataSchema" : {
|
||||
"dataSource" : "",
|
||||
"parser" : {
|
||||
"type" : "avro_hadoop",
|
||||
"parsSpec" : {
|
||||
"format" : "timeAndDims",
|
||||
"timestampSpec" : {},
|
||||
"dimensionsSpec" : {}
|
||||
}
|
||||
"parseSpec" : <standard_druid_parseSpec>
|
||||
}
|
||||
},
|
||||
"ioConfig" : {
|
||||
|
|
|
@ -4,7 +4,7 @@ layout: doc_page
|
|||
|
||||
# Loading streams
|
||||
|
||||
Streams can be ingested in Druid using either [Tranquility](https://github.com/druid-io/tranquility (a Druid-aware
|
||||
Streams can be ingested in Druid using either [Tranquility](https://github.com/druid-io/tranquility) (a Druid-aware
|
||||
client) and the [indexing service](../design/indexing-service.html) or through standalone [Realtime nodes](../design/realtime.html).
|
||||
The first approach will be more complex to set up, but also offers scalability and high availability characteristics that advanced production
|
||||
setups may require. The second approach has some known [limitations](../ingestion/stream-pull.html#limitations).
|
||||
|
|
|
@ -25,7 +25,6 @@ For Real-time Node Configuration, see [Realtime Configuration](../configuration/
|
|||
|
||||
For writing your own plugins to the real-time node, see [Firehose](../ingestion/firehose.html).
|
||||
|
||||
<a id="realtime-specfile"></a>
|
||||
## Realtime "specFile"
|
||||
|
||||
The property `druid.realtime.specFile` has the path of a file (absolute or relative path and file name) with realtime specifications in it. This "specFile" should be a JSON Array of JSON objects like the following:
|
||||
|
@ -166,7 +165,7 @@ The following policies are available:
|
|||
* `none` – All events are accepted. Never hands off data unless shutdown() is called on the configured firehose.
|
||||
|
||||
|
||||
####<a id="sharding"></a> Sharding
|
||||
#### Sharding
|
||||
|
||||
Druid uses shards, or segments with partition numbers, to more efficiently handle large amounts of incoming data. In Druid, shards represent the segments that together cover a time interval based on the value of `segmentGranularity`. If, for example, `segmentGranularity` is set to "hour", then a number of shards may be used to store the data for that hour. Sharding along dimensions may also occur to optimize efficiency.
|
||||
|
||||
|
|
|
@ -4,24 +4,24 @@ layout: doc_page
|
|||
|
||||
## Stream Push
|
||||
|
||||
Druid can connect to any streaming data source through
|
||||
[Tranquility](https://github.com/druid-io/tranquility/blob/master/README.md), a package for pushing
|
||||
Druid can connect to any streaming data source through
|
||||
[Tranquility](https://github.com/druid-io/tranquility/blob/master/README.md), a package for pushing
|
||||
streams to Druid in real-time. Druid does not come bundled with Tranquility, and you will have to download the distribution.
|
||||
|
||||
```note-info
|
||||
<div class="note info">
|
||||
If you've never loaded streaming data into Druid, we recommend trying out the
|
||||
[stream loading tutorial](../tutorials/tutorial-streams.html) first and then coming back to this page.
|
||||
```
|
||||
<a href="../tutorials/tutorial-streams.html">stream loading tutorial</a> first and then coming back to this page.
|
||||
</div>
|
||||
|
||||
Note that with all streaming ingestion options, you must ensure that incoming data is recent
|
||||
enough (within a [configurable windowPeriod](#segmentgranularity-and-windowperiod) of the current
|
||||
time). Older messages will not be processed in real-time. Historical data is best processed with
|
||||
Note that with all streaming ingestion options, you must ensure that incoming data is recent
|
||||
enough (within a [configurable windowPeriod](#segmentgranularity-and-windowperiod) of the current
|
||||
time). Older messages will not be processed in real-time. Historical data is best processed with
|
||||
[batch ingestion](../ingestion/batch-ingestion.html).
|
||||
|
||||
### Server
|
||||
|
||||
Druid can use [Tranquility Server](https://github.com/druid-io/tranquility/blob/master/docs/server.md), which
|
||||
lets you send data to Druid without developing a JVM app. You can run Tranquility server colocated with Druid middleManagers
|
||||
Druid can use [Tranquility Server](https://github.com/druid-io/tranquility/blob/master/docs/server.md), which
|
||||
lets you send data to Druid without developing a JVM app. You can run Tranquility server colocated with Druid middleManagers
|
||||
and historical processes.
|
||||
|
||||
Tranquility server is started by issuing:
|
||||
|
@ -33,7 +33,7 @@ bin/tranquility server -configFile <path_to_config_file>/server.json
|
|||
To customize Tranquility Server:
|
||||
|
||||
- In `server.json`, customize the `properties` and `dataSources`.
|
||||
- If you have servers already running Tranquility, stop them (CTRL-C) and start
|
||||
- If you have servers already running Tranquility, stop them (CTRL-C) and start
|
||||
them up again.
|
||||
|
||||
For tips on customizing `server.json`, see the
|
||||
|
@ -42,8 +42,8 @@ For tips on customizing `server.json`, see the
|
|||
|
||||
### Kafka
|
||||
|
||||
[Tranquility Kafka](https://github.com/druid-io/tranquility/blob/master/docs/kafka.md)
|
||||
lets you load data from Kafka into Druid without writing any code. You only need a configuration
|
||||
[Tranquility Kafka](https://github.com/druid-io/tranquility/blob/master/docs/kafka.md)
|
||||
lets you load data from Kafka into Druid without writing any code. You only need a configuration
|
||||
file.
|
||||
|
||||
Tranquility server is started by issuing:
|
||||
|
@ -57,80 +57,80 @@ To customize Tranquility Kafka in the single-machine quickstart configuration:
|
|||
- In `kafka.json`, customize the `properties` and `dataSources`.
|
||||
- If you have Tranquility already running, stop it (CTRL-C) and start it up again.
|
||||
|
||||
For tips on customizing `kafka.json`, see the
|
||||
For tips on customizing `kafka.json`, see the
|
||||
[Tranquility Kafka documentation](https://github.com/druid-io/tranquility/blob/master/docs/kafka.md).
|
||||
|
||||
### JVM apps and stream processors
|
||||
|
||||
Tranquility can also be embedded in JVM-based applications as a library. You can do this directly
|
||||
in your own program using the
|
||||
[Core API](https://github.com/druid-io/tranquility/blob/master/docs/core.md), or you can use
|
||||
the connectors bundled in Tranquility for popular JVM-based stream processors such as
|
||||
[Storm](https://github.com/druid-io/tranquility/blob/master/docs/storm.md),
|
||||
[Samza](https://github.com/druid-io/tranquility/blob/master/docs/samza.md),
|
||||
[Spark Streaming](https://github.com/druid-io/tranquility/blob/master/docs/spark.md), and
|
||||
Tranquility can also be embedded in JVM-based applications as a library. You can do this directly
|
||||
in your own program using the
|
||||
[Core API](https://github.com/druid-io/tranquility/blob/master/docs/core.md), or you can use
|
||||
the connectors bundled in Tranquility for popular JVM-based stream processors such as
|
||||
[Storm](https://github.com/druid-io/tranquility/blob/master/docs/storm.md),
|
||||
[Samza](https://github.com/druid-io/tranquility/blob/master/docs/samza.md),
|
||||
[Spark Streaming](https://github.com/druid-io/tranquility/blob/master/docs/spark.md), and
|
||||
[Flink](https://github.com/druid-io/tranquility/blob/master/docs/flink.md).
|
||||
|
||||
## Concepts
|
||||
|
||||
### Task creation
|
||||
|
||||
Tranquility automates creation of Druid realtime indexing tasks, handling partitioning, replication,
|
||||
service discovery, and schema rollover for you, seamlessly and without downtime. You never have to
|
||||
write code to deal with individual tasks directly. But, it can be helpful to understand how
|
||||
Tranquility automates creation of Druid realtime indexing tasks, handling partitioning, replication,
|
||||
service discovery, and schema rollover for you, seamlessly and without downtime. You never have to
|
||||
write code to deal with individual tasks directly. But, it can be helpful to understand how
|
||||
Tranquility creates tasks.
|
||||
|
||||
Tranquility spawns relatively short-lived tasks periodically, and each one handles a small number of
|
||||
[Druid segments](../design/segments.html). Tranquility coordinates all task
|
||||
creation through ZooKeeper. You can start up as many Tranquility instances as you like with the same
|
||||
Tranquility spawns relatively short-lived tasks periodically, and each one handles a small number of
|
||||
[Druid segments](../design/segments.html). Tranquility coordinates all task
|
||||
creation through ZooKeeper. You can start up as many Tranquility instances as you like with the same
|
||||
configuration, even on different machines, and they will send to the same set of tasks.
|
||||
|
||||
See the [Tranquility overview](https://github.com/druid-io/tranquility/blob/master/docs/overview.md)
|
||||
See the [Tranquility overview](https://github.com/druid-io/tranquility/blob/master/docs/overview.md)
|
||||
for more details about how Tranquility manages tasks.
|
||||
|
||||
### segmentGranularity and windowPeriod
|
||||
|
||||
The segmentGranularity is the time period covered by the segments produced by each task. For
|
||||
example, a segmentGranularity of "hour" will spawn tasks that create segments covering one hour
|
||||
The segmentGranularity is the time period covered by the segments produced by each task. For
|
||||
example, a segmentGranularity of "hour" will spawn tasks that create segments covering one hour
|
||||
each.
|
||||
|
||||
The windowPeriod is the slack time permitted for events. For example, a windowPeriod of ten minutes
|
||||
(the default) means that any events with a timestamp older than ten minutes in the past, or more
|
||||
The windowPeriod is the slack time permitted for events. For example, a windowPeriod of ten minutes
|
||||
(the default) means that any events with a timestamp older than ten minutes in the past, or more
|
||||
than ten minutes in the future, will be dropped.
|
||||
|
||||
These are important configurations because they influence how long tasks will be alive for, and how
|
||||
long data stays in the realtime system before being handed off to the historical nodes. For example,
|
||||
if your configuration has segmentGranularity "hour" and windowPeriod ten minutes, tasks will stay
|
||||
around listening for events for an hour and ten minutes. For this reason, to prevent excessive
|
||||
These are important configurations because they influence how long tasks will be alive for, and how
|
||||
long data stays in the realtime system before being handed off to the historical nodes. For example,
|
||||
if your configuration has segmentGranularity "hour" and windowPeriod ten minutes, tasks will stay
|
||||
around listening for events for an hour and ten minutes. For this reason, to prevent excessive
|
||||
buildup of tasks, it is recommended that your windowPeriod be less than your segmentGranularity.
|
||||
|
||||
### Append only
|
||||
|
||||
Druid streaming ingestion is *append-only*, meaning you cannot use streaming ingestion to update or
|
||||
delete individual records after they are inserted. If you need to update or delete individual
|
||||
records, you need to use a batch reindexing process. See the *[batch ingest](batch-ingestion.html)*
|
||||
Druid streaming ingestion is *append-only*, meaning you cannot use streaming ingestion to update or
|
||||
delete individual records after they are inserted. If you need to update or delete individual
|
||||
records, you need to use a batch reindexing process. See the *[batch ingest](batch-ingestion.html)*
|
||||
page for more details.
|
||||
|
||||
Druid does support efficient deletion of entire time ranges without resorting to batch reindexing.
|
||||
Druid does support efficient deletion of entire time ranges without resorting to batch reindexing.
|
||||
This can be done automatically through setting up retention policies.
|
||||
|
||||
### Guarantees
|
||||
|
||||
Tranquility operates under a best-effort design. It tries reasonably hard to preserve your data, by allowing you to set
|
||||
up replicas and by retrying failed pushes for a period of time, but it does not guarantee that your events will be
|
||||
Tranquility operates under a best-effort design. It tries reasonably hard to preserve your data, by allowing you to set
|
||||
up replicas and by retrying failed pushes for a period of time, but it does not guarantee that your events will be
|
||||
processed exactly once. In some conditions, it can drop or duplicate events:
|
||||
|
||||
- Events with timestamps outside your configured windowPeriod will be dropped.
|
||||
- If you suffer more Druid Middle Manager failures than your configured replicas count, some
|
||||
- If you suffer more Druid Middle Manager failures than your configured replicas count, some
|
||||
partially indexed data may be lost.
|
||||
- If there is a persistent issue that prevents communication with the Druid indexing service, and
|
||||
retry policies are exhausted during that period, or the period lasts longer than your windowPeriod,
|
||||
- If there is a persistent issue that prevents communication with the Druid indexing service, and
|
||||
retry policies are exhausted during that period, or the period lasts longer than your windowPeriod,
|
||||
some events will be dropped.
|
||||
- If there is an issue that prevents Tranquility from receiving an acknowledgement from the indexing
|
||||
- If there is an issue that prevents Tranquility from receiving an acknowledgement from the indexing
|
||||
service, it will retry the batch, which can lead to duplicated events.
|
||||
- If you are using Tranquility inside Storm or Samza, various parts of both architectures have an
|
||||
- If you are using Tranquility inside Storm or Samza, various parts of both architectures have an
|
||||
at-least-once design and can lead to duplicated events.
|
||||
|
||||
Under normal operation, these risks are minimal. But if you need absolute 100% fidelity for
|
||||
historical data, we recommend a [hybrid batch/streaming](../tutorials/ingestion.html#hybrid-batch-streaming)
|
||||
Under normal operation, these risks are minimal. But if you need absolute 100% fidelity for
|
||||
historical data, we recommend a [hybrid batch/streaming](../tutorials/ingestion.html#hybrid-batch-streaming)
|
||||
architecture.
|
||||
|
|
|
@ -25,21 +25,12 @@ To let Druid load your extensions, follow the steps below
|
|||
|
||||
Example:
|
||||
|
||||
Suppose you specify `druid.extensions.directory=/usr/local/druid/extensions`, and want Druid to load normal extensions ```druid-examples```, ```druid-kafka-eight``` and ```mysql-metadata-storage```.
|
||||
Suppose you specify `druid.extensions.directory=/usr/local/druid/extensions`, and want Druid to load normal extensions ```druid-kafka-eight``` and ```mysql-metadata-storage```.
|
||||
|
||||
Then under ```extensions```, it should look like this,
|
||||
|
||||
```
|
||||
extensions/
|
||||
├── druid-examples
|
||||
│ ├── commons-beanutils-1.8.3.jar
|
||||
│ ├── commons-digester-1.8.jar
|
||||
│ ├── commons-logging-1.1.1.jar
|
||||
│ ├── commons-validator-1.4.0.jar
|
||||
│ ├── druid-examples-0.8.0-rc1.jar
|
||||
│ ├── twitter4j-async-3.0.3.jar
|
||||
│ ├── twitter4j-core-3.0.3.jar
|
||||
│ └── twitter4j-stream-3.0.3.jar
|
||||
├── druid-kafka-eight
|
||||
│ ├── druid-kafka-eight-0.7.3.jar
|
||||
│ ├── jline-0.9.94.jar
|
||||
|
@ -61,7 +52,7 @@ extensions/
|
|||
└── mysql-metadata-storage-0.8.0-rc1.jar
|
||||
```
|
||||
|
||||
As you can see, under ```extensions``` there are three sub-directories ```druid-examples```, ```druid-kafka-eight``` and ```mysql-metadata-storage```, each sub-directory denotes an extension that Druid might load.
|
||||
As you can see, under ```extensions``` there are two sub-directories ```druid-kafka-eight``` and ```mysql-metadata-storage```, each sub-directory denotes an extension that Druid might load.
|
||||
|
||||
3) To have Druid load a specific list of extensions present under the root extension directory, set `druid.extensions.loadList` to the list of extensions to load. Using the example above, if you want Druid to load ```druid-kafka-eight``` and ```mysql-metadata-storage```, you can specify `druid.extensions.loadList=["druid-kafka-eight", "mysql-metadata-storage"]`.
|
||||
|
||||
|
|
|
@ -114,9 +114,14 @@ your functionality as a native Java aggregator.
|
|||
The javascript aggregator is recommended for rapidly prototyping features. This aggregator will be much slower in production
|
||||
use than a native Java aggregator.
|
||||
|
||||
## Approximate Aggregations
|
||||
|
||||
### Cardinality aggregator
|
||||
|
||||
Computes the cardinality of a set of Druid dimensions, using HyperLogLog to estimate the cardinality.
|
||||
Computes the cardinality of a set of Druid dimensions, using HyperLogLog to estimate the cardinality. Please note that this
|
||||
aggregator will be much slower than indexing a column with the hyperUnique aggregator. This aggregator also runs over a dimension column, which
|
||||
means the string dimension cannot be removed from the dataset to improve rollup. In general, we strongly recommend using the hyperUnique aggregator
|
||||
instead of the cardinality aggregator if you do not care about the individual values of a dimension.
|
||||
|
||||
```json
|
||||
{
|
||||
|
@ -181,10 +186,6 @@ Determine the number of distinct people (i.e. combinations of first and last nam
|
|||
}
|
||||
```
|
||||
|
||||
## Complex Aggregations
|
||||
|
||||
Druid supports complex aggregations such as various types of approximate sketches.
|
||||
|
||||
### HyperUnique aggregator
|
||||
|
||||
Uses [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) to compute the estimated cardinality of a dimension that has been aggregated as a "hyperUnique" metric at indexing time.
|
||||
|
@ -193,6 +194,8 @@ Uses [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) to
|
|||
{ "type" : "hyperUnique", "name" : <output_name>, "fieldName" : <metric_name> }
|
||||
```
|
||||
|
||||
For more approximate aggregators, please see [theta sketches](../development/datasketches-aggregators.html).
|
||||
|
||||
## Miscellaneous Aggregations
|
||||
|
||||
### Filtered Aggregator
|
||||
|
|
|
@ -361,11 +361,13 @@ Then groupBy/topN processing pipeline "explodes" all multi-valued dimensions res
|
|||
In addition to "query filter" which efficiently selects the rows to be processed, you can use the filtering dimension spec to filter for specific values within the values of a multi-valued dimension. These dimensionSpecs take a delegate DimensionSpec and a filtering criteria. From the "exploded" rows, only rows matching the given filtering criteria are returned in the query result.
|
||||
|
||||
The following filtered dimension spec acts as a whitelist or blacklist for values as per the "isWhitelist" attribute value.
|
||||
|
||||
```json
|
||||
{ "type" : "listFiltered", "delegate" : <dimensionSpec>, "values": <array of strings>, "isWhitelist": <optional attribute for true/false, default is true> }
|
||||
```
|
||||
|
||||
Following filtered dimension spec retains only the values matching regex. Note that `listFiltered` is faster than this and one should use that for whitelist or blacklist usecase.
|
||||
|
||||
```json
|
||||
{ "type" : "regexFiltered", "delegate" : <dimensionSpec>, "pattern": <java regex pattern> }
|
||||
```
|
||||
|
|
|
@ -5,31 +5,31 @@ layout: doc_page
|
|||
|
||||
Druid has limited support for joins through [query-time lookups](../querying/lookups.html). The common use case of
|
||||
query-time lookups is to replace one dimension value that (e.g. a String ID) with another value (e.g. a human-readable
|
||||
String value).
|
||||
String value). This is similar a star-schema join.
|
||||
|
||||
Druid does not yet have full support for joins. Although Druid’s storage format would allow for the implementation
|
||||
Druid does not yet have full support for joins. Although Druid’s storage format would allow for the implementation
|
||||
of joins (there is no loss of fidelity for columns included as dimensions), full support for joins have not yet been implemented yet
|
||||
for the following reasons:
|
||||
|
||||
1. Scaling join queries has been, in our professional experience,
|
||||
1. Scaling join queries has been, in our professional experience,
|
||||
a constant bottleneck of working with distributed databases.
|
||||
2. The incremental gains in functionality are perceived to be
|
||||
of less value than the anticipated problems with managing
|
||||
2. The incremental gains in functionality are perceived to be
|
||||
of less value than the anticipated problems with managing
|
||||
highly concurrent, join-heavy workloads.
|
||||
|
||||
A join query is essentially the merging of two or more streams of data based on a shared set of keys. The primary
|
||||
high-level strategies for join queries we are aware of are a hash-based strategy or a
|
||||
sorted-merge strategy. The hash-based strategy requires that all but
|
||||
one data set be available as something that looks like a hash table,
|
||||
a lookup operation is then performed on this hash table for every
|
||||
row in the “primary” stream. The sorted-merge strategy assumes
|
||||
that each stream is sorted by the join key and thus allows for the incremental
|
||||
joining of the streams. Each of these strategies, however,
|
||||
requires the materialization of some number of the streams either in
|
||||
A join query is essentially the merging of two or more streams of data based on a shared set of keys. The primary
|
||||
high-level strategies for join queries we are aware of are a hash-based strategy or a
|
||||
sorted-merge strategy. The hash-based strategy requires that all but
|
||||
one data set be available as something that looks like a hash table,
|
||||
a lookup operation is then performed on this hash table for every
|
||||
row in the “primary” stream. The sorted-merge strategy assumes
|
||||
that each stream is sorted by the join key and thus allows for the incremental
|
||||
joining of the streams. Each of these strategies, however,
|
||||
requires the materialization of some number of the streams either in
|
||||
sorted order or in a hash table form.
|
||||
|
||||
When all sides of the join are significantly large tables (> 1 billion
|
||||
records), materializing the pre-join streams requires complex
|
||||
distributed memory management. The complexity of the memory
|
||||
management is only amplified by the fact that we are targeting highly
|
||||
When all sides of the join are significantly large tables (> 1 billion
|
||||
records), materializing the pre-join streams requires complex
|
||||
distributed memory management. The complexity of the memory
|
||||
management is only amplified by the fact that we are targeting highly
|
||||
concurrent, multi-tenant workloads.
|
||||
|
|
|
@ -103,7 +103,7 @@ It can be used in a sample calculation as so:
|
|||
}]
|
||||
```
|
||||
|
||||
#### Example Usage
|
||||
## Example Usage
|
||||
|
||||
In this example, let’s calculate a simple percentage using post aggregators. Let’s imagine our data set has a metric called "total".
|
||||
|
||||
|
|
|
@ -6,32 +6,32 @@ layout: doc_page
|
|||
|
||||
Druid is designed to be deployed as a scalable, fault-tolerant cluster.
|
||||
|
||||
In this document, we'll set up a simple cluster and discuss how it can be further configured to meet
|
||||
your needs. This simple cluster will feature scalable, fault-tolerant servers for Historicals and MiddleManagers, and a single
|
||||
coordination server to host the Coordinator and Overlord processes. In production, we recommend deploying Coordinators and Overlords in a fault-tolerant
|
||||
In this document, we'll set up a simple cluster and discuss how it can be further configured to meet
|
||||
your needs. This simple cluster will feature scalable, fault-tolerant servers for Historicals and MiddleManagers, and a single
|
||||
coordination server to host the Coordinator and Overlord processes. In production, we recommend deploying Coordinators and Overlords in a fault-tolerant
|
||||
configuration as well.
|
||||
|
||||
## Select hardware
|
||||
|
||||
The Coordinator and Overlord processes can be co-located on a single server that is responsible for handling the metadata and coordination needs of your cluster.
|
||||
The equivalent of an AWS [m3.xlarge](https://aws.amazon.com/ec2/instance-types/#M3) is sufficient for most clusters. This
|
||||
The Coordinator and Overlord processes can be co-located on a single server that is responsible for handling the metadata and coordination needs of your cluster.
|
||||
The equivalent of an AWS [m3.xlarge](https://aws.amazon.com/ec2/instance-types/#M3) is sufficient for most clusters. This
|
||||
hardware offers:
|
||||
|
||||
- 4 vCPUs
|
||||
- 15 GB RAM
|
||||
- 80 GB SSD storage
|
||||
|
||||
Historicals and MiddleManagers can be colocated on a single server to handle the actual data in your cluster. These servers benefit greatly from CPU, RAM,
|
||||
and SSDs. The equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3) is a
|
||||
Historicals and MiddleManagers can be colocated on a single server to handle the actual data in your cluster. These servers benefit greatly from CPU, RAM,
|
||||
and SSDs. The equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3) is a
|
||||
good starting point. This hardware offers:
|
||||
|
||||
- 8 vCPUs
|
||||
- 61 GB RAM
|
||||
- 160 GB SSD storage
|
||||
|
||||
Druid Brokers accept queries and farm them out to the rest of the cluster. They also optionally maintain an
|
||||
in-memory query cache. These servers benefit greatly from CPU and RAM, and can also be deployed on
|
||||
the equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3). This hardware
|
||||
Druid Brokers accept queries and farm them out to the rest of the cluster. They also optionally maintain an
|
||||
in-memory query cache. These servers benefit greatly from CPU and RAM, and can also be deployed on
|
||||
the equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3). This hardware
|
||||
offers:
|
||||
|
||||
- 8 vCPUs
|
||||
|
@ -48,14 +48,14 @@ We recommend running your favorite Linux distribution. You will also need:
|
|||
|
||||
* Java 7 or better
|
||||
|
||||
Your OS package manager should be able to help for both Java. If your Ubuntu-based OS
|
||||
does not have a recent enough version of Java, WebUpd8 offers [packages for those
|
||||
Your OS package manager should be able to help for both Java. If your Ubuntu-based OS
|
||||
does not have a recent enough version of Java, WebUpd8 offers [packages for those
|
||||
OSes](http://www.webupd8.org/2012/09/install-oracle-java-8-in-ubuntu-via-ppa.html).
|
||||
|
||||
## Download the distribution
|
||||
|
||||
First, download and unpack the release archive. It's best to do this on a single machine at first,
|
||||
since you will be editing the configurations and then copying the modified distribution out to all
|
||||
First, download and unpack the release archive. It's best to do this on a single machine at first,
|
||||
since you will be editing the configurations and then copying the modified distribution out to all
|
||||
of your servers.
|
||||
|
||||
```bash
|
||||
|
@ -80,8 +80,8 @@ We'll be editing the files in `conf/` in order to get things running.
|
|||
|
||||
## Configure deep storage
|
||||
|
||||
Druid relies on a distributed filesystem or large object (blob) store for data storage. The most
|
||||
commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if
|
||||
Druid relies on a distributed filesystem or large object (blob) store for data storage. The most
|
||||
commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if
|
||||
you already have a Hadoop deployment).
|
||||
|
||||
### S3
|
||||
|
@ -148,13 +148,13 @@ druid.indexer.logs.directory=/druid/indexing-logs
|
|||
|
||||
Also,
|
||||
|
||||
- Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml,
|
||||
mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into
|
||||
- Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml,
|
||||
mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into
|
||||
`conf/druid/_common/`.
|
||||
|
||||
## Configure Tranquility Server (optional)
|
||||
|
||||
Data streams can be sent to Druid through a simple HTTP API powered by Tranquility
|
||||
Data streams can be sent to Druid through a simple HTTP API powered by Tranquility
|
||||
Server. If you will be using this functionality, then at this point you should [configure
|
||||
Tranquility Server](../ingestion/stream-ingestion.html#server).
|
||||
|
||||
|
@ -166,56 +166,56 @@ using this functionality, then at this point you should
|
|||
|
||||
## Configure for connecting to Hadoop (optional)
|
||||
|
||||
If you will be loading data from a Hadoop cluster, then at this point you should configure Druid to be aware
|
||||
If you will be loading data from a Hadoop cluster, then at this point you should configure Druid to be aware
|
||||
of your cluster:
|
||||
|
||||
- Update `druid.indexer.task.hadoopWorkingPath` in `conf/middleManager/runtime.properties` to
|
||||
a path on HDFS that you'd like to use for temporary files required during the indexing process.
|
||||
- Update `druid.indexer.task.hadoopWorkingPath` in `conf/middleManager/runtime.properties` to
|
||||
a path on HDFS that you'd like to use for temporary files required during the indexing process.
|
||||
`druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing` is a common choice.
|
||||
|
||||
- Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml,
|
||||
mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into
|
||||
- Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml,
|
||||
mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into
|
||||
`conf/druid/_common/core-site.xml`, `conf/druid/_common/hdfs-site.xml`, and so on.
|
||||
|
||||
Note that you don't need to use HDFS deep storage in order to load data from Hadoop. For example, if
|
||||
your cluster is running on Amazon Web Services, we recommend using S3 for deep storage even if you
|
||||
Note that you don't need to use HDFS deep storage in order to load data from Hadoop. For example, if
|
||||
your cluster is running on Amazon Web Services, we recommend using S3 for deep storage even if you
|
||||
are loading data using Hadoop or Elastic MapReduce.
|
||||
|
||||
For more info, please see [batch ingestion](../ingestion/batch-ingestion.html).
|
||||
|
||||
## Configure addresses for Druid coordination
|
||||
|
||||
In this simple cluster, you will deploy a single Druid Coordinator, a
|
||||
In this simple cluster, you will deploy a single Druid Coordinator, a
|
||||
single Druid Overlord, a single ZooKeeper instance, and an embedded Derby metadata store on the same server.
|
||||
|
||||
In `conf/druid/_common/common.runtime.properties`, replace
|
||||
In `conf/druid/_common/common.runtime.properties`, replace
|
||||
"zk.host.ip" with the IP address of the machine that runs your ZK instance:
|
||||
|
||||
- `druid.zk.service.host`
|
||||
|
||||
In `conf/_common/common.runtime.properties`, replace
|
||||
In `conf/_common/common.runtime.properties`, replace
|
||||
"metadata.store.ip" with the IP address of the machine that you will use as your metadata store:
|
||||
|
||||
- `druid.metadata.storage.connector.connectURI`
|
||||
- `druid.metadata.storage.connector.host`
|
||||
|
||||
```note-caution
|
||||
In production, we recommend running 2 servers, each running a Druid Coordinator
|
||||
and a Druid Overlord. We also recommend running a ZooKeeper cluster on its own dedicated hardware,
|
||||
as well as replicated [metadata
|
||||
storage](http://druid.io/docs/latest/dependencies/metadata-storage.html) such as MySQL or
|
||||
<div class="note caution">
|
||||
In production, we recommend running 2 servers, each running a Druid Coordinator
|
||||
and a Druid Overlord. We also recommend running a ZooKeeper cluster on its own dedicated hardware,
|
||||
as well as replicated [metadata
|
||||
storage](http://druid.io/docs/latest/dependencies/metadata-storage.html) such as MySQL or
|
||||
PostgreSQL, on its own dedicated hardware.
|
||||
```
|
||||
</div>
|
||||
|
||||
## Tune Druid processes that serve queries
|
||||
|
||||
Druid Historicals and MiddleManagers can be co-located on the same hardware. Both Druid processes benefit greatly from
|
||||
being tuned to the hardware they run on. If you are running Tranquility Server or Kafka, you can also colocate Tranquility with these two Druid processes.
|
||||
If you are using [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3)
|
||||
EC2 instances, or similar hardware, the configuration in the distribution is a
|
||||
Druid Historicals and MiddleManagers can be co-located on the same hardware. Both Druid processes benefit greatly from
|
||||
being tuned to the hardware they run on. If you are running Tranquility Server or Kafka, you can also colocate Tranquility with these two Druid processes.
|
||||
If you are using [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3)
|
||||
EC2 instances, or similar hardware, the configuration in the distribution is a
|
||||
reasonable starting point.
|
||||
|
||||
If you are using different hardware, we recommend adjusting configurations for your specific
|
||||
If you are using different hardware, we recommend adjusting configurations for your specific
|
||||
hardware. The most commonly adjusted configurations are:
|
||||
|
||||
- `-Xmx` and `-Xms`
|
||||
|
@ -227,20 +227,20 @@ hardware. The most commonly adjusted configurations are:
|
|||
- `druid.server.maxSize` and `druid.segmentCache.locations` on Historical Nodes
|
||||
- `druid.worker.capacity` on MiddleManagers
|
||||
|
||||
```note
|
||||
<div class="note info">
|
||||
Keep -XX:MaxDirectMemory >= numThreads*sizeBytes, otherwise Druid will fail to start up..
|
||||
```
|
||||
</div>
|
||||
|
||||
Please see the Druid [configuration documentation](../configuration/index.html) for a full description of all
|
||||
Please see the Druid [configuration documentation](../configuration/index.html) for a full description of all
|
||||
possible configuration options.
|
||||
|
||||
## Tune Druid Brokers
|
||||
|
||||
Druid Brokers also benefit greatly from being tuned to the hardware they
|
||||
run on. If you are using [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3) EC2 instances,
|
||||
Druid Brokers also benefit greatly from being tuned to the hardware they
|
||||
run on. If you are using [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3) EC2 instances,
|
||||
or similar hardware, the configuration in the distribution is a reasonable starting point.
|
||||
|
||||
If you are using different hardware, we recommend adjusting configurations for your specific
|
||||
If you are using different hardware, we recommend adjusting configurations for your specific
|
||||
hardware. The most commonly adjusted configurations are:
|
||||
|
||||
- `-Xmx` and `-Xms`
|
||||
|
@ -251,17 +251,17 @@ hardware. The most commonly adjusted configurations are:
|
|||
- `druid.query.groupBy.maxIntermediateRows`
|
||||
- `druid.query.groupBy.maxResults`
|
||||
|
||||
```note-caution
|
||||
Keep -XX:MaxDirectMemory >= numThreads*sizeBytes, otherwise Druid will fail to start up..
|
||||
```
|
||||
<div class="note caution">
|
||||
Keep -XX:MaxDirectMemory >= numThreads*sizeBytes, otherwise Druid will fail to start up.
|
||||
</div>
|
||||
|
||||
Please see the Druid [configuration documentation](../configuration/index.html) for a full description of all
|
||||
Please see the Druid [configuration documentation](../configuration/index.html) for a full description of all
|
||||
possible configuration options.
|
||||
|
||||
## Start Coordinator, Overlord, Zookeeper, and metadata store
|
||||
|
||||
Copy the Druid distribution and your edited configurations to your coordination
|
||||
server. If you have been editing the configurations on your local machine, you can use *rsync* to
|
||||
Copy the Druid distribution and your edited configurations to your coordination
|
||||
server. If you have been editing the configurations on your local machine, you can use *rsync* to
|
||||
copy them:
|
||||
|
||||
```bash
|
||||
|
@ -269,18 +269,19 @@ rsync -az druid-0.9.0/ COORDINATION_SERVER:druid-0.9.0/
|
|||
```
|
||||
|
||||
Log on to your coordination server and install Zookeeper:
|
||||
|
||||
|
||||
```bash
|
||||
curl http://www.gtlib.gatech.edu/pub/apache/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz -o zookeeper-3.4.6.tar.gz
|
||||
tar -xzf zookeeper-3.4.6.tar.gz
|
||||
cd zookeeper-3.4.6
|
||||
cp conf/zoo_sample.cfg conf/zoo.cfg
|
||||
./bin/zkServer.sh start
|
||||
|
||||
```note-caution
|
||||
In production, we also recommend running a ZooKeeper cluster on its own dedicated hardware.
|
||||
```
|
||||
|
||||
<div class="note caution">
|
||||
In production, we also recommend running a ZooKeeper cluster on its own dedicated hardware.
|
||||
</div>
|
||||
|
||||
On your coordination server, *cd* into the distribution and start up the coordination services (you should do this in different windows or pipe the log to a file):
|
||||
|
||||
```bash
|
||||
|
@ -288,7 +289,7 @@ java `cat conf/druid/coordinator/jvm.config | xargs` -cp conf/druid/_common:conf
|
|||
java `cat conf/druid/overlord/jvm.config | xargs` -cp conf/druid/_common:conf/druid/overlord:lib/* io.druid.cli.Main server overlord
|
||||
```
|
||||
|
||||
You should see a log message printed out for each service that starts up. You can view detailed logs
|
||||
You should see a log message printed out for each service that starts up. You can view detailed logs
|
||||
for any service by looking in the `var/log/druid` directory using another terminal.
|
||||
|
||||
## Start Historicals and MiddleManagers
|
||||
|
@ -304,15 +305,15 @@ java `cat conf/druid/middleManager/jvm.config | xargs` -cp conf/druid/_common:co
|
|||
|
||||
You can add more servers with Druid Historicals and MiddleManagers as needed.
|
||||
|
||||
```note-info
|
||||
For clusters with complex resource allocation needs, you can break apart Historicals and MiddleManagers and scale the components individually.
|
||||
This also allows you take advantage of Druid's built-in MiddleManager
|
||||
<div class="note info">
|
||||
For clusters with complex resource allocation needs, you can break apart Historicals and MiddleManagers and scale the components individually.
|
||||
This also allows you take advantage of Druid's built-in MiddleManager
|
||||
autoscaling facility.
|
||||
```
|
||||
</div>
|
||||
|
||||
If you are doing push-based stream ingestion with Kafka or over HTTP, you can also start Tranquility server on the same
|
||||
hardware that holds MiddleManagers and Historicals. For large scale production, MiddleManagers and Tranquility server
|
||||
can still be co-located. If you are running Tranquility (not server) with a stream processor, you can co-locate
|
||||
If you are doing push-based stream ingestion with Kafka or over HTTP, you can also start Tranquility server on the same
|
||||
hardware that holds MiddleManagers and Historicals. For large scale production, MiddleManagers and Tranquility server
|
||||
can still be co-located. If you are running Tranquility (not server) with a stream processor, you can co-locate
|
||||
Tranquility with the stream processor and not require Tranquility server.
|
||||
|
||||
```bash
|
||||
|
@ -324,9 +325,9 @@ bin/tranquility <server or kafka> -configFile <path_to_druid_distro>/conf/tranqu
|
|||
|
||||
## Start Druid Broker
|
||||
|
||||
Copy the Druid distribution and your edited configurations to your servers set aside for the Druid Brokers.
|
||||
Copy the Druid distribution and your edited configurations to your servers set aside for the Druid Brokers.
|
||||
|
||||
On each one, *cd* into the distribution and run this command to start a Broker (you want to pipe the output to a log file):
|
||||
On each one, *cd* into the distribution and run this command to start a Broker (you may want to pipe the output to a log file):
|
||||
|
||||
```bash
|
||||
java `cat conf/druid/broker/jvm.config | xargs` -cp conf/druid/_common:conf/druid/broker:lib/* io.druid.cli.Main server broker
|
||||
|
@ -336,5 +337,5 @@ You can add more Brokers as needed based on query load.
|
|||
|
||||
## Loading data
|
||||
|
||||
Congratulations, you now have a Druid cluster! The next step is to learn about recommended ways to load data into
|
||||
Congratulations, you now have a Druid cluster! The next step is to learn about recommended ways to load data into
|
||||
Druid based on your use case. Read more about [loading data](ingestion.html).
|
||||
|
|
|
@ -9,15 +9,15 @@ layout: doc_page
|
|||
Druid supports streaming (real-time) and file-based (batch) ingestion methods. The most
|
||||
popular configurations are:
|
||||
|
||||
- [Files](batch-ingestion.html) - Load data from HDFS, S3, local files, or any supported Hadoop
|
||||
- [Files](../ingestion/batch-ingestion.html) - Load data from HDFS, S3, local files, or any supported Hadoop
|
||||
filesystem in batches. We recommend this method if your dataset is already in flat files.
|
||||
|
||||
- [Stream push](stream-ingestion.html#stream-push) - Push a data stream into Druid in real-time
|
||||
- [Stream push](../ingestion/stream-ingestion.html#stream-push) - Push a data stream into Druid in real-time
|
||||
using [Tranquility](http://github.com/druid-io/tranquility), a client library for sending streams
|
||||
to Druid. We recommend this method if your dataset originates in a streaming system like Kafka,
|
||||
Storm, Spark Streaming, or your own system.
|
||||
|
||||
- [Stream pull](stream-ingestion.html#stream-pull) - Pull a data stream directly from an external
|
||||
- [Stream pull](../ingestion/stream-ingestion.html#stream-pull) - Pull a data stream directly from an external
|
||||
data source into Druid using Realtime Nodes.
|
||||
|
||||
## Getting started
|
||||
|
|
|
@ -15,17 +15,17 @@ You will need:
|
|||
* 8G of RAM
|
||||
* 2 vCPUs
|
||||
|
||||
On Mac OS X, you can use [Oracle's JDK
|
||||
8](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) to install
|
||||
On Mac OS X, you can use [Oracle's JDK
|
||||
8](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) to install
|
||||
Java.
|
||||
|
||||
On Linux, your OS package manager should be able to help for Java. If your Ubuntu-
|
||||
based OS does not have a recent enough version of Java, WebUpd8 offers [packages for those
|
||||
based OS does not have a recent enough version of Java, WebUpd8 offers [packages for those
|
||||
OSes](http://www.webupd8.org/2012/09/install-oracle-java-8-in-ubuntu-via-ppa.html).
|
||||
|
||||
## Getting started
|
||||
|
||||
To install Druid, issue the following commands in your terminal:
|
||||
To install Druid, issue the following commands in your terminal:
|
||||
|
||||
```bash
|
||||
curl -O http://static.druid.io/artifacts/releases/druid-0.9.0-bin.tar.gz
|
||||
|
@ -46,7 +46,7 @@ In the package, you should find:
|
|||
|
||||
## Start up Zookeeper
|
||||
|
||||
Druid currently has a dependency on [Apache ZooKeeper](http://zookeeper.apache.org/) for distributed coordination. You'll
|
||||
Druid currently has a dependency on [Apache ZooKeeper](http://zookeeper.apache.org/) for distributed coordination. You'll
|
||||
need to download and run Zookeeper.
|
||||
|
||||
```bash
|
||||
|
@ -65,8 +65,9 @@ With Zookeeper running, return to the druid-0.9.0 directory. In that directory,
|
|||
bin/init
|
||||
```
|
||||
|
||||
Next, you can start up the Druid processes in different terminal windows. This tutorial runs every Druid process on the same system. In production,
|
||||
many of these Druid processes can be colocated even in a distributed cluster.
|
||||
This will setup up some directories for you. Next, you can start up the Druid processes in different terminal windows.
|
||||
This tutorial runs every Druid process on the same system. In a large distributed production cluster,
|
||||
many of these Druid processes can still be co-located together.
|
||||
|
||||
```bash
|
||||
java `cat conf-quickstart/druid/historical/jvm.config | xargs` -cp conf-quickstart/druid/_common:conf-quickstart/druid/historical:lib/* io.druid.cli.Main server historical
|
||||
|
@ -78,7 +79,7 @@ java `cat conf-quickstart/druid/middleManager/jvm.config | xargs` -cp conf-quick
|
|||
|
||||
You should see a log message printed out for each service that starts up.
|
||||
|
||||
Later on, if you'd like to stop the services, CTRL-C to exit from the running java processes. If you
|
||||
Later on, if you'd like to stop the services, CTRL-C to exit from the running java processes. If you
|
||||
want a clean start after stopping the services, delete the `var` directory and run the `init` script again.
|
||||
|
||||
Once every service has started, you are now ready to load data.
|
||||
|
@ -87,13 +88,13 @@ Once every service has started, you are now ready to load data.
|
|||
|
||||
We've included a sample of Wikipedia edits from September 12, 2015 to get you started.
|
||||
|
||||
```note-info
|
||||
This section shows you how to load data in batches, but you can skip ahead to learn how to [load
|
||||
streams in real-time](quickstart.html#load-streaming-data). Druid's streaming ingestion can load data
|
||||
<div class="note info">
|
||||
This section shows you how to load data in batches, but you can skip ahead to learn how to <a href="quickstart.html#load-streaming-data">load
|
||||
streams in real-time</a>. Druid's streaming ingestion can load data
|
||||
with virtually no delay between events occurring and being available for queries.
|
||||
```
|
||||
</div>
|
||||
|
||||
The [dimensions](https://en.wikipedia.org/wiki/Dimension_%28data_warehouse%29) (attributes you can
|
||||
The [dimensions](https://en.wikipedia.org/wiki/Dimension_%28data_warehouse%29) (attributes you can
|
||||
filter and split on) in the Wikipedia dataset, other than time, are:
|
||||
|
||||
* channel
|
||||
|
@ -113,7 +114,7 @@ filter and split on) in the Wikipedia dataset, other than time, are:
|
|||
* regionName
|
||||
* user
|
||||
|
||||
The [measures](https://en.wikipedia.org/wiki/Measure_%28data_warehouse%29), or *metrics* as they are known in Druid (values you can aggregate)
|
||||
The [measures](https://en.wikipedia.org/wiki/Measure_%28data_warehouse%29), or *metrics* as they are known in Druid (values you can aggregate)
|
||||
in the Wikipedia dataset are:
|
||||
|
||||
* count
|
||||
|
@ -122,8 +123,8 @@ in the Wikipedia dataset are:
|
|||
* delta
|
||||
* user_unique
|
||||
|
||||
To load this data into Druid, you can submit an *ingestion task* pointing to the file. We've included
|
||||
a task that loads the `wikiticker-2015-09-12-sampled.json` file included in the archive. To submit
|
||||
To load this data into Druid, you can submit an *ingestion task* pointing to the file. We've included
|
||||
a task that loads the `wikiticker-2015-09-12-sampled.json` file included in the archive. To submit
|
||||
this task, POST it to Druid in a new terminal window from the druid-0.9.0 directory:
|
||||
|
||||
```bash
|
||||
|
@ -132,27 +133,27 @@ curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/wikiticker-inde
|
|||
|
||||
Which will print the ID of the task if the submission was successful:
|
||||
|
||||
```base
|
||||
```bash
|
||||
{"task":"index_hadoop_wikipedia_2013-10-09T21:30:32.802Z"}
|
||||
```
|
||||
|
||||
To view the status of your ingestion task, go to your overlord console:
|
||||
[http://localhost:8090/console.html](http://localhost:8090/console.html). You can refresh the console periodically, and after
|
||||
To view the status of your ingestion task, go to your overlord console:
|
||||
[http://localhost:8090/console.html](http://localhost:8090/console.html). You can refresh the console periodically, and after
|
||||
the task is successful, you should see a "SUCCESS" status for the task.
|
||||
|
||||
After your ingestion task finishes, the data will be loaded by historical nodes and available for
|
||||
querying within a minute or two. You can monitor the progress of loading your data in the
|
||||
coordinator console, by checking whether there is a datasource "wikiticker" with a blue circle
|
||||
After your ingestion task finishes, the data will be loaded by historical nodes and available for
|
||||
querying within a minute or two. You can monitor the progress of loading your data in the
|
||||
coordinator console, by checking whether there is a datasource "wikiticker" with a blue circle
|
||||
indicating "fully available": [http://localhost:8081/#/](http://localhost:8081/#/).
|
||||
|
||||
Once the data is fully available, you can immediately query it— to see how, skip to the [Query
|
||||
Once the data is fully available, you can immediately query it— to see how, skip to the [Query
|
||||
data](#query-data) section below. Or, continue to the [Load your own data](#load-your-own-data)
|
||||
section if you'd like to load a different dataset.
|
||||
|
||||
## Load streaming data
|
||||
|
||||
To load streaming data, we are going to push events into Druid
|
||||
over a simple HTTP API. We will do this use [Tranquility], a high level data producer
|
||||
To load streaming data, we are going to push events into Druid
|
||||
over a simple HTTP API. We will do this use [Tranquility], a high level data producer
|
||||
library for Druid.
|
||||
|
||||
To download Tranquility, issue the following commands in your terminal:
|
||||
|
@ -163,19 +164,19 @@ tar -xzf tranquility-distribution-0.7.2.tgz
|
|||
cd tranquility-distribution-0.7.2
|
||||
```
|
||||
|
||||
We've included a configuration file in `conf-quickstart/tranquility/server.json` as part of the Druid distribution
|
||||
for a *metrics* datasource. We're going to start the Tranquility server process, which can be used to push events
|
||||
We've included a configuration file in `conf-quickstart/tranquility/server.json` as part of the Druid distribution
|
||||
for a *metrics* datasource. We're going to start the Tranquility server process, which can be used to push events
|
||||
directly to Druid.
|
||||
|
||||
``` bash
|
||||
bin/tranquility server -configFile <path_to_druid_distro>/conf-quickstart/tranquility/server.json
|
||||
```
|
||||
|
||||
```note-info
|
||||
<div class="note info">
|
||||
This section shows you how to load data using Tranquility Server, but Druid also supports a wide
|
||||
variety of [other streaming ingestion options](ingestion-streams.html#stream-push), including from
|
||||
variety of <a href="ingestion-streams.html#stream-push">other streaming ingestion options</a>, including from
|
||||
popular streaming systems like Kafka, Storm, Samza, and Spark Streaming.
|
||||
```
|
||||
</div>
|
||||
|
||||
The [dimensions](https://en.wikipedia.org/wiki/Dimension_%28data_warehouse%29) (attributes you can
|
||||
filter and split on) for this datasource are flexible. It's configured for *schemaless dimensions*,
|
||||
|
@ -223,17 +224,17 @@ curl -L -H'Content-Type: application/json' -XPOST --data-binary @quickstart/wiki
|
|||
|
||||
## Visualizing data
|
||||
|
||||
Druid is ideal for power user-facing analytic applications. There are a number of different open source applications to
|
||||
visualize and explore data in Druid. We recommend trying [Pivot](https://github.com/implydata/pivot),
|
||||
[Panoramix](https://github.com/mistercrunch/panoramix), or [Metabase](https://github.com/metabase/metabase) to start
|
||||
Druid is ideal for power user-facing analytic applications. There are a number of different open source applications to
|
||||
visualize and explore data in Druid. We recommend trying [Pivot](https://github.com/implydata/pivot),
|
||||
[Panoramix](https://github.com/mistercrunch/panoramix), or [Metabase](https://github.com/metabase/metabase) to start
|
||||
visualizing the data you just ingested.
|
||||
|
||||
If you installed Pivot for example, you should be able to view your data in your browser at [localhost:9090](localhost:9090).
|
||||
|
||||
### SQL and other query libraries
|
||||
|
||||
There are many more query tools for Druid than we've included here, including SQL
|
||||
engines, and libraries for various languages like Python and Ruby. Please see [the list of
|
||||
There are many more query tools for Druid than we've included here, including SQL
|
||||
engines, and libraries for various languages like Python and Ruby. Please see [the list of
|
||||
libraries](../development/libraries.html) for more information.
|
||||
|
||||
## Clustered setup
|
||||
|
|
|
@ -2,25 +2,36 @@
|
|||
layout: doc_page
|
||||
---
|
||||
|
||||
## Load your own batch data
|
||||
# Tutorial: Load your own batch data
|
||||
|
||||
Before you get started with loading your own batch data, you should have first completed the [quickstart](quickstart.html).
|
||||
## Getting started
|
||||
|
||||
You can easily load any timestamped dataset into Druid. For Druid batch loads, the most important
|
||||
questions are:
|
||||
This tutorial shows you how to load your own data files into Druid.
|
||||
|
||||
For this tutorial, we'll assume you've already downloaded Druid as described in
|
||||
the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
|
||||
don't need to have loaded any data yet.
|
||||
|
||||
Once that's complete, you can load your own dataset by writing a custom ingestion spec.
|
||||
|
||||
## Writing an ingestion spec
|
||||
|
||||
When loading files into Druid, you will use Druid's [batch loading](ingestion-batch.html) process.
|
||||
There's an example batch ingestion spec in `quickstart/wikiticker-index.json` that you can modify
|
||||
for your own needs.
|
||||
|
||||
The most important questions are:
|
||||
|
||||
* What should the dataset be called? This is the "dataSource" field of the "dataSchema".
|
||||
* Where is the dataset located? The file paths belong in the "paths" of the "inputSpec". If you
|
||||
* Where is the dataset located? The file paths belong in the "paths" of the "inputSpec". If you
|
||||
want to load multiple files, you can provide them as a comma-separated string.
|
||||
* Which field should be treated as a timestamp? This belongs in the "column" of the "timestampSpec".
|
||||
* Which fields should be treated as dimensions? This belongs in the "dimensions" of the "dimensionsSpec".
|
||||
* Which fields should be treated as metrics? This belongs in the "metricsSpec".
|
||||
* What time ranges (intervals) are being loaded? This belongs in the "intervals" of the "granularitySpec".
|
||||
|
||||
```note-info
|
||||
If your data does not have a natural sense of time, you can tag each row with the current time.
|
||||
You can also tag all rows with a fixed timestamp, like "2000-01-01T00:00:00.000Z".
|
||||
```
|
||||
|
||||
Let's use this pageviews dataset as an example. Druid supports TSV, CSV, and JSON out of the box.
|
||||
Note that nested JSON objects are not supported, so if you do use JSON, you should provide a file
|
||||
|
@ -90,19 +101,39 @@ And modify it by altering these sections:
|
|||
}
|
||||
```
|
||||
|
||||
Finally, fire off the task and indexing will proceed!
|
||||
## Running the task
|
||||
|
||||
To actually run this task, first make sure that the indexing task can read *pageviews.json*:
|
||||
|
||||
- If you're running locally (no configuration for connecting to Hadoop; this is the default) then
|
||||
place it in the root of the Druid distribution.
|
||||
- If you configured Druid to connect to a Hadoop cluster, upload
|
||||
the pageviews.json file to HDFS. You may need to adjust the `paths` in the ingestion spec.
|
||||
|
||||
To kick off the indexing process, POST your indexing task to the Druid Overlord. In a standard Druid
|
||||
install, the URL is `http://OVERLORD_IP:8090/druid/indexer/v1/task`.
|
||||
|
||||
```bash
|
||||
curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/pageviews-index.json localhost:8090/druid/indexer/v1/task
|
||||
curl -X 'POST' -H 'Content-Type:application/json' -d @my-index-task.json OVERLORD_IP:8090/druid/indexer/v1/task
|
||||
```
|
||||
|
||||
If anything goes wrong with this task (e.g. it finishes with status FAILED), you can troubleshoot
|
||||
If you're running everything on a single machine, you can use localhost:
|
||||
|
||||
```bash
|
||||
curl -X 'POST' -H 'Content-Type:application/json' -d @my-index-task.json localhost:8090/druid/indexer/v1/task
|
||||
```
|
||||
|
||||
If anything goes wrong with this task (e.g. it finishes with status FAILED), you can troubleshoot
|
||||
by visiting the "Task log" on the [overlord console](http://localhost:8090/console.html).
|
||||
|
||||
```note-info
|
||||
Druid supports a wide variety of data formats, ingestion options, and configurations not
|
||||
discussed here. For a full explanation of all available features, see the ingestion sections of the Druid
|
||||
documentation.
|
||||
```
|
||||
## Querying your data
|
||||
|
||||
Your data should become fully available within a minute or two. You can monitor this process on
|
||||
your Coordinator console at [http://localhost:8081/#/](http://localhost:8081/#/).
|
||||
|
||||
Once your data is fully available, you can query it using any of the
|
||||
[supported query methods](../querying/querying.html).
|
||||
|
||||
## Further reading
|
||||
|
||||
For more information on loading batch data, please see [the batch ingestion documentation](../ingestion/batch-ingestion.html).
|
||||
|
|
|
@ -8,21 +8,21 @@ layout: doc_page
|
|||
|
||||
This tutorial shows you how to load data from Kafka into Druid.
|
||||
|
||||
For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
|
||||
the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
|
||||
For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
|
||||
the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
|
||||
don't need to have loaded any data yet.
|
||||
|
||||
```note-info
|
||||
<div class="note info">
|
||||
This tutorial will show you how to load data from Kafka into Druid, but Druid additionally supports
|
||||
a wide variety of batch and streaming loading methods. See the *[Loading files](../ingestion/batch-ingestion.html)*
|
||||
and *[Loading streams](../ingestion/stream-ingestion.html)* pages for more information about other options,
|
||||
a wide variety of batch and streaming loading methods. See the <a href="../ingestion/batch-ingestion.html">Loading files</a>
|
||||
and <a href="../ingestion/stream-ingestion.html">Loading streams</a> pages for more information about other options,
|
||||
including from Hadoop, HTTP, Storm, Samza, Spark Streaming, and your own JVM apps.
|
||||
```
|
||||
</div>
|
||||
|
||||
## Start Kafka
|
||||
|
||||
[Apache Kafka](http://kafka.apache.org/) is a high throughput message bus that works well with
|
||||
Druid. For this tutorial, we will use Kafka 0.9.0.0. To download Kafka, issue the following
|
||||
[Apache Kafka](http://kafka.apache.org/) is a high throughput message bus that works well with
|
||||
Druid. For this tutorial, we will use Kafka 0.9.0.0. To download Kafka, issue the following
|
||||
commands in your terminal:
|
||||
|
||||
```bash
|
||||
|
@ -45,7 +45,7 @@ Run this command to create a Kafka topic called *metrics*, to which we'll send d
|
|||
|
||||
## Enable Druid Kafka ingestion
|
||||
|
||||
Druid includes configs for [Tranquility Kafka](ingestion-streams.md#kafka) to support loading data from Kafka.
|
||||
Druid includes configs for [Tranquility Kafka](ingestion-streams.md#kafka) to support loading data from Kafka.
|
||||
To enable this in the quickstart-based configuration:
|
||||
|
||||
- Stop your Tranquility command (CTRL-C) and then start it up again.
|
||||
|
@ -66,25 +66,25 @@ In your Kafka directory, run:
|
|||
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic metrics
|
||||
```
|
||||
|
||||
The *kafka-console-producer* command is now awaiting input. Copy the generated example metrics,
|
||||
paste them into the *kafka-console-producer* terminal, and press enter. If you like, you can also
|
||||
The *kafka-console-producer* command is now awaiting input. Copy the generated example metrics,
|
||||
paste them into the *kafka-console-producer* terminal, and press enter. If you like, you can also
|
||||
paste more messages into the producer, or you can press CTRL-D to exit the console producer.
|
||||
|
||||
You can immediately query this data, or you can skip ahead to the
|
||||
You can immediately query this data, or you can skip ahead to the
|
||||
[Loading your own data](#loading-your-own-data) section if you'd like to load your own dataset.
|
||||
|
||||
## Querying your data
|
||||
|
||||
After sending data, you can immediately query it using any of the
|
||||
After sending data, you can immediately query it using any of the
|
||||
[supported query methods](../querying/querying.html).
|
||||
|
||||
## Loading your own data
|
||||
|
||||
So far, you've loaded data into Druid from Kafka using an ingestion spec that we've included in the
|
||||
distribution. Each ingestion spec is designed to work with a particular dataset. You load your own
|
||||
So far, you've loaded data into Druid from Kafka using an ingestion spec that we've included in the
|
||||
distribution. Each ingestion spec is designed to work with a particular dataset. You load your own
|
||||
data types into Imply by writing a custom ingestion spec.
|
||||
|
||||
You can write a custom ingestion spec by starting from the bundled configuration in
|
||||
You can write a custom ingestion spec by starting from the bundled configuration in
|
||||
`conf-quickstart/tranquility/kafka.json` and modifying it for your own needs.
|
||||
|
||||
The most important questions are:
|
||||
|
@ -111,7 +111,7 @@ Next, edit `conf-quickstart/tranquility/kafka.json`:
|
|||
* Let's call the dataset "pageviews-kafka".
|
||||
* The timestamp is the "time" field.
|
||||
* Good choices for dimensions are the string fields "url" and "user".
|
||||
* Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
|
||||
* Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
|
||||
sum when we load the data will allow us to compute an average at query time as well.
|
||||
|
||||
You can edit the existing `conf-quickstart/tranquility/kafka.json` file by altering these
|
||||
|
@ -157,7 +157,7 @@ Next, start Druid Kafka ingestion:
|
|||
bin/tranquility kafka -configFile ../druid-0.9.0-SNAPSHOT/conf-quickstart/tranquility/kafka.json
|
||||
```
|
||||
|
||||
- If your Tranquility server or Kafka is already running, stop it (CTRL-C) and
|
||||
- If your Tranquility server or Kafka is already running, stop it (CTRL-C) and
|
||||
start it up again.
|
||||
|
||||
Finally, send some data to the Kafka topic. Let's start with these messages:
|
||||
|
@ -168,23 +168,23 @@ Finally, send some data to the Kafka topic. Let's start with these messages:
|
|||
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}
|
||||
```
|
||||
|
||||
Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
|
||||
[windowPeriod](../ingestion/stream-ingestion.html#segmentgranularity-and-windowperiod) value), so you should
|
||||
replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
|
||||
Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
|
||||
[windowPeriod](../ingestion/stream-ingestion.html#segmentgranularity-and-windowperiod) value), so you should
|
||||
replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
|
||||
get this by running:
|
||||
|
||||
```bash
|
||||
python -c 'import datetime; print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))'
|
||||
```
|
||||
|
||||
Update the timestamps in the JSON above, then copy and paste these messages into this console
|
||||
Update the timestamps in the JSON above, then copy and paste these messages into this console
|
||||
producer and press enter:
|
||||
|
||||
```bash
|
||||
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic pageviews
|
||||
```
|
||||
|
||||
That's it, your data should now be in Druid. You can immediately query it using any of the
|
||||
That's it, your data should now be in Druid. You can immediately query it using any of the
|
||||
[supported query methods](../querying/querying.html).
|
||||
|
||||
## Further reading
|
||||
|
|
|
@ -2,33 +2,33 @@
|
|||
layout: doc_page
|
||||
---
|
||||
|
||||
## Load your own streaming data
|
||||
# Tutorial: Load your own streaming data
|
||||
|
||||
## Getting started
|
||||
|
||||
This tutorial shows you how to load your own streams into Druid.
|
||||
|
||||
For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
|
||||
the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
|
||||
For this tutorial, we'll assume you've already downloaded Druid and Tranquility as described in
|
||||
the [single-machine quickstart](quickstart.html) and have it running on your local machine. You
|
||||
don't need to have loaded any data yet.
|
||||
|
||||
Once that's complete, you can load your own dataset by writing a custom ingestion spec.
|
||||
|
||||
## Writing an ingestion spec
|
||||
|
||||
When loading streams into Druid, we recommend using the [stream push](../ingestion/stream-push.html)
|
||||
process. In this tutorial we'll be using [Tranquility Server](../ingestion/stream-ingestion.html#server) to push
|
||||
When loading streams into Druid, we recommend using the [stream push](../ingestion/stream-push.html)
|
||||
process. In this tutorial we'll be using [Tranquility Server](../ingestion/stream-ingestion.html#server) to push
|
||||
data into Druid over HTTP.
|
||||
|
||||
```note-info
|
||||
This tutorial will show you how to push streams to Druid using HTTP, but Druid additionally supports
|
||||
a wide variety of batch and streaming loading methods. See the *[Loading files](batch-ingestion.html)*
|
||||
and *[Loading streams](stream-ingestion.html)* pages for more information about other options,
|
||||
<div class="note info">
|
||||
This tutorial will show you how to push streams to Druid using HTTP, but Druid additionally supports
|
||||
a wide variety of batch and streaming loading methods. See the <a href="../ingestion/batch-ingestion.html">Loading files</a>
|
||||
and <a href="../ingestion/stream-ingestion.html">Loading streams</a> pages for more information about other options,
|
||||
including from Hadoop, Kafka, Storm, Samza, Spark Streaming, and your own JVM apps.
|
||||
```
|
||||
</div>
|
||||
|
||||
You can prepare for loading a new dataset over HTTP by writing a custom Tranquility Server
|
||||
configuration. The bundled configuration is in `conf-quickstart/tranquility/server.json`, which
|
||||
You can prepare for loading a new dataset over HTTP by writing a custom Tranquility Server
|
||||
configuration. The bundled configuration is in `conf-quickstart/tranquility/server.json`, which
|
||||
you can modify for your own needs.
|
||||
|
||||
The most important questions are:
|
||||
|
@ -49,10 +49,10 @@ So the answers to the questions above are:
|
|||
* Let's call the dataset "pageviews".
|
||||
* The timestamp is the "time" field.
|
||||
* Good choices for dimensions are the string fields "url" and "user".
|
||||
* Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
|
||||
* Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that
|
||||
sum when we load the data will allow us to compute an average at query time as well.
|
||||
|
||||
Now, edit the existing `conf-quickstart/tranquility/server.json` file by altering these
|
||||
Now, edit the existing `conf-quickstart/tranquility/server.json` file by altering these
|
||||
sections:
|
||||
|
||||
1. Change the key `"metrics"` under `"dataSources"` to `"pageviews"`
|
||||
|
@ -95,16 +95,16 @@ Let's send some data! We'll start with these three records:
|
|||
{"time": "2000-01-01T00:00:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}
|
||||
```
|
||||
|
||||
Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
|
||||
[windowPeriod](ingestion-streams.html#segmentgranularity-and-windowperiod) value), so you should
|
||||
replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
|
||||
Druid streaming ingestion requires relatively current messages (relative to a slack time controlled by the
|
||||
[windowPeriod](ingestion-streams.html#segmentgranularity-and-windowperiod) value), so you should
|
||||
replace `2000-01-01T00:00:00Z` in these messages with the current time in ISO8601 format. You can
|
||||
get this by running:
|
||||
|
||||
```bash
|
||||
python -c 'import datetime; print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))'
|
||||
```
|
||||
|
||||
Update the timestamps in the JSON above, and save it to a file named `pageviews.json`. Then send
|
||||
Update the timestamps in the JSON above, and save it to a file named `pageviews.json`. Then send
|
||||
it to Druid by running:
|
||||
|
||||
```bash
|
||||
|
@ -117,16 +117,16 @@ This will print something like:
|
|||
{"result":{"received":3,"sent":3}}
|
||||
```
|
||||
|
||||
This indicates that the HTTP server received 3 events from you, and sent 3 to Druid. Note that
|
||||
this may take a few seconds to finish the first time you run it, as Druid resources must be
|
||||
This indicates that the HTTP server received 3 events from you, and sent 3 to Druid. Note that
|
||||
this may take a few seconds to finish the first time you run it, as Druid resources must be
|
||||
allocated to the ingestion task. Subsequent POSTs should complete quickly.
|
||||
|
||||
If you see `"sent":0` this likely means that your timestamps are not recent enough. Try adjusting
|
||||
If you see `"sent":0` this likely means that your timestamps are not recent enough. Try adjusting
|
||||
your timestamps and re-sending your data.
|
||||
|
||||
## Querying your data
|
||||
|
||||
After sending data, you can immediately query it using any of the
|
||||
After sending data, you can immediately query it using any of the
|
||||
[supported query methods](../querying/querying.html).
|
||||
|
||||
## Further reading
|
||||
|
|
Loading…
Reference in New Issue