mirror of
https://github.com/apache/druid.git
synced 2025-02-17 07:25:02 +00:00
more fixes to docs
This commit is contained in:
parent
4e3d8faed8
commit
5727193873
@ -78,8 +78,8 @@ The spec\_file is a path to a file that contains JSON and an example looks like:
|
||||
"paths" : "/MyDirectory/examples/indexing/wikipedia_data.json"
|
||||
},
|
||||
"metadataUpdateSpec" : {
|
||||
"type":"mysql",
|
||||
"connectURI" : "jdbc:mysql://localhost:3306/druid",
|
||||
"type":"metadata",
|
||||
"connectURI" : "jdbc:metadata storage://localhost:3306/druid",
|
||||
"password" : "diurd",
|
||||
"segmentTable" : "druid_segments",
|
||||
"user" : "druid"
|
||||
@ -158,7 +158,7 @@ This is a specification of the properties that tell the job how to update metada
|
||||
|Field|Type|Description|Required|
|
||||
|-----|----|-----------|--------|
|
||||
|type|String|"metadata" is the only value available.|yes|
|
||||
|connectURI|String|A valid JDBC url to MySQL.|yes|
|
||||
|connectURI|String|A valid JDBC url to metadata storage.|yes|
|
||||
|user|String|Username for db.|yes|
|
||||
|password|String|password for db.|yes|
|
||||
|segmentTable|String|Table to use in DB.|yes|
|
||||
|
@ -92,4 +92,4 @@ To check the state of the overlord, open up your browser and go to `#{OVERLORD_P
|
||||
|
||||
Next, go to `#{COORDINATOR_PUBLIC_IP_ADDR}:#{PORT}`. Click "View Information about the Cluster"->"Full Cluster View." You should now see the information about servers and segments. If the cluster runs correctly, Segment dimensions and Segment binaryVersion fields should be filled up. Allow few minutes for the segments to be processed.
|
||||
|
||||
Now you should be able to query the data using broker's public IP address.
|
||||
Now you should be able to query the data using broker's public IP address.
|
||||
|
@ -241,7 +241,7 @@ This module is used to read announcements of segments in ZooKeeper. The configs
|
||||
|
||||
### Database Connector Module
|
||||
|
||||
These properties specify the jdbc connection and other configuration around the database. The only processes that connect to the DB with these properties are the [Coordinator](Coordinator.html) and [Indexing service](Indexing-service.html). This is tested on MySQL.
|
||||
These properties specify the jdbc connection and other configuration around the database. The only processes that connect to the DB with these properties are the [Coordinator](Coordinator.html) and [Indexing service](Indexing-service.html). This is tested on metadata storage.
|
||||
|
||||
|Property|Description|Default|
|
||||
|--------|-----------|-------|
|
||||
|
@ -25,7 +25,7 @@ The coordinator module uses several of the default modules in [Configuration](Co
|
||||
Dynamic Configuration
|
||||
---------------------
|
||||
|
||||
The coordinator has dynamic configuration to change certain behaviour on the fly. The coordinator a JSON spec object from the Druid [MySQL](Metadata-storage.html) config table. This object is detailed below:
|
||||
The coordinator has dynamic configuration to change certain behaviour on the fly. The coordinator a JSON spec object from the Druid [metadata storage](Metadata-storage.html) config table. This object is detailed below:
|
||||
|
||||
It is recommended that you use the Coordinator Console to configure these parameters. However, if you need to do it via HTTP, the JSON object can be submitted to the overlord via a POST request at:
|
||||
|
||||
|
@ -31,7 +31,7 @@ The node types that currently exist are:
|
||||
|
||||
* [**Historical**](Historical.html) nodes are the workhorses that handle storage and querying on "historical" data (non-realtime). Historical nodes download segments from deep storage, respond to the queries from broker nodes about these segments, and return results to the broker nodes. They announce themselves and the segments they are serving in Zookeeper, and also use Zookeeper to monitor for signals to load or drop new segments.
|
||||
* [**Realtime**](Realtime.html) nodes ingest data in real time. They are in charge of listening to a stream of incoming data and making it available immediately inside the Druid system. Real-time nodes respond to query requests from Broker nodes, returning query results to those nodes. Aged data is pushed from Realtime nodes to Historical nodes.
|
||||
* [**Coordinator**](Coordinator.html) nodes monitor the grouping of historical nodes to ensure that data is available, replicated and in a generally "optimal" configuration. They do this by reading segment metadata information from MySQL to determine what segments should be loaded in the cluster, using Zookeeper to determine what Historical nodes exist, and creating Zookeeper entries to tell Historical nodes to load and drop new segments.
|
||||
* [**Coordinator**](Coordinator.html) nodes monitor the grouping of historical nodes to ensure that data is available, replicated and in a generally "optimal" configuration. They do this by reading segment metadata information from metadata storage to determine what segments should be loaded in the cluster, using Zookeeper to determine what Historical nodes exist, and creating Zookeeper entries to tell Historical nodes to load and drop new segments.
|
||||
* [**Broker**](Broker.html) nodes receive queries from external clients and forward those queries to Realtime and Historical nodes. When Broker nodes receive results, they merge these results and return them to the caller. For knowing topology, Broker nodes use Zookeeper to determine what Realtime and Historical nodes exist.
|
||||
* [**Indexer**](Indexing-Service.html) nodes form a cluster of workers to load batch and real-time data into the system as well as allow for alterations to the data stored in the system (also known as the Indexing Service).
|
||||
|
||||
@ -46,7 +46,7 @@ All nodes can be run in some highly available fashion, either as symmetric peers
|
||||
Aside from these nodes, there are 3 external dependencies to the system:
|
||||
|
||||
1. A running [ZooKeeper](ZooKeeper.html) cluster for cluster service discovery and maintenance of current data topology
|
||||
2. A [MySQL instance](Metadata-storage.html) for maintenance of metadata about the data segments that should be served by the system
|
||||
2. A [metadata storage instance](Metadata-storage.html) for maintenance of metadata about the data segments that should be served by the system
|
||||
3. A ["deep storage" LOB store/file system](Deep-Storage.html) to hold the stored segments
|
||||
|
||||
The following diagram illustrates the cluster's management layer, showing how certain nodes and dependencies help manage the cluster by tracking and exchanging metadata:
|
||||
@ -70,7 +70,7 @@ The output of the indexing process is stored in a "deep storage" LOB store/file
|
||||
|
||||
If a Historical node dies, it will no longer serve its segments, but given that the segments are still available on the "deep storage", any other node can simply download the segment and start serving it. This means that it is possible to actually remove all historical nodes from the cluster and then re-provision them without any data loss. It also means that if the "deep storage" is not available, the nodes can continue to serve the segments they have already pulled down (i.e. the cluster goes stale, not down).
|
||||
|
||||
In order for a segment to exist inside of the cluster, an entry has to be added to a table in a MySQL instance. This entry is a self-describing bit of metadata about the segment, including things like the schema of the segment, its size, and its location on deep storage. These entries are what the Coordinator uses to know what data **should** be available on the cluster.
|
||||
In order for a segment to exist inside of the cluster, an entry has to be added to a table in a metadata storage instance. This entry is a self-describing bit of metadata about the segment, including things like the schema of the segment, its size, and its location on deep storage. These entries are what the Coordinator uses to know what data **should** be available on the cluster.
|
||||
|
||||
### Fault Tolerance
|
||||
|
||||
@ -79,7 +79,7 @@ In order for a segment to exist inside of the cluster, an entry has to be added
|
||||
- **Broker** Can be run in parallel or in hot fail-over.
|
||||
- **Realtime** Depending on the semantics of the delivery stream, multiple of these can be run in parallel processing the exact same stream. They periodically checkpoint to disk and eventually push out to the Historical nodes. Steps are taken to be able to recover from process death, but loss of access to the local disk can result in data loss if this is the only method of adding data to the system.
|
||||
- **"deep storage" file system** If this is not available, new data will not be able to enter the cluster, but the cluster will continue operating as is.
|
||||
- **MySQL** If this is not available, the Coordinator will be unable to find out about new segments in the system, but it will continue with its current view of the segments that should exist in the cluster.
|
||||
- **metadata storage** If this is not available, the Coordinator will be unable to find out about new segments in the system, but it will continue with its current view of the segments that should exist in the cluster.
|
||||
- **ZooKeeper** If this is not available, data topology changes cannot be made, but the Brokers will maintain their most recent view of the data topology and continue serving requests accordingly.
|
||||
|
||||
### Query processing
|
||||
|
@ -14,7 +14,7 @@ This guide walks you through the steps to create the cluster and then how to cre
|
||||
|
||||
## What’s in this Druid Demo Cluster?
|
||||
|
||||
1. A single "Coordinator" node. This node co-locates the [Coordinator](Coordinator.html) process, the [Broker](Broker.html) process, Zookeeper, and the MySQL instance. You can read more about Druid architecture [Design](Design.html).
|
||||
1. A single "Coordinator" node. This node co-locates the [Coordinator](Coordinator.html) process, the [Broker](Broker.html) process, Zookeeper, and the metadata storage instance. You can read more about Druid architecture [Design](Design.html).
|
||||
|
||||
1. Three historical nodes; these historical nodes, have been pre-configured to work with the Coordinator node and should automatically load up the Wikipedia edit stream data (no specific setup is required).
|
||||
|
||||
|
@ -44,7 +44,7 @@ There are additional configs for autoscaling (if it is enabled):
|
||||
|
||||
#### Dynamic Configuration
|
||||
|
||||
Overlord dynamic configuration is mainly for autoscaling. The overlord reads a worker setup spec as a JSON object from the Druid [MySQL](Metadata-storage.html) config table. This object contains information about the version of middle managers to create, the maximum and minimum number of middle managers in the cluster at one time, and additional information required to automatically create middle managers.
|
||||
Overlord dynamic configuration is mainly for autoscaling. The overlord reads a worker setup spec as a JSON object from the Druid [metadata storage](Metadata-storage.html) config table. This object contains information about the version of middle managers to create, the maximum and minimum number of middle managers in the cluster at one time, and additional information required to automatically create middle managers.
|
||||
|
||||
The JSON object can be submitted to the overlord via a POST request at:
|
||||
|
||||
|
@ -17,7 +17,7 @@ Make sure that the `druid.publish.type` on your real-time nodes is set to `metad
|
||||
```
|
||||
druid.publish.type=metadata
|
||||
|
||||
druid.metadata.storage.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
|
||||
druid.metadata.storage.connector.connectURI=jdbc\:metadata storage\://localhost\:3306/druid
|
||||
druid.metadata.storage.connector.user=druid
|
||||
druid.metadata.storage.connector.password=diurd
|
||||
|
||||
|
@ -6,13 +6,13 @@ Production Cluster Configuration
|
||||
|
||||
__This configuration is an example of what a production cluster could look like. Many other hardware combinations are possible! Cheaper hardware is absolutely possible.__
|
||||
|
||||
This production Druid cluster assumes that MySQL and Zookeeper are already set up. The deep storage that is used for examples is S3 and memcached is used for a distributed cache.
|
||||
This production Druid cluster assumes that metadata storage and Zookeeper are already set up. The deep storage that is used for examples is S3 and memcached is used for a distributed cache.
|
||||
|
||||
The nodes that respond to queries (Historical, Broker, and Middle manager nodes) will use as many cores as are available, depending on usage, so it is best to keep these on dedicated machines. The upper limit of effectively utilized cores is not well characterized yet and would depend on types of queries, query load, and the schema. Historical daemons should have a heap a size of at least 1GB per core for normal usage, but could be squeezed into a smaller heap for testing. Since in-memory caching is essential for good performance, even more RAM is better. Broker nodes will use RAM for caching, so they do more than just route queries. SSDs are highly recommended for Historical nodes not all data is loaded in available memory.
|
||||
|
||||
The nodes that are responsible for coordination (Coordinator and Overlord nodes) require much less processing.
|
||||
|
||||
The effective utilization of cores by Zookeeper, MySQL, and Coordinator nodes is likely to be between 1 and 2 for each process/daemon, so these could potentially share a machine with lots of cores. These daemons work with heap a size between 500MB and 1GB.
|
||||
The effective utilization of cores by Zookeeper, metadata storage, and Coordinator nodes is likely to be between 1 and 2 for each process/daemon, so these could potentially share a machine with lots of cores. These daemons work with heap a size between 500MB and 1GB.
|
||||
|
||||
We'll use r3.8xlarge nodes for query facing nodes and m1.xlarge nodes for coordination nodes. The following examples work relatively well in production, however, a more optimized tuning for the nodes we selected and more optimal hardware for a Druid cluster are both definitely possible.
|
||||
|
||||
|
@ -7,7 +7,7 @@ For Real-time Node Configuration, see [Realtime Configuration](Realtime-Config.h
|
||||
|
||||
For Real-time Ingestion, see [Realtime Ingestion](Realtime-ingestion.html).
|
||||
|
||||
Realtime nodes provide a realtime index. Data indexed via these nodes is immediately available for querying. Realtime nodes will periodically build segments representing the data they’ve collected over some span of time and transfer these segments off to [Historical](Historical.html) nodes. They use ZooKeeper to monitor the transfer and MySQL to store metadata about the transfered segment. Once transfered, segments are forgotten by the Realtime nodes.
|
||||
Realtime nodes provide a realtime index. Data indexed via these nodes is immediately available for querying. Realtime nodes will periodically build segments representing the data they’ve collected over some span of time and transfer these segments off to [Historical](Historical.html) nodes. They use ZooKeeper to monitor the transfer and the metadata storage to store metadata about the transferred segment. Once transfered, segments are forgotten by the Realtime nodes.
|
||||
|
||||
### Running
|
||||
|
||||
|
@ -133,4 +133,4 @@ druid.processing.buffer.sizeBytes=100000000
|
||||
druid.processing.numThreads=1
|
||||
```
|
||||
|
||||
This simple broker will run groupBys in a single thread.
|
||||
This simple broker will run groupBys in a single thread.
|
||||
|
@ -58,7 +58,7 @@ This just walks through getting the relevant software installed and running. Yo
|
||||
|
||||
1. Install necessary software with yum
|
||||
|
||||
yum install -y java-1.7.0-openjdk-devel git wget mysql-server
|
||||
yum install -y java-1.7.0-openjdk-devel git wget metadata storage-server
|
||||
|
||||
1. Install maven
|
||||
|
||||
@ -204,11 +204,11 @@ This just walks through getting the relevant software installed and running. Yo
|
||||
|
||||
s3cmd ls
|
||||
|
||||
1. Start MySQL server
|
||||
1. Start metadata storage server
|
||||
|
||||
service mysqld start
|
||||
chkconfig mysqld on
|
||||
/usr/bin/mysqladmin -u root password 'riakdruid'
|
||||
service metadata storaged start
|
||||
chkconfig metadata storaged on
|
||||
/usr/bin/metadata storageadmin -u root password 'riakdruid'
|
||||
|
||||
NOTE: If you don't like "riakdruid" as your password, feel free to change it around.
|
||||
NOTE: If you have used root user to connect to database. It should be changed by other user but I have used this one to simplify it
|
||||
|
@ -14,7 +14,7 @@ Before we start digging into how to query Druid, make sure you've gone through t
|
||||
|
||||
Let's start up a simple Druid cluster so we can query all the things.
|
||||
|
||||
Note: If Zookeeper and MySQL aren't running, you'll have to start them again as described in [The Druid Cluster](Tutorial%3A-The-Druid-Cluster.html).
|
||||
Note: If Zookeeper and metadata storage aren't running, you'll have to start them again as described in [The Druid Cluster](Tutorial%3A-The-Druid-Cluster.html).
|
||||
|
||||
To start a Coordinator node:
|
||||
|
||||
|
@ -1,275 +0,0 @@
|
||||
---
|
||||
layout: doc_page
|
||||
---
|
||||
|
||||
# Tutorial: Loading Your Data (Part 1)
|
||||
In our last [tutorial](Tutorial%3A-The-Druid-Cluster.html), we set up a complete Druid cluster. We created all the Druid dependencies and loaded some batched data. Druid shards data into self-contained chunks known as [segments](Segments.html). Segments are the fundamental unit of storage in Druid and all Druid nodes only understand segments.
|
||||
|
||||
In this tutorial, we will learn about batch ingestion (as opposed to real-time ingestion) and how to create segments using the final piece of the Druid Cluster, the [indexing service](Indexing-Service.html). The indexing service is a standalone service that accepts [tasks](Tasks.html) in the form of POST requests. The output of most tasks are segments.
|
||||
|
||||
If you are interested more about ingesting your own data into Druid, skip to the next [tutorial](Tutorial%3A-Loading-Your-Data-Part-2.html).
|
||||
|
||||
About the data
|
||||
--------------
|
||||
|
||||
The data source we'll be working with is Wikipedia edits once again. The data schema is the same as the previous tutorials:
|
||||
|
||||
Dimensions (things to filter on):
|
||||
|
||||
```json
|
||||
"page"
|
||||
"language"
|
||||
"user"
|
||||
"unpatrolled"
|
||||
"newPage"
|
||||
"robot"
|
||||
"anonymous"
|
||||
"namespace"
|
||||
"continent"
|
||||
"country"
|
||||
"region"
|
||||
"city"
|
||||
```
|
||||
|
||||
Metrics (things to aggregate over):
|
||||
|
||||
```json
|
||||
"count"
|
||||
"added"
|
||||
"delta"
|
||||
"deleted"
|
||||
```
|
||||
Setting Up
|
||||
----------
|
||||
|
||||
At this point, you should already have Druid downloaded and are comfortable with running a Druid cluster locally. If you are not, see [here](Tutorial%3A-The-Druid-Cluster.html).
|
||||
|
||||
Let's start from our usual starting point in the tarball directory.
|
||||
|
||||
Segments require data, so before we can build a Druid segment, we are going to need some raw data. Make sure that the following file exists:
|
||||
|
||||
```
|
||||
examples/indexing/wikipedia_data.json
|
||||
```
|
||||
|
||||
Open the file and make sure the following events exist:
|
||||
|
||||
```json
|
||||
{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
|
||||
{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}
|
||||
{"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111}
|
||||
{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
|
||||
{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
|
||||
```
|
||||
|
||||
There are five data points spread across the day of 2013-08-31. Talk about big data right? Thankfully, we don't need a ton of data to introduce how batch ingestion works.
|
||||
|
||||
In order to ingest and query this data, we are going to need to run a historical node, a coordinator node, and an indexing service to run the batch ingestion.
|
||||
|
||||
Note: If Zookeeper and MySQL aren't running, you'll have to start them again as described in [The Druid Cluster](Tutorial%3A-The-Druid-Cluster.html).
|
||||
|
||||
#### Starting a Local Indexing Service
|
||||
|
||||
The simplest indexing service we can start up is to run an [overlord](Indexing-Service.html) node in local mode. You can do so by issuing:
|
||||
|
||||
```bash
|
||||
java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/overlord io.druid.cli.Main server overlord
|
||||
```
|
||||
|
||||
The overlord configurations should already exist in:
|
||||
|
||||
```
|
||||
config/overlord/runtime.properties
|
||||
```
|
||||
|
||||
The configurations for the overlord node are as follows:
|
||||
|
||||
```bash
|
||||
druid.host=localhost
|
||||
druid.port=8087
|
||||
druid.service=overlord
|
||||
|
||||
druid.zk.service.host=localhost
|
||||
|
||||
druid.extensions.coordinates=["io.druid.extensions:druid-kafka-seven:0.6.160"]
|
||||
|
||||
druid.metadata.storage.connector.connectURI=jdbc:mysql://localhost:3306/druid
|
||||
druid.metadata.storage.connector.user=druid
|
||||
druid.metadata.storage.connector.password=diurd
|
||||
|
||||
druid.selectors.indexing.serviceName=overlord
|
||||
druid.indexer.queue.startDelay=PT0M
|
||||
druid.indexer.runner.javaOpts="-server -Xmx256m"
|
||||
druid.indexer.fork.property.druid.processing.numThreads=1
|
||||
druid.indexer.fork.property.druid.computation.buffer.size=100000000
|
||||
```
|
||||
|
||||
If you are interested in reading more about these configurations, see [here](Indexing-Service.html).
|
||||
|
||||
#### Starting Other Nodes
|
||||
|
||||
Just in case you forgot how, let's start up the other nodes we require:
|
||||
|
||||
Coordinator node:
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/coordinator io.druid.cli.Main server coordinator
|
||||
```
|
||||
|
||||
Historical node:
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/historical io.druid.cli.Main server historical
|
||||
```
|
||||
|
||||
Note: Historical, real-time and broker nodes share the same query interface. Hence, we do not explicitly need a broker node for this tutorial. All queries can go against the historical node directly.
|
||||
|
||||
Once all the nodes are up and running, we are ready to index some data.
|
||||
|
||||
Indexing the Data
|
||||
-----------------
|
||||
|
||||
To index the data and build a Druid segment, we are going to need to submit a task to the indexing service. This task should already exist:
|
||||
|
||||
```
|
||||
examples/indexing/wikipedia_index_task.json
|
||||
```
|
||||
|
||||
Open up the file to see the following:
|
||||
|
||||
```json
|
||||
{
|
||||
"type" : "index",
|
||||
"dataSource" : "wikipedia",
|
||||
"granularitySpec" : {
|
||||
"type" : "uniform",
|
||||
"gran" : "DAY",
|
||||
"intervals" : [ "2013-08-31/2013-09-01" ]
|
||||
},
|
||||
"aggregators" : [{
|
||||
"type" : "count",
|
||||
"name" : "count"
|
||||
}, {
|
||||
"type" : "doubleSum",
|
||||
"name" : "added",
|
||||
"fieldName" : "added"
|
||||
}, {
|
||||
"type" : "doubleSum",
|
||||
"name" : "deleted",
|
||||
"fieldName" : "deleted"
|
||||
}, {
|
||||
"type" : "doubleSum",
|
||||
"name" : "delta",
|
||||
"fieldName" : "delta"
|
||||
}],
|
||||
"firehose" : {
|
||||
"type" : "local",
|
||||
"baseDir" : "examples/indexing",
|
||||
"filter" : "wikipedia_data.json",
|
||||
"parser" : {
|
||||
"timestampSpec" : {
|
||||
"column" : "timestamp"
|
||||
},
|
||||
"data" : {
|
||||
"format" : "json",
|
||||
"dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Okay, so what is happening here? The "type" field indicates the type of task we plan to run. In this case, it is a simple "index" task. The "granularitySpec" indicates that we are building a daily segment for 2013-08-31 to 2013-09-01. Next, the "aggregators" indicate which fields in our data set we plan to build metric columns for. The "fieldName" corresponds to the metric name in the raw data. The "name" corresponds to what our metric column is actually going to be called in the segment. Finally, we have a local "firehose" that is going to read data from disk. We tell the firehose where our data is located and the types of files we are looking to ingest. In our case, we only have a single data file.
|
||||
|
||||
Let's send our task to the indexing service now:
|
||||
|
||||
```bash
|
||||
curl -X 'POST' -H 'Content-Type:application/json' -d @examples/indexing/wikipedia_index_task.json localhost:8087/druid/indexer/v1/task
|
||||
```
|
||||
|
||||
Issuing the request should return a task ID like so:
|
||||
|
||||
```bash
|
||||
$ curl -X 'POST' -H 'Content-Type:application/json' -d @examples/indexing/wikipedia_index_task.json localhost:8087/druid/indexer/v1/task
|
||||
{"task":"index_wikipedia_2013-10-09T21:30:32.802Z"}
|
||||
$
|
||||
```
|
||||
|
||||
In your indexing service logs, you should see the following:
|
||||
|
||||
```bash
|
||||
2013-10-09 21:41:41,150 INFO [qtp300448720-21] io.druid.indexing.overlord.HeapMemoryTaskStorage - Inserting task index_wikipedia_2013-10-09T21:41:41.147Z with status: TaskStatus{id=index_wikipedia_2013-10-09T21:41:41.147Z, status=RUNNING, duration=-1}
|
||||
2013-10-09 21:41:41,151 INFO [qtp300448720-21] io.druid.indexing.overlord.TaskLockbox - Created new TaskLockPosse: TaskLockPosse{taskLock=TaskLock{groupId=index_wikipedia_2013-10-09T21:41:41.147Z, dataSource=wikipedia, interval=2013-08-31T00:00:00.000Z/2013-09-01T00:00:00.000Z, version=2013-10-09T21:41:41.151Z}, taskIds=[]}
|
||||
...
|
||||
013-10-09 21:41:41,215 INFO [pool-6-thread-1] io.druid.indexing.overlord.ForkingTaskRunner - Logging task index_wikipedia_2013-10-09T21:41:41.147Z_generator_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_0 output to: /tmp/persistent/index_wikipedia_2013-10-09T21:41:41.147Z_generator_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_0/b5099fdb-d6b0-4b81-9053-b2af70336a7e/log
|
||||
2013-10-09 21:41:45,017 INFO [qtp300448720-22] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_wikipedia_2013-10-09T21:41:41.147Z_generator_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_0]: LockListAction{}
|
||||
|
||||
````
|
||||
|
||||
After a few seconds, the task should complete and you should see in the indexing service logs:
|
||||
|
||||
```bash
|
||||
2013-10-09 21:41:45,765 INFO [pool-6-thread-1] io.druid.indexing.overlord.exec.TaskConsumer - Received SUCCESS status for task: IndexGeneratorTask{id=index_wikipedia_2013-10-09T21:41:41.147Z_generator_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_0, type=index_generator, dataSource=wikipedia, interval=Optional.of(2013-08-31T00:00:00.000Z/2013-09-01T00:00:00.000Z)}
|
||||
```
|
||||
|
||||
Congratulations! The segment has completed building. Once a segment is built, a segment metadata entry is created in your MySQL table. The coordinator compares what is in the segment metadata table with what is in the cluster. A new entry in the metadata table will cause the coordinator to load the new segment in a minute or so.
|
||||
|
||||
You should see the following logs on the coordinator:
|
||||
|
||||
```bash
|
||||
2013-10-09 21:41:54,368 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - [_default_tier] : Assigned 1 segments among 1 servers
|
||||
2013-10-09 21:41:54,369 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - Load Queues:
|
||||
2013-10-09 21:41:54,369 INFO [Coordinator-Exec--0] io.druid.server.coordinator.helper.DruidCoordinatorLogger - Server[localhost:8081, historical, _default_tier] has 1 left to load, 0 left to drop, 4,477 bytes queued, 4,477 bytes served.
|
||||
```
|
||||
|
||||
These logs indicate that the coordinator has assigned our new segment to the historical node to download and serve. If you look at the historical node logs, you should see:
|
||||
|
||||
```bash
|
||||
2013-10-09 21:41:54,369 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Loading segment wikipedia_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_2013-10-09T21:41:41.151Z
|
||||
2013-10-09 21:41:54,369 INFO [ZkCoordinator-0] io.druid.segment.loading.LocalDataSegmentPuller - Unzipping local file[/tmp/druid/localStorage/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2013-10-09T21:41:41.151Z/0/index.zip] to [/tmp/druid/indexCache/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2013-10-09T21:41:41.151Z/0]
|
||||
2013-10-09 21:41:54,370 INFO [ZkCoordinator-0] io.druid.utils.CompressionUtils - Unzipping file[/tmp/druid/localStorage/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2013-10-09T21:41:41.151Z/0/index.zip] to [/tmp/druid/indexCache/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2013-10-09T21:41:41.151Z/0]
|
||||
2013-10-09 21:41:54,380 INFO [ZkCoordinator-0] io.druid.server.coordination.SingleDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_2013-10-09T21:41:41.151Z] to path[/druid/servedSegments/localhost:8081/wikipedia_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_2013-10-09T21:41:41.151Z]
|
||||
```
|
||||
|
||||
Once the segment is announced the segment is queryable. Now you should be able to query the data.
|
||||
|
||||
Issuing a [TimeBoundaryQuery](TimeBoundaryQuery.html) should yield:
|
||||
|
||||
```json
|
||||
[ {
|
||||
"timestamp" : "2013-08-31T01:02:33.000Z",
|
||||
"result" : {
|
||||
"minTime" : "2013-08-31T01:02:33.000Z",
|
||||
"maxTime" : "2013-08-31T12:41:27.000Z"
|
||||
}
|
||||
} ]
|
||||
```
|
||||
|
||||
Console
|
||||
--------
|
||||
|
||||
The indexing service overlord has a console located at:
|
||||
|
||||
```bash
|
||||
localhost:8087/console.html
|
||||
```
|
||||
|
||||
On this console, you can look at statuses and logs of recently submitted and completed tasks.
|
||||
|
||||
If you decide to reuse the local firehose to ingest your own data and if you run into problems, you can use the console to read the individual task logs.
|
||||
|
||||
Task logs can be stored locally or uploaded to [Deep Storage](Deep-Storage.html). More information about how to configure this is [here](Configuration.html).
|
||||
|
||||
Most common data ingestion problems are around timestamp formats and other malformed data issues.
|
||||
|
||||
Next Steps
|
||||
----------
|
||||
|
||||
This tutorial covered ingesting a small batch data set and loading it into Druid. In [Loading Your Data Part 2](Tutorial%3A-Loading-Your-Data-Part-2.html), we will cover how to ingest data using Hadoop for larger data sets.
|
||||
|
||||
Note: The index task and local firehose can be used to ingest your own data if the size of that data is relatively small (< 1G). The index task is fairly slow and we highly recommend using the Hadoop Index Task for ingesting larger quantities of data.
|
||||
|
||||
Additional Information
|
||||
----------------------
|
||||
|
||||
Getting data into Druid can definitely be difficult for first time users. Please don't hesitate to ask questions in our IRC channel or on our [google groups page](https://groups.google.com/forum/#!forum/druid-development).
|
@ -1,343 +0,0 @@
|
||||
---
|
||||
layout: doc_page
|
||||
---
|
||||
|
||||
# Tutorial: Loading Your Data (Part 2)
|
||||
In this tutorial we will cover more advanced/real-world ingestion topics.
|
||||
|
||||
Druid can ingest streaming or batch data. Streaming data is ingested via the real-time node, and batch data is ingested via the Hadoop batch indexer. Druid also has a standalone ingestion service called the [indexing service](Indexing-Service.html).
|
||||
|
||||
The Data
|
||||
--------
|
||||
The data source we'll be using is (surprise!) Wikipedia edits. The data schema is still:
|
||||
|
||||
Dimensions (things to filter on):
|
||||
|
||||
```json
|
||||
"page"
|
||||
"language"
|
||||
"user"
|
||||
"unpatrolled"
|
||||
"newPage"
|
||||
"robot"
|
||||
"anonymous"
|
||||
"namespace"
|
||||
"continent"
|
||||
"country"
|
||||
"region"
|
||||
"city"
|
||||
```
|
||||
|
||||
Metrics (things to aggregate over):
|
||||
|
||||
```json
|
||||
"count"
|
||||
"added"
|
||||
"delta"
|
||||
"deleted"
|
||||
```
|
||||
|
||||
Streaming Event Ingestion
|
||||
-------------------------
|
||||
|
||||
With real-world data, we recommend having a message bus such as [Apache Kafka](http://kafka.apache.org/) sit between the data stream and the real-time node. The message bus provides higher availability for production environments. [Firehoses](Firehose.html) are the key abstraction for real-time ingestion.
|
||||
|
||||
<a id="set-up-kafka"></a>
|
||||
#### Setting up Kafka
|
||||
|
||||
[KafkaFirehoseFactory](Firehose.html) is how druid communicates with Kafka. Using this [Firehose](Firehose.html) with the right configuration, we can import data into Druid in real-time without writing any code. To load data to a real-time node via Kafka, we'll first need to initialize Zookeeper and Kafka, and then configure and initialize a [Realtime](Realtime.html) node.
|
||||
|
||||
The following quick-start instructions for booting a Zookeeper and then Kafka cluster were taken from the [Kafka website](http://kafka.apache.org/07/quickstart.html).
|
||||
|
||||
1. Download Apache Kafka 0.7.2 from [http://kafka.apache.org/downloads.html](http://kafka.apache.org/downloads.html)
|
||||
|
||||
```bash
|
||||
wget http://archive.apache.org/dist/kafka/old_releases/kafka-0.7.2-incubating/kafka-0.7.2-incubating-src.tgz
|
||||
tar -xvzf kafka-0.7.2-incubating-src.tgz
|
||||
cd kafka-0.7.2-incubating-src
|
||||
```
|
||||
|
||||
2. Build Kafka
|
||||
|
||||
```bash
|
||||
./sbt update
|
||||
./sbt package
|
||||
```
|
||||
|
||||
3. Boot Kafka
|
||||
|
||||
```bash
|
||||
cat config/zookeeper.properties
|
||||
bin/zookeeper-server-start.sh config/zookeeper.properties
|
||||
|
||||
# in a new console
|
||||
bin/kafka-server-start.sh config/server.properties
|
||||
```
|
||||
|
||||
4. Launch the console producer (so you can type in JSON kafka messages in a bit)
|
||||
|
||||
```bash
|
||||
bin/kafka-console-producer.sh --zookeeper localhost:2181 --topic wikipedia
|
||||
```
|
||||
|
||||
When things are ready, you should see log messages such as:
|
||||
|
||||
```
|
||||
[2013-10-09 22:03:07,802] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)
|
||||
```
|
||||
|
||||
#### Launch a Realtime Node
|
||||
|
||||
You should be comfortable starting Druid nodes at this point. If not, it may be worthwhile to revisit the first few tutorials.
|
||||
|
||||
1. Real-time nodes can be started with:
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.realtime.specFile=examples/indexing/wikipedia.spec -classpath lib/*:config/realtime io.druid.cli.Main server realtime
|
||||
```
|
||||
|
||||
2. A realtime.spec should already exist for the data source in the Druid tarball. You should be able to find it at:
|
||||
|
||||
```bash
|
||||
examples/indexing/wikipedia.spec
|
||||
```
|
||||
|
||||
The contents of the file should match:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"schema": {
|
||||
"dataSource": "wikipedia",
|
||||
"aggregators" : [
|
||||
{
|
||||
"type" : "count",
|
||||
"name" : "count"
|
||||
},
|
||||
{
|
||||
"type" : "doubleSum",
|
||||
"name" : "added",
|
||||
"fieldName" : "added"
|
||||
},
|
||||
{
|
||||
"type" : "doubleSum",
|
||||
"name" : "deleted",
|
||||
"fieldName" : "deleted"
|
||||
},
|
||||
{
|
||||
"type" : "doubleSum",
|
||||
"name" : "delta",
|
||||
"fieldName" : "delta"
|
||||
}
|
||||
],
|
||||
"indexGranularity": "none"
|
||||
},
|
||||
"config": {
|
||||
"maxRowsInMemory": 500000,
|
||||
"intermediatePersistPeriod": "PT10m"
|
||||
},
|
||||
"firehose": {
|
||||
"type": "kafka-0.7.2",
|
||||
"consumerProps": {
|
||||
"zk.connect": "localhost:2181",
|
||||
"zk.connectiontimeout.ms": "15000",
|
||||
"zk.sessiontimeout.ms": "15000",
|
||||
"zk.synctime.ms": "5000",
|
||||
"groupid": "druid-example",
|
||||
"fetch.size": "1048586",
|
||||
"autooffset.reset": "largest",
|
||||
"autocommit.enable": "false"
|
||||
},
|
||||
"feed": "wikipedia",
|
||||
"parser": {
|
||||
"timestampSpec": {
|
||||
"column": "timestamp"
|
||||
},
|
||||
"data": {
|
||||
"format": "json",
|
||||
"dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
|
||||
}
|
||||
}
|
||||
},
|
||||
"plumber": {
|
||||
"type": "realtime",
|
||||
"windowPeriod": "PT10m",
|
||||
"segmentGranularity": "hour",
|
||||
"basePersistDirectory": "\/tmp\/realtime\/basePersist",
|
||||
"rejectionPolicy": {
|
||||
"type": "test"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Note: This config uses a "test" [rejection policy](Plumber.html) which will accept all events and timely hand off, however, we strongly recommend you do not use this in production. Using this rejection policy, segments for events for the same time range will be overridden.
|
||||
|
||||
3. Let's copy and paste some data into the Kafka console producer
|
||||
|
||||
```json
|
||||
{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
|
||||
{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}
|
||||
{"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111}
|
||||
{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
|
||||
{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
|
||||
```
|
||||
|
||||
Disclaimer: We recognize the timestamps of these events aren't actually recent.
|
||||
|
||||
5. Watch the events as they are ingested by Druid's real-time node:
|
||||
|
||||
```bash
|
||||
...
|
||||
2013-10-10 05:13:18,976 INFO [chief-wikipedia] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-31T01:00:00.000Z_2013-08-31T02:00:00.000Z_2013-08-31T01:00:00.000Z] at path[/druid/segments/localhost:8083/2013-10-10T05:13:18.972Z0]
|
||||
2013-10-10 05:13:18,992 INFO [chief-wikipedia] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-31T03:00:00.000Z_2013-08-31T04:00:00.000Z_2013-08-31T03:00:00.000Z] at path[/druid/segments/localhost:8083/2013-10-10T05:13:18.972Z0]
|
||||
2013-10-10 05:13:18,997 INFO [chief-wikipedia] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-31T07:00:00.000Z_2013-08-31T08:00:00.000Z_2013-08-31T07:00:00.000Z] at path[/druid/segments/localhost:8083/2013-10-10T05:13:18.972Z0]
|
||||
2013-10-10 05:13:19,003 INFO [chief-wikipedia] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-31T11:00:00.000Z_2013-08-31T12:00:00.000Z_2013-08-31T11:00:00.000Z] at path[/druid/segments/localhost:8083/2013-10-10T05:13:18.972Z0]
|
||||
2013-10-10 05:13:19,008 INFO [chief-wikipedia] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-31T12:00:00.000Z_2013-08-31T13:00:00.000Z_2013-08-31T12:00:00.000Z] at path[/druid/segments/localhost:8083/2013-10-10T05:13:18.972Z0]
|
||||
...
|
||||
```
|
||||
|
||||
Issuing a [TimeBoundaryQuery](TimeBoundaryQuery.html) to the real-time node should yield valid results:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"timestamp" : "2013-08-31T01:02:33.000Z",
|
||||
"result" : {
|
||||
"minTime" : "2013-08-31T01:02:33.000Z",
|
||||
"maxTime" : "2013-08-31T12:41:27.000Z"
|
||||
}
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
Batch Ingestion
|
||||
---------------
|
||||
Druid is designed for large data volumes, and most real-world data sets require batch indexing be done through a Hadoop job.
|
||||
|
||||
For this tutorial, we used [Hadoop 1.0.3](https://archive.apache.org/dist/hadoop/core/hadoop-1.0.3/). There are many pages on the Internet showing how to set up a single-node (standalone) Hadoop cluster, which is all that's needed for this example.
|
||||
|
||||
For the purposes of this tutorial, we are going to use our very small and simple Wikipedia data set. This data can directly be ingested via other means as shown in the previous [tutorial](Tutorial%3A-Loading-Your-Data-Part-1.html), but we are going to use Hadoop here for demonstration purposes.
|
||||
|
||||
Our data is located at:
|
||||
|
||||
```
|
||||
examples/indexing/wikipedia_data.json
|
||||
```
|
||||
|
||||
The following events should exist in the file:
|
||||
|
||||
```json
|
||||
{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
|
||||
{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}
|
||||
{"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Oblast", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111}
|
||||
{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
|
||||
{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
|
||||
```
|
||||
|
||||
#### Set Up a Druid Cluster
|
||||
|
||||
To index the data, we are going to need an indexing service, a historical node, and a coordinator node.
|
||||
|
||||
Note: If Zookeeper and MySQL aren't running, you'll have to start them again as described in [The Druid Cluster](Tutorial%3A-The-Druid-Cluster.html).
|
||||
|
||||
To start the Indexing Service:
|
||||
|
||||
```bash
|
||||
java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:<hadoop_config_path>:config/overlord io.druid.cli.Main server overlord
|
||||
```
|
||||
|
||||
To start the Coordinator Node:
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/coordinator io.druid.cli.Main server coordinator
|
||||
```
|
||||
|
||||
To start the Historical Node:
|
||||
|
||||
```bash
|
||||
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/historical io.druid.cli.Main server historical
|
||||
```
|
||||
|
||||
#### Index the Data
|
||||
|
||||
Before indexing the data, make sure you have a valid Hadoop cluster running. To build our Druid segment, we are going to submit a [Hadoop index task](Tasks.html) to the indexing service. The grammar for the Hadoop index task is very similar to the index task of the last tutorial. The tutorial Hadoop index task should be located at:
|
||||
|
||||
```
|
||||
examples/indexing/wikipedia_index_hadoop_task.json
|
||||
```
|
||||
|
||||
Examining the contents of the file, you should find:
|
||||
|
||||
```json
|
||||
{
|
||||
"type" : "index_hadoop",
|
||||
"config": {
|
||||
"dataSource" : "wikipedia",
|
||||
"timestampSpec" : {
|
||||
"column" : "timestamp",
|
||||
"format" : "auto"
|
||||
},
|
||||
"dataSpec" : {
|
||||
"format" : "json",
|
||||
"dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
|
||||
},
|
||||
"granularitySpec" : {
|
||||
"type" : "uniform",
|
||||
"gran" : "DAY",
|
||||
"intervals" : [ "2013-08-31/2013-09-01" ]
|
||||
},
|
||||
"pathSpec" : {
|
||||
"type" : "static",
|
||||
"paths" : "examples/indexing/wikipedia_data.json"
|
||||
},
|
||||
"targetPartitionSize" : 5000000,
|
||||
"rollupSpec" : {
|
||||
"aggs": [
|
||||
{
|
||||
"type" : "count",
|
||||
"name" : "count"
|
||||
},
|
||||
{
|
||||
"type" : "doubleSum",
|
||||
"name" : "added",
|
||||
"fieldName" : "added"
|
||||
},
|
||||
{
|
||||
"type" : "doubleSum",
|
||||
"name" : "deleted",
|
||||
"fieldName" : "deleted"
|
||||
},
|
||||
{
|
||||
"type" : "doubleSum",
|
||||
"name" : "delta",
|
||||
"fieldName" : "delta"
|
||||
}
|
||||
],
|
||||
"rollupGranularity" : "none"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
If you are curious about what all this configuration means, see [here](Tasks.html).
|
||||
|
||||
To submit the task:
|
||||
|
||||
```bash
|
||||
curl -X 'POST' -H 'Content-Type:application/json' -d @examples/indexing/wikipedia_index_hadoop_task.json localhost:8087/druid/indexer/v1/task
|
||||
```
|
||||
|
||||
After the task is completed, the segment should be assigned to your historical node. You should be able to query the segment.
|
||||
|
||||
Next Steps
|
||||
----------
|
||||
We demonstrated using the indexing service as a way to ingest data into Druid. Previous versions of Druid used the [HadoopDruidIndexer](Batch-ingestion.html) to ingest batch data. The `HadoopDruidIndexer` still remains a valid option for batch ingestion, however, we recommend using the indexing service as the preferred method of getting batch data into Druid.
|
||||
|
||||
For more information on querying, check out this [tutorial](Tutorial%3A-All-About-Queries.html).
|
||||
|
||||
Additional Information
|
||||
----------------------
|
||||
|
||||
Getting data into Druid can definitely be difficult for first time users. Please don't hesitate to ask questions in our IRC channel or on our [google groups page](https://groups.google.com/forum/#!forum/druid-development).
|
Loading…
x
Reference in New Issue
Block a user