Merge pull request #263 from metamx/is-docs

New Tutorial for batch ingestion with indexing service
This commit is contained in:
fjy 2013-10-09 15:45:03 -07:00
commit cf4711d99d
32 changed files with 831 additions and 294 deletions

View File

@ -20,7 +20,7 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-cassandra-storage</artifactId>
<name>druid-cassandra-storage</name>
<description>druid-cassandra-storage</description>

View File

@ -20,7 +20,7 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-common</artifactId>
<name>druid-common</name>
<description>druid-common</description>

View File

@ -1,69 +1,147 @@
---
layout: doc_page
---
Disclaimer: We are still in the process of finalizing the indexing service and these configs are prone to change at any time. We will announce when we feel the indexing service and the configurations described are stable.
The indexing service is a highly-available, distributed service that runs indexing related tasks. Indexing service [tasks](Tasks.html) create (and sometimes destroy) Druid [segments](Segments.html). The indexing service has a master/slave like architecture.
The indexing service is a distributed task/job queue. It accepts requests in the form of [Tasks](Tasks.html) and executes those tasks across a set of worker nodes. Worker capacity can be automatically adjusted based on the number of tasks pending in the system. The indexing service is highly available, has built in retry logic, and can backup per task logs in deep storage.
The indexing service is composed of three main components: a peon component that can run a single task, a middle manager component that manages peons, and an overlord component that manages task distribution to middle managers.
Overlords and middle managers may run on the same node or across multiple nodes while middle managers and peons always run on the same node.
The indexing service is composed of two main components, a coordinator node that manages task distribution and worker capacity, and worker nodes that execute tasks in separate JVMs.
Most Basic Getting Started Configuration
----------------------------------------
Run:
```
io.druid.cli.Main server overlord
```
With the following JVM configuration:
```
-server
-Xmx2g
-Duser.timezone=UTC
-Dfile.encoding=UTF-8
-Ddruid.host=localhost
-Ddruid.port=8080
-Ddruid.service=overlord
-Ddruid.zk.service.host=localhost
-Ddruid.db.connector.connectURI=jdbc:mysql://localhost:3306/druid
-Ddruid.db.connector.user=druid
-Ddruid.db.connector.password=diurd
-Ddruid.selectors.indexing.serviceName=overlord
-Ddruid.indexer.runner.javaOpts="-server -Xmx1g"
-Ddruid.indexer.runner.startPort=8081
-Ddruid.indexer.fork.property.druid.computation.buffer.size=268435456
```
You can now submit simple indexing tasks to the indexing service.
<!--
Preamble
--------
The truth is, the indexing service is an experience that is difficult to characterize with words. When they asked me to write this preamble, I was taken aback. I wasnt quite sure what exactly to write or how to describe this… entity. I accepted the job, as much for the challenge and inner growth as the money, and took to the mountains for reflection. Six months later, I knew I had it, I was done and had achieved the next euphoric victory in the continuous struggle that plagues my life. But, enough about me. This is about the indexing service.
The indexing service is philosophical transcendence, an infallible truth that will shape your soul, mold your character, and define your reality. The indexing service is creating world peace, playing with puppies, unwrapping presents on Christmas morning, cradling a loved one, and beating Goro in Mortal Kombat for the first time. The indexing service is sustainable economic growth, global propensity, and a world of transparent financial transactions. The indexing service is a true belieber. The indexing service is panicking because you forgot you signed up for a course and the big exam is in a few minutes, only to wake up and realize it was all a dream. What is the indexing service? More like what isnt the indexing service. The indexing service is here and it is ready, but are you?
-->
Indexer Coordinator Node
------------------------
<!--
Indexing Service Overview
-------------------------
The indexer coordinator node exposes HTTP endpoints where tasks can be submitted by posting a JSON blob to specific endpoints. It can be started by launching IndexerCoordinatorMain.java. The indexer coordinator node can operate in local mode or remote mode. In local mode, the coordinator and worker run on the same host and port. In remote mode, worker processes run on separate hosts and ports.
We need one of these.
-->
Tasks can be submitted via POST requests to:
Overlord Node
-------------
The overlord node is responsible for accepting tasks, coordinating task distribution, creating locks around tasks, and returning statuses to callers.
#### Usage
Tasks are submitted to the overlord node in the form of JSON objects. Tasks can be submitted via POST requests to:
```
http://<COORDINATOR_IP>:<port>/druid/indexer/v1/task
http://<OVERLORD_IP>:<port>/druid/indexer/v1/task
```
Tasks can cancelled via POST requests to:
```
http://<COORDINATOR_IP>:<port>/druid/indexer/v1/task/{taskId}/shutdown
http://<OVERLORD_IP>:<port>/druid/indexer/v1/task/{taskId}/shutdown
```
Issuing the cancel request once sends a graceful shutdown request. Graceful shutdowns may not stop a task right away, but instead issue a safe stop command at a point deemed least impactful to the system. Issuing the cancel request twice in succession will kill 9 the task.
Issuing the cancel request will kill 9 the task.
Task statuses can be retrieved via GET requests to:
```
http://<COORDINATOR_IP>:<port>/druid/indexer/v1/task/{taskId}/status
http://<OVERLORD_IP>:<port>/druid/indexer/v1/task/{taskId}/status
```
Task segments can be retrieved via GET requests to:
```
http://<COORDINATOR_IP>:<port>/druid/indexer/v1/task/{taskId}/segments
http://<OVERLORD_IP>:<port>/druid/indexer/v1/task/{taskId}/segments
```
When a task is submitted, the coordinator creates a lock over the data source and interval of the task. The coordinator also stores the task in a MySQL database table. The database table is read at startup time to bootstrap any tasks that may have been submitted to the coordinator but may not yet have been executed.
#### Console
The coordinator also exposes a simple UI to show what tasks are currently running on what nodes at
The overlord console can be used to view pending tasks, running tasks, available workers, and recent worker creation and termination. The console can be accessed at:
```
http://<COORDINATOR_IP>:<port>/static/console.html
http://<OVERLORD_IP>:8080/console.html
```
#### Task Execution
The coordinator retrieves worker setup metadata from the Druid [MySQL](MySQL.html) config table. This metadata contains information about the version of workers to create, the maximum and minimum number of workers in the cluster at one time, and additional information required to automatically create workers.
Tasks are assigned to workers by creating entries under specific /tasks paths associated with a worker, similar to how the Druid coordinator node assigns segments to historical nodes. See [Worker Configuration](Indexing-Service#configuration-1). Once a worker picks up a task, it deletes the task entry and announces a task status under a /status path associated with the worker. Tasks are submitted to a worker until the worker hits capacity. If all workers in a cluster are at capacity, the indexer coordinator node automatically creates new worker resources.
#### Autoscaling
The Autoscaling mechanisms currently in place are tightly coupled with our deployment infrastructure but the framework should be in place for other implementations. We are highly open to new implementations or extensions of the existing mechanisms. In our own deployments, worker nodes are Amazon AWS EC2 nodes and they are provisioned to register themselves in a [galaxy](https://github.com/ning/galaxy) environment.
The Autoscaling mechanisms currently in place are tightly coupled with our deployment infrastructure but the framework should be in place for other implementations. We are highly open to new implementations or extensions of the existing mechanisms. In our own deployments, middle manager nodes are Amazon AWS EC2 nodes and they are provisioned to register themselves in a [galaxy](https://github.com/ning/galaxy) environment.
The Coordinator node controls the number of workers in the cluster according to a worker setup spec that is submitted via a POST request to the indexer at:
If autoscaling is enabled, new middle managers may be added when a task has been in pending state for too long. Middle managers may be terminated if they have not run any tasks for a period of time.
#### JVM Configuration
The overlord module requires the following basic configs to run in remote mode:
|Property|Description|Default|
|--------|-----------|-------|
|`druid.indexer.runner.type`|Choices "local" or "remote". Indicates whether tasks should be run locally or in a distributed environment.|local|
|`druid.indexer.storage.type`|Choices are "local" or "db". Indicates whether incoming tasks should be stored locally (in heap) or in a database. Storing incoming tasks in a database allows for tasks to be bootstrapped if the overlord should fail.|local|
The following configs only apply if the overlord is running in remote mode:
|Property|Description|Default|
|--------|-----------|-------|
|`druid.indexer.runner.taskAssignmentTimeout`|How long to wait after a task as been assigned to a middle manager before throwing an error.|PT5M|
|`druid.indexer.runner.minWorkerVersion`|The minimum middle manager version to send tasks to. |none|
|`druid.indexer.runner.compressZnodes`|Indicates whether or not the overlord should expect middle managers to compress Znodes.|false|
|`druid.indexer.runner.maxZnodeBytes`|The maximum size Znode in bytes that can be created in Zookeeper.|524288|
There are additional configs for autoscaling (if it is enabled):
|Property|Description|Default|
|--------|-----------|-------|
|`druid.indexer.autoscale.doAutoscale`|If set to "true" autoscaling will be enabled.|false|
|`druid.indexer.autoscale.provisionPeriod`|How often to check whether or not new middle managers should be added.|PT1M|
|`druid.indexer.autoscale.terminatePeriod`|How often to check when middle managers should be removed.|PT1H|
|`druid.indexer.autoscale.originTime`|The starting reference timestamp that the terminate period increments upon.|2012-01-01T00:55:00.000Z|
|`druid.indexer.autoscale.strategy`|Choices are "noop" or "ec2". Sets the strategy to run when autoscaling is required.|noop|
|`druid.indexer.autoscale.workerIdleTimeout`|How long can a worker be idle (not a run task) before it can be considered for termination.|PT10M|
|`druid.indexer.autoscale.maxScalingDuration`|How long the overlord will wait around for a middle manager to show up before giving up.|PT15M|
|`druid.indexer.autoscale.numEventsToTrack`|The number of autoscaling related events (node creation and termination) to track.|10|
|`druid.indexer.autoscale.pendingTaskTimeout`|How long a task can be in "pending" state before the overlord tries to scale up.|PT30S|
|`druid.indexer.autoscale.workerVersion`|If set, will only create nodes of set version during autoscaling. Overrides dynamic configuration. |null|
|`druid.indexer.autoscale.workerPort`|The port that middle managers will run on.|8080|
#### Dynamic Configuration
Overlord dynamic configuration is mainly for autoscaling. The overlord reads a worker setup spec as a JSON object from the Druid [MySQL](MySQL.html) config table. This object contains information about the version of middle managers to create, the maximum and minimum number of middle managers in the cluster at one time, and additional information required to automatically create middle managers.
The JSON object can be submitted to the overlord via a POST request at:
```
http://<COORDINATOR_IP>:<port>/druid/indexer/v1/worker/setup
@ -94,7 +172,7 @@ A sample worker setup spec is shown below:
}
```
Issuing a GET request at the same URL will return the current worker setup spec that is currently in place. The worker setup spec list above is just a sample and it is possible to write worker setup specs for other deployment environments. A description of the worker setup spec is shown below.
Issuing a GET request at the same URL will return the current worker setup spec that is currently in place. The worker setup spec list above is just a sample and it is possible to extend the code base for other deployment environments. A description of the worker setup spec is shown below.
|Property|Description|Default|
|--------|-----------|-------|
@ -104,109 +182,85 @@ Issuing a GET request at the same URL will return the current worker setup spec
|`nodeData`|A JSON object that contains metadata about new nodes to create.|none|
|`userData`|A JSON object that contains metadata about how the node should register itself on startup. This data is sent with node creation requests.|none|
For more information about configuring Auto-scaling, see [Auto-Scaling Configuration](https://github.com/metamx/druid/wiki/Indexing-Service#auto-scaling-configuration).
#### Running
```
io.druid.cli.Main server overlord
```
Note: When running the overlord in local mode, all middle manager and peon configurations must be provided as well.
MiddleManager Node
------------------
The middle manager node is a worker node that executes submitted tasks. Middle Managers forward tasks to peons that run in separate JVMs. Each peon is capable of running only one task at a time, however, a middle manager may have multiple peons.
#### JVM Configuration
Middle managers pass their configurations down to their child peons. The middle manager module requires the following configs:
|Property|Description|Default|
|--------|-----------|-------|
|`druid.worker.ip`|The IP of the worker.|localhost|
|`druid.worker.version`|Version identifier for the middle manager.|0|
|`druid.worker.capacity`|Maximum number of tasks the middle manager can accept.|Number of available processors - 1|
|`druid.indexer.runner.compressZnodes`|Indicates whether or not the middle managers should compress Znodes.|false|
|`druid.indexer.runner.maxZnodeBytes`|The maximum size Znode in bytes that can be created in Zookeeper.|524288|
|`druid.indexer.runner.taskDir`|Temporary intermediate directory used during task execution.|/tmp/persistent|
|`druid.indexer.runner.javaCommand`|Command required to execute java.|java|
|`druid.indexer.runner.javaOpts`|-X Java options to run the peon in its own JVM.|""|
|`druid.indexer.runner.classpath`|Java classpath for the peon.|System.getProperty("java.class.path")|
|`druid.indexer.runner.startPort`|The port that peons begin running on.|8080|
|`druid.indexer.runner.allowedPrefixes`|Whitelist of prefixes for configs that can be passed down to child peons.|"com.metamx", "druid", "io.druid", "user.timezone","file.encoding"|
#### Running
Indexer Coordinator nodes can be run using the `com.metamx.druid.indexing.coordinator.http.IndexerCoordinatorMain` class.
#### Configuration
Indexer Coordinator nodes require [basic service configuration](https://github.com/metamx/druid/wiki/Configuration#basic-service-configuration). In addition, there are several extra configurations that are required.
```
-Ddruid.zk.paths.indexer.announcementsPath=/druid/indexer/announcements
-Ddruid.zk.paths.indexer.leaderLatchPath=/druid/indexer/leaderLatchPath
-Ddruid.zk.paths.indexer.statusPath=/druid/indexer/status
-Ddruid.zk.paths.indexer.tasksPath=/druid/demo/indexer/tasks
-Ddruid.indexer.runner=remote
-Ddruid.indexer.taskDir=/mnt/persistent/task/
-Ddruid.indexer.configTable=sample_config
-Ddruid.indexer.workerSetupConfigName=worker_setup
-Ddruid.indexer.strategy=ec2
-Ddruid.indexer.hadoopWorkingPath=/tmp/druid-indexing
-Ddruid.indexer.logs.s3bucket=some_bucket
-Ddruid.indexer.logs.s3prefix=some_prefix
io.druid.cli.Main server middleManager
```
The indexing service requires some additional Zookeeper configs.
Peons
-----
Peons run a single task in a single JVM. Peons are a part of middle managers and should rarely (if ever) be run on their own.
#### JVM Configuration
Although peons inherit the configurations of their parent middle managers, explicit child peon configs can be set by prefixing them with:
```
druid.indexer.fork.property
```
Additional peon configs include:
|Property|Description|Default|
|--------|-----------|-------|
|`druid.zk.paths.indexer.announcementsPath`|The base path where workers announce themselves.|none|
|`druid.zk.paths.indexer.leaderLatchPath`|The base that coordinator nodes use to determine a leader.|none|
|`druid.zk.paths.indexer.statusPath`|The base path where workers announce task statuses.|none|
|`druid.zk.paths.indexer.tasksPath`|The base path where the coordinator assigns new tasks.|none|
|`druid.peon.mode`|Choices are "local" and "remote". Setting this to local means you intend to run the peon as a standalone node (Not recommended).|remote|
|`druid.indexer.baseDir`|Base temporary working directory.|/tmp|
|`druid.indexer.baseTaskDir`|Base temporary working directory for tasks.|/tmp/persistent/tasks|
|`druid.indexer.hadoopWorkingPath`|Temporary working directory for Hadoop tasks.|/tmp/druid-indexing|
|`druid.indexer.defaultRowFlushBoundary`|Highest row count before persisting to disk. Used for indexing generating tasks.|50000|
|`druid.indexer.task.chathandler.type`|Choices are "noop" and "announce". Certain tasks will use service discovery to announce an HTTP endpoint that events can be posted to.|noop|
Theres several additional configs that are required to run tasks.
If the peon is running in remote mode, there must be an overlord up and running. Running peons in remote mode require the following configurations:
|Property|Description|Default|
|--------|-----------|-------|
|`druid.indexer.runner`|Indicates whether tasks should be run locally or in a distributed environment. "local" or "remote".|local|
|`druid.indexer.taskDir`|Intermediate temporary directory that tasks may use.|none|
|`druid.indexer.configTable`|The MySQL config table where misc configs live.|none|
|`druid.indexer.strategy`|The autoscaling strategy to use.|noop|
|`druid.indexer.hadoopWorkingPath`|Intermediate temporary hadoop working directory that certain index tasks may use.|none|
|`druid.indexer.logs.s3bucket`|S3 bucket to store logs.|none|
|`druid.indexer.logs.s3prefix`|S3 key prefix to store logs.|none|
#### Console
The indexer console can be used to view pending tasks, running tasks, available workers, and recent worker creation and termination. The console can be accessed at:
```
http://<COORDINATOR_IP>:8080/static/console.html
```
Worker Node
-----------
The worker node executes submitted tasks. Workers run tasks in separate JVMs.
|`druid.peon.taskActionClient.retry.minWait`|The minimum retry time to communicate with overlord.|PT1M|
|`druid.peon.taskActionClient.retry.maxWait`|The maximum retry time to communicate with overlord.|PT10M|
|`druid.peon.taskActionClient.retry.maxRetryCount`|The maximum number of retries to communicate with overlord.|10|
#### Running
Worker nodes can be run using the `com.metamx.druid.indexing.worker.http.WorkerMain` class. Worker nodes can automatically be created by the Indexer Coordinator as part of autoscaling.
#### Configuration
Worker nodes require [basic service configuration](https://github.com/metamx/druid/wiki/Configuration#basic-service-configuration). In addition, there are several extra configurations that are required.
The peon should very rarely ever be run independent of the middle manager.
```
-Ddruid.worker.version=0
-Ddruid.worker.capacity=3
-Ddruid.indexer.threads=3
-Ddruid.indexer.taskDir=/mnt/persistent/task/
-Ddruid.indexer.hadoopWorkingPath=/tmp/druid-indexing
-Ddruid.worker.coordinatorService=druid:sample_cluster:indexer
-Ddruid.indexer.fork.hostpattern=<IP>:%d
-Ddruid.indexer.fork.startport=8080
-Ddruid.indexer.fork.main=com.metamx.druid.indexing.worker.executor.ExecutorMain
-Ddruid.indexer.fork.opts="-server -Xmx1g -Xms1g -XX:NewSize=256m -XX:MaxNewSize=256m"
-Ddruid.indexer.fork.property.druid.service=druid/sample_cluster/executor
# These configs are the same configs you would set for basic service configuration, just with a different prefix
-Ddruid.indexer.fork.property.druid.monitoring.monitorSystem=false
-Ddruid.indexer.fork.property.druid.computation.buffer.size=268435456
-Ddruid.indexer.fork.property.druid.indexer.taskDir=/mnt/persistent/task/
-Ddruid.indexer.fork.property.druid.processing.formatString=processing-%s
-Ddruid.indexer.fork.property.druid.processing.numThreads=1
-Ddruid.indexer.fork.property.druid.server.maxSize=0
-Ddruid.indexer.fork.property.druid.request.logging.dir=request_logs/
io.druid.cli.Main internal peon <task_file> <status_file>
```
Many of the configurations for workers are similar to those for basic service configuration":https://github.com/metamx/druid/wiki/Configuration\#basic-service-configuration, but with a different config prefix. Below we describe the unique worker configs.
The task file contains the task JSON object.
The status file indicates where the task status will be output.
|Property|Description|Default|
|--------|-----------|-------|
|`druid.worker.version`|Version identifier for the worker.|0|
|`druid.worker.capacity`|Maximum number of tasks the worker can accept.|1|
|`druid.indexer.threads`|Number of processing threads per worker.|1|
|`druid.worker.coordinatorService`|Name of the indexer coordinator used for service discovery.|none|
|`druid.indexer.fork.hostpattern`|The format of the host name.|none|
|`druid.indexer.fork.startport`|Port in which child JVM starts from.|none|
|`druid.indexer.fork.opts`|JVM options for child JVMs.|none|
Tasks
-----
See [Tasks](Tasks.html).

View File

@ -1,71 +1,260 @@
---
layout: doc_page
---
Tasks are run on workers and always operate on a single datasource. Once an indexer coordinator node accepts a task, a lock is created for the datasource and interval specified in the task. Tasks do not need to explicitly release locks, they are released upon task completion. Tasks may potentially release locks early if they desire. Tasks ids are unique by naming them using UUIDs or the timestamp in which the task was created. Tasks are also part of a "task group", which is a set of tasks that can share interval locks.
Tasks are run on middle managers and always operate on a single data source.
There are several different types of tasks.
Append Task
-----------
Segment Creation Tasks
----------------------
#### Index Task
The Index Task is a simpler variation of the Index Hadoop task that is designed to be used for smaller data sets. The task executes within the indexing service and does not require an external Hadoop setup to use. The grammar of the index task is as follows:
```
{
"type" : "index",
"dataSource" : "example",
"granularitySpec" : {
"type" : "uniform",
"gran" : "DAY",
"intervals" : [ "2010/2020" ]
},
"aggregators" : [ {
"type" : "count",
"name" : "count"
}, {
"type" : "doubleSum",
"name" : "value",
"fieldName" : "value"
} ],
"firehose" : {
"type" : "local",
"baseDir" : "/tmp/data/json",
"filter" : "sample_data.json",
"parser" : {
"timestampSpec" : {
"column" : "timestamp"
},
"data" : {
"format" : "json",
"dimensions" : [ "dim1", "dim2", "dim3" ]
}
}
}
}
```
|property|description|required?|
|--------|-----------|---------|
|type|The task type, this should always be "index".|yes|
|id|The task ID.|no|
|granularitySpec|See [granularitySpec](Tasks.html#Granularity-Spec)|yes|
|spatialDimensions|Dimensions to build spatial indexes over. See [Spatial-Indexing](Spatial-Indexing.html)|no|
|aggregators|The metrics to aggregate in the data set. For more info, see [Aggregations](Aggregations.html)|yes|
|indexGranularity|The rollup granularity for timestamps.|no|
|targetPartitionSize|Used in sharding. Determines how many rows are in each segment.|no|
|firehose|The input source of data. For more info, see [Firehose](Firehose.html)|yes|
|rowFlushBoundary|Used in determining when intermediate persist should occur to disk.|no|
#### Index Hadoop Task
The Hadoop Index Task is used to index larger data sets that require the parallelization and processing power of a Hadoop cluster.
```
{
"type" : "index_hadoop",
"config": <Hadoop index config>
}
```
|property|description|required?|
|--------|-----------|---------|
|type|The task type, this should always be "index_hadoop".|yes|
|config|See [Batch Ingestion](Batch-ingestion.html)|yes|
#### Realtime Index Task
The indexing service can also run real-time tasks. These tasks effectively transform a middle manager into a real-time node. We introduced real-time tasks as a way to programmatically add new real-time data sources without needing to manually add nodes. The grammar for the real-time task is as follows:
```
{
"type" : "index_realtime",
"id": "example,
"resource": {
"availabilityGroup" : "someGroup",
"requiredCapacity" : 1
},
"schema": {
"dataSource": "dataSourceName",
"aggregators": [
{
"type": "count",
"name": "events"
},
{
"type": "doubleSum",
"name": "outColumn",
"fieldName": "inColumn"
}
],
"indexGranularity": "minute",
"shardSpec": {
"type": "none"
}
},
"firehose": {
"type": "kafka-0.7.2",
"consumerProps": {
"zk.connect": "zk_connect_string",
"zk.connectiontimeout.ms": "15000",
"zk.sessiontimeout.ms": "15000",
"zk.synctime.ms": "5000",
"groupid": "consumer-group",
"fetch.size": "1048586",
"autooffset.reset": "largest",
"autocommit.enable": "false"
},
"feed": "your_kafka_topic",
"parser": {
"timestampSpec": {
"column": "timestamp",
"format": "iso"
},
"data": {
"format": "json"
},
"dimensionExclusions": [
"value"
]
}
},
"fireDepartmentConfig": {
"maxRowsInMemory": 500000,
"intermediatePersistPeriod": "PT10m"
},
"windowPeriod": "PT10m",
"segmentGranularity": "hour",
"rejectionPolicy": {
"type": "messageTime"
}
}
```
Id:
The ID of the task. Not required.
Resource:
A JSON object used for high availability purposes. Not required.
|Field|Type|Description|Required|
|-----|----|-----------|--------|
|availabilityGroup|String|An uniqueness identifier for the task. Tasks with the same availability group will always run on different middle managers. Used mainly for replication. |yes|
|requiredCapacity|Integer|How much middle manager capacity this task will take.|yes|
Schema:
See [Schema](Realtime.html#Schema).
Fire Department Config:
See [Config](Realtime.html#Config).
Firehose:
See [Firehose](Firehose.html).
Window Period:
See [Realtime](Realtime.html).
Segment Granularity:
See [Realtime](Realtime.html).
Rejection Policy:
See [Realtime](Realtime.html).
Segment Merging Tasks
---------------------
#### Append Task
Append tasks append a list of segments together into a single segment (one after the other). The grammar is:
{
"id": <task_id>,
"dataSource": <task_datasource>,
"segments": <JSON list of DataSegment objects to append>
}
```
{
"id": <task_id>,
"dataSource": <task_datasource>,
"segments": <JSON list of DataSegment objects to append>
}
```
Merge Task
----------
#### Merge Task
Merge tasks merge a list of segments together. Any common timestamps are merged. The grammar is:
{
"id": <task_id>,
"dataSource": <task_datasource>,
"segments": <JSON list of DataSegment objects to append>
}
```
{
"id": <task_id>,
"dataSource": <task_datasource>,
"segments": <JSON list of DataSegment objects to append>
}
```
Delete Task
-----------
Segment Destroying Tasks
------------------------
#### Delete Task
Delete tasks create empty segments with no data. The grammar is:
{
"id": <task_id>,
"dataSource": <task_datasource>,
"segments": <JSON list of DataSegment objects to append>
}
```
{
"id": <task_id>,
"dataSource": <task_datasource>,
"segments": <JSON list of DataSegment objects to append>
}
```
Kill Task
---------
#### Kill Task
Kill tasks delete all information about a segment and removes it from deep storage. Killable segments must be disabled (used==0) in the Druid segment table. The available grammar is:
{
"id": <task_id>,
"dataSource": <task_datasource>,
"segments": <JSON list of DataSegment objects to append>
}
```
{
"id": <task_id>,
"dataSource": <task_datasource>,
"segments": <JSON list of DataSegment objects to append>
}
```
Index Task
----------
Misc. Tasks
-----------
Index Partitions Task
---------------------
#### Version Converter Task
Index Generator Task
--------------------
These tasks convert segments from an existing older index version to the latest index version. The available grammar is:
Index Hadoop Task
-----------------
```
{
"id": <task_id>,
"groupId" : <task_group_id>,
"dataSource": <task_datasource>,
"interval" : <segment_interval>,
"segment": <JSON DataSegment object to convert>
}
```
Index Realtime Task
-------------------
#### Noop Task
Version Converter Task
----------------------
These tasks start, sleep for a time and are used only for testing. The available grammar is:
Version Converter SubTask
-------------------------
```
{
"id": <optional_task_id>,
"interval" : <optional_segment_interval>,
"runTime" : <optional_millis_to_sleep>,
"firehose": <optional_firehose_to_test_connect>
}
```
Locking
-------
Once an overlord node accepts a task, a lock is created for the data source and interval specified in the task. Tasks do not need to explicitly release locks, they are released upon task completion. Tasks may potentially release locks early if they desire. Tasks ids are unique by naming them using UUIDs or the timestamp in which the task was created. Tasks are also part of a "task group", which is a set of tasks that can share interval locks.

View File

@ -0,0 +1,249 @@
---
layout: doc_page
---
In our last [tutorial](Tutorial:-The-Druid-Cluster.html), we setup a complete Druid cluster. We created all the Druid dependencies and loaded some batched data. Druid shards data into self-contained chunks known as [segments](Segments.html). Segments are the fundamental unit of storage in Druid and all Druid nodes only understand segments.
In this tutorial, we will learn about batch ingestion (as opposed to real-time ingestion) and how to create segments using the final piece of the Druid Cluster, the [indexing service](Indexing-Service.html). The indexing service is a standalone service that accepts [tasks](Tasks.html) in the form of POST requests. The output of most tasks are segments.
About the data
--------------
The data source we'll be working with is Wikipedia edits once again. The data schema is the same as the previous tutorials:
Dimensions (things to filter on):
```json
"page"
"language"
"user"
"unpatrolled"
"newPage"
"robot"
"anonymous"
"namespace"
"continent"
"country"
"region"
"city"
```
Metrics (things to aggregate over):
```json
"count"
"added"
"delta"
"deleted"
```
Setting Up
----------
At this point, you should already have Druid downloaded and are comfortable with running a Druid cluster locally. If you are not, stop here and familiarize yourself with the first two tutorials.
Let's start from our usual starting point in the tarball directory.
Segments require data, so before we can build a Druid segment, we are going to need some raw data. Make sure that the following file exists:
```
examples/indexing/wikipedia_data.json
```
Open the file and make sure the following events exist:
```
{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Dingo Land", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}
{"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Vodka Land", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111}
{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "cancer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
```
There are five data points spread across the day of 2013-08-31. Talk about big data right? Thankfully, we don't need a ton of data to introduce how batch ingestion works.
In order to ingest and query this data, we are going to need to run a historical node, a coordinator node, and an indexing service to run the batch ingestion.
#### Starting a Local Indexing Service
The simplest indexing service we can start up is to run an [overlord](Indexing-Service.html) node in local mode. You can do so by issuing:
```bash
java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/overlord io.druid.cli.Main server overlord
```
The overlord configurations should already exist in:
```
config/overlord/runtime.properties
```
The configurations for the overlord node are as follows:
```
druid.host=localhost
druid.port=8087
druid.service=overlord
druid.zk.service.host=localhost
druid.db.connector.connectURI=jdbc:mysql://localhost:3306/druid
druid.db.connector.user=druid
druid.db.connector.password=diurd
druid.selectors.indexing.serviceName=overlord
druid.indexer.runner.javaOpts="-server -Xmx1g"
druid.indexer.runner.startPort=8088
druid.indexer.fork.property.druid.computation.buffer.size=268435456
```
If you are interested in reading more about these configurations, see [here](Indexing-Service.html).
When the overlord node is ready for tasks, you should see a message like the following:
```
013-10-09 21:30:32,817 INFO [Thread-14] io.druid.indexing.overlord.TaskQueue - Waiting for work...
```
#### Starting Other Nodes
Just in case you forgot how, let's start up the other nodes we require:
Coordinator node:
```bash
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/coordinator io.druid.cli.Main server coordinator
```
Historical node:
```bash
java -Xmx256m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath lib/*:config/historical io.druid.cli.Main server historical
```
Note: Historical, real-time and broker nodes share the same query interface. Hence, we do not explicitly need a broker node for this tutorial. All queries can go against the historical node directly.
Once all the nodes are up and running, we are ready to index some data.
Indexing the Data
-----------------
To index the data and build a Druid segment, we are going to need to submit a task to the indexing service. This task should already exist:
```
examples/indexing/index_task.json
```
Open up the file to see the following:
```
{
"type" : "index",
"dataSource" : "wikipedia",
"granularitySpec" : {
"type" : "uniform",
"gran" : "DAY",
"intervals" : [ "2013-08-31/2013-09-01" ]
},
"aggregators" : [{
"type" : "count",
"name" : "edit_count"
}, {
"type" : "doubleSum",
"name" : "added",
"fieldName" : "added"
}, {
"type" : "doubleSum",
"name" : "deleted",
"fieldName" : "deleted"
}, {
"type" : "doubleSum",
"name" : "delta",
"fieldName" : "delta"
}],
"firehose" : {
"type" : "local",
"baseDir" : "examples/indexing",
"filter" : "wikipedia_data.json",
"parser" : {
"timestampSpec" : {
"column" : "timestamp"
},
"data" : {
"format" : "json",
"dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
}
}
}
}
```
Okay, so what is happening here? The "type" field indicates the type of task we plan to run. In this case, it is a simple "index" task. The "granularitySpec" indicates that we are building a daily segment for 2013-08-31 to 2013-09-01. Next, the "aggregators" indicate which fields in our data set we plan to build metric columns for. The "fieldName" corresponds to the metric name in the raw data. The "name" corresponds to what our metric column is actually going to be called in the segment. Finally, we have a local "firehose" that is going to read data from disk. We tell the firehose where our data is located and the types of files we are looking to ingest. In our case, we only have a single data file.
Let's send our task to the indexing service now:
```
curl -X 'POST' -H 'Content-Type:application/json' -d @examples/indexing/wikipedia_index_task.json localhost:8087/druid/indexer/v1/task
```
Issuing the request should return a task ID like so:
```
fjy$ curl -X 'POST' -H 'Content-Type:application/json' -d @examples/indexing/wikipedia_index_task.json localhost:8087/druid/indexer/v1/task
{"task":"index_wikipedia_2013-10-09T21:30:32.802Z"}
fjy$
```
In your indexing service logs, you should see the following:
````
2013-10-09 21:41:41,150 INFO [qtp300448720-21] io.druid.indexing.overlord.HeapMemoryTaskStorage - Inserting task index_wikipedia_2013-10-09T21:41:41.147Z with status: TaskStatus{id=index_wikipedia_2013-10-09T21:41:41.147Z, status=RUNNING, duration=-1}
2013-10-09 21:41:41,151 INFO [qtp300448720-21] io.druid.indexing.overlord.TaskLockbox - Created new TaskLockPosse: TaskLockPosse{taskLock=TaskLock{groupId=index_wikipedia_2013-10-09T21:41:41.147Z, dataSource=wikipedia, interval=2013-08-31T00:00:00.000Z/2013-09-01T00:00:00.000Z, version=2013-10-09T21:41:41.151Z}, taskIds=[]}
...
013-10-09 21:41:41,215 INFO [pool-6-thread-1] io.druid.indexing.overlord.ForkingTaskRunner - Logging task index_wikipedia_2013-10-09T21:41:41.147Z_generator_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_0 output to: /tmp/persistent/index_wikipedia_2013-10-09T21:41:41.147Z_generator_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_0/b5099fdb-d6b0-4b81-9053-b2af70336a7e/log
2013-10-09 21:41:45,017 INFO [qtp300448720-22] io.druid.indexing.common.actions.LocalTaskActionClient - Performing action for task[index_wikipedia_2013-10-09T21:41:41.147Z_generator_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_0]: LockListAction{}
````
After a few seconds, the task should complete and you should see in the indexing service logs:
```
2013-10-09 21:41:45,765 INFO [pool-6-thread-1] io.druid.indexing.overlord.exec.TaskConsumer - Received SUCCESS status for task: IndexGeneratorTask{id=index_wikipedia_2013-10-09T21:41:41.147Z_generator_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_0, type=index_generator, dataSource=wikipedia, interval=Optional.of(2013-08-31T00:00:00.000Z/2013-09-01T00:00:00.000Z)}
```
Congratulations! The segment has completed building. Once a segment is built, a segment metadata entry is created in your MySQL table. The coordinator compares what is in the segment metadata table with what is in the cluster. A new entry in the metadata table will cause the coordinator to load the new segment in a minute or so.
You should see the following logs on the coordinator:
```
2013-10-09 21:41:54,368 INFO [Coordinator-Exec--0] io.druid.server.coordinator.DruidCoordinatorLogger - [_default_tier] : Assigned 1 segments among 1 servers
2013-10-09 21:41:54,369 INFO [Coordinator-Exec--0] io.druid.server.coordinator.DruidCoordinatorLogger - Load Queues:
2013-10-09 21:41:54,369 INFO [Coordinator-Exec--0] io.druid.server.coordinator.DruidCoordinatorLogger - Server[localhost:8081, historical, _default_tier] has 1 left to load, 0 left to drop, 4,477 bytes queued, 4,477 bytes served.
```
These logs indicate that the coordinator has assigned our new segment to the historical node to download and serve. If you look at the historical node logs, you should see:
```
2013-10-09 21:41:54,369 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Loading segment wikipedia_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_2013-10-09T21:41:41.151Z
2013-10-09 21:41:54,369 INFO [ZkCoordinator-0] io.druid.segment.loading.LocalDataSegmentPuller - Unzipping local file[/tmp/druid/localStorage/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2013-10-09T21:41:41.151Z/0/index.zip] to [/tmp/druid/indexCache/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2013-10-09T21:41:41.151Z/0]
2013-10-09 21:41:54,370 INFO [ZkCoordinator-0] io.druid.utils.CompressionUtils - Unzipping file[/tmp/druid/localStorage/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2013-10-09T21:41:41.151Z/0/index.zip] to [/tmp/druid/indexCache/wikipedia/2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z/2013-10-09T21:41:41.151Z/0]
2013-10-09 21:41:54,380 INFO [ZkCoordinator-0] io.druid.server.coordination.SingleDataSegmentAnnouncer - Announcing segment[wikipedia_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_2013-10-09T21:41:41.151Z] to path[/druid/servedSegments/localhost:8081/wikipedia_2013-08-31T00:00:00.000Z_2013-09-01T00:00:00.000Z_2013-10-09T21:41:41.151Z]
```
Once the segment is announced the segment is queryable. Now you should be able to query the data.
Issuing a [TimeBoundaryQuery](TimeBoundaryQuery.html) should yield:
```
[ {
"timestamp" : "2013-08-31T01:02:33.000Z",
"result" : {
"minTime" : "2013-08-31T01:02:33.000Z",
"maxTime" : "2013-08-31T12:41:27.000Z"
}
} ]
```
Next Steps
----------
This tutorial covered ingesting a small batch data set and loading it into Druid. In [Loading Your Data Part 2](Tutorial-Loading-Your-Data-Part-2.html), we will cover how to ingest data using Hadoop for larger data sets.
Additional Information
----------------------
Getting data into Druid can definitely be difficult for first time users. Please don't hesitate to ask questions in our IRC channel or on our [google groups page](http://www.groups.google.com/forum/#!forum/druid-development).

View File

@ -1,7 +1,7 @@
---
layout: doc_page
---
Welcome back! In our first [tutorial](Tutorial:-A-First-Look-at-Druid), we introduced you to the most basic Druid setup: a single realtime node. We streamed in some data and queried it. Realtime nodes collect very recent data and periodically hand that data off to the rest of the Druid cluster. Some questions about the architecture must naturally come to mind. What does the rest of Druid cluster look like? How does Druid load available static data?
Welcome back! In our first [tutorial](Tutorial:-A-First-Look-at-Druid.html), we introduced you to the most basic Druid setup: a single realtime node. We streamed in some data and queried it. Realtime nodes collect very recent data and periodically hand that data off to the rest of the Druid cluster. Some questions about the architecture must naturally come to mind. What does the rest of Druid cluster look like? How does Druid load available static data?
This tutorial will hopefully answer these questions!

View File

@ -0,0 +1,5 @@
{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Dingo Land", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}
{"timestamp": "2013-08-31T07:11:21Z", "page": "Cherno Alpha", "language" : "ru", "user" : "masterYi", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"article", "continent":"Asia", "country":"Russia", "region":"Vodka Land", "city":"Moscow", "added": 123, "deleted": 12, "delta": 111}
{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "cancer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}

View File

@ -0,0 +1,39 @@
{
"type" : "index",
"dataSource" : "wikipedia",
"granularitySpec" : {
"type" : "uniform",
"gran" : "DAY",
"intervals" : [ "2013-08-31/2013-09-01" ]
},
"aggregators" : [{
"type" : "count",
"name" : "edit_count"
}, {
"type" : "doubleSum",
"name" : "added",
"fieldName" : "added"
}, {
"type" : "doubleSum",
"name" : "deleted",
"fieldName" : "deleted"
}, {
"type" : "doubleSum",
"name" : "delta",
"fieldName" : "delta"
}],
"firehose" : {
"type" : "local",
"baseDir" : "examples/indexing/",
"filter" : "wikipedia_data.json",
"parser" : {
"timestampSpec" : {
"column" : "timestamp"
},
"data" : {
"format" : "json",
"dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
}
}
}
}

View File

@ -0,0 +1,14 @@
druid.host=localhost
druid.port=8087
druid.service=overlord
druid.zk.service.host=localhost
druid.db.connector.connectURI=jdbc:mysql://localhost:3306/druid
druid.db.connector.user=druid
druid.db.connector.password=diurd
druid.selectors.indexing.serviceName=overlord
druid.indexer.runner.javaOpts="-server -Xmx1g"
druid.indexer.runner.startPort=8088
druid.indexer.fork.property.druid.computation.buffer.size=268435456

View File

@ -20,7 +20,7 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-examples</artifactId>
<name>druid-examples</name>
<description>druid-examples</description>
@ -33,17 +33,12 @@
<dependencies>
<dependency>
<groupId>com.metamx.druid</groupId>
<artifactId>druid-realtime</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-server</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-common</artifactId>
<version>${project.parent.version}</version>
</dependency>

View File

@ -20,7 +20,7 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-hdfs-storage</artifactId>
<name>druid-hdfs-storage</name>
<description>druid-hdfs-storage</description>

View File

@ -20,7 +20,7 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-indexing-hadoop</artifactId>
<name>druid-indexing-hadoop</name>
<description>Druid Indexing Hadoop</description>
@ -33,7 +33,7 @@
<dependencies>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-server</artifactId>
<version>${project.parent.version}</version>
</dependency>

View File

@ -20,7 +20,7 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-indexing-service</artifactId>
<name>druid-indexing-service</name>
<description>druid-indexing-service</description>
@ -33,25 +33,20 @@
<dependencies>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-common</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-server</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-indexing-hadoop</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>com.metamx.druid</groupId>
<artifactId>druid-realtime</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>com.metamx</groupId>

View File

@ -44,6 +44,7 @@ import io.druid.guice.annotations.Self;
import io.druid.indexing.common.TaskStatus;
import io.druid.indexing.common.task.Task;
import io.druid.indexing.overlord.config.ForkingTaskRunnerConfig;
import io.druid.indexing.worker.config.WorkerConfig;
import io.druid.server.DruidNode;
import io.druid.tasklogs.TaskLogPusher;
import io.druid.tasklogs.TaskLogStreamer;
@ -84,6 +85,7 @@ public class ForkingTaskRunner implements TaskRunner, TaskLogStreamer
@Inject
public ForkingTaskRunner(
ForkingTaskRunnerConfig config,
WorkerConfig workerConfig,
Properties props,
TaskLogPusher taskLogPusher,
ObjectMapper jsonMapper,
@ -96,7 +98,7 @@ public class ForkingTaskRunner implements TaskRunner, TaskLogStreamer
this.jsonMapper = jsonMapper;
this.node = node;
this.exec = MoreExecutors.listeningDecorator(Executors.newFixedThreadPool(config.maxForks()));
this.exec = MoreExecutors.listeningDecorator(Executors.newFixedThreadPool(workerConfig.getCapacity()));
}
@Override

View File

@ -23,6 +23,7 @@ import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.inject.Inject;
import io.druid.guice.annotations.Self;
import io.druid.indexing.overlord.config.ForkingTaskRunnerConfig;
import io.druid.indexing.worker.config.WorkerConfig;
import io.druid.server.DruidNode;
import io.druid.tasklogs.TaskLogPusher;
@ -33,6 +34,7 @@ import java.util.Properties;
public class ForkingTaskRunnerFactory implements TaskRunnerFactory
{
private final ForkingTaskRunnerConfig config;
private final WorkerConfig workerConfig;
private final Properties props;
private final ObjectMapper jsonMapper;
private final TaskLogPusher persistentTaskLogs;
@ -41,12 +43,14 @@ public class ForkingTaskRunnerFactory implements TaskRunnerFactory
@Inject
public ForkingTaskRunnerFactory(
final ForkingTaskRunnerConfig config,
final WorkerConfig workerConfig,
final Properties props,
final ObjectMapper jsonMapper,
final TaskLogPusher persistentTaskLogs,
@Self DruidNode node
) {
this.config = config;
this.workerConfig = workerConfig;
this.props = props;
this.jsonMapper = jsonMapper;
this.persistentTaskLogs = persistentTaskLogs;
@ -56,6 +60,6 @@ public class ForkingTaskRunnerFactory implements TaskRunnerFactory
@Override
public TaskRunner build()
{
return new ForkingTaskRunner(config, props, persistentTaskLogs, jsonMapper, node);
return new ForkingTaskRunner(config, workerConfig, props, persistentTaskLogs, jsonMapper, node);
}
}

View File

@ -748,7 +748,7 @@ public class RemoteTaskRunner implements TaskRunner, TaskLogStreamer
);
sortedWorkers.addAll(zkWorkers.values());
final String configMinWorkerVer = workerSetupData.get().getMinVersion();
final String minWorkerVer = configMinWorkerVer == null ? config.getWorkerVersion() : configMinWorkerVer;
final String minWorkerVer = configMinWorkerVer == null ? config.getMinWorkerVersion() : configMinWorkerVer;
for (ZkWorker zkWorker : sortedWorkers) {
if (zkWorker.canRunTask(task) && zkWorker.isValidVersion(minWorkerVer)) {

View File

@ -1,37 +0,0 @@
/*
* Druid - a distributed column store.
* Copyright (C) 2012, 2013 Metamarkets Group Inc.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
*/
package io.druid.indexing.overlord.config;
import org.skife.config.Config;
import org.skife.config.Default;
import org.skife.config.DefaultNull;
/**
*/
public abstract class EC2AutoScalingStrategyConfig
{
@Config("druid.indexer.worker.port")
@Default("8080")
public abstract String getWorkerPort();
@Config("druid.indexer.worker.version")
@DefaultNull
public abstract String getWorkerVersion();
}

View File

@ -29,10 +29,6 @@ import java.util.List;
public class ForkingTaskRunnerConfig
{
@JsonProperty
@Min(1)
private int maxForks = 1;
@JsonProperty
@NotNull
private String taskDir = "/tmp/persistent";
@ -69,11 +65,6 @@ public class ForkingTaskRunnerConfig
"file.encoding"
);
public int maxForks()
{
return maxForks;
}
public String getTaskDir()
{
return taskDir;

View File

@ -37,7 +37,7 @@ public class RemoteTaskRunnerConfig
private boolean compressZnodes = false;
@JsonProperty
private String workerVersion = null;
private String minWorkerVersion = null;
@JsonProperty
@Min(10 * 1024)
@ -53,9 +53,9 @@ public class RemoteTaskRunnerConfig
return compressZnodes;
}
public String getWorkerVersion()
public String getMinWorkerVersion()
{
return workerVersion;
return minWorkerVersion;
}
public long getMaxZnodeBytes()

View File

@ -30,11 +30,11 @@ public class WorkerConfig
{
@JsonProperty
@NotNull
private String ip = null;
private String ip = "localhost";
@JsonProperty
@NotNull
private String version = null;
private String version = "0";
@JsonProperty
@Min(1)

View File

@ -1,44 +0,0 @@
/*
* Druid - a distributed column store.
* Copyright (C) 2012, 2013 Metamarkets Group Inc.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
*/
package io.druid.indexing.worker.executor;
import com.fasterxml.jackson.databind.ObjectMapper;
import io.druid.indexing.overlord.TaskRunner;
import java.io.File;
public class ExecutorLifecycleFactory
{
private final File taskFile;
private final File statusFile;
public ExecutorLifecycleFactory(File taskFile, File statusFile)
{
this.taskFile = taskFile;
this.statusFile = statusFile;
}
public ExecutorLifecycle build(TaskRunner taskRunner, ObjectMapper jsonMapper)
{
return new ExecutorLifecycle(
new ExecutorLifecycleConfig().setTaskFile(taskFile).setStatusFile(statusFile), taskRunner, jsonMapper
);
}
}

View File

@ -45,7 +45,7 @@ public class TestRemoteTaskRunnerConfig extends RemoteTaskRunnerConfig
}
@Override
public String getWorkerVersion()
public String getMinWorkerVersion()
{
return "";
}

View File

@ -20,7 +20,7 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-processing</artifactId>
<name>druid-processing</name>
<description>A module that is everything required to understands Druid Segments</description>
@ -33,7 +33,7 @@
<dependencies>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-common</artifactId>
<version>${project.parent.version}</version>
</dependency>

View File

@ -20,7 +20,7 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-s3-extensions</artifactId>
<name>druid-s3-extensions</name>
<description>druid-s3-extensions</description>

View File

@ -21,7 +21,7 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-server</artifactId>
<name>druid-server</name>
<description>Druid Server</description>
@ -34,7 +34,7 @@
<dependencies>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-processing</artifactId>
<version>${project.parent.version}</version>
</dependency>
@ -221,7 +221,7 @@
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-processing</artifactId>
<version>${project.parent.version}</version>
<type>test-jar</type>

View File

@ -20,7 +20,7 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-services</artifactId>
<name>druid-services</name>
<description>druid-services</description>
@ -33,37 +33,32 @@
<dependencies>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-examples</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-indexing-hadoop</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-indexing-service</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>com.metamx.druid</groupId>
<artifactId>druid-realtime</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-server</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-examples</artifactId>
<version>${project.parent.version}</version>
</dependency>
<dependency>
<groupId>com.metamx.druid</groupId>
<groupId>io.druid</groupId>
<artifactId>druid-indexing-service</artifactId>
<version>${project.parent.version}</version>
</dependency>

View File

@ -42,6 +42,13 @@
</includes>
<outputDirectory>config/historical</outputDirectory>
</fileSet>
<fileSet>
<directory>../examples/config/overlord</directory>
<includes>
<include>*</include>
</includes>
<outputDirectory>config/overlord</outputDirectory>
</fileSet>
<fileSet>
<directory>../examples/bin</directory>
<includes>

View File

@ -122,7 +122,7 @@ public class CliPeon extends GuiceRunnable
binder.bind(TaskToolboxFactory.class).in(LazySingleton.class);
JsonConfigProvider.bind(binder, "druid.indexer.task", TaskConfig.class);
JsonConfigProvider.bind(binder, "druid.worker.taskActionClient.retry", RetryPolicyConfig.class);
JsonConfigProvider.bind(binder, "druid.peon.taskActionClient.retry", RetryPolicyConfig.class);
configureTaskActionClient(binder);

View File

@ -24,6 +24,7 @@ import io.airlift.command.Cli;
import io.airlift.command.Help;
import io.airlift.command.ParseException;
import io.druid.cli.convert.ConvertProperties;
import io.druid.cli.validate.DruidJsonValidator;
import io.druid.initialization.Initialization;
import io.druid.server.initialization.ExtensionsConfig;
@ -58,7 +59,7 @@ public class Main
builder.withGroup("tools")
.withDescription("Various tools for working with Druid")
.withDefaultCommand(Help.class)
.withCommands(ConvertProperties.class);
.withCommands(ConvertProperties.class, DruidJsonValidator.class);
builder.withGroup("index")
.withDescription("Run indexing for druid")

View File

@ -85,7 +85,7 @@ public class ConvertProperties implements Runnable
new Rename("druid.indexer.fork.startport", "druid.indexer.runner.startPort"),
new Rename("druid.indexer.properties.prefixes", "druid.indexer.runner.allowedPrefixes"),
new Rename("druid.indexer.taskAssignmentTimeoutDuration", "druid.indexer.runner.taskAssignmentTimeout"),
new Rename("druid.indexer.worker.version", "druid.indexer.runner.workerVersion"),
new Rename("druid.indexer.worker.version", "druid.indexer.runner.minWorkerVersion"),
new Rename("druid.zk.maxNumBytes", "druid.indexer.runner.maxZnodeBytes"),
new Rename("druid.indexer.provisionResources.duration", "druid.indexer.autoscale.provisionPeriod"),
new Rename("druid.indexer.terminateResources.duration", "druid.indexer.autoscale.terminatePeriod"),

View File

@ -0,0 +1,78 @@
/*
* Druid - a distributed column store.
* Copyright (C) 2012, 2013 Metamarkets Group Inc.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
*/
package io.druid.cli.validate;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.api.client.repackaged.com.google.common.base.Throwables;
import com.metamx.common.UOE;
import io.airlift.command.Arguments;
import io.airlift.command.Command;
import io.airlift.command.Option;
import io.druid.indexer.HadoopDruidIndexerConfig;
import io.druid.indexing.common.task.Task;
import io.druid.jackson.DefaultObjectMapper;
import io.druid.query.Query;
import io.druid.segment.realtime.Schema;
import java.io.File;
/**
*/
@Command(
name = "validator",
description = "Validates that a given Druid JSON object is correctly formatted"
)
public class DruidJsonValidator implements Runnable
{
@Option(name = "-f", title = "file", description = "file to validate", required = true)
public String jsonFile;
@Option(name = "-t", title = "type", description = "the type of schema to validate", required = true)
public String type;
@Override
public void run()
{
File file = new File(jsonFile);
if (!file.exists()) {
System.out.printf("File[%s] does not exist.%n", file);
}
final ObjectMapper jsonMapper = new DefaultObjectMapper();
try {
if (type.equalsIgnoreCase("query")) {
jsonMapper.readValue(file, Query.class);
} else if (type.equalsIgnoreCase("hadoopConfig")) {
jsonMapper.readValue(file, HadoopDruidIndexerConfig.class);
} else if (type.equalsIgnoreCase("task")) {
jsonMapper.readValue(file, Task.class);
} else if (type.equalsIgnoreCase("realtimeSchema")) {
jsonMapper.readValue(file, Schema.class);
} else {
throw new UOE("Unknown type[%s]", type);
}
}
catch (Exception e) {
System.out.println("INVALID JSON!");
throw Throwables.propagate(e);
}
}
}