mirror of https://github.com/apache/druid.git
Added prepend tag to make pages display.
This commit is contained in:
parent
c06b37f36e
commit
248fba683a
|
@ -1,87 +1,6 @@
|
|||
Aggregations are specifications of processing over metrics available in Druid.
|
||||
Available aggregations are:
|
||||
|
||||
### Sum aggregators
|
||||
|
||||
#### `longSum` aggregator
|
||||
|
||||
computes the sum of values as a 64-bit, signed integer
|
||||
|
||||
<code>{
|
||||
"type" : "longSum",
|
||||
"name" : <output_name>,
|
||||
"fieldName" : <metric_name>
|
||||
}</code>
|
||||
|
||||
`name` – output name for the summed value
|
||||
`fieldName` – name of the metric column to sum over
|
||||
|
||||
#### `doubleSum` aggregator
|
||||
|
||||
Computes the sum of values as 64-bit floating point value. Similar to `longSum`
|
||||
|
||||
<code>{
|
||||
"type" : "doubleSum",
|
||||
"name" : <output_name>,
|
||||
"fieldName" : <metric_name>
|
||||
}</code>
|
||||
|
||||
### Count aggregator
|
||||
|
||||
`count` computes the row count that match the filters
|
||||
|
||||
<code>{
|
||||
"type" : "count",
|
||||
"name" : <output_name>,
|
||||
}</code>
|
||||
|
||||
### Min / Max aggregators
|
||||
|
||||
#### `min` aggregator
|
||||
|
||||
`min` computes the minimum metric value
|
||||
|
||||
<code>{
|
||||
"type" : "min",
|
||||
"name" : <output_name>,
|
||||
"fieldName" : <metric_name>
|
||||
}</code>
|
||||
|
||||
#### `max` aggregator
|
||||
|
||||
`max` computes the maximum metric value
|
||||
|
||||
<code>{
|
||||
"type" : "max",
|
||||
"name" : <output_name>,
|
||||
"fieldName" : <metric_name>
|
||||
}</code>
|
||||
|
||||
### JavaScript aggregator
|
||||
|
||||
Computes an arbitrary JavaScript function over a set of columns (both metrics and dimensions).
|
||||
|
||||
All JavaScript functions must return numerical values.
|
||||
|
||||
<code>{
|
||||
"type": "javascript",
|
||||
"name": "<output_name>",
|
||||
"fieldNames" : [ <column1>, <column2>, ... ],
|
||||
"fnAggregate" : "function(current, column1, column2, ...) {
|
||||
<updates partial aggregate (current) based on the current row values>
|
||||
return <updated partial aggregate>
|
||||
}"
|
||||
"fnCombine" : "function(partialA, partialB) { return <combined partial results>; }"
|
||||
"fnReset" : "function() { return <initial value>; }"
|
||||
}</code>
|
||||
|
||||
**Example**
|
||||
|
||||
<code>{
|
||||
"type": "javascript",
|
||||
"name": "sum(log(x)/y) + 10",
|
||||
"fieldNames": ["x", "y"],
|
||||
"fnAggregate" : "function(current, a, b) { return current + (Math.log(a) * b); }"
|
||||
"fnCombine" : "function(partialA, partialB) { return partialA + partialB; }"
|
||||
"fnReset" : "function() { return 10; }"
|
||||
}</code>
|
||||
---
|
||||
layout: default
|
||||
---
|
||||
---
|
||||
layout: default
|
||||
---
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Batch Data Ingestion
|
||||
====================
|
||||
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
# Booting a Single Node Cluster #
|
||||
|
||||
[[Loading Your Data]] and [[Querying Your Data]] contain recipes to boot a small druid cluster on localhost. Here we will boot a small cluster on EC2. You can checkout the code, or download a tarball from [here](http://static.druid.io/artifacts/druid-services-0.5.51-SNAPSHOT-bin.tar.gz).
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Broker
|
||||
======
|
||||
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
### Clone and Build from Source
|
||||
|
||||
The other way to setup Druid is from source via git. To do so, run these commands:
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
A Druid cluster consists of various node types that need to be set up depending on your use case. See our [[Design]] docs for a description of the different node types.
|
||||
|
||||
Setup Scripts
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Compute
|
||||
=======
|
||||
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Concepts and Terminology
|
||||
========================
|
||||
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
This describes the basic server configuration that is loaded by all the server processes; the same file is loaded by all. See also the json “specFile” descriptions in [[Realtime]] and [[Batch-ingestion]].
|
||||
|
||||
JVM Configuration Best Practices
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
If you are interested in contributing to the code, we accept [pull requests](https://help.github.com/articles/using-pull-requests). Note: we have only just completed decoupling our Metamarkets-specific code from the code base and we took some short-cuts in interface design to make it happen. So, there are a number of interfaces that exist right now which are likely to be in flux. If you are embedding Druid in your system, it will be safest for the time being to only extend/implement interfaces that this wiki describes, as those are intended as stable (unless otherwise mentioned).
|
||||
|
||||
For issue tracking, we are using the github issue tracker. Please fill out an issue from the Issues tab on the github screen.
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Deep storage is where segments are stored. It is a storage mechanism that Druid does not provide. This deep storage infrastructure defines the level of durability of your data, as long as Druid nodes can see this storage infrastructure and get at the segments stored on it, you will not lose data no matter how many Druid nodes you lose. If segments disappear from this storage layer, then you will lose whatever data those segments represented.
|
||||
|
||||
The currently supported types of deep storage follow.
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
For a comprehensive look at the architecture of Druid, read the [White Paper](http://static.druid.io/docs/druid.pdf).
|
||||
|
||||
What is Druid?
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
A version may be declared as a release candidate if it has been deployed to a sizable production cluster. Release candidates are declared as stable after we feel fairly confident there are no major bugs in the version. Check out the [[Versioning]] section for how we describe software versions.
|
||||
|
||||
Release Candidate
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
# Druid Personal Demo Cluster (DPDC)
|
||||
|
||||
Note, there are currently some issues with the CloudFormation. We are working through them and will update the documentation here when things work properly. In the meantime, the simplest way to get your feet wet with a cluster setup is to run through the instructions at [housejester/druid-test-harness](https://github.com/housejester/druid-test-harness), though it is based on an older version. If you just want to get a feel for the types of data and queries that you can issue, check out [[Realtime Examples]]
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
We are not experts on Cassandra, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means. We will fix this page.
|
||||
|
||||
Druid is highly optimized for scans and aggregations, it supports arbitrarily deep drill downs into data sets without the need to pre-compute, and it can ingest event streams in real-time and allow users to query events as they come in. Cassandra is a great key-value store and it has some features that allow you to use it to do more interesting things than what you can do with a pure key-value store. But, it is not built for the same use cases that Druid handles, namely regularly scanning over billions of entries per query.
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Druid is a complementary addition to Hadoop. Hadoop is great at storing and making accessible large amounts of individually low-value data. Unfortunately, Hadoop is not great at providing query speed guarantees on top of that data, nor does it have very good operational characteristics for a customer-facing production system. Druid, on the other hand, excels at taking high-value summaries of the low-value data on Hadoop, making it available in a fast and always-on fashion, such that it could be exposed directly to a customer.
|
||||
|
||||
Druid also requires some infrastructure to exist for “deep storage”. HDFS is one of the implemented options for this “deep storage”.
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
The question of Druid versus Impala or Shark basically comes down to your product requirements and what the systems were designed to do.
|
||||
|
||||
Druid was designed to
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
###How does Druid compare to Redshift?
|
||||
|
||||
In terms of drawing a differentiation, Redshift is essentially ParAccel (Actian) which Amazon is licensing.
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
How does Druid compare to Vertica?
|
||||
|
||||
Vertica is similar to ParAccel/Redshift ([[Druid-vs-Redshift]]) described above in that it wasn’t built for real-time streaming data ingestion and it supports full SQL.
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Examples
|
||||
========
|
||||
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
A filter is a JSON object indicating which rows of data should be included in the computation for a query. It’s essentially the equivalent of the WHERE clause in SQL. Druid supports the following types of filters.
|
||||
|
||||
### Selector filter
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Firehoses describe the data stream source. They are pluggable and thus the configuration schema can and will vary based on the `type` of the firehose.
|
||||
|
||||
|Field|Type|Description|Required|
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
The granularity field determines how data gets bucketed across the time dimension, i.e how it gets aggregated by hour, day, minute, etc.
|
||||
|
||||
It can be specified either as a string for simple granularities or as an object for arbitrary granularities.
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
These types of queries take a groupBy query object and return an array of JSON objects where each object represents a grouping asked for by the query.
|
||||
|
||||
An example groupBy query object is shown below:
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
A having clause is a JSON object identifying which rows from a groupBy query should be returned, by specifying conditions on aggregated values.
|
||||
|
||||
It is essentially the equivalent of the HAVING clause in SQL.
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Druid is an open-source analytics datastore designed for realtime, exploratory, queries on large-scale data sets (100’s of Billions entries, 100’s TB data). Druid provides for cost effective, always-on, realtime data ingestion and arbitrary data exploration.
|
||||
|
||||
- Check out some [[Examples]]
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Disclaimer: We are still in the process of finalizing the indexing service and these configs are prone to change at any time. We will announce when we feel the indexing service and the configurations described are stable.
|
||||
|
||||
The indexing service is a distributed task/job queue. It accepts requests in the form of [[Tasks]] and executes those tasks across a set of worker nodes. Worker capacity can be automatically adjusted based on the number of tasks pending in the system. The indexing service is highly available, has built in retry logic, and can backup per task logs in deep storage.
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
### R
|
||||
|
||||
- [RDruid](https://github.com/metamx/RDruid) - Druid connector for R
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Once you have a realtime node working, it is time to load your own data to see how Druid performs.
|
||||
|
||||
Druid can ingest data in three ways: via Kafka and a realtime node, via the indexing service, and via the Hadoop batch loader. Data is ingested in realtime using a [[Firehose]].
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Master
|
||||
======
|
||||
|
||||
|
@ -12,7 +15,7 @@ Rules
|
|||
|
||||
Segments are loaded and dropped from the cluster based on a set of rules. Rules indicate how segments should be assigned to different compute node tiers and how many replicants of a segment should exist in each tier. Rules may also indicate when segments should be dropped entirely from the cluster. The master loads a set of rules from the database. Rules may be specific to a certain datasource and/or a default set of rules can be configured. Rules are read in order and hence the ordering of rules is important. The master will cycle through all available segments and match each segment with the first rule that applies. Each segment may only match a single rule
|
||||
|
||||
For more information on rules, see [[Rule Configuration]].
|
||||
For more information on rules, see [[Rule Configuration.md]].
|
||||
|
||||
Cleaning Up Segments
|
||||
--------------------
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
MySQL is an external dependency of Druid. We use it to store various metadata about the system, but not to store the actual data. There are a number of tables used for various purposes described below.
|
||||
|
||||
Segments Table
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
The orderBy field provides the functionality to sort and limit the set of results from a groupBy query. Available options are:
|
||||
|
||||
### DefaultLimitSpec
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
The Plumber is the thing that handles generated segments both while they are being generated and when they are “done”. This is also technically a pluggable interface and there are multiple implementations, but there are a lot of details handled by the plumber such that it is expected that there will only be a few implementations and only more advanced third-parties will implement their own. See [here](https://github.com/metamx/druid/wiki/Plumber#available-plumbers) for a description of the plumbers included with Druid.
|
||||
|
||||
|Field|Type|Description|Required|
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Post-aggregations are specifications of processing that should happen on aggregated values as they come out of Druid. If you include a post aggregation as part of a query, make sure to include all aggregators the post-aggregator requires.
|
||||
|
||||
There are several post-aggregators available.
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
# Setup #
|
||||
|
||||
Before we start querying druid, we're going to finish setting up a complete cluster on localhost. In [[Loading Your Data]] we setup a [[Realtime]], [[Compute]] and [[Master]] node. If you've already completed that tutorial, you need only follow the directions for 'Booting a Broker Node'.
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Querying
|
||||
========
|
||||
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Realtime
|
||||
========
|
||||
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Note: It is recommended that the master console is used to configure rules. However, the master node does have HTTP endpoints to programmatically configure rules.
|
||||
|
||||
Load Rules
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
A search query returns dimension values that match the search specification.
|
||||
|
||||
<code>{
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Search query specs define how a “match” is defined between a search value and a dimension value. The available search query specs are:
|
||||
|
||||
InsensitiveContainsSearchQuerySpec
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Segment metadata queries return per segment information about:
|
||||
\* Cardinality of all columns in the segment
|
||||
\* Estimated byte size for the segment columns in TSV format
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Segments
|
||||
========
|
||||
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Note: This feature is highly experimental and only works with spatially indexed dimensions.
|
||||
|
||||
The grammar for a spatial filter is as follows:
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Note: This feature is highly experimental.
|
||||
|
||||
In any of the data specs, there is now the option of providing spatial dimensions. For example, for a JSON data spec, spatial dimensions can be specified as follows:
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
This page describes how to use Riak-CS for deep storage instead of S3. We are still setting up some of the peripheral stuff (file downloads, etc.).
|
||||
|
||||
This guide provided by Pablo Nebrera, thanks!
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Numerous backend engineers at [Metamarkets](http://www.metamarkets.com) work on Druid full-time. If you any questions about usage or code, feel free to contact any of us.
|
||||
|
||||
Google Groups Mailing List
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Tasks are run on workers and always operate on a single datasource. Once an indexer coordinator node accepts a task, a lock is created for the datasource and interval specified in the task. Tasks do not need to explicitly release locks, they are released upon task completion. Tasks may potentially release locks early if they desire. Tasks ids are unique by naming them using UUIDs or the timestamp in which the task was created. Tasks are also part of a “task group”, which is a set of tasks that can share interval locks.
|
||||
|
||||
There are several different types of tasks.
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
YourKit supports the Druid open source projects with its
|
||||
full-featured Java Profiler.
|
||||
YourKit, LLC is the creator of innovative and intelligent tools for profiling
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Time boundary queries return the earliest and latest data points of a data set. The grammar is:
|
||||
|
||||
<code>{
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Timeseries queries
|
||||
==================
|
||||
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Greetings! This tutorial will help clarify some core Druid concepts. We will use a realtime dataset and issue some basic Druid queries. If you are ready to explore Druid, and learn a thing or two, read on!
|
||||
|
||||
About the data
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Welcome back! In our first [tutorial](https://github.com/metamx/druid/wiki/Tutorial%3A-A-First-Look-at-Druid), we introduced you to the most basic Druid setup: a single realtime node. We streamed in some data and queried it. Realtime nodes collect very recent data and periodically hand that data off to the rest of the Druid cluster. Some questions about the architecture must naturally come to mind. What does the rest of Druid cluster look like? How does Druid load available static data?
|
||||
|
||||
This tutorial will hopefully answer these questions!
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Greetings! This tutorial will help clarify some core Druid concepts. We will use a realtime dataset and issue some basic Druid queries. If you are ready to explore Druid, and learn a thing or two, read on!
|
||||
|
||||
About the data
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Greetings! We see you’ve taken an interest in Druid. That’s awesome! Hopefully this tutorial will help clarify some core Druid concepts. We will go through one of the Real-time [[Examples]], and issue some basic Druid queries. The data source we’ll be working with is the [Twitter spritzer stream](https://dev.twitter.com/docs/streaming-apis/streams/public). If you are ready to explore Druid, brave its challenges, and maybe learn a thing or two, read on!
|
||||
|
||||
Setting Up
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
This page discusses how we do versioning and provides information on our stable releases.
|
||||
|
||||
Versioning Strategy
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Druid uses ZooKeeper (ZK) for management of current cluster state. The operations that happen over ZK are
|
||||
|
||||
1. [[Master]] leader election
|
||||
|
|
|
@ -1,2 +1,3 @@
|
|||
name: Your New Jekyll Site
|
||||
pygments: true
|
||||
markdown: redcarpet
|
||||
|
|
|
@ -1,3 +1,6 @@
|
|||
---
|
||||
layout: default
|
||||
---
|
||||
Contents
|
||||
\* [[Introduction|Home]]
|
||||
\* [[Download]]
|
||||
|
|
Loading…
Reference in New Issue