@ -122,7 +122,6 @@ edit.
\begin { table*}
\begin { table*}
\centering
\centering
\label { tab:sample_ data}
\begin { tabular} { | l | l | l | l | l | l | l | l |}
\begin { tabular} { | l | l | l | l | l | l | l | l |}
\hline
\hline
\textbf { Timestamp} & \textbf { Page} & \textbf { Username} & \textbf { Gender} & \textbf { City} & \textbf { Characters Added} & \textbf { Characters Removed} \\ \hline
\textbf { Timestamp} & \textbf { Page} & \textbf { Username} & \textbf { Gender} & \textbf { City} & \textbf { Characters Added} & \textbf { Characters Removed} \\ \hline
@ -132,6 +131,7 @@ edit.
2011-01-01T02:00:00Z & Ke\$ ha & Xeno & Male & Taiyuan & 3194 & 170 \\ \hline
2011-01-01T02:00:00Z & Ke\$ ha & Xeno & Male & Taiyuan & 3194 & 170 \\ \hline
\end { tabular}
\end { tabular}
\caption { Sample Druid data for edits that have occurred on Wikipedia.}
\caption { Sample Druid data for edits that have occurred on Wikipedia.}
\label { tab:sample_ data}
\end { table*}
\end { table*}
Our goal is to rapidly compute drill-downs and aggregates over this data. We
Our goal is to rapidly compute drill-downs and aggregates over this data. We
@ -160,7 +160,7 @@ determine business success or failure.
Finally, another key problem that Metamarkets faced in its early days was to
Finally, another key problem that Metamarkets faced in its early days was to
allow users and alerting systems to be able to make business decisions in
allow users and alerting systems to be able to make business decisions in
" real-time". The time from when an event is created to when that
`` real-time". The time from when an event is created to when that
event is queryable determines how fast users and systems are able to react to
event is queryable determines how fast users and systems are able to react to
potentially catastrophic occurrences in their systems. Popular open source data
potentially catastrophic occurrences in their systems. Popular open source data
warehousing systems such as Hadoop were unable to provide the sub-second data ingestion
warehousing systems such as Hadoop were unable to provide the sub-second data ingestion
@ -177,7 +177,7 @@ A Druid cluster consists of different types of nodes and each node type is
designed to perform a specific set of things. We believe this design separates
designed to perform a specific set of things. We believe this design separates
concerns and simplifies the complexity of the system. The different node types
concerns and simplifies the complexity of the system. The different node types
operate fairly independent of each other and there is minimal interaction
operate fairly independent of each other and there is minimal interaction
between them. Hence, intra-cluster communication failures have minimal impact
among them. Hence, intra-cluster communication failures have minimal impact
on data availability. To solve complex data analysis problems, the different
on data availability. To solve complex data analysis problems, the different
node types come together to form a fully working system. The name Druid comes
node types come together to form a fully working system. The name Druid comes
from the Druid class in many role-playing games: it is a shape-shifter, capable
from the Druid class in many role-playing games: it is a shape-shifter, capable
@ -231,10 +231,10 @@ On a periodic basis, each real-time node will schedule a background task that
searches for all locally persisted indexes. The task merges these indexes
searches for all locally persisted indexes. The task merges these indexes
together and builds an immutable block of data that contains all the events
together and builds an immutable block of data that contains all the events
that have ingested by a real-time node for some span of time. We refer to this
that have ingested by a real-time node for some span of time. We refer to this
block of data as a " segment". During the handoff stage, a real-time node
block of data as a `` segment". During the handoff stage, a real-time node
uploads this segment to a permanent backup storage, typically a distributed
uploads this segment to a permanent backup storage, typically a distributed
file system such as S3 \cite { decandia2007dynamo} or HDFS
file system such as S3 \cite { decandia2007dynamo} or HDFS
\cite { shvachko2010hadoop} , which Druid refers to as " deep storage". The ingest,
\cite { shvachko2010hadoop} , which Druid refers to as `` deep storage". The ingest,
persist, merge, and handoff steps are fluid; there is no data loss during any
persist, merge, and handoff steps are fluid; there is no data loss during any
of the processes.
of the processes.
@ -260,7 +260,7 @@ collected for 13:00 to 14:00 and unannounces it is serving this data.
\centering
\centering
\includegraphics [width = 4.5in] { realtime_ timeline}
\includegraphics [width = 4.5in] { realtime_ timeline}
\caption { The node starts, ingests data, persists, and periodically hands data
\caption { The node starts, ingests data, persists, and periodically hands data
off. This process repeats indefinitely. The time interval s between different
off. This process repeats indefinitely. The time period s between different
real-time node operations are configurable.}
real-time node operations are configurable.}
\label { fig:realtime_ timeline}
\label { fig:realtime_ timeline}
\end { figure*}
\end { figure*}
@ -436,8 +436,8 @@ Rules indicate how segments should be assigned to different historical node
tiers and how many replicates of a segment should exist in each tier. Rules may
tiers and how many replicates of a segment should exist in each tier. Rules may
also indicate when segments should be dropped entirely from the cluster. Rules
also indicate when segments should be dropped entirely from the cluster. Rules
are usually set for a period of time. For example, a user may use rules to
are usually set for a period of time. For example, a user may use rules to
load the most recent one month's worth of segments into a " hot" cluster, the
load the most recent one month's worth of segments into a `` hot" cluster, the
most recent one year's worth of segments into a " cold" cluster, and drop any
most recent one year's worth of segments into a `` cold" cluster, and drop any
segments that are older.
segments that are older.
The coordinator nodes load a set of rules from a rule table in the MySQL
The coordinator nodes load a set of rules from a rule table in the MySQL
@ -569,7 +569,7 @@ representations.
\subsection { Indices for Filtering Data}
\subsection { Indices for Filtering Data}
In many real world OLAP workflows, queries are issued for the aggregated
In many real world OLAP workflows, queries are issued for the aggregated
results of some set of metrics where some set of dimension specifications are
results of some set of metrics where some set of dimension specifications are
met. An example query is: " How many Wikipedia edits were done by users in
met. An example query is: `` How many Wikipedia edits were done by users in
San Francisco who are also male?". This query is filtering the Wikipedia data
San Francisco who are also male?". This query is filtering the Wikipedia data
set in Table~\ref { tab:sample_ data} based on a Boolean expression of dimension
set in Table~\ref { tab:sample_ data} based on a Boolean expression of dimension
values. In many real world data sets, dimension columns contain strings and
values. In many real world data sets, dimension columns contain strings and
@ -609,12 +609,11 @@ used in search engines. Bitmap indices for OLAP workloads is described in
detail in \cite { o1997improved} . Bitmap compression algorithms are a
detail in \cite { o1997improved} . Bitmap compression algorithms are a
well-defined area of research \cite { antoshenkov1995byte, wu2006optimizing,
well-defined area of research \cite { antoshenkov1995byte, wu2006optimizing,
van2011memory} and often utilize run-length encoding. Druid opted to use the
van2011memory} and often utilize run-length encoding. Druid opted to use the
Concise algorithm \cite { colantonio2010concise} as it can outperform WAH by
Concise algorithm \cite { colantonio2010concise} . Figure~\ref { fig:concise_ plot}
reducing compressed bitmap size by up to 50\% . Figure~\ref { fig:concise_ plot}
illustrates the number of bytes using Concise compression versus using an
illustrates the number of bytes using Concise compression versus using an
integer array. The results were generated on a \texttt { cc2.8xlarge} system with a single
integer array. The results were generated on a \texttt { cc2.8xlarge} system with
thread, 2G heap, 512m young gen, and a forced GC between each run. The data set
a single thread, 2G heap, 512m young gen, and a forced GC between each run. The
is a single day’ s worth of data collected from the Twitter garden hose
data set is a single day’ s worth of data collected from the Twitter garden hose
\cite { twitter2013} data stream. The data set contains 2,272,295 rows and 12
\cite { twitter2013} data stream. The data set contains 2,272,295 rows and 12
dimensions of varying cardinality. As an additional comparison, we also
dimensions of varying cardinality. As an additional comparison, we also
resorted the data set rows to maximize compression.
resorted the data set rows to maximize compression.
@ -680,8 +679,8 @@ A sample count query over a week of data is as follows:
}
}
\end { verbatim} }
\end { verbatim} }
The query shown above will return a count of the number of rows in the Wikipedia datasource
The query shown above will return a count of the number of rows in the Wikipedia datasource
from 2013-01-01 to 2013-01-08, filtered for only those rows where the value of the " page" dimension is
from 2013-01-01 to 2013-01-08, filtered for only those rows where the value of the `` page" dimension is
equal to " Ke\$ ha". The results will be bucketed by day and will be a JSON array of the following form:
equal to `` Ke\$ ha". The results will be bucketed by day and will be a JSON array of the following form:
{ \scriptsize \begin { verbatim}
{ \scriptsize \begin { verbatim}
[ {
[ {
"timestamp": "2012-01-01T00:00:00.000Z",
"timestamp": "2012-01-01T00:00:00.000Z",
@ -706,7 +705,7 @@ of this paper to fully describe the query API but more information can be found
online\footnote { \href { http://druid.io/docs/latest/Querying.html} { http://druid.io/docs/latest/Querying.html} } .
online\footnote { \href { http://druid.io/docs/latest/Querying.html} { http://druid.io/docs/latest/Querying.html} } .
As of this writing, a join query for Druid is not yet implemented. This has
As of this writing, a join query for Druid is not yet implemented. This has
been a function of engineering resource allocation decisions and use case more
been a function of engineering resource allocation and use case decisions more
than a decision driven by technical merit. Indeed, Druid's storage format
than a decision driven by technical merit. Indeed, Druid's storage format
would allow for the implementation of joins (there is no loss of fidelity for
would allow for the implementation of joins (there is no loss of fidelity for
columns included as dimensions) and the implementation of them has been a
columns included as dimensions) and the implementation of them has been a
@ -724,7 +723,7 @@ a shared set of keys. The primary high-level strategies for join queries the
authors are aware of are a hash-based strategy or a sorted-merge strategy. The
authors are aware of are a hash-based strategy or a sorted-merge strategy. The
hash-based strategy requires that all but one data set be available as
hash-based strategy requires that all but one data set be available as
something that looks like a hash table, a lookup operation is then performed on
something that looks like a hash table, a lookup operation is then performed on
this hash table for every row in the " primary" stream. The sorted-merge
this hash table for every row in the `` primary" stream. The sorted-merge
strategy assumes that each stream is sorted by the join key and thus allows for
strategy assumes that each stream is sorted by the join key and thus allows for
the incremental joining of the streams. Each of these strategies, however,
the incremental joining of the streams. Each of these strategies, however,
requires the materialization of some number of the streams either in sorted
requires the materialization of some number of the streams either in sorted
@ -751,8 +750,7 @@ Druid query performance can vary signficantly depending on the query
being issued. For example, sorting the values of a high cardinality dimension
being issued. For example, sorting the values of a high cardinality dimension
based on a given metric is much more expensive than a simple count over a time
based on a given metric is much more expensive than a simple count over a time
range. To showcase the average query latencies in a production Druid cluster,
range. To showcase the average query latencies in a production Druid cluster,
we selected 8 of our most queried data sources, described in
we selected 8 of our most queried data sources, described in Table~\ref { tab:datasources} .
Table~\ref { tab:datasources} .
Approximately 30\% of the queries are standard
Approximately 30\% of the queries are standard
aggregates involving different types of metrics and filters, 60\% of queries
aggregates involving different types of metrics and filters, 60\% of queries
@ -764,7 +762,6 @@ involving all columns are very rare.
\begin { table}
\begin { table}
\centering
\centering
\label { tab:datasources}
\begin { tabular} { | l | l | l |}
\begin { tabular} { | l | l | l |}
\hline
\hline
\textbf { Data Source} & \textbf { Dimensions} & \textbf { Metrics} \\ \hline
\textbf { Data Source} & \textbf { Dimensions} & \textbf { Metrics} \\ \hline
@ -778,14 +775,15 @@ involving all columns are very rare.
\texttt { h} & 78 & 14 \\ \hline
\texttt { h} & 78 & 14 \\ \hline
\end { tabular}
\end { tabular}
\caption { Characteristics of production data sources.}
\caption { Characteristics of production data sources.}
\label { tab:datasources}
\end { table}
\end { table}
A few notes about our results:
A few notes about our results:
\begin { itemize} [leftmargin=*,beginpenalty=5000,topsep=0pt]
\begin { itemize} [leftmargin=*,beginpenalty=5000,topsep=0pt]
\item The results are from a " hot" tier in our production cluster. We run
\item The results are from a `` hot" tier in our production cluster. We run
several tiers of varying performance in production.
several tiers of varying performance in production.
\item There is approximately 10.5TB of RAM available in the " hot" tier and
\item There is approximately 10.5TB of RAM available in the `` hot" tier and
approximately 10TB of segments loaded (including replication). Collectively,
approximately 10TB of segments loaded (including replication). Collectively,
there are about 50 billion Druid rows in this tier. Results for
there are about 50 billion Druid rows in this tier. Results for
every data source are not shown.
every data source are not shown.
@ -798,12 +796,12 @@ threads and 672 total cores (hyperthreaded).
\end { itemize}
\end { itemize}
Query latencies are shown in Figure~\ref { fig:query_ latency} and the queries per
Query latencies are shown in Figure~\ref { fig:query_ latency} and the queries per
minute is shown in Figure~\ref { fig:queries_ per_ min} . Across all the various
minute are shown in Figure~\ref { fig:queries_ per_ min} . Across all the various
data sources, average query latency is approximately 550 milliseconds, with
data sources, average query latency is approximately 550 milliseconds, with
90\% of queries returning in less than 1 second, 95\% in under 2 seconds, and
90\% of queries returning in less than 1 second, 95\% in under 2 seconds, and
99\% of queries taking less than 10 seconds to complete .
99\% of queries returning in less than 10 seconds .
Occasionally we observe spikes in latency, as observed on February 19,
Occasionally we observe spikes in latency, as observed on February 19,
in which case network issues on the cache nod es were compounded by very high
in which case network issues on the Memcached instanc es were compounded by very high
query load on one of our largest datasources.
query load on one of our largest datasources.
\begin { figure}
\begin { figure}
@ -893,11 +891,10 @@ aggregations we want to perform on those metrics. With the most basic data set
800,000 events/second/core, which is really just a measurement of how fast we can
800,000 events/second/core, which is really just a measurement of how fast we can
deserialize events. Real world data sets are never this simple.
deserialize events. Real world data sets are never this simple.
Table~\ref { tab:ingest_ datasources} shows a selection of data sources and their
Table~\ref { tab:ingest_ datasources} shows a selection of data sources and their
chracteristics.
cha racteristics.
\begin { table}
\begin { table}
\centering
\centering
\label { tab:ingest_ datasources}
\begin { tabular} { | l | l | l | l |}
\begin { tabular} { | l | l | l | l |}
\hline
\hline
\scriptsize \textbf { Data Source} & \scriptsize \textbf { Dimensions} & \scriptsize \textbf { Metrics} & \scriptsize \textbf { Peak events/s} \\ \hline
\scriptsize \textbf { Data Source} & \scriptsize \textbf { Dimensions} & \scriptsize \textbf { Metrics} & \scriptsize \textbf { Peak events/s} \\ \hline
@ -911,6 +908,7 @@ chracteristics.
\texttt { z} & 33 & 24 & 95747.74 \\ \hline
\texttt { z} & 33 & 24 & 95747.74 \\ \hline
\end { tabular}
\end { tabular}
\caption { Ingestion characteristics of various data sources.}
\caption { Ingestion characteristics of various data sources.}
\label { tab:ingest_ datasources}
\end { table}
\end { table}
We can see that, based on the descriptions in
We can see that, based on the descriptions in
@ -938,7 +936,7 @@ The latency measurements we presented are sufficient to address the our stated
problems of interactivity. We would prefer the variability in the latencies to
problems of interactivity. We would prefer the variability in the latencies to
be less. It is still very possible to possible to decrease latencies by adding
be less. It is still very possible to possible to decrease latencies by adding
additional hardware, but we have not chosen to do so because infrastructure
additional hardware, but we have not chosen to do so because infrastructure
cost is still a consideration to us.
costs are still a consideration to us.
\section { Druid in Production} \label { sec:production}
\section { Druid in Production} \label { sec:production}
Over the last few years, we have gained tremendous knowledge about handling
Over the last few years, we have gained tremendous knowledge about handling
@ -976,7 +974,7 @@ historical nodes.
\paragraph { Data Center Outages}
\paragraph { Data Center Outages}
Complete cluster failures are possible, but extremely rare. If Druid is
Complete cluster failures are possible, but extremely rare. If Druid is
deployed only in a single data center, it is possible for the entire data
only deployed in a single data center, it is possible for the entire data
center to fail. In such cases, new machines need to be provisioned. As long as
center to fail. In such cases, new machines need to be provisioned. As long as
deep storage is still available, cluster recovery time is network bound as
deep storage is still available, cluster recovery time is network bound as
historical nodes simply need to redownload every segment from deep storage. We
historical nodes simply need to redownload every segment from deep storage. We
@ -1076,7 +1074,7 @@ stores \cite{macnicol2004sybase}.
In this paper, we presented Druid, a distributed, column-oriented, real-time
In this paper, we presented Druid, a distributed, column-oriented, real-time
analytical data store. Druid is designed to power high performance applications
analytical data store. Druid is designed to power high performance applications
and is optimized for low query latencies. Druid supports streaming data
and is optimized for low query latencies. Druid supports streaming data
ingestion and is fault-tolerant. We discussed how Druid benchmarks and
ingestion and is fault-tolerant. We discussed Druid benchmarks and
summarized key architecture aspects such
summarized key architecture aspects such
as the storage format, query language, and general execution.
as the storage format, query language, and general execution.