final edits to paper

This commit is contained in:
fjy 2014-03-28 14:38:21 -07:00
parent 2858949431
commit 4be5cd6386
4 changed files with 20 additions and 20 deletions

Binary file not shown.

View File

@ -198,31 +198,31 @@ determine business success or failure.
Finally, another key problem that Metamarkets faced in its early days was to Finally, another key problem that Metamarkets faced in its early days was to
allow users and alerting systems to be able to make business decisions in allow users and alerting systems to be able to make business decisions in
``real-time". The time from when an event is created to when that ``real-time". The time from when an event is created to when that event is
event is queryable determines how fast users and systems are able to react to queryable determines how fast interested parties are able to react to
potentially catastrophic occurrences in their systems. Popular open source data potentially catastrophic situations in their systems. Popular open source data
warehousing systems such as Hadoop were unable to provide the sub-second data ingestion warehousing systems such as Hadoop were unable to provide the sub-second data
latencies we required. ingestion latencies we required.
The problems of data exploration, ingestion, and availability span multiple The problems of data exploration, ingestion, and availability span multiple
industries. Since Druid was open sourced in October 2012, it been deployed as a industries. Since Druid was open sourced in October 2012, it been deployed as a
video, network monitoring, operations monitoring, and online advertising video, network monitoring, operations monitoring, and online advertising
analytics platform in multiple companies. analytics platform at multiple companies.
\section{Architecture} \section{Architecture}
\label{sec:architecture} \label{sec:architecture}
A Druid cluster consists of different types of nodes and each node type is A Druid cluster consists of different types of nodes and each node type is
designed to perform a specific set of things. We believe this design separates designed to perform a specific set of things. We believe this design separates
concerns and simplifies the complexity of the system. The different node types concerns and simplifies the complexity of the overall system. The different
operate fairly independent of each other and there is minimal interaction node types operate fairly independent of each other and there is minimal
among them. Hence, intra-cluster communication failures have minimal impact interaction among them. Hence, intra-cluster communication failures have
on data availability. minimal impact on data availability.
To solve complex data analysis problems, the different To solve complex data analysis problems, the different node types come together
node types come together to form a fully working system. The composition of and to form a fully working system. The name Druid comes from the Druid class in
flow of data in a Druid cluster are shown in Figure~\ref{fig:cluster}. The name Druid comes from the Druid class in many role-playing games: it is a many role-playing games: it is a shape-shifter, capable of taking on many
shape-shifter, capable of taking on many different forms to fulfill various different forms to fulfill various different roles in a group. The composition
different roles in a group. of and flow of data in a Druid cluster are shown in Figure~\ref{fig:cluster}.
\begin{figure*} \begin{figure*}
\centering \centering
@ -422,7 +422,7 @@ their results, the broker will cache these results on a per segment basis for
future use. This process is illustrated in Figure~\ref{fig:caching}. Real-time future use. This process is illustrated in Figure~\ref{fig:caching}. Real-time
data is never cached and hence requests for real-time data will always be data is never cached and hence requests for real-time data will always be
forwarded to real-time nodes. Real-time data is perpetually changing and forwarded to real-time nodes. Real-time data is perpetually changing and
caching the results would be unreliable. caching the results is unreliable.
\begin{figure*} \begin{figure*}
\centering \centering
@ -534,7 +534,7 @@ queryable during MySQL outages.
Data tables in Druid (called \emph{data sources}) are collections of Data tables in Druid (called \emph{data sources}) are collections of
timestamped events and partitioned into a set of segments, where each segment timestamped events and partitioned into a set of segments, where each segment
is typically 5--10 million rows. Formally, we define a segment as a collection is typically 5--10 million rows. Formally, we define a segment as a collection
of rows of data that span some period in time. Segments represent the of rows of data that span some period of time. Segments represent the
fundamental storage unit in Druid and replication and distribution are done at fundamental storage unit in Druid and replication and distribution are done at
a segment level. a segment level.
@ -839,7 +839,7 @@ minute are shown in Figure~\ref{fig:queries_per_min}. Across all the various
data sources, average query latency is approximately 550 milliseconds, with data sources, average query latency is approximately 550 milliseconds, with
90\% of queries returning in less than 1 second, 95\% in under 2 seconds, and 90\% of queries returning in less than 1 second, 95\% in under 2 seconds, and
99\% of queries returning in less than 10 seconds. Occasionally we observe 99\% of queries returning in less than 10 seconds. Occasionally we observe
spikes in latency, as observed on February 19, in which case network issues on spikes in latency, as observed on February 19, where network issues on
the Memcached instances were compounded by very high query load on one of our the Memcached instances were compounded by very high query load on one of our
largest data sources. largest data sources.
@ -984,7 +984,7 @@ production workloads with Druid and have made a couple of interesting observatio
\paragraph{Query Patterns} \paragraph{Query Patterns}
Druid is often used to explore data and generate reports on data. In the Druid is often used to explore data and generate reports on data. In the
explore use case, the number of queries issued by a single user is much higher explore use case, the number of queries issued by a single user are much higher
than in the reporting use case. Exploratory queries often involve progressively than in the reporting use case. Exploratory queries often involve progressively
adding filters for the same time range to narrow down results. Users tend to adding filters for the same time range to narrow down results. Users tend to
explore short time intervals of recent data. In the generate report use case, explore short time intervals of recent data. In the generate report use case,