mirror of https://github.com/apache/druid.git
final edits to paper
This commit is contained in:
parent
2858949431
commit
4be5cd6386
Binary file not shown.
|
@ -198,31 +198,31 @@ determine business success or failure.
|
|||
|
||||
Finally, another key problem that Metamarkets faced in its early days was to
|
||||
allow users and alerting systems to be able to make business decisions in
|
||||
``real-time". The time from when an event is created to when that
|
||||
event is queryable determines how fast users and systems are able to react to
|
||||
potentially catastrophic occurrences in their systems. Popular open source data
|
||||
warehousing systems such as Hadoop were unable to provide the sub-second data ingestion
|
||||
latencies we required.
|
||||
``real-time". The time from when an event is created to when that event is
|
||||
queryable determines how fast interested parties are able to react to
|
||||
potentially catastrophic situations in their systems. Popular open source data
|
||||
warehousing systems such as Hadoop were unable to provide the sub-second data
|
||||
ingestion latencies we required.
|
||||
|
||||
The problems of data exploration, ingestion, and availability span multiple
|
||||
industries. Since Druid was open sourced in October 2012, it been deployed as a
|
||||
video, network monitoring, operations monitoring, and online advertising
|
||||
analytics platform in multiple companies.
|
||||
analytics platform at multiple companies.
|
||||
|
||||
\section{Architecture}
|
||||
\label{sec:architecture}
|
||||
A Druid cluster consists of different types of nodes and each node type is
|
||||
designed to perform a specific set of things. We believe this design separates
|
||||
concerns and simplifies the complexity of the system. The different node types
|
||||
operate fairly independent of each other and there is minimal interaction
|
||||
among them. Hence, intra-cluster communication failures have minimal impact
|
||||
on data availability.
|
||||
concerns and simplifies the complexity of the overall system. The different
|
||||
node types operate fairly independent of each other and there is minimal
|
||||
interaction among them. Hence, intra-cluster communication failures have
|
||||
minimal impact on data availability.
|
||||
|
||||
To solve complex data analysis problems, the different
|
||||
node types come together to form a fully working system. The composition of and
|
||||
flow of data in a Druid cluster are shown in Figure~\ref{fig:cluster}. The name Druid comes from the Druid class in many role-playing games: it is a
|
||||
shape-shifter, capable of taking on many different forms to fulfill various
|
||||
different roles in a group.
|
||||
To solve complex data analysis problems, the different node types come together
|
||||
to form a fully working system. The name Druid comes from the Druid class in
|
||||
many role-playing games: it is a shape-shifter, capable of taking on many
|
||||
different forms to fulfill various different roles in a group. The composition
|
||||
of and flow of data in a Druid cluster are shown in Figure~\ref{fig:cluster}.
|
||||
|
||||
\begin{figure*}
|
||||
\centering
|
||||
|
@ -422,7 +422,7 @@ their results, the broker will cache these results on a per segment basis for
|
|||
future use. This process is illustrated in Figure~\ref{fig:caching}. Real-time
|
||||
data is never cached and hence requests for real-time data will always be
|
||||
forwarded to real-time nodes. Real-time data is perpetually changing and
|
||||
caching the results would be unreliable.
|
||||
caching the results is unreliable.
|
||||
|
||||
\begin{figure*}
|
||||
\centering
|
||||
|
@ -534,7 +534,7 @@ queryable during MySQL outages.
|
|||
Data tables in Druid (called \emph{data sources}) are collections of
|
||||
timestamped events and partitioned into a set of segments, where each segment
|
||||
is typically 5--10 million rows. Formally, we define a segment as a collection
|
||||
of rows of data that span some period in time. Segments represent the
|
||||
of rows of data that span some period of time. Segments represent the
|
||||
fundamental storage unit in Druid and replication and distribution are done at
|
||||
a segment level.
|
||||
|
||||
|
@ -839,9 +839,9 @@ minute are shown in Figure~\ref{fig:queries_per_min}. Across all the various
|
|||
data sources, average query latency is approximately 550 milliseconds, with
|
||||
90\% of queries returning in less than 1 second, 95\% in under 2 seconds, and
|
||||
99\% of queries returning in less than 10 seconds. Occasionally we observe
|
||||
spikes in latency, as observed on February 19, in which case network issues on
|
||||
spikes in latency, as observed on February 19, where network issues on
|
||||
the Memcached instances were compounded by very high query load on one of our
|
||||
largest datasources.
|
||||
largest data sources.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
|
@ -984,7 +984,7 @@ production workloads with Druid and have made a couple of interesting observatio
|
|||
|
||||
\paragraph{Query Patterns}
|
||||
Druid is often used to explore data and generate reports on data. In the
|
||||
explore use case, the number of queries issued by a single user is much higher
|
||||
explore use case, the number of queries issued by a single user are much higher
|
||||
than in the reporting use case. Exploratory queries often involve progressively
|
||||
adding filters for the same time range to narrow down results. Users tend to
|
||||
explore short time intervals of recent data. In the generate report use case,
|
||||
|
|
Binary file not shown.
Binary file not shown.
Loading…
Reference in New Issue