mirror of https://github.com/apache/druid.git
minor layout/wording changes to make it fit nicely
This commit is contained in:
parent
b0790783b7
commit
d206a9e5f7
|
@ -8,7 +8,7 @@ zip : sgmd0658-yang.zip
|
||||||
|
|
||||||
%.zip : %.pdf
|
%.zip : %.pdf
|
||||||
@rm -f dummy.ps
|
@rm -f dummy.ps
|
||||||
@touch dummy.ps
|
@echo 1234 > dummy.ps
|
||||||
zip $@ $*.pdf $*.tex dummy.ps
|
zip $@ $*.pdf $*.tex dummy.ps
|
||||||
|
|
||||||
clean :
|
clean :
|
||||||
|
|
Binary file not shown.
|
@ -61,8 +61,7 @@
|
||||||
\maketitle
|
\maketitle
|
||||||
|
|
||||||
\begin{abstract}
|
\begin{abstract}
|
||||||
Druid is an open
|
Druid is an open source\footnote{\href{http://druid.io/}{http://druid.io/} \href{https://github.com/metamx/druid}{https://github.com/metamx/druid}}
|
||||||
source\footnote{\href{http://druid.io/}{http://druid.io/} \href{https://github.com/metamx/druid}{https://github.com/metamx/druid}}
|
|
||||||
data store designed for real-time exploratory analytics on large data sets.
|
data store designed for real-time exploratory analytics on large data sets.
|
||||||
The system combines a column-oriented storage layout, a distributed,
|
The system combines a column-oriented storage layout, a distributed,
|
||||||
shared-nothing architecture, and an advanced indexing structure to allow for
|
shared-nothing architecture, and an advanced indexing structure to allow for
|
||||||
|
@ -131,7 +130,6 @@ service, and attempts to help inform anyone who faces a similar problem about a
|
||||||
potential method of solving it. Druid is deployed in production at several
|
potential method of solving it. Druid is deployed in production at several
|
||||||
technology
|
technology
|
||||||
companies\footnote{\href{http://druid.io/druid.html}{http://druid.io/druid.html}}.
|
companies\footnote{\href{http://druid.io/druid.html}{http://druid.io/druid.html}}.
|
||||||
|
|
||||||
The structure of the paper is as follows: we first describe the problem in
|
The structure of the paper is as follows: we first describe the problem in
|
||||||
Section \ref{sec:problem-definition}. Next, we detail system architecture from
|
Section \ref{sec:problem-definition}. Next, we detail system architecture from
|
||||||
the point of view of how data flows through the system in Section
|
the point of view of how data flows through the system in Section
|
||||||
|
@ -145,6 +143,21 @@ in Section \ref{sec:related}.
|
||||||
\section{Problem Definition}
|
\section{Problem Definition}
|
||||||
\label{sec:problem-definition}
|
\label{sec:problem-definition}
|
||||||
|
|
||||||
|
\begin{table*}
|
||||||
|
\centering
|
||||||
|
\begin{tabular}{| l | l | l | l | l | l | l | l |}
|
||||||
|
\hline
|
||||||
|
\textbf{Timestamp} & \textbf{Page} & \textbf{Username} & \textbf{Gender} & \textbf{City} & \textbf{Characters Added} & \textbf{Characters Removed} \\ \hline
|
||||||
|
2011-01-01T01:00:00Z & Justin Bieber & Boxer & Male & San Francisco & 1800 & 25 \\ \hline
|
||||||
|
2011-01-01T01:00:00Z & Justin Bieber & Reach & Male & Waterloo & 2912 & 42 \\ \hline
|
||||||
|
2011-01-01T02:00:00Z & Ke\$ha & Helz & Male & Calgary & 1953 & 17 \\ \hline
|
||||||
|
2011-01-01T02:00:00Z & Ke\$ha & Xeno & Male & Taiyuan & 3194 & 170 \\ \hline
|
||||||
|
\end{tabular}
|
||||||
|
\caption{Sample Druid data for edits that have occurred on Wikipedia.}
|
||||||
|
\label{tab:sample_data}
|
||||||
|
\end{table*}
|
||||||
|
|
||||||
|
|
||||||
Druid was originally designed to solve problems around ingesting and exploring
|
Druid was originally designed to solve problems around ingesting and exploring
|
||||||
large quantities of transactional events (log data). This form of timeseries
|
large quantities of transactional events (log data). This form of timeseries
|
||||||
data is commonly found in OLAP workflows and the nature of the data tends to be
|
data is commonly found in OLAP workflows and the nature of the data tends to be
|
||||||
|
@ -160,20 +173,6 @@ there are a set of metric columns that contain values (usually numeric) that
|
||||||
can be aggregated, such as the number of characters added or removed in an
|
can be aggregated, such as the number of characters added or removed in an
|
||||||
edit.
|
edit.
|
||||||
|
|
||||||
\begin{table*}
|
|
||||||
\centering
|
|
||||||
\begin{tabular}{| l | l | l | l | l | l | l | l |}
|
|
||||||
\hline
|
|
||||||
\textbf{Timestamp} & \textbf{Page} & \textbf{Username} & \textbf{Gender} & \textbf{City} & \textbf{Characters Added} & \textbf{Characters Removed} \\ \hline
|
|
||||||
2011-01-01T01:00:00Z & Justin Bieber & Boxer & Male & San Francisco & 1800 & 25 \\ \hline
|
|
||||||
2011-01-01T01:00:00Z & Justin Bieber & Reach & Male & Waterloo & 2912 & 42 \\ \hline
|
|
||||||
2011-01-01T02:00:00Z & Ke\$ha & Helz & Male & Calgary & 1953 & 17 \\ \hline
|
|
||||||
2011-01-01T02:00:00Z & Ke\$ha & Xeno & Male & Taiyuan & 3194 & 170 \\ \hline
|
|
||||||
\end{tabular}
|
|
||||||
\caption{Sample Druid data for edits that have occurred on Wikipedia.}
|
|
||||||
\label{tab:sample_data}
|
|
||||||
\end{table*}
|
|
||||||
|
|
||||||
Our goal is to rapidly compute drill-downs and aggregates over this data. We
|
Our goal is to rapidly compute drill-downs and aggregates over this data. We
|
||||||
want to answer questions like “How many edits were made on the page Justin
|
want to answer questions like “How many edits were made on the page Justin
|
||||||
Bieber from males in San Francisco?” and “What is the average number of
|
Bieber from males in San Francisco?” and “What is the average number of
|
||||||
|
@ -218,21 +217,21 @@ designed to perform a specific set of things. We believe this design separates
|
||||||
concerns and simplifies the complexity of the system. The different node types
|
concerns and simplifies the complexity of the system. The different node types
|
||||||
operate fairly independent of each other and there is minimal interaction
|
operate fairly independent of each other and there is minimal interaction
|
||||||
among them. Hence, intra-cluster communication failures have minimal impact
|
among them. Hence, intra-cluster communication failures have minimal impact
|
||||||
on data availability. To solve complex data analysis problems, the different
|
on data availability.
|
||||||
node types come together to form a fully working system. The name Druid comes
|
|
||||||
from the Druid class in many role-playing games: it is a shape-shifter, capable
|
To solve complex data analysis problems, the different
|
||||||
of taking on many different forms to fulfill various different roles in a
|
node types come together to form a fully working system. The composition of and
|
||||||
group. The composition of and flow of data in a Druid cluster are shown in
|
flow of data in a Druid cluster are shown in Figure~\ref{fig:cluster}. The name Druid comes from the Druid class in many role-playing games: it is a
|
||||||
Figure~\ref{fig:cluster}.
|
shape-shifter, capable of taking on many different forms to fulfill various
|
||||||
|
different roles in a group.
|
||||||
|
|
||||||
\begin{figure*}
|
\begin{figure*}
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width = 4.51in]{cluster}
|
\includegraphics[width = 4.5in]{cluster}
|
||||||
\caption{An overview of a Druid cluster and the flow of data through the cluster.}
|
\caption{An overview of a Druid cluster and the flow of data through the cluster.}
|
||||||
\label{fig:cluster}
|
\label{fig:cluster}
|
||||||
\end{figure*}
|
\end{figure*}
|
||||||
|
|
||||||
\newpage
|
|
||||||
\subsection{Real-time Nodes}
|
\subsection{Real-time Nodes}
|
||||||
\label{sec:realtime}
|
\label{sec:realtime}
|
||||||
Real-time nodes encapsulate the functionality to ingest and query event
|
Real-time nodes encapsulate the functionality to ingest and query event
|
||||||
|
@ -260,10 +259,11 @@ in \cite{o1996log} and is illustrated in Figure~\ref{fig:realtime_flow}.
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width = 2.6in]{realtime_flow}
|
\includegraphics[width = 2.6in]{realtime_flow}
|
||||||
\caption{Real-time nodes first buffer events in memory. On a periodic basis,
|
\caption{Real-time nodes buffer events to an in-memory index, which is
|
||||||
the in-memory index is persisted to disk. On another periodic basis, all
|
regularly persisted to disk. On a periodic basis, persisted indexes are then merged
|
||||||
persisted indexes are merged together and handed off. Queries will hit the
|
together before getting handed off.
|
||||||
in-memory index and the persisted indexes.}
|
Queries will hit both the in-memory and persisted indexes.
|
||||||
|
}
|
||||||
\label{fig:realtime_flow}
|
\label{fig:realtime_flow}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
@ -428,9 +428,7 @@ caching the results would be unreliable.
|
||||||
\begin{figure*}
|
\begin{figure*}
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width = 4.5in]{caching}
|
\includegraphics[width = 4.5in]{caching}
|
||||||
\caption{Broker nodes cache per segment results. Every Druid query is mapped to
|
\caption{Results are cached per segment. Queries combine cached results with results computed on historical and real-time nodes.}
|
||||||
a set of segments. Queries often combine cached segment results with those that
|
|
||||||
need to be computed on historical and real-time nodes.}
|
|
||||||
\label{fig:caching}
|
\label{fig:caching}
|
||||||
\end{figure*}
|
\end{figure*}
|
||||||
|
|
||||||
|
@ -802,7 +800,7 @@ involving all columns are very rare.
|
||||||
|
|
||||||
\begin{table}
|
\begin{table}
|
||||||
\centering
|
\centering
|
||||||
\begin{tabular}{| l | l | l |}
|
\scriptsize\begin{tabular}{| l | l | l |}
|
||||||
\hline
|
\hline
|
||||||
\textbf{Data Source} & \textbf{Dimensions} & \textbf{Metrics} \\ \hline
|
\textbf{Data Source} & \textbf{Dimensions} & \textbf{Metrics} \\ \hline
|
||||||
\texttt{a} & 25 & 21 \\ \hline
|
\texttt{a} & 25 & 21 \\ \hline
|
||||||
|
@ -814,6 +812,7 @@ involving all columns are very rare.
|
||||||
\texttt{g} & 26 & 18 \\ \hline
|
\texttt{g} & 26 & 18 \\ \hline
|
||||||
\texttt{h} & 78 & 14 \\ \hline
|
\texttt{h} & 78 & 14 \\ \hline
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
|
\normalsize
|
||||||
\caption{Characteristics of production data sources.}
|
\caption{Characteristics of production data sources.}
|
||||||
\label{tab:datasources}
|
\label{tab:datasources}
|
||||||
\end{table}
|
\end{table}
|
||||||
|
|
Binary file not shown.
Binary file not shown.
Loading…
Reference in New Issue