minor layout/wording changes to make it fit nicely

This commit is contained in:
Xavier Léauté 2014-03-24 13:25:51 -07:00
parent b0790783b7
commit d206a9e5f7
5 changed files with 33 additions and 34 deletions

View File

@ -8,7 +8,7 @@ zip : sgmd0658-yang.zip
%.zip : %.pdf %.zip : %.pdf
@rm -f dummy.ps @rm -f dummy.ps
@touch dummy.ps @echo 1234 > dummy.ps
zip $@ $*.pdf $*.tex dummy.ps zip $@ $*.pdf $*.tex dummy.ps
clean : clean :

Binary file not shown.

View File

@ -61,8 +61,7 @@
\maketitle \maketitle
\begin{abstract} \begin{abstract}
Druid is an open Druid is an open source\footnote{\href{http://druid.io/}{http://druid.io/} \href{https://github.com/metamx/druid}{https://github.com/metamx/druid}}
source\footnote{\href{http://druid.io/}{http://druid.io/} \href{https://github.com/metamx/druid}{https://github.com/metamx/druid}}
data store designed for real-time exploratory analytics on large data sets. data store designed for real-time exploratory analytics on large data sets.
The system combines a column-oriented storage layout, a distributed, The system combines a column-oriented storage layout, a distributed,
shared-nothing architecture, and an advanced indexing structure to allow for shared-nothing architecture, and an advanced indexing structure to allow for
@ -131,7 +130,6 @@ service, and attempts to help inform anyone who faces a similar problem about a
potential method of solving it. Druid is deployed in production at several potential method of solving it. Druid is deployed in production at several
technology technology
companies\footnote{\href{http://druid.io/druid.html}{http://druid.io/druid.html}}. companies\footnote{\href{http://druid.io/druid.html}{http://druid.io/druid.html}}.
The structure of the paper is as follows: we first describe the problem in The structure of the paper is as follows: we first describe the problem in
Section \ref{sec:problem-definition}. Next, we detail system architecture from Section \ref{sec:problem-definition}. Next, we detail system architecture from
the point of view of how data flows through the system in Section the point of view of how data flows through the system in Section
@ -145,6 +143,21 @@ in Section \ref{sec:related}.
\section{Problem Definition} \section{Problem Definition}
\label{sec:problem-definition} \label{sec:problem-definition}
\begin{table*}
\centering
\begin{tabular}{| l | l | l | l | l | l | l | l |}
\hline
\textbf{Timestamp} & \textbf{Page} & \textbf{Username} & \textbf{Gender} & \textbf{City} & \textbf{Characters Added} & \textbf{Characters Removed} \\ \hline
2011-01-01T01:00:00Z & Justin Bieber & Boxer & Male & San Francisco & 1800 & 25 \\ \hline
2011-01-01T01:00:00Z & Justin Bieber & Reach & Male & Waterloo & 2912 & 42 \\ \hline
2011-01-01T02:00:00Z & Ke\$ha & Helz & Male & Calgary & 1953 & 17 \\ \hline
2011-01-01T02:00:00Z & Ke\$ha & Xeno & Male & Taiyuan & 3194 & 170 \\ \hline
\end{tabular}
\caption{Sample Druid data for edits that have occurred on Wikipedia.}
\label{tab:sample_data}
\end{table*}
Druid was originally designed to solve problems around ingesting and exploring Druid was originally designed to solve problems around ingesting and exploring
large quantities of transactional events (log data). This form of timeseries large quantities of transactional events (log data). This form of timeseries
data is commonly found in OLAP workflows and the nature of the data tends to be data is commonly found in OLAP workflows and the nature of the data tends to be
@ -160,20 +173,6 @@ there are a set of metric columns that contain values (usually numeric) that
can be aggregated, such as the number of characters added or removed in an can be aggregated, such as the number of characters added or removed in an
edit. edit.
\begin{table*}
\centering
\begin{tabular}{| l | l | l | l | l | l | l | l |}
\hline
\textbf{Timestamp} & \textbf{Page} & \textbf{Username} & \textbf{Gender} & \textbf{City} & \textbf{Characters Added} & \textbf{Characters Removed} \\ \hline
2011-01-01T01:00:00Z & Justin Bieber & Boxer & Male & San Francisco & 1800 & 25 \\ \hline
2011-01-01T01:00:00Z & Justin Bieber & Reach & Male & Waterloo & 2912 & 42 \\ \hline
2011-01-01T02:00:00Z & Ke\$ha & Helz & Male & Calgary & 1953 & 17 \\ \hline
2011-01-01T02:00:00Z & Ke\$ha & Xeno & Male & Taiyuan & 3194 & 170 \\ \hline
\end{tabular}
\caption{Sample Druid data for edits that have occurred on Wikipedia.}
\label{tab:sample_data}
\end{table*}
Our goal is to rapidly compute drill-downs and aggregates over this data. We Our goal is to rapidly compute drill-downs and aggregates over this data. We
want to answer questions like “How many edits were made on the page Justin want to answer questions like “How many edits were made on the page Justin
Bieber from males in San Francisco?” and “What is the average number of Bieber from males in San Francisco?” and “What is the average number of
@ -218,21 +217,21 @@ designed to perform a specific set of things. We believe this design separates
concerns and simplifies the complexity of the system. The different node types concerns and simplifies the complexity of the system. The different node types
operate fairly independent of each other and there is minimal interaction operate fairly independent of each other and there is minimal interaction
among them. Hence, intra-cluster communication failures have minimal impact among them. Hence, intra-cluster communication failures have minimal impact
on data availability. To solve complex data analysis problems, the different on data availability.
node types come together to form a fully working system. The name Druid comes
from the Druid class in many role-playing games: it is a shape-shifter, capable To solve complex data analysis problems, the different
of taking on many different forms to fulfill various different roles in a node types come together to form a fully working system. The composition of and
group. The composition of and flow of data in a Druid cluster are shown in flow of data in a Druid cluster are shown in Figure~\ref{fig:cluster}. The name Druid comes from the Druid class in many role-playing games: it is a
Figure~\ref{fig:cluster}. shape-shifter, capable of taking on many different forms to fulfill various
different roles in a group.
\begin{figure*} \begin{figure*}
\centering \centering
\includegraphics[width = 4.51in]{cluster} \includegraphics[width = 4.5in]{cluster}
\caption{An overview of a Druid cluster and the flow of data through the cluster.} \caption{An overview of a Druid cluster and the flow of data through the cluster.}
\label{fig:cluster} \label{fig:cluster}
\end{figure*} \end{figure*}
\newpage
\subsection{Real-time Nodes} \subsection{Real-time Nodes}
\label{sec:realtime} \label{sec:realtime}
Real-time nodes encapsulate the functionality to ingest and query event Real-time nodes encapsulate the functionality to ingest and query event
@ -260,10 +259,11 @@ in \cite{o1996log} and is illustrated in Figure~\ref{fig:realtime_flow}.
\begin{figure} \begin{figure}
\centering \centering
\includegraphics[width = 2.6in]{realtime_flow} \includegraphics[width = 2.6in]{realtime_flow}
\caption{Real-time nodes first buffer events in memory. On a periodic basis, \caption{Real-time nodes buffer events to an in-memory index, which is
the in-memory index is persisted to disk. On another periodic basis, all regularly persisted to disk. On a periodic basis, persisted indexes are then merged
persisted indexes are merged together and handed off. Queries will hit the together before getting handed off.
in-memory index and the persisted indexes.} Queries will hit both the in-memory and persisted indexes.
}
\label{fig:realtime_flow} \label{fig:realtime_flow}
\end{figure} \end{figure}
@ -428,9 +428,7 @@ caching the results would be unreliable.
\begin{figure*} \begin{figure*}
\centering \centering
\includegraphics[width = 4.5in]{caching} \includegraphics[width = 4.5in]{caching}
\caption{Broker nodes cache per segment results. Every Druid query is mapped to \caption{Results are cached per segment. Queries combine cached results with results computed on historical and real-time nodes.}
a set of segments. Queries often combine cached segment results with those that
need to be computed on historical and real-time nodes.}
\label{fig:caching} \label{fig:caching}
\end{figure*} \end{figure*}
@ -802,7 +800,7 @@ involving all columns are very rare.
\begin{table} \begin{table}
\centering \centering
\begin{tabular}{| l | l | l |} \scriptsize\begin{tabular}{| l | l | l |}
\hline \hline
\textbf{Data Source} & \textbf{Dimensions} & \textbf{Metrics} \\ \hline \textbf{Data Source} & \textbf{Dimensions} & \textbf{Metrics} \\ \hline
\texttt{a} & 25 & 21 \\ \hline \texttt{a} & 25 & 21 \\ \hline
@ -814,6 +812,7 @@ involving all columns are very rare.
\texttt{g} & 26 & 18 \\ \hline \texttt{g} & 26 & 18 \\ \hline
\texttt{h} & 78 & 14 \\ \hline \texttt{h} & 78 & 14 \\ \hline
\end{tabular} \end{tabular}
\normalsize
\caption{Characteristics of production data sources.} \caption{Characteristics of production data sources.}
\label{tab:datasources} \label{tab:datasources}
\end{table} \end{table}