next set of updates to paper

2014-03-07 17:03:22 -08:00 · 2014-03-07 17:03:22 -08:00 · 1989578e6e
parent d089e65682
commit 1989578e6e
12 changed files with 142 additions and 124 deletions
--- a/publications/whitepaper/druid.pdf
+++ b/publications/whitepaper/druid.pdf
--- a/publications/whitepaper/druid.tex
+++ b/publications/whitepaper/druid.tex
@ -18,8 +18,8 @@
 \numberofauthors{6}
 \author{
-\alignauthor Fangjin Yang, Eric Tschetter, Gian Merlino, Nelson Ray, Xavier Léauté, Deep Ganguli, Himadri Singh\\
+\alignauthor Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, Deep Ganguli\\
-\email{\{fangjin, cheddar, gian, nelson, xavier, deep, himadri\}@metamarkets.com}
+\email{\{fangjin, cheddar, xavier, nelson, gian, deep\}@metamarkets.com}
 }
 \date{21 March 2013}
@ -729,104 +729,133 @@ At the time of writing, the query language does not support joins. Although the
 storage format is able to support joins, we've targeted Druid at user-facing
 workloads that must return in a matter of seconds, and as such, we've chosen to
 not spend the time to implement joins as it has been our experience that
-requiring joins on your queries often limits the performance you can achieve.
+requiring joins on
 Implemting joins and extending the Druid API to understand SQL is something
 we'd like to do in future work.
-\section{Experimental Results}
+\section{Performance}
-\label{sec:benchmarks}
+As Druid is a production system, we've chosen to share some of our performance
-To illustrate Druid's performance, we conducted a series of experiments that
+measurements from our production cluster. The date range of the data is one
-focused on measuring Druid's query and data ingestion capabilities.
+month.
 \subsection{Query Performance}
-To benchmark Druid query performance, we created a large test cluster with 6TB
+Druid query performance can vary signficantly depending on the actual query
-of uncompressed data, representing tens of billions of fact rows. The data set
+being issued. For example, determining the approximate cardinality of a given
-contained more than a dozen dimensions, with cardinalities ranging from the
+dimension is a much more expensive operation than a simple sum of a metric
-double digits to tens of millions. We computed four metrics for each row
+column. Similarily, sorting the values of a high cardinality dimension based on
-(counts, sums, and averages). The data was sharded first on timestamp and then
+a given metric is much more expensive than a simple count over a time range.
-on dimension values, creating thousands of shards roughly 8 million fact rows
+Furthermore, the time range of a query and the number of metric aggregators in
-apiece.
+the query will contribute to query latency. Instead of going into full detail
 about every possible query a user can issue, we've instead chosen to showcase a
 higher level view of average latencies we see in our production cluster. We
 selected 8 of our most queried data sources, described in
 Table~\ref{tab:datasources}. 
-The cluster used in the benchmark consisted of 100 historical nodes, each with
+\begin{table}
-16 cores, 60GB of RAM, 10 GigE Ethernet, and 1TB of disk space. Collectively,
+  \centering
-the cluster comprised of 1600 cores, 6TB or RAM, sufficiently fast Ethernet and
+  \caption{Druid Query Datasources}
-more than enough disk space.
+  \label{tab:datasources}
  \begin{tabular}{| l | l | l |}
    \hline
    \textbf{Data Source} & \textbf{Dimensions} & \textbf{Metrics} \\ \hline
    \texttt{a} & 10 & 10 \\ \hline
    \texttt{b} & 10 & 10 \\ \hline
    \texttt{c} & 10 & 10 \\ \hline
    \texttt{d} & 10 & 10 \\ \hline
    \texttt{e} & 10 & 10 \\ \hline
    \texttt{f} & 10 & 10 \\ \hline
    \texttt{g} & 10 & 10 \\ \hline
    \texttt{h} & 10 & 10 \\ \hline
  \end{tabular}
 \end{table}
 Some more details of the cluster:
 SQL statements are included in Table~\ref{tab:sql_queries}. These queries are
 meant to represent some common queries that are made against Druid for typical data
 analysis workflows. Although Druid has its own query language, we choose to
 translate the queries into SQL to better describe what the queries are doing.
 Please note:
 \begin{itemize}
-\item The timestamp range of the queries encompassed all data.
+\item The results are from a "hot" tier in our production cluster.
-\item Each machine was a 16-core machine with 60GB RAM and 1TB of local
+\item There is approximately 10.5TB of RAM available in the "hot" tier and approximately 10TB of segments loaded (including replication). Collectively, there are about 50 billion Druid rows in this tier. Results for every data source are not shown.
-  disk. The machine was configured to only use 15 threads for
+\item The hot tier uses Xeon E5-2670 processors and consists of 1302 processing threads and 672 total cores (hyperthreaded).
  processing queries.
 \item A memory-mapped storage engine was used (the machine was configured to memory map the data
  instead of loading it into the Java heap.)
 \end{itemize}
-\begin{table*}
+The average query latency is shown in Figure~\ref{fig:avg_query_latency} and
 the queries per minute is shown in Figure~\ref{fig:queries_per_min}. We can see
 that across the various datasources, the average query latency is approximately
 540ms. The 90th percentile query latency across these data sources is < 1s, the
 95th percentile is < 2s, and the 99th percentile is < 10s. The percentiles are
 shown in Figure~\ref{fig:query_percentiles}. It is very possible to possible to
 decrease query latencies by adding additional hardware, but we have not chosen
 to do so because infrastructure cost is still a consideration.
 \begin{figure}
 \centering 
-  \caption{Druid Queries}
+\includegraphics[width = 2.8in]{avg_query_latency}
-  \label{tab:sql_queries}
+\caption{Druid production cluster average query latency across multiple data sources.} 
-  \begin{tabular}{| l | p{15cm} |}
+\label{fig:avg_query_latency}
-    \hline
+\end{figure}
    \textbf{Query \#} & \textbf{Query} \\ \hline
    1 & \texttt{SELECT count(*) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?} \\ \hline
    2 & \texttt{SELECT count(*), sum(metric1) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?} \\ \hline
    3 & \texttt{SELECT count(*), sum(metric1), sum(metric2), sum(metric3), sum(metric4) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?} \\ \hline
    4 & \texttt{SELECT high\_card\_dimension, count(*) AS cnt FROM \_table\_
 WHERE timestamp $\geq$ ? AND timestamp < ? GROUP BY high\_card\_dimension ORDER
 BY cnt limit 100} \\ \hline 5 & \texttt{SELECT high\_card\_dimension, count(*)
 AS cnt, sum(metric1) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?
 GROUP BY high\_card\_dimension ORDER BY cnt limit 100} \\ \hline 6 &
 \texttt{SELECT high\_card\_dimension, count(*) AS cnt, sum(metric1),
 sum(metric2), sum(metric3), sum(metric4) FROM \_table\_ WHERE timestamp $\geq$
 ? AND timestamp < ? GROUP BY high\_card\_dimension ORDER BY cnt limit 100} \\
 \hline \end{tabular} \end{table*}
-Figure~\ref{fig:cluster_scan_rate} shows the cluster scan rate and
+\begin{figure}
-Figure~\ref{fig:core_scan_rate} shows the core scan rate.  In
+\centering 
-Figure~\ref{fig:cluster_scan_rate} we also include projected linear scaling
+\includegraphics[width = 2.8in]{queries_per_min}
-based on the results of the 25 core cluster.  In particular, we observe
+\caption{Druid production cluster queries per minute across multiple data sources.} 
-diminishing marginal returns to performance in the size of the cluster.  Under
+\label{fig:queries_per_min}
-linear scaling, the first SQL count query (query 1) would have achieved a speed
+\end{figure}
 of 37 billion rows per second on our 75 node cluster.  In fact, the speed was
 26 billion rows per second.  However, queries 2-6 maintain a near-linear
 speedup up to 50 nodes: the core scan rates in Figure~\ref{fig:core_scan_rate}
 remain nearly constant.  The increase in speed of a parallel computing system
 is often limited by the time needed for the sequential operations of the
 system, in accordance with Amdahl's law \cite{amdahl1967validity}.
-\begin{figure} \centering \includegraphics[width = 2.8in]{cluster_scan_rate}
+\begin{figure}
-\caption{Druid cluster scan rate with lines indicating linear scaling from 25
+\centering
-nodes.} \label{fig:cluster_scan_rate} \end{figure}
+\includegraphics[width = 2.8in]{query_percentiles}
 \caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
 \label{fig:query_percentiles}
 \end{figure}
-\begin{figure} \centering \includegraphics[width = 2.8in]{core_scan_rate}
+We also present our Druid benchmarks with TPC-H data. Although most of the
-\caption{Druid core scan rate.} \label{fig:core_scan_rate} \end{figure}
+TPC-H queries do not directly apply to Druid, we've selected similiar queries
 to demonstrate Druid's query performance. For a comparison, we also provide the
 results of the same queries using MySQL with MyISAM (InnoDB was slower in our
 tests). We selected MySQL as the base comparison because it its universal
 popularity. We choose not to select another open source column store because we
 were not confident we could correctly tune it to optimize performance. The
 results for the 1 GB data set are shown in Figure~\ref{fig:tpch_1gb} and the
 results of the 100 GB data set are in Figure~\ref{fig:tpch_100gb}.
-The first query listed in Table~\ref{tab:sql_queries} is a simple
+\begin{figure}
-count, achieving scan rates of 33M rows/second/core. We believe
+\centering
-the 75 node cluster was actually overprovisioned for the test
+\includegraphics[width = 2.8in]{tpch_1gb}
-dataset, explaining the modest improvement over the 50 node cluster.
+\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
-Druid's concurrency model is based on shards: one thread will scan one
+\label{fig:tpch_1gb}
-shard. If the number of segments on a historical node modulo the number
+\end{figure}
 of cores is small (e.g. 17 segments and 15 cores), then many of the
 cores will be idle during the last round of the computation.
-When we include more aggregations we see performance degrade.  This is
+\begin{figure}
-because of the column-oriented storage format Druid employs.  For the
+\centering
-\texttt{count(*)} queries, Druid only has to check the timestamp column to satisfy
+\includegraphics[width = 2.8in]{tpch_100gb}
-the ``where'' clause.  As we add metrics, it has to also load those metric
+\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
-values and scan over them, increasing the amount of memory scanned.
+\label{fig:tpch_100gb}
 \end{figure}
 Finally, we present our results of scaling Druid to meet increasing data load
 with the TPC-H 100 GB data set. We observe that when we increased the number of
 cores from 8 to 48, we do not display linear scaling and we see diminishing
 marginal returns to performance in the size of the cluster. The increase in
 speed of a parallel computing system is often limited by the time needed for
 the sequential operations of the system, in accordance with Amdahl's law
 \cite{amdahl1967validity}. Our query results and query speedup are shown in
 Figure~\ref{tpch_scaling}.
 \begin{figure}
 \centering
 \includegraphics[width = 2.8in]{tpch_scaling}
 \caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
 \label{fig:tpch_scaling}
 \end{figure}
 \subsection{Data Ingestion Performance}
-To measure Druid's data latency latency, we spun up a single real-time node
+To showcase Druid's data latency latency, we selected several production
-with the following configurations:
+datasources of varying dimensions, metrics, and event volume. Our production
 ingestion setup is as follows:
 \begin{itemize}
-\item JVM arguments: -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+HeapDumpOnOutOfMemoryError
+\item Total RAM: 360 GB 
-\item CPU: 2.3 GHz Intel Core i7 
+\item Total CPU: 12 x Intel Xeon E5-2670 (96 cores)
 \item Note: Using this setup, several other data sources are being ingested and many other Druid related ingestion tasks are running across these machines.
 \end{itemize}
 Druid's data ingestion latency is heavily dependent on the complexity of the
@ -835,51 +864,40 @@ dimensions in each event, the number of metrics in each event, and the types of
 aggregations we want to perform on those metrics. With the most basic data set
 (one that only has a timestamp column), our setup can ingest data at a rate of
 800k events/sec/node, which is really just a measurement of how fast we can
-deserialize events. Real world data sets are never this simple. To simulate
+deserialize events. Real world data sets are never this simple. A description
-real-world ingestion rates, we created a data set with 5 dimensions and a
+of the data sources we selected is shown in Table~\ref{tab:ingest_datasources}. 
-single metric. 4 out of the 5 dimensions have a cardinality less than 100, and
+
-we varied the cardinality of the final dimension. The results of varying the
+\begin{table}
-cardinality of a dimension is shown in
+  \centering
-Figure~\ref{fig:throughput_vs_cardinality}.
+  \caption{Druid Ingestion Datasources}
  \label{tab:ingest_datasources}
  \begin{tabular}{| l | l | l | l |}
    \hline
    \textbf{Data Source} & \textbf{Dims} & \textbf{Mets} & \textbf{Peak Throughput (events/sec)} \\ \hline
    \texttt{s} & 7 & 2 & 28334.60 \\ \hline
    \texttt{t} & 10 & 7 & 68808.70 \\ \hline
    \texttt{u} & 5 & 1 & 49933.93 \\ \hline
    \texttt{v} & 30 & 10 & 22240.45 \\ \hline
    \texttt{w} & 35 & 14 & 135763.17 \\ \hline
    \texttt{x} & 28 & 6 & 46525.85 \\ \hline
    \texttt{y} & 33 & 24 & 162462.41 \\ \hline
    \texttt{z} & 33 & 24 & 95747.74 \\ \hline
  \end{tabular}
 \end{table}
 We can see that based on the descriptions in
 Table~\ref{tab:ingest_datasources}, latencies vary significantly and the
 ingestion latency is not always a factor of the number of dimensions and
 metrics. Some of the lower latencies on simple data sets are simply because the
 event producer was not sending a tremendous amount of data. A graphical
 representation of the ingestion latencies of these data sources over a span of
 time is shown in Figure~\ref{fig:ingestion_rate}.
 \begin{figure}
 \centering
-\includegraphics[width = 2.8in]{throughput_vs_cardinality}
+\includegraphics[width = 2.8in]{ingestion_rate}
-\caption{When we vary the cardinality of a single dimension, we can see monotonically decreasing throughput.}
+\caption{Druid production cluster ingestion rate for multiple data sources.}
-\label{fig:throughput_vs_cardinality}
+\label{fig:ingestion_rate}
 \end{figure}
 In Figure~\ref{fig:throughput_vs_num_dims}, we instead vary the number of
 dimensions in our data set. Each dimension has a cardinality less than 100. We
 can see a similar decline in ingestion throughput as the number of dimensions
 increases.
 \begin{figure}
 \centering
 \includegraphics[width = 2.8in]{throughput_vs_num_dims}
 \caption{Increasing the number of dimensions of our data set also leads to a decline in throughput.}
 \label{fig:throughput_vs_num_dims}
 \end{figure}
 Finally, keeping our number of dimensions constant at 5, with four dimensions
 having a cardinality in the 0-100 range and the final dimension having a
 cardinality of 10,000, we can see a similar decline in throughput when we
 increase the number of metrics/aggregators in the data set. We used random
 types of metrics/aggregators in this experiment, and they vary from longs,
 doubles, and other more complex types. The randomization introduces more noise
 in the results, leading to a graph that is not strictly decreasing. These
 results are shown in Figure~\ref{fig:throughput_vs_num_metrics}. For most real
 world data sets, the number of metrics tends to be less than the number of
 dimensions.  Hence, we can see that introducing a few new metrics does not
 impact the ingestion latency as severely as in the other graphs.
 \begin{figure}
 \centering
 \includegraphics[width = 2.8in]{throughput_vs_num_metrics}
 \caption{Adding new metrics to a data set decreases ingestion latency. In most
 real world data sets, the number of metrics in a data set tends to be lower
 than the number of dimensions.}
 \label{fig:throughput_vs_num_metrics}
 \end{figure}
 \section{Druid in Production}
--- a/publications/whitepaper/figures/90th_percentile.pdf
+++ b/publications/whitepaper/figures/90th_percentile.pdf
--- a/publications/whitepaper/figures/95th_percentile.pdf
+++ b/publications/whitepaper/figures/95th_percentile.pdf
--- a/publications/whitepaper/figures/99th_percentile.pdf
+++ b/publications/whitepaper/figures/99th_percentile.pdf
--- a/publications/whitepaper/figures/avg_query_latency.pdf
+++ b/publications/whitepaper/figures/avg_query_latency.pdf
--- a/publications/whitepaper/figures/ingestion_rate.pdf
+++ b/publications/whitepaper/figures/ingestion_rate.pdf
--- a/publications/whitepaper/figures/queries_per_min.pdf
+++ b/publications/whitepaper/figures/queries_per_min.pdf
--- a/publications/whitepaper/figures/query_percentiles.pdf
+++ b/publications/whitepaper/figures/query_percentiles.pdf
--- a/publications/whitepaper/figures/tpch_100gb.pdf
+++ b/publications/whitepaper/figures/tpch_100gb.pdf
--- a/publications/whitepaper/figures/tpch_1gb.pdf
+++ b/publications/whitepaper/figures/tpch_1gb.pdf
--- a/publications/whitepaper/figures/tpch_scaling.png
+++ b/publications/whitepaper/figures/tpch_scaling.png