next set of updates to paper

2014-03-07 17:03:22 -08:00 · 2014-03-07 17:03:22 -08:00 · 1989578e6e
parent d089e65682
commit 1989578e6e
12 changed files with 142 additions and 124 deletions
--- a/publications/whitepaper/druid.pdf
+++ b/publications/whitepaper/druid.pdf
--- a/publications/whitepaper/druid.tex
+++ b/publications/whitepaper/druid.tex
@ -18,8 +18,8 @@

 \numberofauthors{6}
 \author{
-\alignauthor Fangjin Yang, Eric Tschetter, Gian Merlino, Nelson Ray, Xavier Léauté, Deep Ganguli, Himadri Singh\\
-\email{\{fangjin, cheddar, gian, nelson, xavier, deep, himadri\}@metamarkets.com}
+\alignauthor Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, Deep Ganguli\\
+\email{\{fangjin, cheddar, xavier, nelson, gian, deep\}@metamarkets.com}
 }
 \date{21 March 2013}

@ -729,104 +729,133 @@ At the time of writing, the query language does not support joins. Although the
 storage format is able to support joins, we've targeted Druid at user-facing
 workloads that must return in a matter of seconds, and as such, we've chosen to
 not spend the time to implement joins as it has been our experience that
-requiring joins on your queries often limits the performance you can achieve.
-Implemting joins and extending the Druid API to understand SQL is something
-we'd like to do in future work.
+requiring joins on

-\section{Experimental Results}
-\label{sec:benchmarks}
-To illustrate Druid's performance, we conducted a series of experiments that
-focused on measuring Druid's query and data ingestion capabilities.
+\section{Performance}
+As Druid is a production system, we've chosen to share some of our performance
+measurements from our production cluster. The date range of the data is one
+month.

 \subsection{Query Performance}
-To benchmark Druid query performance, we created a large test cluster with 6TB
-of uncompressed data, representing tens of billions of fact rows. The data set
-contained more than a dozen dimensions, with cardinalities ranging from the
-double digits to tens of millions. We computed four metrics for each row
-(counts, sums, and averages). The data was sharded first on timestamp and then
-on dimension values, creating thousands of shards roughly 8 million fact rows
-apiece.
+Druid query performance can vary signficantly depending on the actual query
+being issued. For example, determining the approximate cardinality of a given
+dimension is a much more expensive operation than a simple sum of a metric
+column. Similarily, sorting the values of a high cardinality dimension based on
+a given metric is much more expensive than a simple count over a time range.
+Furthermore, the time range of a query and the number of metric aggregators in
+the query will contribute to query latency. Instead of going into full detail
+about every possible query a user can issue, we've instead chosen to showcase a
+higher level view of average latencies we see in our production cluster. We
+selected 8 of our most queried data sources, described in
+Table~\ref{tab:datasources}. 

-The cluster used in the benchmark consisted of 100 historical nodes, each with
-16 cores, 60GB of RAM, 10 GigE Ethernet, and 1TB of disk space. Collectively,
-the cluster comprised of 1600 cores, 6TB or RAM, sufficiently fast Ethernet and
-more than enough disk space.
+\begin{table}
+  \centering
+  \caption{Druid Query Datasources}
+  \label{tab:datasources}
+  \begin{tabular}{| l | l | l |}
+    \hline
+    \textbf{Data Source} & \textbf{Dimensions} & \textbf{Metrics} \\ \hline
+    \texttt{a} & 10 & 10 \\ \hline
+    \texttt{b} & 10 & 10 \\ \hline
+    \texttt{c} & 10 & 10 \\ \hline
+    \texttt{d} & 10 & 10 \\ \hline
+    \texttt{e} & 10 & 10 \\ \hline
+    \texttt{f} & 10 & 10 \\ \hline
+    \texttt{g} & 10 & 10 \\ \hline
+    \texttt{h} & 10 & 10 \\ \hline
+  \end{tabular}
+\end{table}
+
+Some more details of the cluster:

-SQL statements are included in Table~\ref{tab:sql_queries}. These queries are
-meant to represent some common queries that are made against Druid for typical data
-analysis workflows. Although Druid has its own query language, we choose to
-translate the queries into SQL to better describe what the queries are doing.
-Please note:
 \begin{itemize}
-\item The timestamp range of the queries encompassed all data.
-\item Each machine was a 16-core machine with 60GB RAM and 1TB of local
-  disk. The machine was configured to only use 15 threads for
-  processing queries.
+\item The results are from a "hot" tier in our production cluster.
+\item There is approximately 10.5TB of RAM available in the "hot" tier and approximately 10TB of segments loaded (including replication). Collectively, there are about 50 billion Druid rows in this tier. Results for every data source are not shown.
+\item The hot tier uses Xeon E5-2670 processors and consists of 1302 processing threads and 672 total cores (hyperthreaded).
 \item A memory-mapped storage engine was used (the machine was configured to memory map the data
  instead of loading it into the Java heap.)
 \end{itemize}

-\begin{table*}
-  \centering
-  \caption{Druid Queries}
-  \label{tab:sql_queries}
-  \begin{tabular}{| l | p{15cm} |}
-    \hline
-    \textbf{Query \#} & \textbf{Query} \\ \hline
-    1 & \texttt{SELECT count(*) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?} \\ \hline
-    2 & \texttt{SELECT count(*), sum(metric1) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?} \\ \hline
-    3 & \texttt{SELECT count(*), sum(metric1), sum(metric2), sum(metric3), sum(metric4) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?} \\ \hline
-    4 & \texttt{SELECT high\_card\_dimension, count(*) AS cnt FROM \_table\_
-WHERE timestamp $\geq$ ? AND timestamp < ? GROUP BY high\_card\_dimension ORDER
-BY cnt limit 100} \\ \hline 5 & \texttt{SELECT high\_card\_dimension, count(*)
-AS cnt, sum(metric1) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?
-GROUP BY high\_card\_dimension ORDER BY cnt limit 100} \\ \hline 6 &
-\texttt{SELECT high\_card\_dimension, count(*) AS cnt, sum(metric1),
-sum(metric2), sum(metric3), sum(metric4) FROM \_table\_ WHERE timestamp $\geq$
-? AND timestamp < ? GROUP BY high\_card\_dimension ORDER BY cnt limit 100} \\
-\hline \end{tabular} \end{table*}
+The average query latency is shown in Figure~\ref{fig:avg_query_latency} and
+the queries per minute is shown in Figure~\ref{fig:queries_per_min}. We can see
+that across the various datasources, the average query latency is approximately
+540ms. The 90th percentile query latency across these data sources is < 1s, the
+95th percentile is < 2s, and the 99th percentile is < 10s. The percentiles are
+shown in Figure~\ref{fig:query_percentiles}. It is very possible to possible to
+decrease query latencies by adding additional hardware, but we have not chosen
+to do so because infrastructure cost is still a consideration.

-Figure~\ref{fig:cluster_scan_rate} shows the cluster scan rate and
-Figure~\ref{fig:core_scan_rate} shows the core scan rate.  In
-Figure~\ref{fig:cluster_scan_rate} we also include projected linear scaling
-based on the results of the 25 core cluster.  In particular, we observe
-diminishing marginal returns to performance in the size of the cluster.  Under
-linear scaling, the first SQL count query (query 1) would have achieved a speed
-of 37 billion rows per second on our 75 node cluster.  In fact, the speed was
-26 billion rows per second.  However, queries 2-6 maintain a near-linear
-speedup up to 50 nodes: the core scan rates in Figure~\ref{fig:core_scan_rate}
-remain nearly constant.  The increase in speed of a parallel computing system
-is often limited by the time needed for the sequential operations of the
-system, in accordance with Amdahl's law \cite{amdahl1967validity}.
+\begin{figure}
+\centering 
+\includegraphics[width = 2.8in]{avg_query_latency}
+\caption{Druid production cluster average query latency across multiple data sources.} 
+\label{fig:avg_query_latency}
+\end{figure}

-\begin{figure} \centering \includegraphics[width = 2.8in]{cluster_scan_rate}
-\caption{Druid cluster scan rate with lines indicating linear scaling from 25
-nodes.} \label{fig:cluster_scan_rate} \end{figure}
+\begin{figure}
+\centering 
+\includegraphics[width = 2.8in]{queries_per_min}
+\caption{Druid production cluster queries per minute across multiple data sources.} 
+\label{fig:queries_per_min}
+\end{figure}

-\begin{figure} \centering \includegraphics[width = 2.8in]{core_scan_rate}
-\caption{Druid core scan rate.} \label{fig:core_scan_rate} \end{figure}
+\begin{figure}
+\centering
+\includegraphics[width = 2.8in]{query_percentiles}
+\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
+\label{fig:query_percentiles}
+\end{figure}

-The first query listed in Table~\ref{tab:sql_queries} is a simple
-count, achieving scan rates of 33M rows/second/core. We believe
-the 75 node cluster was actually overprovisioned for the test
-dataset, explaining the modest improvement over the 50 node cluster.
-Druid's concurrency model is based on shards: one thread will scan one
-shard. If the number of segments on a historical node modulo the number
-of cores is small (e.g. 17 segments and 15 cores), then many of the
-cores will be idle during the last round of the computation.
+We also present our Druid benchmarks with TPC-H data. Although most of the
+TPC-H queries do not directly apply to Druid, we've selected similiar queries
+to demonstrate Druid's query performance. For a comparison, we also provide the
+results of the same queries using MySQL with MyISAM (InnoDB was slower in our
+tests). We selected MySQL as the base comparison because it its universal
+popularity. We choose not to select another open source column store because we
+were not confident we could correctly tune it to optimize performance. The
+results for the 1 GB data set are shown in Figure~\ref{fig:tpch_1gb} and the
+results of the 100 GB data set are in Figure~\ref{fig:tpch_100gb}.

-When we include more aggregations we see performance degrade.  This is
-because of the column-oriented storage format Druid employs.  For the
-\texttt{count(*)} queries, Druid only has to check the timestamp column to satisfy
-the ``where'' clause.  As we add metrics, it has to also load those metric
-values and scan over them, increasing the amount of memory scanned.
+\begin{figure}
+\centering
+\includegraphics[width = 2.8in]{tpch_1gb}
+\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
+\label{fig:tpch_1gb}
+\end{figure}
+
+\begin{figure}
+\centering
+\includegraphics[width = 2.8in]{tpch_100gb}
+\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
+\label{fig:tpch_100gb}
+\end{figure}
+
+Finally, we present our results of scaling Druid to meet increasing data load
+with the TPC-H 100 GB data set. We observe that when we increased the number of
+cores from 8 to 48, we do not display linear scaling and we see diminishing
+marginal returns to performance in the size of the cluster. The increase in
+speed of a parallel computing system is often limited by the time needed for
+the sequential operations of the system, in accordance with Amdahl's law
+\cite{amdahl1967validity}. Our query results and query speedup are shown in
+Figure~\ref{tpch_scaling}.
+
+\begin{figure}
+\centering
+\includegraphics[width = 2.8in]{tpch_scaling}
+\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
+\label{fig:tpch_scaling}
+\end{figure}

 \subsection{Data Ingestion Performance}
-To measure Druid's data latency latency, we spun up a single real-time node
-with the following configurations:
+To showcase Druid's data latency latency, we selected several production
+datasources of varying dimensions, metrics, and event volume. Our production
+ingestion setup is as follows:
+
 \begin{itemize}
-\item JVM arguments: -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+HeapDumpOnOutOfMemoryError
-\item CPU: 2.3 GHz Intel Core i7 
+\item Total RAM: 360 GB 
+\item Total CPU: 12 x Intel Xeon E5-2670 (96 cores)
+\item Note: Using this setup, several other data sources are being ingested and many other Druid related ingestion tasks are running across these machines.
 \end{itemize}

 Druid's data ingestion latency is heavily dependent on the complexity of the
@ -835,51 +864,40 @@ dimensions in each event, the number of metrics in each event, and the types of
 aggregations we want to perform on those metrics. With the most basic data set
 (one that only has a timestamp column), our setup can ingest data at a rate of
 800k events/sec/node, which is really just a measurement of how fast we can
-deserialize events. Real world data sets are never this simple. To simulate
-real-world ingestion rates, we created a data set with 5 dimensions and a
-single metric. 4 out of the 5 dimensions have a cardinality less than 100, and
-we varied the cardinality of the final dimension. The results of varying the
-cardinality of a dimension is shown in
-Figure~\ref{fig:throughput_vs_cardinality}.
+deserialize events. Real world data sets are never this simple. A description
+of the data sources we selected is shown in Table~\ref{tab:ingest_datasources}. 
+
+\begin{table}
+  \centering
+  \caption{Druid Ingestion Datasources}
+  \label{tab:ingest_datasources}
+  \begin{tabular}{| l | l | l | l |}
+    \hline
+    \textbf{Data Source} & \textbf{Dims} & \textbf{Mets} & \textbf{Peak Throughput (events/sec)} \\ \hline
+    \texttt{s} & 7 & 2 & 28334.60 \\ \hline
+    \texttt{t} & 10 & 7 & 68808.70 \\ \hline
+    \texttt{u} & 5 & 1 & 49933.93 \\ \hline
+    \texttt{v} & 30 & 10 & 22240.45 \\ \hline
+    \texttt{w} & 35 & 14 & 135763.17 \\ \hline
+    \texttt{x} & 28 & 6 & 46525.85 \\ \hline
+    \texttt{y} & 33 & 24 & 162462.41 \\ \hline
+    \texttt{z} & 33 & 24 & 95747.74 \\ \hline
+  \end{tabular}
+\end{table}
+
+We can see that based on the descriptions in
+Table~\ref{tab:ingest_datasources}, latencies vary significantly and the
+ingestion latency is not always a factor of the number of dimensions and
+metrics. Some of the lower latencies on simple data sets are simply because the
+event producer was not sending a tremendous amount of data. A graphical
+representation of the ingestion latencies of these data sources over a span of
+time is shown in Figure~\ref{fig:ingestion_rate}.

 \begin{figure}
 \centering
-\includegraphics[width = 2.8in]{throughput_vs_cardinality}
-\caption{When we vary the cardinality of a single dimension, we can see monotonically decreasing throughput.}
-\label{fig:throughput_vs_cardinality}
-\end{figure}
-
-In Figure~\ref{fig:throughput_vs_num_dims}, we instead vary the number of
-dimensions in our data set. Each dimension has a cardinality less than 100. We
-can see a similar decline in ingestion throughput as the number of dimensions
-increases.
-
-\begin{figure}
-\centering
-\includegraphics[width = 2.8in]{throughput_vs_num_dims}
-\caption{Increasing the number of dimensions of our data set also leads to a decline in throughput.}
-\label{fig:throughput_vs_num_dims}
-\end{figure}
-
-Finally, keeping our number of dimensions constant at 5, with four dimensions
-having a cardinality in the 0-100 range and the final dimension having a
-cardinality of 10,000, we can see a similar decline in throughput when we
-increase the number of metrics/aggregators in the data set. We used random
-types of metrics/aggregators in this experiment, and they vary from longs,
-doubles, and other more complex types. The randomization introduces more noise
-in the results, leading to a graph that is not strictly decreasing. These
-results are shown in Figure~\ref{fig:throughput_vs_num_metrics}. For most real
-world data sets, the number of metrics tends to be less than the number of
-dimensions.  Hence, we can see that introducing a few new metrics does not
-impact the ingestion latency as severely as in the other graphs.
-
-\begin{figure}
-\centering
-\includegraphics[width = 2.8in]{throughput_vs_num_metrics}
-\caption{Adding new metrics to a data set decreases ingestion latency. In most
-real world data sets, the number of metrics in a data set tends to be lower
-than the number of dimensions.}
-\label{fig:throughput_vs_num_metrics}
+\includegraphics[width = 2.8in]{ingestion_rate}
+\caption{Druid production cluster ingestion rate for multiple data sources.}
+\label{fig:ingestion_rate}
 \end{figure}

 \section{Druid in Production}
--- a/publications/whitepaper/figures/90th_percentile.pdf
+++ b/publications/whitepaper/figures/90th_percentile.pdf
--- a/publications/whitepaper/figures/95th_percentile.pdf
+++ b/publications/whitepaper/figures/95th_percentile.pdf
--- a/publications/whitepaper/figures/99th_percentile.pdf
+++ b/publications/whitepaper/figures/99th_percentile.pdf
--- a/publications/whitepaper/figures/avg_query_latency.pdf
+++ b/publications/whitepaper/figures/avg_query_latency.pdf
--- a/publications/whitepaper/figures/ingestion_rate.pdf
+++ b/publications/whitepaper/figures/ingestion_rate.pdf
--- a/publications/whitepaper/figures/queries_per_min.pdf
+++ b/publications/whitepaper/figures/queries_per_min.pdf
--- a/publications/whitepaper/figures/query_percentiles.pdf
+++ b/publications/whitepaper/figures/query_percentiles.pdf
--- a/publications/whitepaper/figures/tpch_100gb.pdf
+++ b/publications/whitepaper/figures/tpch_100gb.pdf
--- a/publications/whitepaper/figures/tpch_1gb.pdf
+++ b/publications/whitepaper/figures/tpch_1gb.pdf
--- a/publications/whitepaper/figures/tpch_scaling.png
+++ b/publications/whitepaper/figures/tpch_scaling.png