next set of updates to paper

This commit is contained in:
fjy 2014-03-07 17:03:22 -08:00
parent d089e65682
commit 1989578e6e
12 changed files with 142 additions and 124 deletions

Binary file not shown.

View File

@ -18,8 +18,8 @@
\numberofauthors{6}
\author{
\alignauthor Fangjin Yang, Eric Tschetter, Gian Merlino, Nelson Ray, Xavier Léauté, Deep Ganguli, Himadri Singh\\
\email{\{fangjin, cheddar, gian, nelson, xavier, deep, himadri\}@metamarkets.com}
\alignauthor Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, Deep Ganguli\\
\email{\{fangjin, cheddar, xavier, nelson, gian, deep\}@metamarkets.com}
}
\date{21 March 2013}
@ -729,104 +729,133 @@ At the time of writing, the query language does not support joins. Although the
storage format is able to support joins, we've targeted Druid at user-facing
workloads that must return in a matter of seconds, and as such, we've chosen to
not spend the time to implement joins as it has been our experience that
requiring joins on your queries often limits the performance you can achieve.
Implemting joins and extending the Druid API to understand SQL is something
we'd like to do in future work.
requiring joins on
\section{Experimental Results}
\label{sec:benchmarks}
To illustrate Druid's performance, we conducted a series of experiments that
focused on measuring Druid's query and data ingestion capabilities.
\section{Performance}
As Druid is a production system, we've chosen to share some of our performance
measurements from our production cluster. The date range of the data is one
month.
\subsection{Query Performance}
To benchmark Druid query performance, we created a large test cluster with 6TB
of uncompressed data, representing tens of billions of fact rows. The data set
contained more than a dozen dimensions, with cardinalities ranging from the
double digits to tens of millions. We computed four metrics for each row
(counts, sums, and averages). The data was sharded first on timestamp and then
on dimension values, creating thousands of shards roughly 8 million fact rows
apiece.
Druid query performance can vary signficantly depending on the actual query
being issued. For example, determining the approximate cardinality of a given
dimension is a much more expensive operation than a simple sum of a metric
column. Similarily, sorting the values of a high cardinality dimension based on
a given metric is much more expensive than a simple count over a time range.
Furthermore, the time range of a query and the number of metric aggregators in
the query will contribute to query latency. Instead of going into full detail
about every possible query a user can issue, we've instead chosen to showcase a
higher level view of average latencies we see in our production cluster. We
selected 8 of our most queried data sources, described in
Table~\ref{tab:datasources}.
The cluster used in the benchmark consisted of 100 historical nodes, each with
16 cores, 60GB of RAM, 10 GigE Ethernet, and 1TB of disk space. Collectively,
the cluster comprised of 1600 cores, 6TB or RAM, sufficiently fast Ethernet and
more than enough disk space.
\begin{table}
\centering
\caption{Druid Query Datasources}
\label{tab:datasources}
\begin{tabular}{| l | l | l |}
\hline
\textbf{Data Source} & \textbf{Dimensions} & \textbf{Metrics} \\ \hline
\texttt{a} & 10 & 10 \\ \hline
\texttt{b} & 10 & 10 \\ \hline
\texttt{c} & 10 & 10 \\ \hline
\texttt{d} & 10 & 10 \\ \hline
\texttt{e} & 10 & 10 \\ \hline
\texttt{f} & 10 & 10 \\ \hline
\texttt{g} & 10 & 10 \\ \hline
\texttt{h} & 10 & 10 \\ \hline
\end{tabular}
\end{table}
Some more details of the cluster:
SQL statements are included in Table~\ref{tab:sql_queries}. These queries are
meant to represent some common queries that are made against Druid for typical data
analysis workflows. Although Druid has its own query language, we choose to
translate the queries into SQL to better describe what the queries are doing.
Please note:
\begin{itemize}
\item The timestamp range of the queries encompassed all data.
\item Each machine was a 16-core machine with 60GB RAM and 1TB of local
disk. The machine was configured to only use 15 threads for
processing queries.
\item The results are from a "hot" tier in our production cluster.
\item There is approximately 10.5TB of RAM available in the "hot" tier and approximately 10TB of segments loaded (including replication). Collectively, there are about 50 billion Druid rows in this tier. Results for every data source are not shown.
\item The hot tier uses Xeon E5-2670 processors and consists of 1302 processing threads and 672 total cores (hyperthreaded).
\item A memory-mapped storage engine was used (the machine was configured to memory map the data
instead of loading it into the Java heap.)
\end{itemize}
\begin{table*}
\centering
\caption{Druid Queries}
\label{tab:sql_queries}
\begin{tabular}{| l | p{15cm} |}
\hline
\textbf{Query \#} & \textbf{Query} \\ \hline
1 & \texttt{SELECT count(*) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?} \\ \hline
2 & \texttt{SELECT count(*), sum(metric1) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?} \\ \hline
3 & \texttt{SELECT count(*), sum(metric1), sum(metric2), sum(metric3), sum(metric4) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?} \\ \hline
4 & \texttt{SELECT high\_card\_dimension, count(*) AS cnt FROM \_table\_
WHERE timestamp $\geq$ ? AND timestamp < ? GROUP BY high\_card\_dimension ORDER
BY cnt limit 100} \\ \hline 5 & \texttt{SELECT high\_card\_dimension, count(*)
AS cnt, sum(metric1) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?
GROUP BY high\_card\_dimension ORDER BY cnt limit 100} \\ \hline 6 &
\texttt{SELECT high\_card\_dimension, count(*) AS cnt, sum(metric1),
sum(metric2), sum(metric3), sum(metric4) FROM \_table\_ WHERE timestamp $\geq$
? AND timestamp < ? GROUP BY high\_card\_dimension ORDER BY cnt limit 100} \\
\hline \end{tabular} \end{table*}
The average query latency is shown in Figure~\ref{fig:avg_query_latency} and
the queries per minute is shown in Figure~\ref{fig:queries_per_min}. We can see
that across the various datasources, the average query latency is approximately
540ms. The 90th percentile query latency across these data sources is < 1s, the
95th percentile is < 2s, and the 99th percentile is < 10s. The percentiles are
shown in Figure~\ref{fig:query_percentiles}. It is very possible to possible to
decrease query latencies by adding additional hardware, but we have not chosen
to do so because infrastructure cost is still a consideration.
Figure~\ref{fig:cluster_scan_rate} shows the cluster scan rate and
Figure~\ref{fig:core_scan_rate} shows the core scan rate. In
Figure~\ref{fig:cluster_scan_rate} we also include projected linear scaling
based on the results of the 25 core cluster. In particular, we observe
diminishing marginal returns to performance in the size of the cluster. Under
linear scaling, the first SQL count query (query 1) would have achieved a speed
of 37 billion rows per second on our 75 node cluster. In fact, the speed was
26 billion rows per second. However, queries 2-6 maintain a near-linear
speedup up to 50 nodes: the core scan rates in Figure~\ref{fig:core_scan_rate}
remain nearly constant. The increase in speed of a parallel computing system
is often limited by the time needed for the sequential operations of the
system, in accordance with Amdahl's law \cite{amdahl1967validity}.
\begin{figure}
\centering
\includegraphics[width = 2.8in]{avg_query_latency}
\caption{Druid production cluster average query latency across multiple data sources.}
\label{fig:avg_query_latency}
\end{figure}
\begin{figure} \centering \includegraphics[width = 2.8in]{cluster_scan_rate}
\caption{Druid cluster scan rate with lines indicating linear scaling from 25
nodes.} \label{fig:cluster_scan_rate} \end{figure}
\begin{figure}
\centering
\includegraphics[width = 2.8in]{queries_per_min}
\caption{Druid production cluster queries per minute across multiple data sources.}
\label{fig:queries_per_min}
\end{figure}
\begin{figure} \centering \includegraphics[width = 2.8in]{core_scan_rate}
\caption{Druid core scan rate.} \label{fig:core_scan_rate} \end{figure}
\begin{figure}
\centering
\includegraphics[width = 2.8in]{query_percentiles}
\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
\label{fig:query_percentiles}
\end{figure}
The first query listed in Table~\ref{tab:sql_queries} is a simple
count, achieving scan rates of 33M rows/second/core. We believe
the 75 node cluster was actually overprovisioned for the test
dataset, explaining the modest improvement over the 50 node cluster.
Druid's concurrency model is based on shards: one thread will scan one
shard. If the number of segments on a historical node modulo the number
of cores is small (e.g. 17 segments and 15 cores), then many of the
cores will be idle during the last round of the computation.
We also present our Druid benchmarks with TPC-H data. Although most of the
TPC-H queries do not directly apply to Druid, we've selected similiar queries
to demonstrate Druid's query performance. For a comparison, we also provide the
results of the same queries using MySQL with MyISAM (InnoDB was slower in our
tests). We selected MySQL as the base comparison because it its universal
popularity. We choose not to select another open source column store because we
were not confident we could correctly tune it to optimize performance. The
results for the 1 GB data set are shown in Figure~\ref{fig:tpch_1gb} and the
results of the 100 GB data set are in Figure~\ref{fig:tpch_100gb}.
When we include more aggregations we see performance degrade. This is
because of the column-oriented storage format Druid employs. For the
\texttt{count(*)} queries, Druid only has to check the timestamp column to satisfy
the ``where'' clause. As we add metrics, it has to also load those metric
values and scan over them, increasing the amount of memory scanned.
\begin{figure}
\centering
\includegraphics[width = 2.8in]{tpch_1gb}
\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
\label{fig:tpch_1gb}
\end{figure}
\begin{figure}
\centering
\includegraphics[width = 2.8in]{tpch_100gb}
\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
\label{fig:tpch_100gb}
\end{figure}
Finally, we present our results of scaling Druid to meet increasing data load
with the TPC-H 100 GB data set. We observe that when we increased the number of
cores from 8 to 48, we do not display linear scaling and we see diminishing
marginal returns to performance in the size of the cluster. The increase in
speed of a parallel computing system is often limited by the time needed for
the sequential operations of the system, in accordance with Amdahl's law
\cite{amdahl1967validity}. Our query results and query speedup are shown in
Figure~\ref{tpch_scaling}.
\begin{figure}
\centering
\includegraphics[width = 2.8in]{tpch_scaling}
\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
\label{fig:tpch_scaling}
\end{figure}
\subsection{Data Ingestion Performance}
To measure Druid's data latency latency, we spun up a single real-time node
with the following configurations:
To showcase Druid's data latency latency, we selected several production
datasources of varying dimensions, metrics, and event volume. Our production
ingestion setup is as follows:
\begin{itemize}
\item JVM arguments: -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+HeapDumpOnOutOfMemoryError
\item CPU: 2.3 GHz Intel Core i7
\item Total RAM: 360 GB
\item Total CPU: 12 x Intel Xeon E5-2670 (96 cores)
\item Note: Using this setup, several other data sources are being ingested and many other Druid related ingestion tasks are running across these machines.
\end{itemize}
Druid's data ingestion latency is heavily dependent on the complexity of the
@ -835,51 +864,40 @@ dimensions in each event, the number of metrics in each event, and the types of
aggregations we want to perform on those metrics. With the most basic data set
(one that only has a timestamp column), our setup can ingest data at a rate of
800k events/sec/node, which is really just a measurement of how fast we can
deserialize events. Real world data sets are never this simple. To simulate
real-world ingestion rates, we created a data set with 5 dimensions and a
single metric. 4 out of the 5 dimensions have a cardinality less than 100, and
we varied the cardinality of the final dimension. The results of varying the
cardinality of a dimension is shown in
Figure~\ref{fig:throughput_vs_cardinality}.
deserialize events. Real world data sets are never this simple. A description
of the data sources we selected is shown in Table~\ref{tab:ingest_datasources}.
\begin{table}
\centering
\caption{Druid Ingestion Datasources}
\label{tab:ingest_datasources}
\begin{tabular}{| l | l | l | l |}
\hline
\textbf{Data Source} & \textbf{Dims} & \textbf{Mets} & \textbf{Peak Throughput (events/sec)} \\ \hline
\texttt{s} & 7 & 2 & 28334.60 \\ \hline
\texttt{t} & 10 & 7 & 68808.70 \\ \hline
\texttt{u} & 5 & 1 & 49933.93 \\ \hline
\texttt{v} & 30 & 10 & 22240.45 \\ \hline
\texttt{w} & 35 & 14 & 135763.17 \\ \hline
\texttt{x} & 28 & 6 & 46525.85 \\ \hline
\texttt{y} & 33 & 24 & 162462.41 \\ \hline
\texttt{z} & 33 & 24 & 95747.74 \\ \hline
\end{tabular}
\end{table}
We can see that based on the descriptions in
Table~\ref{tab:ingest_datasources}, latencies vary significantly and the
ingestion latency is not always a factor of the number of dimensions and
metrics. Some of the lower latencies on simple data sets are simply because the
event producer was not sending a tremendous amount of data. A graphical
representation of the ingestion latencies of these data sources over a span of
time is shown in Figure~\ref{fig:ingestion_rate}.
\begin{figure}
\centering
\includegraphics[width = 2.8in]{throughput_vs_cardinality}
\caption{When we vary the cardinality of a single dimension, we can see monotonically decreasing throughput.}
\label{fig:throughput_vs_cardinality}
\end{figure}
In Figure~\ref{fig:throughput_vs_num_dims}, we instead vary the number of
dimensions in our data set. Each dimension has a cardinality less than 100. We
can see a similar decline in ingestion throughput as the number of dimensions
increases.
\begin{figure}
\centering
\includegraphics[width = 2.8in]{throughput_vs_num_dims}
\caption{Increasing the number of dimensions of our data set also leads to a decline in throughput.}
\label{fig:throughput_vs_num_dims}
\end{figure}
Finally, keeping our number of dimensions constant at 5, with four dimensions
having a cardinality in the 0-100 range and the final dimension having a
cardinality of 10,000, we can see a similar decline in throughput when we
increase the number of metrics/aggregators in the data set. We used random
types of metrics/aggregators in this experiment, and they vary from longs,
doubles, and other more complex types. The randomization introduces more noise
in the results, leading to a graph that is not strictly decreasing. These
results are shown in Figure~\ref{fig:throughput_vs_num_metrics}. For most real
world data sets, the number of metrics tends to be less than the number of
dimensions. Hence, we can see that introducing a few new metrics does not
impact the ingestion latency as severely as in the other graphs.
\begin{figure}
\centering
\includegraphics[width = 2.8in]{throughput_vs_num_metrics}
\caption{Adding new metrics to a data set decreases ingestion latency. In most
real world data sets, the number of metrics in a data set tends to be lower
than the number of dimensions.}
\label{fig:throughput_vs_num_metrics}
\includegraphics[width = 2.8in]{ingestion_rate}
\caption{Druid production cluster ingestion rate for multiple data sources.}
\label{fig:ingestion_rate}
\end{figure}
\section{Druid in Production}

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB