mirror of https://github.com/apache/druid.git
next set of updates to paper
This commit is contained in:
parent
d089e65682
commit
1989578e6e
Binary file not shown.
|
@ -18,8 +18,8 @@
|
||||||
|
|
||||||
\numberofauthors{6}
|
\numberofauthors{6}
|
||||||
\author{
|
\author{
|
||||||
\alignauthor Fangjin Yang, Eric Tschetter, Gian Merlino, Nelson Ray, Xavier Léauté, Deep Ganguli, Himadri Singh\\
|
\alignauthor Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, Deep Ganguli\\
|
||||||
\email{\{fangjin, cheddar, gian, nelson, xavier, deep, himadri\}@metamarkets.com}
|
\email{\{fangjin, cheddar, xavier, nelson, gian, deep\}@metamarkets.com}
|
||||||
}
|
}
|
||||||
\date{21 March 2013}
|
\date{21 March 2013}
|
||||||
|
|
||||||
|
@ -729,104 +729,133 @@ At the time of writing, the query language does not support joins. Although the
|
||||||
storage format is able to support joins, we've targeted Druid at user-facing
|
storage format is able to support joins, we've targeted Druid at user-facing
|
||||||
workloads that must return in a matter of seconds, and as such, we've chosen to
|
workloads that must return in a matter of seconds, and as such, we've chosen to
|
||||||
not spend the time to implement joins as it has been our experience that
|
not spend the time to implement joins as it has been our experience that
|
||||||
requiring joins on your queries often limits the performance you can achieve.
|
requiring joins on
|
||||||
Implemting joins and extending the Druid API to understand SQL is something
|
|
||||||
we'd like to do in future work.
|
|
||||||
|
|
||||||
\section{Experimental Results}
|
\section{Performance}
|
||||||
\label{sec:benchmarks}
|
As Druid is a production system, we've chosen to share some of our performance
|
||||||
To illustrate Druid's performance, we conducted a series of experiments that
|
measurements from our production cluster. The date range of the data is one
|
||||||
focused on measuring Druid's query and data ingestion capabilities.
|
month.
|
||||||
|
|
||||||
\subsection{Query Performance}
|
\subsection{Query Performance}
|
||||||
To benchmark Druid query performance, we created a large test cluster with 6TB
|
Druid query performance can vary signficantly depending on the actual query
|
||||||
of uncompressed data, representing tens of billions of fact rows. The data set
|
being issued. For example, determining the approximate cardinality of a given
|
||||||
contained more than a dozen dimensions, with cardinalities ranging from the
|
dimension is a much more expensive operation than a simple sum of a metric
|
||||||
double digits to tens of millions. We computed four metrics for each row
|
column. Similarily, sorting the values of a high cardinality dimension based on
|
||||||
(counts, sums, and averages). The data was sharded first on timestamp and then
|
a given metric is much more expensive than a simple count over a time range.
|
||||||
on dimension values, creating thousands of shards roughly 8 million fact rows
|
Furthermore, the time range of a query and the number of metric aggregators in
|
||||||
apiece.
|
the query will contribute to query latency. Instead of going into full detail
|
||||||
|
about every possible query a user can issue, we've instead chosen to showcase a
|
||||||
|
higher level view of average latencies we see in our production cluster. We
|
||||||
|
selected 8 of our most queried data sources, described in
|
||||||
|
Table~\ref{tab:datasources}.
|
||||||
|
|
||||||
The cluster used in the benchmark consisted of 100 historical nodes, each with
|
\begin{table}
|
||||||
16 cores, 60GB of RAM, 10 GigE Ethernet, and 1TB of disk space. Collectively,
|
\centering
|
||||||
the cluster comprised of 1600 cores, 6TB or RAM, sufficiently fast Ethernet and
|
\caption{Druid Query Datasources}
|
||||||
more than enough disk space.
|
\label{tab:datasources}
|
||||||
|
\begin{tabular}{| l | l | l |}
|
||||||
|
\hline
|
||||||
|
\textbf{Data Source} & \textbf{Dimensions} & \textbf{Metrics} \\ \hline
|
||||||
|
\texttt{a} & 10 & 10 \\ \hline
|
||||||
|
\texttt{b} & 10 & 10 \\ \hline
|
||||||
|
\texttt{c} & 10 & 10 \\ \hline
|
||||||
|
\texttt{d} & 10 & 10 \\ \hline
|
||||||
|
\texttt{e} & 10 & 10 \\ \hline
|
||||||
|
\texttt{f} & 10 & 10 \\ \hline
|
||||||
|
\texttt{g} & 10 & 10 \\ \hline
|
||||||
|
\texttt{h} & 10 & 10 \\ \hline
|
||||||
|
\end{tabular}
|
||||||
|
\end{table}
|
||||||
|
|
||||||
|
Some more details of the cluster:
|
||||||
|
|
||||||
SQL statements are included in Table~\ref{tab:sql_queries}. These queries are
|
|
||||||
meant to represent some common queries that are made against Druid for typical data
|
|
||||||
analysis workflows. Although Druid has its own query language, we choose to
|
|
||||||
translate the queries into SQL to better describe what the queries are doing.
|
|
||||||
Please note:
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item The timestamp range of the queries encompassed all data.
|
\item The results are from a "hot" tier in our production cluster.
|
||||||
\item Each machine was a 16-core machine with 60GB RAM and 1TB of local
|
\item There is approximately 10.5TB of RAM available in the "hot" tier and approximately 10TB of segments loaded (including replication). Collectively, there are about 50 billion Druid rows in this tier. Results for every data source are not shown.
|
||||||
disk. The machine was configured to only use 15 threads for
|
\item The hot tier uses Xeon E5-2670 processors and consists of 1302 processing threads and 672 total cores (hyperthreaded).
|
||||||
processing queries.
|
|
||||||
\item A memory-mapped storage engine was used (the machine was configured to memory map the data
|
\item A memory-mapped storage engine was used (the machine was configured to memory map the data
|
||||||
instead of loading it into the Java heap.)
|
instead of loading it into the Java heap.)
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
\begin{table*}
|
The average query latency is shown in Figure~\ref{fig:avg_query_latency} and
|
||||||
|
the queries per minute is shown in Figure~\ref{fig:queries_per_min}. We can see
|
||||||
|
that across the various datasources, the average query latency is approximately
|
||||||
|
540ms. The 90th percentile query latency across these data sources is < 1s, the
|
||||||
|
95th percentile is < 2s, and the 99th percentile is < 10s. The percentiles are
|
||||||
|
shown in Figure~\ref{fig:query_percentiles}. It is very possible to possible to
|
||||||
|
decrease query latencies by adding additional hardware, but we have not chosen
|
||||||
|
to do so because infrastructure cost is still a consideration.
|
||||||
|
|
||||||
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
\caption{Druid Queries}
|
\includegraphics[width = 2.8in]{avg_query_latency}
|
||||||
\label{tab:sql_queries}
|
\caption{Druid production cluster average query latency across multiple data sources.}
|
||||||
\begin{tabular}{| l | p{15cm} |}
|
\label{fig:avg_query_latency}
|
||||||
\hline
|
\end{figure}
|
||||||
\textbf{Query \#} & \textbf{Query} \\ \hline
|
|
||||||
1 & \texttt{SELECT count(*) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?} \\ \hline
|
|
||||||
2 & \texttt{SELECT count(*), sum(metric1) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?} \\ \hline
|
|
||||||
3 & \texttt{SELECT count(*), sum(metric1), sum(metric2), sum(metric3), sum(metric4) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?} \\ \hline
|
|
||||||
4 & \texttt{SELECT high\_card\_dimension, count(*) AS cnt FROM \_table\_
|
|
||||||
WHERE timestamp $\geq$ ? AND timestamp < ? GROUP BY high\_card\_dimension ORDER
|
|
||||||
BY cnt limit 100} \\ \hline 5 & \texttt{SELECT high\_card\_dimension, count(*)
|
|
||||||
AS cnt, sum(metric1) FROM \_table\_ WHERE timestamp $\geq$ ? AND timestamp < ?
|
|
||||||
GROUP BY high\_card\_dimension ORDER BY cnt limit 100} \\ \hline 6 &
|
|
||||||
\texttt{SELECT high\_card\_dimension, count(*) AS cnt, sum(metric1),
|
|
||||||
sum(metric2), sum(metric3), sum(metric4) FROM \_table\_ WHERE timestamp $\geq$
|
|
||||||
? AND timestamp < ? GROUP BY high\_card\_dimension ORDER BY cnt limit 100} \\
|
|
||||||
\hline \end{tabular} \end{table*}
|
|
||||||
|
|
||||||
Figure~\ref{fig:cluster_scan_rate} shows the cluster scan rate and
|
\begin{figure}
|
||||||
Figure~\ref{fig:core_scan_rate} shows the core scan rate. In
|
\centering
|
||||||
Figure~\ref{fig:cluster_scan_rate} we also include projected linear scaling
|
\includegraphics[width = 2.8in]{queries_per_min}
|
||||||
based on the results of the 25 core cluster. In particular, we observe
|
\caption{Druid production cluster queries per minute across multiple data sources.}
|
||||||
diminishing marginal returns to performance in the size of the cluster. Under
|
\label{fig:queries_per_min}
|
||||||
linear scaling, the first SQL count query (query 1) would have achieved a speed
|
\end{figure}
|
||||||
of 37 billion rows per second on our 75 node cluster. In fact, the speed was
|
|
||||||
26 billion rows per second. However, queries 2-6 maintain a near-linear
|
|
||||||
speedup up to 50 nodes: the core scan rates in Figure~\ref{fig:core_scan_rate}
|
|
||||||
remain nearly constant. The increase in speed of a parallel computing system
|
|
||||||
is often limited by the time needed for the sequential operations of the
|
|
||||||
system, in accordance with Amdahl's law \cite{amdahl1967validity}.
|
|
||||||
|
|
||||||
\begin{figure} \centering \includegraphics[width = 2.8in]{cluster_scan_rate}
|
\begin{figure}
|
||||||
\caption{Druid cluster scan rate with lines indicating linear scaling from 25
|
\centering
|
||||||
nodes.} \label{fig:cluster_scan_rate} \end{figure}
|
\includegraphics[width = 2.8in]{query_percentiles}
|
||||||
|
\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
|
||||||
|
\label{fig:query_percentiles}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
\begin{figure} \centering \includegraphics[width = 2.8in]{core_scan_rate}
|
We also present our Druid benchmarks with TPC-H data. Although most of the
|
||||||
\caption{Druid core scan rate.} \label{fig:core_scan_rate} \end{figure}
|
TPC-H queries do not directly apply to Druid, we've selected similiar queries
|
||||||
|
to demonstrate Druid's query performance. For a comparison, we also provide the
|
||||||
|
results of the same queries using MySQL with MyISAM (InnoDB was slower in our
|
||||||
|
tests). We selected MySQL as the base comparison because it its universal
|
||||||
|
popularity. We choose not to select another open source column store because we
|
||||||
|
were not confident we could correctly tune it to optimize performance. The
|
||||||
|
results for the 1 GB data set are shown in Figure~\ref{fig:tpch_1gb} and the
|
||||||
|
results of the 100 GB data set are in Figure~\ref{fig:tpch_100gb}.
|
||||||
|
|
||||||
The first query listed in Table~\ref{tab:sql_queries} is a simple
|
\begin{figure}
|
||||||
count, achieving scan rates of 33M rows/second/core. We believe
|
\centering
|
||||||
the 75 node cluster was actually overprovisioned for the test
|
\includegraphics[width = 2.8in]{tpch_1gb}
|
||||||
dataset, explaining the modest improvement over the 50 node cluster.
|
\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
|
||||||
Druid's concurrency model is based on shards: one thread will scan one
|
\label{fig:tpch_1gb}
|
||||||
shard. If the number of segments on a historical node modulo the number
|
\end{figure}
|
||||||
of cores is small (e.g. 17 segments and 15 cores), then many of the
|
|
||||||
cores will be idle during the last round of the computation.
|
|
||||||
|
|
||||||
When we include more aggregations we see performance degrade. This is
|
\begin{figure}
|
||||||
because of the column-oriented storage format Druid employs. For the
|
\centering
|
||||||
\texttt{count(*)} queries, Druid only has to check the timestamp column to satisfy
|
\includegraphics[width = 2.8in]{tpch_100gb}
|
||||||
the ``where'' clause. As we add metrics, it has to also load those metric
|
\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
|
||||||
values and scan over them, increasing the amount of memory scanned.
|
\label{fig:tpch_100gb}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
Finally, we present our results of scaling Druid to meet increasing data load
|
||||||
|
with the TPC-H 100 GB data set. We observe that when we increased the number of
|
||||||
|
cores from 8 to 48, we do not display linear scaling and we see diminishing
|
||||||
|
marginal returns to performance in the size of the cluster. The increase in
|
||||||
|
speed of a parallel computing system is often limited by the time needed for
|
||||||
|
the sequential operations of the system, in accordance with Amdahl's law
|
||||||
|
\cite{amdahl1967validity}. Our query results and query speedup are shown in
|
||||||
|
Figure~\ref{tpch_scaling}.
|
||||||
|
|
||||||
|
\begin{figure}
|
||||||
|
\centering
|
||||||
|
\includegraphics[width = 2.8in]{tpch_scaling}
|
||||||
|
\caption{Druid production cluster 90th, 95th, and 99th query latency percentiles for 8 most queried data sources.}
|
||||||
|
\label{fig:tpch_scaling}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
\subsection{Data Ingestion Performance}
|
\subsection{Data Ingestion Performance}
|
||||||
To measure Druid's data latency latency, we spun up a single real-time node
|
To showcase Druid's data latency latency, we selected several production
|
||||||
with the following configurations:
|
datasources of varying dimensions, metrics, and event volume. Our production
|
||||||
|
ingestion setup is as follows:
|
||||||
|
|
||||||
\begin{itemize}
|
\begin{itemize}
|
||||||
\item JVM arguments: -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+HeapDumpOnOutOfMemoryError
|
\item Total RAM: 360 GB
|
||||||
\item CPU: 2.3 GHz Intel Core i7
|
\item Total CPU: 12 x Intel Xeon E5-2670 (96 cores)
|
||||||
|
\item Note: Using this setup, several other data sources are being ingested and many other Druid related ingestion tasks are running across these machines.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
Druid's data ingestion latency is heavily dependent on the complexity of the
|
Druid's data ingestion latency is heavily dependent on the complexity of the
|
||||||
|
@ -835,51 +864,40 @@ dimensions in each event, the number of metrics in each event, and the types of
|
||||||
aggregations we want to perform on those metrics. With the most basic data set
|
aggregations we want to perform on those metrics. With the most basic data set
|
||||||
(one that only has a timestamp column), our setup can ingest data at a rate of
|
(one that only has a timestamp column), our setup can ingest data at a rate of
|
||||||
800k events/sec/node, which is really just a measurement of how fast we can
|
800k events/sec/node, which is really just a measurement of how fast we can
|
||||||
deserialize events. Real world data sets are never this simple. To simulate
|
deserialize events. Real world data sets are never this simple. A description
|
||||||
real-world ingestion rates, we created a data set with 5 dimensions and a
|
of the data sources we selected is shown in Table~\ref{tab:ingest_datasources}.
|
||||||
single metric. 4 out of the 5 dimensions have a cardinality less than 100, and
|
|
||||||
we varied the cardinality of the final dimension. The results of varying the
|
\begin{table}
|
||||||
cardinality of a dimension is shown in
|
\centering
|
||||||
Figure~\ref{fig:throughput_vs_cardinality}.
|
\caption{Druid Ingestion Datasources}
|
||||||
|
\label{tab:ingest_datasources}
|
||||||
|
\begin{tabular}{| l | l | l | l |}
|
||||||
|
\hline
|
||||||
|
\textbf{Data Source} & \textbf{Dims} & \textbf{Mets} & \textbf{Peak Throughput (events/sec)} \\ \hline
|
||||||
|
\texttt{s} & 7 & 2 & 28334.60 \\ \hline
|
||||||
|
\texttt{t} & 10 & 7 & 68808.70 \\ \hline
|
||||||
|
\texttt{u} & 5 & 1 & 49933.93 \\ \hline
|
||||||
|
\texttt{v} & 30 & 10 & 22240.45 \\ \hline
|
||||||
|
\texttt{w} & 35 & 14 & 135763.17 \\ \hline
|
||||||
|
\texttt{x} & 28 & 6 & 46525.85 \\ \hline
|
||||||
|
\texttt{y} & 33 & 24 & 162462.41 \\ \hline
|
||||||
|
\texttt{z} & 33 & 24 & 95747.74 \\ \hline
|
||||||
|
\end{tabular}
|
||||||
|
\end{table}
|
||||||
|
|
||||||
|
We can see that based on the descriptions in
|
||||||
|
Table~\ref{tab:ingest_datasources}, latencies vary significantly and the
|
||||||
|
ingestion latency is not always a factor of the number of dimensions and
|
||||||
|
metrics. Some of the lower latencies on simple data sets are simply because the
|
||||||
|
event producer was not sending a tremendous amount of data. A graphical
|
||||||
|
representation of the ingestion latencies of these data sources over a span of
|
||||||
|
time is shown in Figure~\ref{fig:ingestion_rate}.
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width = 2.8in]{throughput_vs_cardinality}
|
\includegraphics[width = 2.8in]{ingestion_rate}
|
||||||
\caption{When we vary the cardinality of a single dimension, we can see monotonically decreasing throughput.}
|
\caption{Druid production cluster ingestion rate for multiple data sources.}
|
||||||
\label{fig:throughput_vs_cardinality}
|
\label{fig:ingestion_rate}
|
||||||
\end{figure}
|
|
||||||
|
|
||||||
In Figure~\ref{fig:throughput_vs_num_dims}, we instead vary the number of
|
|
||||||
dimensions in our data set. Each dimension has a cardinality less than 100. We
|
|
||||||
can see a similar decline in ingestion throughput as the number of dimensions
|
|
||||||
increases.
|
|
||||||
|
|
||||||
\begin{figure}
|
|
||||||
\centering
|
|
||||||
\includegraphics[width = 2.8in]{throughput_vs_num_dims}
|
|
||||||
\caption{Increasing the number of dimensions of our data set also leads to a decline in throughput.}
|
|
||||||
\label{fig:throughput_vs_num_dims}
|
|
||||||
\end{figure}
|
|
||||||
|
|
||||||
Finally, keeping our number of dimensions constant at 5, with four dimensions
|
|
||||||
having a cardinality in the 0-100 range and the final dimension having a
|
|
||||||
cardinality of 10,000, we can see a similar decline in throughput when we
|
|
||||||
increase the number of metrics/aggregators in the data set. We used random
|
|
||||||
types of metrics/aggregators in this experiment, and they vary from longs,
|
|
||||||
doubles, and other more complex types. The randomization introduces more noise
|
|
||||||
in the results, leading to a graph that is not strictly decreasing. These
|
|
||||||
results are shown in Figure~\ref{fig:throughput_vs_num_metrics}. For most real
|
|
||||||
world data sets, the number of metrics tends to be less than the number of
|
|
||||||
dimensions. Hence, we can see that introducing a few new metrics does not
|
|
||||||
impact the ingestion latency as severely as in the other graphs.
|
|
||||||
|
|
||||||
\begin{figure}
|
|
||||||
\centering
|
|
||||||
\includegraphics[width = 2.8in]{throughput_vs_num_metrics}
|
|
||||||
\caption{Adding new metrics to a data set decreases ingestion latency. In most
|
|
||||||
real world data sets, the number of metrics in a data set tends to be lower
|
|
||||||
than the number of dimensions.}
|
|
||||||
\label{fig:throughput_vs_num_metrics}
|
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\section{Druid in Production}
|
\section{Druid in Production}
|
||||||
|
|
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
After Width: | Height: | Size: 43 KiB |
Loading…
Reference in New Issue