more edits

2014-03-12 19:07:06 -07:00 · 2014-03-12 19:07:06 -07:00 · 7d16960e23
parent a2f5d37081
commit 7d16960e23
2 changed files with 36 additions and 36 deletions
--- a/publications/whitepaper/druid.pdf
+++ b/publications/whitepaper/druid.pdf
--- a/publications/whitepaper/druid.tex
+++ b/publications/whitepaper/druid.tex
@ -666,16 +666,16 @@ The exact query syntax depends on the query type and the information requested.
 A sample count query over a week of data is as follows:
 \begin{verbatim}
 {
-  "queryType"     : "timeseries",
-  "dataSource"    : "wikipedia",
-  "intervals"     : "2013-01-01/2013-01-08",
-  "filter"        : {
-                      "type":"selector",
-                      "dimension":"page",
-                      "value":"Ke$ha"
-                    },
-  "granularity"   : "day",
-  "aggregations"  : [ {"type":"count", "name":"rows"} ]
+  "queryType"    : "timeseries",
+  "dataSource"   : "wikipedia",
+  "intervals"    : "2013-01-01/2013-01-08",
+  "filter"       : {
+                     "type" : "selector",
+                     "dimension" : "page",
+                     "value" : "Ke$ha"
+                   },
+  "granularity"  : "day",
+  "aggregations" : [{"type":"count", "name":"rows"}]
 }
 \end{verbatim}
 The query shown above will return a count of the number of rows in the Wikipedia datasource
@ -697,13 +697,10 @@ equal to "Ke\$ha". The results will be bucketed by day and will be a JSON array
 } ]
 \end{verbatim}
 Druid supports many types of aggregations including double sums, long sums,
-minimums, maximums, and several others. Druid also supports complex aggregations
-such as cardinality estimation and approximate quantile estimation.  The
-results of aggregations can be combined in mathematical expressions to form
-other aggregations. The query API is highly customizable and can be extended to
-filter and group results based on almost any arbitrary condition.  It is beyond
-the scope of this paper to fully describe the query API but more information
-can be found
+minimums, maximums, and complex aggregations such as cardinality estimation and
+approximate quantile estimation.  The results of aggregations can be combined
+in mathematical expressions to form other aggregations. It is beyond the scope
+of this paper to fully describe the query API but more information can be found
 online\footnote{\href{http://druid.io/docs/latest/Querying.html}{http://druid.io/docs/latest/Querying.html}}.

 At the time of writing, the query language does not support joins. Although the
@ -720,15 +717,17 @@ cluster at Metamarkets. The date range of the data is for Feburary 2014.

 \subsection{Query Performance}
 Druid query performance can vary signficantly depending on the actual query
-being issued. For example, determining the approximate cardinality of a given
-dimension is a much more expensive operation than a simple sum of a metric
-column. Similarily, sorting the values of a high cardinality dimension based on
-a given metric is much more expensive than a simple count over a time range.
-Furthermore, the time range of a query and the number of metric aggregators in
-the query will contribute to query latency. Instead of going into full detail
-about every query issued in our production cluster, we've instead chosen to
-showcase a higher level view of average latencies in our cluster. We selected 8
-of our most queried data sources, described in Table~\ref{tab:datasources}. 
+being issued. For example, sorting the values of a high cardinality dimension
+based on a given metric is much more expensive than a simple count over a time
+range. To showcase the average query latencies in a production Druid cluster,
+we selected 8 of our most queried data sources, described in
+Table~\ref{tab:datasources}. Approximately 30\% of the queries are standard
+aggregates involving different types of metrics and filters, 60\% of queries
+are ordered group bys over one or more dimensions with aggregates, and 10\% of
+queries are search queries and metadata retrieval queries. The number of
+columns scanned in aggregate queries roughly follows an exponential
+distribution. Queries involving a single column are very frequent, and queries
+involving all columns are very rare.

 \begin{table}
  \centering
@ -763,10 +762,7 @@ the queries per minute is shown in Figure~\ref{fig:queries_per_min}. We can see
 that across the various data sources, the average query latency is approximately
 540ms. The 90th percentile query latency across these data sources is < 1s, the
 95th percentile is < 2s, and the 99th percentile is < 10s. The percentiles are
-shown in Figure~\ref{fig:query_percentiles}. It is very possible to possible to
-decrease query latencies by adding additional hardware, but we have not chosen
-to do so because infrastructure cost is still a consideration.
-
+shown in Figure~\ref{fig:query_percentiles}. 
 \begin{figure}
 \centering 
 \includegraphics[width = 2.8in]{avg_query_latency}
@ -821,10 +817,9 @@ Finally, we present our results of scaling Druid to meet increasing data
 volumes with the TPC-H 100 GB data set. Our distributed cluster used Amazon EC2
 c3.2xlarge (Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz) instances for broker
 nodes. We observe that when we increased the number of cores from 8 to 48, we
-do not always display linear scaling.  The increase in speed of a parallel
+do not always display linear scaling as the increase in speed of a parallel
 computing system is often limited by the time needed for the sequential
-operations of the system, in accordance with Amdahl's law
-\cite{amdahl1967validity}. Our query results and query speedup are shown in
+operations of the system. Our query results and query speedup are shown in
 Figure~\ref{fig:tpch_scaling}.

 \begin{figure}
@ -880,7 +875,9 @@ rate that the data producer was delivering data. The results are shown in
 Figure~\ref{fig:ingestion_rate}. We define throughput as the number of events a
 real-time node can ingest and also make queryable. If too many events are sent
 to the real-time node, those events are blocked until the real-time node has
-capacity to accept them.
+capacity to accept them.The peak ingestion latency we measured in production
+was 22914.43 events/sec/core on an Amazon EC2 cc2.8xlarge. The data source had
+30 dimensions and 19 metrics.

 \begin{figure}
 \centering
@ -889,8 +886,11 @@ capacity to accept them.
 \label{fig:ingestion_rate}
 \end{figure}

-The peak ingestion latency we measured in production was 22914.43 events/sec/core
-on an Amazon EC2 cc2.8xlarge. The data source had 30 dimensions and 19 metrics.
+The latency measurements we presented are sufficient to address the our stated
+problems of interactivity. We would prefer the variability in the latencies to
+be less. It is still very possible to possible to decrease latencies by adding
+additional hardware, but we have not chosen to do so because infrastructure
+cost is still a consideration to us.

 \section{Druid in Production}
 \label{sec:production}