more paper updates

2014-02-20 18:15:54 -08:00 · 2014-02-20 18:15:54 -08:00 · 1afcc71227
parent 7b0b90a860
commit 1afcc71227
2 changed files with 48 additions and 11 deletions
--- a/publications/whitepaper/druid.pdf
+++ b/publications/whitepaper/druid.pdf
--- a/publications/whitepaper/druid.tex
+++ b/publications/whitepaper/druid.tex
@ -928,11 +928,46 @@ inverted indices to perform fast filters are also used in other data
 stores \cite{macnicol2004sybase}.

 \section{Druid in Production}
-Druid is run in production at several organizations and is often part of a more
-sophisticated data analytics stack. We've made multiple design decisions to
-allow for ease of usability, deployment, and monitoring.
+Over the last few years of using Druid, we've gained tremendous
+knowledge about handling production workloads, setting up correct operational
+monitoring, integrating Druid with other products as part of a more
+sophisticated data analytics stack, and distributing data to handle entire data
+center outages. One of the most important lessons we've learned is that no
+amount of testing can accurately simulate a production environment, and failures
+will occur for every imaginable and unimaginable reason. Interestingly, most of
+our most severe crashes were due to misunderstanding the impacts a
+seemingly small feature would have on the overall system. 
+
+Some of our more interesting observations include:
+\begin{itemize}
+\item Druid is most often used in production to power exploratory dashboards.
+Interestingly, because many users of explatory dashboards are not from
+technical backgrounds, they often issue queries without understanding the
+impacts to the underlying system. For example, some users become impatient that
+their queries for terabytes of data do not return in milliseconds and
+continously refresh their dashboard view, generating heavy load to Druid. This
+type of usage forced Druid to better defend itself against expensive repetitive
+queries.
+
+\item Cluster query performance benefits from multitenancy. Hosting every
+production datasource in the same cluster leads to better data parallelization
+as additional nodes are added.
+
+\item Even if you provide users with the ability to arbitrarily explore data, they
+often only have a few questions in mind. Caching is extremely important, and in
+fact we see a very high percentage of our query results come from the broker cache.
+
+\item When using a memory mapped storage engine, even a small amount of paging
+data from disk can severely impact query performance. SSDs can greatly solve
+this problem.
+
+\item Leveraging approximate algorithms can greatly reduce data storage costs and
+improve query performance. Many users do not care about exact answers to their
+questions and are comfortable with a few percentage points of error. 
+\end{itemize}

 \subsection{Operational Monitoring}
+Proper monitoring is critical to run a large scale distributed cluster.
 Each Druid node is designed to periodically emit a set of operational metrics.
 These metrics may include system level data such as CPU usage, available
 memory, and disk capacity, JVM statistics such as garbage collection time, and
@ -948,11 +983,10 @@ cluster has allowed us to find numerous production problems, such as gradual
 query speed degregations, less than optimally tuned hardware, and various other
 system bottlenecks. We also use a metrics cluster to analyze what queries are
 made in production. This analysis allows us to determine what our users are
-most often doing and we use this information to drive what optimizations we
-should implement.
+most often doing and we use this information to drive our road map.

 \subsection{Pairing Druid with a Stream Processor}
-As the time of writing, Druid can only understand fully denormalized data
+At the time of writing, Druid can only understand fully denormalized data
 streams. In order to provide full business logic in production, Druid can be
 paired with a stream processor such as Apache Storm \cite{marz2013storm}. A
 Storm topology consumes events from a data stream, retains only those that are
@ -978,11 +1012,14 @@ be desired if one data center is situated much closer to users.
 In this paper, we presented Druid, a distributed, column-oriented, real-time
 analytical data store. Druid is designed to power high performance applications
 and is optimized for low query latencies. Druid supports streaming data
-ingestion and is fault-tolerant. We discussed how Druid was able to
-scan 27 billion rows in a second. We summarized key architecture aspects such
-as the storage format, query language, and general execution. In the future, we
-plan to cover the different algorithms we’ve developed for Druid and how other
-systems may plug into Druid in greater detail.
+ingestion and is fault-tolerant. We discussed how Druid benchmarks and
+summarized key architecture aspects such
+as the storage format, query language, and general execution.
+
+In the future, we plan to extend the Druid query language to support full SQL.
+Doing so will require joins, a feature we've held off on implementing because
+we do our joins at the data processing layer. We are also interested in
+exploring more flexible data ingestion and support for less structured data.

 \balance