paper edits

This commit is contained in:
fjy 2014-02-19 18:41:37 -08:00
parent c57cf6ab4d
commit 0107ce6a29
3 changed files with 138 additions and 29 deletions

View File

@ -366,3 +366,55 @@
year = {2013},
howpublished = "\url{http://www.elasticseach.com/}"
}
@book{oehler2012ibm,
title={IBM Cognos TM1: The Official Guide},
author={Oehler, Karsten and Gruenes, Jochen and Ilacqua, Christopher and Perez, Manuel},
year={2012},
publisher={McGraw-Hill}
}
@book{schrader2009oracle,
title={Oracle Essbase \& Oracle OLAP},
author={Schrader, Michael and Vlamis, Dan and Nader, Mike and Claterbos, Chris and Collins, Dave and Campbell, Mitch and Conrad, Floyd},
year={2009},
publisher={McGraw-Hill, Inc.}
}
@book{lachev2005applied,
title={Applied Microsoft Analysis Services 2005: And Microsoft Business Intelligence Platform},
author={Lachev, Teo},
year={2005},
publisher={Prologika Press}
}
@article{o1996log,
title={The log-structured merge-tree (LSM-tree)},
author={ONeil, Patrick and Cheng, Edward and Gawlick, Dieter and ONeil, Elizabeth},
journal={Acta Informatica},
volume={33},
number={4},
pages={351--385},
year={1996},
publisher={Springer}
}
@inproceedings{o1997improved,
title={Improved query performance with variant indexes},
author={O'Neil, Patrick and Quass, Dallan},
booktitle={ACM Sigmod Record},
volume={26},
number={2},
pages={38--49},
year={1997},
organization={ACM}
}
@inproceedings{cipar2012lazybase,
title={LazyBase: trading freshness for performance in a scalable database},
author={Cipar, James and Ganger, Greg and Keeton, Kimberly and Morrey III, Charles B and Soules, Craig AN and Veitch, Alistair},
booktitle={Proceedings of the 7th ACM european conference on Computer Systems},
pages={169--182},
year={2012},
organization={ACM}
}

Binary file not shown.

View File

@ -76,12 +76,13 @@ could be fully leveraged for our requirements.
We ended up creating Druid, an open-source, distributed, column-oriented,
realtime analytical data store. In many ways, Druid shares similarities with
other interactive query systems \cite{melnik2010dremel}, main-memory databases
\cite{farber2012sap}, and widely-known distributed data stores such as BigTable
\cite{chang2008bigtable}, Dynamo \cite{decandia2007dynamo}, and Cassandra
\cite{lakshman2010cassandra}. The distribution and query model also
borrow ideas from current generation search infrastructure
\cite{linkedin2013senseidb, apache2013solr, banon2013elasticsearch}.
other OLAP systems \cite{oehler2012ibm, schrader2009oracle, lachev2005applied},
interactive query systems \cite{melnik2010dremel}, main-memory databases
\cite{farber2012sap}, and widely-known distributed data stores
\cite{chang2008bigtable, decandia2007dynamo, lakshman2010cassandra}. The
distribution and query model also borrow ideas from current generation search
infrastructure \cite{linkedin2013senseidb, apache2013solr,
banon2013elasticsearch}.
This paper describes the architecture of Druid, explores the various design
decisions made in creating an always-on production system that powers a hosted
@ -202,13 +203,14 @@ Zookeeper.
Real-time nodes maintain an in-memory index buffer for all incoming events.
These indexes are incrementally populated as new events are ingested and the
indexes are also directly queryable. Druid virtually behaves as a row store
for queries on events that exist in this JVM heap-based buffer. To avoid heap overflow
problems, real-time nodes persist their in-memory indexes to disk either
periodically or after some maximum row limit is reached. This persist process
converts data stored in the in-memory buffer to a column oriented storage
format described in Section \ref{sec:storage-format}. Each persisted index is immutable and
real-time nodes load persisted indexes into off-heap memory such that they can
still be queried. Figure~\ref{fig:realtime_flow} illustrates the process.
for queries on events that exist in this JVM heap-based buffer. To avoid heap
overflow problems, real-time nodes persist their in-memory indexes to disk
either periodically or after some maximum row limit is reached. This persist
process converts data stored in the in-memory buffer to a column oriented
storage format described in Section \ref{sec:storage-format}. Each persisted
index is immutable and real-time nodes load persisted indexes into off-heap
memory such that they can still be queried. This process is described in detail
in \cite{o1996log} and is illustrated in Figure~\ref{fig:realtime_flow}.
\begin{figure}
\centering
@ -602,20 +604,22 @@ the two arrays.
\end{figure}
This approach of performing Boolean operations on large bitmap sets is commonly
used in search engines. Bitmap compression algorithms are a well-defined area
of research and often utilize run-length encoding. Popular algorithms include
Byte-aligned Bitmap Code \cite{antoshenkov1995byte}, Word-Aligned Hybrid (WAH)
code \cite{wu2006optimizing}, and Partitioned Word-Aligned Hybrid (PWAH)
compression \cite{van2011memory}. Druid opted to use the Concise algorithm
\cite{colantonio2010concise} as it can outperform WAH by reducing the size of
the compressed bitmaps by up to 50\%. Figure~\ref{fig:concise_plot}
illustrates the number of bytes using Concise compression versus using an
integer array. The results were generated on a cc2.8xlarge system with a single
thread, 2G heap, 512m young gen, and a forced GC between each run. The data set
is a single days worth of data collected from the Twitter garden hose
\cite{twitter2013} data stream. The data set contains 2,272,295 rows and 12
dimensions of varying cardinality. As an additional comparison, we also
resorted the data set rows to maximize compression.
used in search engines. Bitmap indices for OLAP workloads is described in
detail in \cite{o1997improved}. Bitmap compression algorithms are a
well-defined area of research and often utilize run-length encoding. Popular
algorithms include Byte-aligned Bitmap Code \cite{antoshenkov1995byte},
Word-Aligned Hybrid (WAH) code \cite{wu2006optimizing}, and Partitioned
Word-Aligned Hybrid (PWAH) compression \cite{van2011memory}. Druid opted to use
the Concise algorithm \cite{colantonio2010concise} as it can outperform WAH by
reducing the size of the compressed bitmaps by up to 50\%.
Figure~\ref{fig:concise_plot} illustrates the number of bytes using Concise
compression versus using an integer array. The results were generated on a
cc2.8xlarge system with a single thread, 2G heap, 512m young gen, and a forced
GC between each run. The data set is a single days worth of data collected
from the Twitter garden hose \cite{twitter2013} data stream. The data set
contains 2,272,295 rows and 12 dimensions of varying cardinality. As an
additional comparison, we also resorted the data set rows to maximize
compression.
In the unsorted case, the total Concise size was 53,451,144 bytes and the total
integer array size was 127,248,520 bytes. Overall, Concise compressed sets are
@ -887,7 +891,7 @@ algorithms mentioned in PowerDrill.
Although Druid builds on many of the same principles as other distributed
columnar data stores \cite{fink2012distributed}, many of these data stores are
designed to be more generic key-value stores \cite{stonebraker2005c} and do not
designed to be more generic key-value stores \cite{lakshman2010cassandra} and do not
support computation directly in the storage layer. There are also other data
stores designed for some of the same of the data warehousing issues that Druid
is meant to solve. These systems include include in-memory databases such as
@ -896,6 +900,13 @@ stores lack Druid's low latency ingestion characteristics. Druid also has
native analytical features baked in, similar to \cite{paraccel2013}, however,
Druid allows system wide rolling software updates with no downtime.
Druid is similiar to \cite{stonebraker2005c, cipar2012lazybase} in that it has
two subsystems, a read-optimized subsystem in the historical nodes and a
write-optimized subsystem in real-time nodes. Real-time nodes are designed to
ingest a high volume of append heavy data, and do not support data updates.
Unlike the two aforementioned systems, Druid is meant for OLAP transactions and
not OLTP transactions.
Druid's low latency data ingestion features share some similarities with
Trident/Storm \cite{marz2013storm} and Streaming Spark
\cite{zaharia2012discretized}, however, both systems are focused on stream
@ -916,7 +927,53 @@ of functionality as Druid, some of Druids optimization techniques such as usi
inverted indices to perform fast filters are also used in other data
stores \cite{macnicol2004sybase}.
\section{Conclusions}
\section{Druid in Production}
Druid is run in production at several organizations and is often part of a more
sophisticated data analytics stack. We've made multiple design decisions to
allow for ease of usability, deployment, and monitoring.
\subsection{Operational Monitoring}
Each Druid node is designed to periodically emit a set of operational metrics.
These metrics may include system level data such as CPU usage, available
memory, and disk capacity, JVM statistics such as garbage collection time, and
heap usage, or node specific metrics such as segment scan time, cache
hit rates, and data ingestion latencies. For each query, Druid nodes can also
emit metrics about the details of the query such as the number of filters
applied, or the interval of data requested.
Metrics can be emitted from a production Druid cluster into a dedicated metrics
Druid cluster. Queries can be made to the metrics Druid cluster to explore
production cluster performance and stability. Leveraging a dedicated metrics
cluster has allowed us to find numerous production problems, such as gradual
query speed degregations, less than optimally tuned hardware, and various other
system bottlenecks. We also use a metrics cluster to analyze what queries are
made in production. This analysis allows us to determine what our users are
most often doing and we use this information to drive what optimizations we
should implement.
\subsection{Pairing Druid with a Stream Processor}
As the time of writing, Druid can only understand fully denormalized data
streams. In order to provide full business logic in production, Druid can be
paired with a stream processor such as Apache Storm \cite{marz2013storm}. A
Storm topology consumes events from a data stream, retains only those that are
“on-time”, and applies any relevant business logic. This could range from
simple transformations, such as id to name lookups, up to complex operations
such as multi-stream joins. The Storm topology forwards the processed event
stream to Druid in real-time. Storm handles the streaming data processing work,
and Druid is used for responding to queries on top of both real-time and
historical data.
\subsection{Multiple Data Center Distribution}
Large scale production outages may not only affect single nodes, but entire
data centers as well. The tier configuration in Druid coordinator nodes allow
for segments to be replicated across multiple tiers. Hence, segments can be
exactly replicated across historical nodes in multiple data centers.
Similarily, query preference can be assigned to different tiers. It is possible
to have nodes in one data center act as a primary cluster (and recieve all
queries) and have a redundant cluster in another data center. Such a setup may
be desired if one data center is situated much closer to users.
\section{Conclusions and Future Work}
\label{sec:conclusions}
In this paper, we presented Druid, a distributed, column-oriented, real-time
analytical data store. Druid is designed to power high performance applications