New comparisons for Druid

2015-11-06 12:21:51 -08:00 · 2015-11-06 12:21:51 -08:00 · 0b319093df
parent 61139b9dfa
commit 0b319093df
6 changed files with 245 additions and 0 deletions
--- a/docs/content/comparisons/druid-vs-commericial-solutions.md
+++ b/docs/content/comparisons/druid-vs-commericial-solutions.md
@ -0,0 +1,62 @@
 ---
 layout: doc_page
 ---
 Druid vs Proprietary Commercial Solutions (Vertica/Redshift)
 ============================================================
 The proprietary database world has numerous solutions by numerous vendors. At a very high level, Druid is distinguished from most  
 of these solutions by the scale at which it can operate at, the scale and performance seen with ingesting streaming data, the performance seen from issuing ad-hoc exploratory queries, 
 the support it has for multi-tenancy (making Druid ideal for powering user facing applications), and how cost effective it is to run at scale. Druid is an open source project and the development 
 is community led. The direction of the project is driven by the needs of the community. Below, we highlight comparisons between some specific 
 systems. We welcome contributions for additional comparisons.
 ## Druid vs Redshift
 ### How does Druid compare to Redshift?
 In terms of drawing a differentiation, Redshift started out as ParAccel (Actian), which Amazon is licensing and has since heavily modified.
 Aside from potential performance differences, there are some functional differences:
 ### Real-time data ingestion
 Because Druid is optimized to provide insight against massive quantities of streaming data; it is able to load and aggregate data in real-time.
 Generally traditional data warehouses including column stores work only with batch ingestion and are not optimal for streaming data in regularly.
 ### Druid is a read oriented analytical data store
 Druid’s write semantics are not as fluid and does not support full joins (we support large table to small table joins). Redshift provides full SQL support including joins and insert/update statements.
 ### Data distribution model
 Druid’s data distribution is segment-based and leverages a highly available "deep" storage such as S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from "deep" storage.
 To contrast, ParAccel’s data distribution model is hash-based. Expanding the cluster requires re-hashing the data across the nodes, making it difficult to perform without taking downtime. Amazon’s Redshift works around this issue with a multi-step process:
 * set cluster into read-only mode
 * copy data from cluster to new cluster that exists in parallel
 * redirect traffic to new cluster
 ### Replication strategy
 Druid employs segment-level data distribution meaning that more nodes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying. Replication is done automatically and without any impact to performance.
 ParAccel’s hash-based distribution generally means that replication is conducted via hot spares. This puts a numerical limit on the number of nodes you can lose without losing data, and this replication strategy often does not allow the hot spare to help share query load.
 ### Indexing strategy
 Along with column oriented structures, Druid uses indexing structures to speed up query execution when a filter is provided. Indexing structures do increase storage overhead (and make it more difficult to allow for mutation), but they also significantly speed up queries.
 ParAccel does not appear to employ indexing strategies.
 ## Druid vs Vertica
 ### How does Druid compare to Vertica?
 Vertica is similar to ParAccel/Redshift described above in that it wasn’t built for real-time streaming data ingestion and it supports full SQL.
 The other big difference is that instead of employing indexing, Vertica tries to optimize processing by leveraging run-length encoding (RLE) and other compression techniques along with a "projection" system that creates materialized copies of the data in a different sort order (to maximize the effectiveness of RLE).
 We are unclear about how Vertica handles data distribution and replication, so we cannot speak to if/how Druid is different.
--- a/docs/content/comparisons/druid-vs-computing-frameworks.md
+++ b/docs/content/comparisons/druid-vs-computing-frameworks.md
@ -0,0 +1,46 @@
 ---
 layout: doc_page
 ---
 Druid vs General Computing Frameworks (Spark/Hadoop)
 ====================================================
 Druid is much more complementary to general computing frameworks than it is competitive. General compute engines can 
 very flexibly process and transform data. Druid acts as a query accelerator to these systems by indexing raw data into a custom 
 column format to speed up OLAP queries.
 ## Druid vs Hadoop (HDFS/MapReduce)
 Hadoop has shown the world that it’s possible to house your data warehouse on commodity hardware for a fraction of the price 
 of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things.
 1.  They can now query all of their data in a fairly flexible manner and answer any question they have
 2.  The queries take a long time
 The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency. 
 Druid is a complementary addition to Hadoop. Hadoop is great at storing and making accessible large amounts of individually low-value data. 
 Unfortunately, Hadoop is not great at providing query speed guarantees on top of that data, nor does it have very good 
 operational characteristics for a customer-facing production system. Druid, on the other hand, excels at taking high-value 
 summaries of the low-value data on Hadoop, making it available in a fast and always-on fashion, such that it could be exposed directly to a customer.
 Druid also requires some infrastructure to exist for [deep storage](../dependencies/deep-storage.html). 
 HDFS is one of the implemented options for this [deep storage](../dependencies/deep-storage.html).
 Please note we are only comparing Druid to base Hadoop here, but we welcome comparisons of Druid vs other systems or combinations 
 of systems in the Hadoop ecosystems.
 ## Druid vs Spark
 Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and
 can be viewed as a back-office analytics platform.  RDDs enable data reuse by persisting intermediate results
 in memory and enable Spark to provide fast computations for iterative algorithms.
 This is especially beneficial for certain work flows such as machine
 learning, where the same operation may be applied over and over
 again until some result is converged upon.  Spark provides analysts with
 the ability to run queries and analyze large amounts of data with a
 wide array of different algorithms.
 Druid is designed to power analytic applications and focuses on the latencies to ingest data and serve queries
 over that data. If you were to build an application where users could
 arbitrarily explore data, the latencies seen by using Spark will likely be too slow for an interactive experience.
--- a/docs/content/comparisons/druid-vs-key-value.md
+++ b/docs/content/comparisons/druid-vs-key-value.md
@ -0,0 +1,28 @@
 ---
 layout: doc_page
 ---
 Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
 ====================================================
 Druid is highly optimized for scans and aggregations, it supports arbitrarily deep drill downs into data sets. This same functionality 
 is supported in key/value stores in 2 ways:
 1. Pre-compute all permutations of possible user queries
 2. Range scans on event data
 When pre-computing results, the key is the exact parameters of the query, and the value is the result of the query.  
 The queries return extremely quickly, but at the cost of flexibility, as ad-hoc exploratory queries are not possible with 
 pre-computing every possible query permutation. Pre-computing all permutations of all ad-hoc queries leads to result sets 
 that grow exponentially with the number of columns of a data set, and pre-computing queries for complex real-world data sets 
 can require hours of pre-processing time.
 The other approach to using key/value stores for aggregations to use the dimensions of an event as the key and the event measures as the value. 
 Aggregations are done by issuing range scans on this data. Timeseries specific databases such as OpenTSDB use this approach. 
 One of the limitations here is that the key/value storage model does not have indexes for any kind of filtering other than prefix ranges, 
 which can be used to filter a query down to a metric and time range, but cannot resolve complex predicates to narrow the exact data to scan. 
 When the number of rows to scan gets large, this limitation can greatly reduce performance. It is also harder to achieve good 
 locality with key/value stores because most don’t support pushing down aggregates to the storage layer.
 For arbitrary exploration of data (flexible data filtering), Druid's custom column format enables ad-hoc queries without pre-computation. The format 
 also enables fast scans on columns, which is important for good aggregation performance.
--- a/docs/content/comparisons/druid-vs-search-systems.md
+++ b/docs/content/comparisons/druid-vs-search-systems.md
@ -0,0 +1,21 @@
 ---
 layout: doc_page
 ---
 Druid vs Search Systems (Elasticsearch/Solr)
 ============================================
 We are not experts on search systems, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means.
 Elasticsearch and Solr are a search systems based on Apache Lucene. They provides full text search for schema-free documents 
 and provides access to raw event level data. Search systems are increasingly adding more support for analytics and aggregations. 
 [Some members of the community](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ) have pointed out  
 the resource requirements for data ingestion and aggregation in search systems are much higher than those of Druid.
 Search systems also do not support data summarization/roll-up at ingestion time, which can compact the data that needs to be 
 stored up to 100x with real-world data sets. This leads to search systems having greater storage requirements.
 Druid focuses on OLAP work flows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost, 
 and supports a wide range of analytic operations. Druid has some basic search support for structured event data, but does not support 
 full text search. Druid also does not support completely unstructured data. Measures must be defined in a Druid schema such that 
 summarization/roll-up can be done.
--- a/docs/content/comparisons/druid-vs-sql-on-hadoop.md
+++ b/docs/content/comparisons/druid-vs-sql-on-hadoop.md
@ -0,0 +1,58 @@
 ---
 layout: doc_page
 ---
 Druid vs SQL-on-Hadoop (Hive/Impala/Drill/Spark SQL/Presto)
 ===========================================================
 Druid is much more complementary to SQL-on-Hadoop engines than it is competitive. SQL-on-Hadoop engines provide an 
 execution engine for various data formats and data stores, and 
 many can be made to push down computations down to Druid, while providing a SQL interface to Druid.
 For a direct comparison between the technologies and when to only use one or the other, the basically comes down to your 
 product requirements and what the systems were designed to do.  
 Druid was designed to
 1. be an always on service
 1. ingest data in real-time
 1. handle slice-n-dice style ad-hoc queries
 SQL-on-Hadoop engines (as far as we are aware) were designed to replace Hadoop MapReduce with another, faster, query layer 
 that is completely generic and plays well with the other ecosystem of Hadoop technologies.
 What does this mean?  We can talk about it in terms of three general areas
 1. Queries
 1. Data Ingestion
 1. Query Flexibility
 ## Queries
 Druid segments stores data in a custom column format. Segments are scanned directly as part of queries and each Druid server 
 calculates a set of results that are eventually merged at the Broker level. This means the data that is transferred between servers 
 are queries and results, and all computation is done internally as part of the Druid servers.
 Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats. 
 They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce) 
 and they have facilities to cache data locally so that it can be accessed and updated quicker.  
 Many SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still 
 some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly 
 how much of a performance impact this makes.
 ## Data Ingestion
 Druid is built to allow for real-time ingestion of data.  You can ingest data and query it immediately upon ingestion, 
 the latency between how quickly the event is reflected in the data is dominated by how long it takes to deliver the event to Druid.
 SQL-on-Hadoop, being based on data in HDFS or some other backing store, are limited in their data ingestion rates by the 
 rate at which that backing store can make data available.  Generally, the backing store is the biggest bottleneck for 
 how quickly data can become available.
 ## Query Flexibility
 Druid's query language is fairly low level and maps to how Druid operates internally. Although Druid can be combined with a high level query 
 planner such as [Plywood](https://github.com/implydata/plywood) to support most SQL queries and analytic SQL queries (minus joins among large tables), 
 base Druid is less flexible than SQL-on-Hadoop solutions for generic processing.
 SQL-on-Hadoop support SQL style queries with full joins.
--- a/docs/content/comparisons/druid-vs-storage-formats.md
+++ b/docs/content/comparisons/druid-vs-storage-formats.md
@ -0,0 +1,30 @@
 ---
 layout: doc_page
 ---
 Druid vs Storage Formats (Parquet/Kudu)
 =======================================
 The biggest difference between Druid and existing storage formats is that Druid includes an execution engine that can run 
 queries, and a time-optimized data management system for advanced data retention and distribution,. 
 The Druid segment is a custom column format designed for fast aggregates and filters. Below we compare Druid's segment format to 
 other existing formats. 
 ## Druid vs Parquet
 Druid's storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet's storage format is much 
 more hierachical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid.
 ## Druid vs Kudu
 Kudu's storage format enables single row updates, whereas updates to existing Druid segments requires recreating the segment, hence,   
 the process for updating old values is slower in Druid. The requirements in Kudu for maintaining extra head space to store 
 updates as well as organizing data by id instead of time has the potential to introduce some extra latency and accessing 
 of data that is not need to answer a query at query time. Druid summarizes/rollups up data at ingestion time, which in practice reduces the raw data that needs to be 
 stored by an average of 40 times, and increases performance of scanning raw data significantly. 
 This summarization processes loses information about individual events however. Druid segments also contain bitmap indexes for 
 fast filtering, which Kudu does not currently support. Druid's segment architecture is heavily geared towards fast aggregates and filters, and for OLAP workflows. Appends are very 
 fast in Druid, whereas updates of older data is slower. This is by design as the data Druid is good far is typically event data, and does not need to be updated too frequently. 
 Kudu supports arbitrary primary keys with uniqueness constraints, and efficient lookup by ranges of those keys. 
 Kudu chooses not to include the execution engine, but supports sufficient operations so as to allow node-local processing from the execution engines. 
 This means that Kudu can support multiple frameworks on the same data (eg MR, Spark, and SQL).