Merge pull request #1928 from druid-io/new-compares

New comparisons for Druid
2015-11-20 16:40:59 -08:00 · 2015-11-20 16:40:59 -08:00 · c0580bf063
parent bec7dacd86 46bf1ba5ef
commit c0580bf063
10 changed files with 137 additions and 112 deletions
--- a/docs/content/comparisons/druid-vs-cassandra.md
+++ b/docs/content/comparisons/druid-vs-cassandra.md
@ -1,13 +0,0 @@
---
-layout: doc_page
---
-
-Druid vs. Cassandra
-===================
-
-
-We are not experts on Cassandra, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means.  We will fix this page.
-
-Druid is highly optimized for scans and aggregations, it supports arbitrarily deep drill downs into data sets without the need to pre-compute, and it can ingest event streams in real-time and allow users to query events as they come in. Cassandra is a great key-value store and it has some features that allow you to use it to do more interesting things than what you can do with a pure key-value store. But, it is not built for the same use cases that Druid handles, namely regularly scanning over billions of entries per query.
-
-Furthermore, Druid is fully read-consistent. Druid breaks down a data set into immutable chunks known as segments. All replicants always present the exact same view for the piece of data they are holding and we don’t have to worry about data synchronization. The tradeoff is that Druid has limited semantics for write and update operations. Cassandra, similar to Amazon’s Dynamo, has an eventually consistent data model. Writes are always supported but updates to data may take some time before all replicas sync up (data reconciliation is done at read time). This model favors availability and scalability over consistency.
--- a/docs/content/comparisons/druid-vs-elasticsearch.md
+++ b/docs/content/comparisons/druid-vs-elasticsearch.md
@ -5,8 +5,17 @@ layout: doc_page
 Druid vs Elasticsearch
 ======================

-We are not experts on Elasticsearch, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means.
+We are not experts on search systems, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means.

-Elasticsearch is a search server based on Apache Lucene. It provides full text search for schema-free documents and provides access to raw event level data. Elasticsearch also provides support for analytics and aggregations. Based on [user testimony](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ), the resource requirements for data ingestion and aggregation in Elasticsearch are higher than those of Druid.
+Elasticsearch is a search systems based on Apache Lucene. It provides full text search for schema-free documents 
+and provides access to raw event level data. Elasticsearch is increasingly adding more support for analytics and aggregations. 
+[Some members of the community](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ) have pointed out  
+the resource requirements for data ingestion and aggregation in Elasticsearch is much higher than those of Druid.

-Druid focuses on OLAP work flows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost, and supports a wide range of analytic operations. Druid has some basic search support for structured event data.
+Elasticsearch also does not support data summarization/roll-up at ingestion time, which can compact the data that needs to be 
+stored up to 100x with real-world data sets. This leads to Elasticsearch having greater storage requirements.
+
+Druid focuses on OLAP work flows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost, 
+and supports a wide range of analytic operations. Druid has some basic search support for structured event data, but does not support 
+full text search. Druid also does not support completely unstructured data. Measures must be defined in a Druid schema such that 
+summarization/roll-up can be done.
--- a/docs/content/comparisons/druid-vs-hadoop.md
+++ b/docs/content/comparisons/druid-vs-hadoop.md
@ -1,18 +0,0 @@
---
-layout: doc_page
---
-
-Druid vs Hadoop
-===============
-
-
-Hadoop has shown the world that it’s possible to house your data warehouse on commodity hardware for a fraction of the price of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things.
-
-1.  They can now query all of their data in a fairly flexible manner and answer any question they have
-2.  The queries take a long time
-
-The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency. 
-
-Druid is a complementary addition to Hadoop. Hadoop is great at storing and making accessible large amounts of individually low-value data. Unfortunately, Hadoop is not great at providing query speed guarantees on top of that data, nor does it have very good operational characteristics for a customer-facing production system. Druid, on the other hand, excels at taking high-value summaries of the low-value data on Hadoop, making it available in a fast and always-on fashion, such that it could be exposed directly to a customer.
-
-Druid also requires some infrastructure to exist for [deep storage](../dependencies/deep-storage.html). HDFS is one of the implemented options for this [deep storage](../dependencies/deep-storage.html).
--- a/docs/content/comparisons/druid-vs-impala-or-shark.md
+++ b/docs/content/comparisons/druid-vs-impala-or-shark.md
@ -1,49 +0,0 @@
---
-layout: doc_page
---
-
-Druid vs Impala or Shark
-========================
-
-The question of Druid versus Impala or Shark basically comes down to your product requirements and what the systems were designed to do.  
-
-Druid was designed to
-
-1. be an always on service
-1. ingest data in real-time
-1. handle slice-n-dice style ad-hoc queries
-
-Impala and Shark's primary design concerns (as far as I am aware) were to replace Hadoop MapReduce with another, faster, query layer that is completely generic and plays well with the other ecosystem of Hadoop technologies.  I will caveat this discussion with the statement that I am not an expert on Impala or Shark, nor am I intimately familiar with their roadmaps.  If anything is incorrect on this page, I'd be happy to change it, please send a note to the mailing list.
-
-What does this mean?  We can talk about it in terms of four general areas
-
-1. Fault Tolerance
-1. Query Speed
-1. Data Ingestion
-1. Query Flexibility
-
-## Fault Tolerance
-
-Druid pulls segments down from [Deep Storage](../dependencies/deep-storage.html) before serving queries on top of it.  This means that for the data to exist in the Druid cluster, it must exist as a local copy on a historical node.  If deep storage becomes unavailable for any reason, new segments will not be loaded into the system, but the cluster will continue to operate exactly as it was when the backing store disappeared. 
-
-Impala and Shark, on the other hand, pull their data in from HDFS (or some other Hadoop FileSystem) in response to a query.  This has implications for the operation of queries if you need to take HDFS down for a bit (say a software upgrade).  It's possible that data that has been cached in the nodes is still available when the backing file system goes down, but I'm not sure.
-
-This is just one example, but Druid was built to continue operating in the face of failures of any one of its various pieces.  The [Design](../design/design.html) describes these design decisions from the Druid side in more detail.
-
-## Query Speed
-
-Druid takes control of the data given to it, storing it in a column-oriented fashion, compressing it and adding indexing structures.  All of which add to the speed at which queries can be processed.  The column orientation means that we only look at the data that a query asks for in order to compute the answer.  Compression increases the data storage capacity of RAM and allows us to fit more data into quickly accessible RAM.  Indexing structures mean that as you add boolean filters to your queries, we do less processing and you get your result faster, whereas a lot of processing engines do *more* processing when filters are added.
-
-Impala/Shark can basically be thought of as daemon caching layers on top of HDFS.  They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce) and they have facilities to cache data locally so that it can be accessed and updated quicker.  But, I do not believe they go beyond caching capabilities to actually speed up queries.  So, at the end of the day, they don't change the paradigm from a brute-force, scan-everything query processing paradigm.
-
-## Data Ingestion
-
-Druid is built to allow for real-time ingestion of data.  You can ingest data and query it immediately upon ingestion, the latency between how quickly the event is reflected in the data is dominated by how long it takes to deliver the event to Druid.
-
-Impala/Shark, being based on data in HDFS or some other backing store, are limited in their data ingestion rates by the rate at which that backing store can make data available.  Generally, the backing store is the biggest bottleneck for how quickly data can become available.
-
-## Query Flexibility
-
-Druid supports timeseries and groupBy style queries.  It doesn't have support for joins, which makes it a lot less flexible for generic processing.
-
-Impala/Shark support SQL style queries with full joins.
--- a/docs/content/comparisons/druid-vs-key-value.md
+++ b/docs/content/comparisons/druid-vs-key-value.md
@ -0,0 +1,28 @@
+---
+layout: doc_page
+---
+
+Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
+====================================================
+
+Druid is highly optimized for scans and aggregations, it supports arbitrarily deep drill downs into data sets. This same functionality 
+is supported in key/value stores in 2 ways:
+
+1. Pre-compute all permutations of possible user queries
+2. Range scans on event data
+
+When pre-computing results, the key is the exact parameters of the query, and the value is the result of the query.  
+The queries return extremely quickly, but at the cost of flexibility, as ad-hoc exploratory queries are not possible with 
+pre-computing every possible query permutation. Pre-computing all permutations of all ad-hoc queries leads to result sets 
+that grow exponentially with the number of columns of a data set, and pre-computing queries for complex real-world data sets 
+can require hours of pre-processing time.
+
+The other approach to using key/value stores for aggregations to use the dimensions of an event as the key and the event measures as the value. 
+Aggregations are done by issuing range scans on this data. Timeseries specific databases such as OpenTSDB use this approach. 
+One of the limitations here is that the key/value storage model does not have indexes for any kind of filtering other than prefix ranges, 
+which can be used to filter a query down to a metric and time range, but cannot resolve complex predicates to narrow the exact data to scan. 
+When the number of rows to scan gets large, this limitation can greatly reduce performance. It is also harder to achieve good 
+locality with key/value stores because most don’t support pushing down aggregates to the storage layer.
+
+For arbitrary exploration of data (flexible data filtering), Druid's custom column format enables ad-hoc queries without pre-computation. The format 
+also enables fast scans on columns, which is important for good aggregation performance.
--- a/docs/content/comparisons/druid-vs-kudu.md
+++ b/docs/content/comparisons/druid-vs-kudu.md
@ -0,0 +1,19 @@
+---
+layout: doc_page
+---
+
+Druid vs Kudu
+=============
+
+Kudu's storage format enables single row updates, whereas updates to existing Druid segments requires recreating the segment, hence,   
+the process for updating old values is slower in Druid. The requirements in Kudu for maintaining extra head space to store 
+updates as well as organizing data by id instead of time has the potential to introduce some extra latency and accessing 
+of data that is not need to answer a query at query time. Druid summarizes/rollups up data at ingestion time, which in practice reduces the raw data that needs to be 
+stored by an average of 40 times, and increases performance of scanning raw data significantly. 
+This summarization processes loses information about individual events however. Druid segments also contain bitmap indexes for 
+fast filtering, which Kudu does not currently support. Druid's segment architecture is heavily geared towards fast aggregates and filters, and for OLAP workflows. Appends are very 
+fast in Druid, whereas updates of older data is slower. This is by design as the data Druid is good far is typically event data, and does not need to be updated too frequently. 
+Kudu supports arbitrary primary keys with uniqueness constraints, and efficient lookup by ranges of those keys. 
+Kudu chooses not to include the execution engine, but supports sufficient operations so as to allow node-local processing from the execution engines. 
+This means that Kudu can support multiple frameworks on the same data (eg MR, Spark, and SQL).
+
--- a/docs/content/comparisons/druid-vs-redshift.md
+++ b/docs/content/comparisons/druid-vs-redshift.md
@ -7,7 +7,7 @@ Druid vs Redshift

 ### How does Druid compare to Redshift?

-In terms of drawing a differentiation, Redshift is essentially ParAccel (Actian) which Amazon is licensing.
+In terms of drawing a differentiation, Redshift started out as ParAccel (Actian), which Amazon is licensing and has since heavily modified.

 Aside from potential performance differences, there are some functional differences:

@ -19,11 +19,11 @@ Generally traditional data warehouses including column stores work only with bat

 ### Druid is a read oriented analytical data store

-It’s write semantics aren’t as fluid and does not support joins. ParAccel is a full database with SQL support including joins and insert/update statements.
+Druid’s write semantics are not as fluid and does not support full joins (we support large table to small table joins). Redshift provides full SQL support including joins and insert/update statements.

 ### Data distribution model

-Druid’s data distribution, is segment based which exists on highly available "deep" storage, like S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from "deep" storage.
+Druid’s data distribution is segment-based and leverages a highly available "deep" storage such as S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from "deep" storage.

 To contrast, ParAccel’s data distribution model is hash-based. Expanding the cluster requires re-hashing the data across the nodes, making it difficult to perform without taking downtime. Amazon’s Redshift works around this issue with a multi-step process:

@ -33,12 +33,12 @@ To contrast, ParAccel’s data distribution model is hash-based. Expanding the c

 ### Replication strategy

-Druid employs segment-level data distribution meaning that more nodes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying.
+Druid employs segment-level data distribution meaning that more nodes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying. Replication is done automatically and without any impact to performance.

 ParAccel’s hash-based distribution generally means that replication is conducted via hot spares. This puts a numerical limit on the number of nodes you can lose without losing data, and this replication strategy often does not allow the hot spare to help share query load.

 ### Indexing strategy

-Along with column oriented structures, Druid uses indexing structures to speed up query execution when a filter is provided. Indexing structures do increase storage overhead (and make it more difficult to allow for mutation), but they can also significantly speed up queries.
+Along with column oriented structures, Druid uses indexing structures to speed up query execution when a filter is provided. Indexing structures do increase storage overhead (and make it more difficult to allow for mutation), but they also significantly speed up queries.

 ParAccel does not appear to employ indexing strategies.
--- a/docs/content/comparisons/druid-vs-spark.md
+++ b/docs/content/comparisons/druid-vs-spark.md
@ -5,7 +5,7 @@ layout: doc_page
 Druid vs Spark
 ==============

-We are not experts on Spark, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means.
+Druid and Spark are complementary solutions as Druid can be used to accelerate OLAP queries in Spark.

 Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and
 can be viewed as a back-office analytics platform.  RDDs enable data reuse by persisting intermediate results
@ -17,5 +17,5 @@ the ability to run queries and analyze large amounts of data with a
 wide array of different algorithms.

 Druid is designed to power analytic applications and focuses on the latencies to ingest data and serve queries
-over that data. If you were to build a web UI where users could
-arbitrarily explore data, the latencies seen by using Spark may be too slow for interactive use cases.
+over that data. If you were to build an application where users could
+arbitrarily explore data, the latencies seen by using Spark will likely be too slow for an interactive experience.
--- a/docs/content/comparisons/druid-vs-sql-on-hadoop.md
+++ b/docs/content/comparisons/druid-vs-sql-on-hadoop.md
@ -0,0 +1,64 @@
+---
+layout: doc_page
+---
+
+Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)
+===========================================================
+
+SQL-on-Hadoop engines provide an 
+execution engine for various data formats and data stores, and 
+many can be made to push down computations down to Druid, while providing a SQL interface to Druid.
+
+For a direct comparison between the technologies and when to only use one or the other, things basically comes down to your 
+product requirements and what the systems were designed to do.  
+
+Druid was designed to
+
+1. be an always on service
+1. ingest data in real-time
+1. handle slice-n-dice style ad-hoc queries
+
+SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems. 
+Some of these engines (including Impala and Presto) can be colocated with HDFS data nodes and coordinate with them to achieve data locality for queries.
+What does this mean?  We can talk about it in terms of three general areas
+
+1. Queries
+1. Data Ingestion
+1. Query Flexibility
+
+### Queries
+
+Druid segments stores data in a custom column format. Segments are scanned directly as part of queries and each Druid server 
+calculates a set of results that are eventually merged at the Broker level. This means the data that is transferred between servers 
+are queries and results, and all computation is done internally as part of the Druid servers.
+
+Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats. 
+They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce).  
+Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still 
+some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly 
+how much of a performance impact this makes.
+
+### Data Ingestion
+
+Druid is built to allow for real-time ingestion of data.  You can ingest data and query it immediately upon ingestion, 
+the latency between how quickly the event is reflected in the data is dominated by how long it takes to deliver the event to Druid.
+
+SQL-on-Hadoop, being based on data in HDFS or some other backing store, are limited in their data ingestion rates by the 
+rate at which that backing store can make data available.  Generally, the backing store is the biggest bottleneck for 
+how quickly data can become available.
+
+### Query Flexibility
+
+Druid's query language is fairly low level and maps to how Druid operates internally. Although Druid can be combined with a high level query 
+planner such as [Plywood](https://github.com/implydata/plywood) to support most SQL queries and analytic SQL queries (minus joins among large tables), 
+base Druid is less flexible than SQL-on-Hadoop solutions for generic processing.
+
+SQL-on-Hadoop support SQL style queries with full joins.
+
+## Druid vs Parquet
+
+Parquet is a column storage format that is designed to work with SQL-on-Hadoop engines. Parquet doesn't have a query execution engine, and instead 
+relies on external sources to pull data out of it.
+
+Druid's storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet's storage format is much 
+more hierachical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid.
--- a/docs/content/comparisons/druid-vs-vertica.md
+++ b/docs/content/comparisons/druid-vs-vertica.md
@ -1,15 +0,0 @@
---
-layout: doc_page
---
-
-Druid vs Vertica
-================
-
-
-How does Druid compare to Vertica?
-
-Vertica is similar to ParAccel/Redshift ([Druid-vs-Redshift](../comparisons/druid-vs-redshift.html)) described above in that it wasn’t built for real-time streaming data ingestion and it supports full SQL.
-
-The other big difference is that instead of employing indexing, Vertica tries to optimize processing by leveraging run-length encoding (RLE) and other compression techniques along with a "projection" system that creates materialized copies of the data in a different sort order (to maximize the effectiveness of RLE).
-
-We are unclear about how Vertica handles data distribution and replication, so we cannot speak to if/how Druid is different.