rework compares again

This commit is contained in:
fjy 2015-11-09 16:40:07 -08:00
parent 0b319093df
commit b99576d854
12 changed files with 69 additions and 234 deletions

View File

@ -1,13 +0,0 @@
---
layout: doc_page
---
Druid vs. Cassandra
===================
We are not experts on Cassandra, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means. We will fix this page.
Druid is highly optimized for scans and aggregations, it supports arbitrarily deep drill downs into data sets without the need to pre-compute, and it can ingest event streams in real-time and allow users to query events as they come in. Cassandra is a great key-value store and it has some features that allow you to use it to do more interesting things than what you can do with a pure key-value store. But, it is not built for the same use cases that Druid handles, namely regularly scanning over billions of entries per query.
Furthermore, Druid is fully read-consistent. Druid breaks down a data set into immutable chunks known as segments. All replicants always present the exact same view for the piece of data they are holding and we dont have to worry about data synchronization. The tradeoff is that Druid has limited semantics for write and update operations. Cassandra, similar to Amazons Dynamo, has an eventually consistent data model. Writes are always supported but updates to data may take some time before all replicas sync up (data reconciliation is done at read time). This model favors availability and scalability over consistency.

View File

@ -1,62 +0,0 @@
---
layout: doc_page
---
Druid vs Proprietary Commercial Solutions (Vertica/Redshift)
============================================================
The proprietary database world has numerous solutions by numerous vendors. At a very high level, Druid is distinguished from most
of these solutions by the scale at which it can operate at, the scale and performance seen with ingesting streaming data, the performance seen from issuing ad-hoc exploratory queries,
the support it has for multi-tenancy (making Druid ideal for powering user facing applications), and how cost effective it is to run at scale. Druid is an open source project and the development
is community led. The direction of the project is driven by the needs of the community. Below, we highlight comparisons between some specific
systems. We welcome contributions for additional comparisons.
## Druid vs Redshift
### How does Druid compare to Redshift?
In terms of drawing a differentiation, Redshift started out as ParAccel (Actian), which Amazon is licensing and has since heavily modified.
Aside from potential performance differences, there are some functional differences:
### Real-time data ingestion
Because Druid is optimized to provide insight against massive quantities of streaming data; it is able to load and aggregate data in real-time.
Generally traditional data warehouses including column stores work only with batch ingestion and are not optimal for streaming data in regularly.
### Druid is a read oriented analytical data store
Druids write semantics are not as fluid and does not support full joins (we support large table to small table joins). Redshift provides full SQL support including joins and insert/update statements.
### Data distribution model
Druids data distribution is segment-based and leverages a highly available "deep" storage such as S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from "deep" storage.
To contrast, ParAccels data distribution model is hash-based. Expanding the cluster requires re-hashing the data across the nodes, making it difficult to perform without taking downtime. Amazons Redshift works around this issue with a multi-step process:
* set cluster into read-only mode
* copy data from cluster to new cluster that exists in parallel
* redirect traffic to new cluster
### Replication strategy
Druid employs segment-level data distribution meaning that more nodes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying. Replication is done automatically and without any impact to performance.
ParAccels hash-based distribution generally means that replication is conducted via hot spares. This puts a numerical limit on the number of nodes you can lose without losing data, and this replication strategy often does not allow the hot spare to help share query load.
### Indexing strategy
Along with column oriented structures, Druid uses indexing structures to speed up query execution when a filter is provided. Indexing structures do increase storage overhead (and make it more difficult to allow for mutation), but they also significantly speed up queries.
ParAccel does not appear to employ indexing strategies.
## Druid vs Vertica
### How does Druid compare to Vertica?
Vertica is similar to ParAccel/Redshift described above in that it wasnt built for real-time streaming data ingestion and it supports full SQL.
The other big difference is that instead of employing indexing, Vertica tries to optimize processing by leveraging run-length encoding (RLE) and other compression techniques along with a "projection" system that creates materialized copies of the data in a different sort order (to maximize the effectiveness of RLE).
We are unclear about how Vertica handles data distribution and replication, so we cannot speak to if/how Druid is different.

View File

@ -1,46 +0,0 @@
---
layout: doc_page
---
Druid vs General Computing Frameworks (Spark/Hadoop)
====================================================
Druid is much more complementary to general computing frameworks than it is competitive. General compute engines can
very flexibly process and transform data. Druid acts as a query accelerator to these systems by indexing raw data into a custom
column format to speed up OLAP queries.
## Druid vs Hadoop (HDFS/MapReduce)
Hadoop has shown the world that its possible to house your data warehouse on commodity hardware for a fraction of the price
of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things.
1. They can now query all of their data in a fairly flexible manner and answer any question they have
2. The queries take a long time
The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency.
Druid is a complementary addition to Hadoop. Hadoop is great at storing and making accessible large amounts of individually low-value data.
Unfortunately, Hadoop is not great at providing query speed guarantees on top of that data, nor does it have very good
operational characteristics for a customer-facing production system. Druid, on the other hand, excels at taking high-value
summaries of the low-value data on Hadoop, making it available in a fast and always-on fashion, such that it could be exposed directly to a customer.
Druid also requires some infrastructure to exist for [deep storage](../dependencies/deep-storage.html).
HDFS is one of the implemented options for this [deep storage](../dependencies/deep-storage.html).
Please note we are only comparing Druid to base Hadoop here, but we welcome comparisons of Druid vs other systems or combinations
of systems in the Hadoop ecosystems.
## Druid vs Spark
Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and
can be viewed as a back-office analytics platform. RDDs enable data reuse by persisting intermediate results
in memory and enable Spark to provide fast computations for iterative algorithms.
This is especially beneficial for certain work flows such as machine
learning, where the same operation may be applied over and over
again until some result is converged upon. Spark provides analysts with
the ability to run queries and analyze large amounts of data with a
wide array of different algorithms.
Druid is designed to power analytic applications and focuses on the latencies to ingest data and serve queries
over that data. If you were to build an application where users could
arbitrarily explore data, the latencies seen by using Spark will likely be too slow for an interactive experience.

View File

@ -5,8 +5,17 @@ layout: doc_page
Druid vs Elasticsearch Druid vs Elasticsearch
====================== ======================
We are not experts on Elasticsearch, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means. We are not experts on search systems, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means.
Elasticsearch is a search server based on Apache Lucene. It provides full text search for schema-free documents and provides access to raw event level data. Elasticsearch also provides support for analytics and aggregations. Based on [user testimony](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ), the resource requirements for data ingestion and aggregation in Elasticsearch are higher than those of Druid. Elasticsearch is a search systems based on Apache Lucene. It provides full text search for schema-free documents
and provides access to raw event level data. Elasticsearch is increasingly adding more support for analytics and aggregations.
[Some members of the community](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ) have pointed out
the resource requirements for data ingestion and aggregation in Elasticsearch is much higher than those of Druid.
Druid focuses on OLAP work flows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost, and supports a wide range of analytic operations. Druid has some basic search support for structured event data. Elasticsearch also does not support data summarization/roll-up at ingestion time, which can compact the data that needs to be
stored up to 100x with real-world data sets. This leads to Elasticsearch having greater storage requirements.
Druid focuses on OLAP work flows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost,
and supports a wide range of analytic operations. Druid has some basic search support for structured event data, but does not support
full text search. Druid also does not support completely unstructured data. Measures must be defined in a Druid schema such that
summarization/roll-up can be done.

View File

@ -2,17 +2,24 @@
layout: doc_page layout: doc_page
--- ---
Druid vs Hadoop Druid vs Hadoop (HDFS/MapReduce)
=============== ================================
Hadoop has shown the world that its possible to house your data warehouse on commodity hardware for a fraction of the price
Hadoop has shown the world that its possible to house your data warehouse on commodity hardware for a fraction of the price of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things. of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things.
1. They can now query all of their data in a fairly flexible manner and answer any question they have 1. They can now query all of their data in a fairly flexible manner and answer any question they have
2. The queries take a long time 2. The queries take a long time
The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency. The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency.
Druid is a complementary addition to Hadoop. Hadoop is great at storing and making accessible large amounts of individually low-value data. Unfortunately, Hadoop is not great at providing query speed guarantees on top of that data, nor does it have very good operational characteristics for a customer-facing production system. Druid, on the other hand, excels at taking high-value summaries of the low-value data on Hadoop, making it available in a fast and always-on fashion, such that it could be exposed directly to a customer. Druid is a complementary addition to Hadoop. Hadoop is great at storing and making accessible large amounts of individually low-value data.
Unfortunately, Hadoop is not great at providing query speed guarantees on top of that data, nor does it have very good
operational characteristics for a customer-facing production system. Druid, on the other hand, excels at taking high-value
summaries of the low-value data on Hadoop, making it available in a fast and always-on fashion, such that it could be exposed directly to a customer.
Druid also requires some infrastructure to exist for [deep storage](../dependencies/deep-storage.html). HDFS is one of the implemented options for this [deep storage](../dependencies/deep-storage.html). Druid also requires some infrastructure to exist for [deep storage](../dependencies/deep-storage.html).
HDFS is one of the implemented options for this [deep storage](../dependencies/deep-storage.html).
Please note we are only comparing Druid to base Hadoop here, but we welcome comparisons of Druid vs other systems or combinations
of systems in the Hadoop ecosystems.

View File

@ -1,49 +0,0 @@
---
layout: doc_page
---
Druid vs Impala or Shark
========================
The question of Druid versus Impala or Shark basically comes down to your product requirements and what the systems were designed to do.
Druid was designed to
1. be an always on service
1. ingest data in real-time
1. handle slice-n-dice style ad-hoc queries
Impala and Shark's primary design concerns (as far as I am aware) were to replace Hadoop MapReduce with another, faster, query layer that is completely generic and plays well with the other ecosystem of Hadoop technologies. I will caveat this discussion with the statement that I am not an expert on Impala or Shark, nor am I intimately familiar with their roadmaps. If anything is incorrect on this page, I'd be happy to change it, please send a note to the mailing list.
What does this mean? We can talk about it in terms of four general areas
1. Fault Tolerance
1. Query Speed
1. Data Ingestion
1. Query Flexibility
## Fault Tolerance
Druid pulls segments down from [Deep Storage](../dependencies/deep-storage.html) before serving queries on top of it. This means that for the data to exist in the Druid cluster, it must exist as a local copy on a historical node. If deep storage becomes unavailable for any reason, new segments will not be loaded into the system, but the cluster will continue to operate exactly as it was when the backing store disappeared.
Impala and Shark, on the other hand, pull their data in from HDFS (or some other Hadoop FileSystem) in response to a query. This has implications for the operation of queries if you need to take HDFS down for a bit (say a software upgrade). It's possible that data that has been cached in the nodes is still available when the backing file system goes down, but I'm not sure.
This is just one example, but Druid was built to continue operating in the face of failures of any one of its various pieces. The [Design](../design/design.html) describes these design decisions from the Druid side in more detail.
## Query Speed
Druid takes control of the data given to it, storing it in a column-oriented fashion, compressing it and adding indexing structures. All of which add to the speed at which queries can be processed. The column orientation means that we only look at the data that a query asks for in order to compute the answer. Compression increases the data storage capacity of RAM and allows us to fit more data into quickly accessible RAM. Indexing structures mean that as you add boolean filters to your queries, we do less processing and you get your result faster, whereas a lot of processing engines do *more* processing when filters are added.
Impala/Shark can basically be thought of as daemon caching layers on top of HDFS. They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce) and they have facilities to cache data locally so that it can be accessed and updated quicker. But, I do not believe they go beyond caching capabilities to actually speed up queries. So, at the end of the day, they don't change the paradigm from a brute-force, scan-everything query processing paradigm.
## Data Ingestion
Druid is built to allow for real-time ingestion of data. You can ingest data and query it immediately upon ingestion, the latency between how quickly the event is reflected in the data is dominated by how long it takes to deliver the event to Druid.
Impala/Shark, being based on data in HDFS or some other backing store, are limited in their data ingestion rates by the rate at which that backing store can make data available. Generally, the backing store is the biggest bottleneck for how quickly data can become available.
## Query Flexibility
Druid supports timeseries and groupBy style queries. It doesn't have support for joins, which makes it a lot less flexible for generic processing.
Impala/Shark support SQL style queries with full joins.

View File

@ -0,0 +1,19 @@
---
layout: doc_page
---
Druid vs Kudu
=============
Kudu's storage format enables single row updates, whereas updates to existing Druid segments requires recreating the segment, hence,
the process for updating old values is slower in Druid. The requirements in Kudu for maintaining extra head space to store
updates as well as organizing data by id instead of time has the potential to introduce some extra latency and accessing
of data that is not need to answer a query at query time. Druid summarizes/rollups up data at ingestion time, which in practice reduces the raw data that needs to be
stored by an average of 40 times, and increases performance of scanning raw data significantly.
This summarization processes loses information about individual events however. Druid segments also contain bitmap indexes for
fast filtering, which Kudu does not currently support. Druid's segment architecture is heavily geared towards fast aggregates and filters, and for OLAP workflows. Appends are very
fast in Druid, whereas updates of older data is slower. This is by design as the data Druid is good far is typically event data, and does not need to be updated too frequently.
Kudu supports arbitrary primary keys with uniqueness constraints, and efficient lookup by ranges of those keys.
Kudu chooses not to include the execution engine, but supports sufficient operations so as to allow node-local processing from the execution engines.
This means that Kudu can support multiple frameworks on the same data (eg MR, Spark, and SQL).

View File

@ -7,7 +7,7 @@ Druid vs Redshift
### How does Druid compare to Redshift? ### How does Druid compare to Redshift?
In terms of drawing a differentiation, Redshift is essentially ParAccel (Actian) which Amazon is licensing. In terms of drawing a differentiation, Redshift started out as ParAccel (Actian), which Amazon is licensing and has since heavily modified.
Aside from potential performance differences, there are some functional differences: Aside from potential performance differences, there are some functional differences:
@ -19,11 +19,11 @@ Generally traditional data warehouses including column stores work only with bat
### Druid is a read oriented analytical data store ### Druid is a read oriented analytical data store
Its write semantics arent as fluid and does not support joins. ParAccel is a full database with SQL support including joins and insert/update statements. Druids write semantics are not as fluid and does not support full joins (we support large table to small table joins). Redshift provides full SQL support including joins and insert/update statements.
### Data distribution model ### Data distribution model
Druids data distribution, is segment based which exists on highly available "deep" storage, like S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from "deep" storage. Druids data distribution is segment-based and leverages a highly available "deep" storage such as S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from "deep" storage.
To contrast, ParAccels data distribution model is hash-based. Expanding the cluster requires re-hashing the data across the nodes, making it difficult to perform without taking downtime. Amazons Redshift works around this issue with a multi-step process: To contrast, ParAccels data distribution model is hash-based. Expanding the cluster requires re-hashing the data across the nodes, making it difficult to perform without taking downtime. Amazons Redshift works around this issue with a multi-step process:
@ -33,12 +33,12 @@ To contrast, ParAccels data distribution model is hash-based. Expanding the c
### Replication strategy ### Replication strategy
Druid employs segment-level data distribution meaning that more nodes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying. Druid employs segment-level data distribution meaning that more nodes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying. Replication is done automatically and without any impact to performance.
ParAccels hash-based distribution generally means that replication is conducted via hot spares. This puts a numerical limit on the number of nodes you can lose without losing data, and this replication strategy often does not allow the hot spare to help share query load. ParAccels hash-based distribution generally means that replication is conducted via hot spares. This puts a numerical limit on the number of nodes you can lose without losing data, and this replication strategy often does not allow the hot spare to help share query load.
### Indexing strategy ### Indexing strategy
Along with column oriented structures, Druid uses indexing structures to speed up query execution when a filter is provided. Indexing structures do increase storage overhead (and make it more difficult to allow for mutation), but they can also significantly speed up queries. Along with column oriented structures, Druid uses indexing structures to speed up query execution when a filter is provided. Indexing structures do increase storage overhead (and make it more difficult to allow for mutation), but they also significantly speed up queries.
ParAccel does not appear to employ indexing strategies. ParAccel does not appear to employ indexing strategies.

View File

@ -1,21 +0,0 @@
---
layout: doc_page
---
Druid vs Search Systems (Elasticsearch/Solr)
============================================
We are not experts on search systems, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means.
Elasticsearch and Solr are a search systems based on Apache Lucene. They provides full text search for schema-free documents
and provides access to raw event level data. Search systems are increasingly adding more support for analytics and aggregations.
[Some members of the community](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ) have pointed out
the resource requirements for data ingestion and aggregation in search systems are much higher than those of Druid.
Search systems also do not support data summarization/roll-up at ingestion time, which can compact the data that needs to be
stored up to 100x with real-world data sets. This leads to search systems having greater storage requirements.
Druid focuses on OLAP work flows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost,
and supports a wide range of analytic operations. Druid has some basic search support for structured event data, but does not support
full text search. Druid also does not support completely unstructured data. Measures must be defined in a Druid schema such that
summarization/roll-up can be done.

View File

@ -5,8 +5,6 @@ layout: doc_page
Druid vs Spark Druid vs Spark
============== ==============
We are not experts on Spark, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means.
Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and
can be viewed as a back-office analytics platform. RDDs enable data reuse by persisting intermediate results can be viewed as a back-office analytics platform. RDDs enable data reuse by persisting intermediate results
in memory and enable Spark to provide fast computations for iterative algorithms. in memory and enable Spark to provide fast computations for iterative algorithms.
@ -17,5 +15,5 @@ the ability to run queries and analyze large amounts of data with a
wide array of different algorithms. wide array of different algorithms.
Druid is designed to power analytic applications and focuses on the latencies to ingest data and serve queries Druid is designed to power analytic applications and focuses on the latencies to ingest data and serve queries
over that data. If you were to build a web UI where users could over that data. If you were to build an application where users could
arbitrarily explore data, the latencies seen by using Spark may be too slow for interactive use cases. arbitrarily explore data, the latencies seen by using Spark will likely be too slow for an interactive experience.

View File

@ -2,7 +2,7 @@
layout: doc_page layout: doc_page
--- ---
Druid vs SQL-on-Hadoop (Hive/Impala/Drill/Spark SQL/Presto) Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)
=========================================================== ===========================================================
Druid is much more complementary to SQL-on-Hadoop engines than it is competitive. SQL-on-Hadoop engines provide an Druid is much more complementary to SQL-on-Hadoop engines than it is competitive. SQL-on-Hadoop engines provide an
@ -27,7 +27,7 @@ What does this mean? We can talk about it in terms of three general areas
1. Data Ingestion 1. Data Ingestion
1. Query Flexibility 1. Query Flexibility
## Queries ### Queries
Druid segments stores data in a custom column format. Segments are scanned directly as part of queries and each Druid server Druid segments stores data in a custom column format. Segments are scanned directly as part of queries and each Druid server
calculates a set of results that are eventually merged at the Broker level. This means the data that is transferred between servers calculates a set of results that are eventually merged at the Broker level. This means the data that is transferred between servers
@ -40,7 +40,7 @@ Many SQL-on-Hadoop engines have daemon processes that can be run where the data
some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly
how much of a performance impact this makes. how much of a performance impact this makes.
## Data Ingestion ### Data Ingestion
Druid is built to allow for real-time ingestion of data. You can ingest data and query it immediately upon ingestion, Druid is built to allow for real-time ingestion of data. You can ingest data and query it immediately upon ingestion,
the latency between how quickly the event is reflected in the data is dominated by how long it takes to deliver the event to Druid. the latency between how quickly the event is reflected in the data is dominated by how long it takes to deliver the event to Druid.
@ -49,10 +49,18 @@ SQL-on-Hadoop, being based on data in HDFS or some other backing store, are limi
rate at which that backing store can make data available. Generally, the backing store is the biggest bottleneck for rate at which that backing store can make data available. Generally, the backing store is the biggest bottleneck for
how quickly data can become available. how quickly data can become available.
## Query Flexibility ### Query Flexibility
Druid's query language is fairly low level and maps to how Druid operates internally. Although Druid can be combined with a high level query Druid's query language is fairly low level and maps to how Druid operates internally. Although Druid can be combined with a high level query
planner such as [Plywood](https://github.com/implydata/plywood) to support most SQL queries and analytic SQL queries (minus joins among large tables), planner such as [Plywood](https://github.com/implydata/plywood) to support most SQL queries and analytic SQL queries (minus joins among large tables),
base Druid is less flexible than SQL-on-Hadoop solutions for generic processing. base Druid is less flexible than SQL-on-Hadoop solutions for generic processing.
SQL-on-Hadoop support SQL style queries with full joins. SQL-on-Hadoop support SQL style queries with full joins.
## Druid vs Parquet
Parquet is a column storage format that is designed to work with SQL-on-Hadoop engines. Parquet doesn't have a query execution engine, and instead
relies on external sources to pull data out of it.
Druid's storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet's storage format is much
more hierachical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid.

View File

@ -1,15 +0,0 @@
---
layout: doc_page
---
Druid vs Vertica
================
How does Druid compare to Vertica?
Vertica is similar to ParAccel/Redshift ([Druid-vs-Redshift](../comparisons/druid-vs-redshift.html)) described above in that it wasnt built for real-time streaming data ingestion and it supports full SQL.
The other big difference is that instead of employing indexing, Vertica tries to optimize processing by leveraging run-length encoding (RLE) and other compression techniques along with a "projection" system that creates materialized copies of the data in a different sort order (to maximize the effectiveness of RLE).
We are unclear about how Vertica handles data distribution and replication, so we cannot speak to if/how Druid is different.