From 0b319093df7b7f6433c56b324ffab46f9286a37a Mon Sep 17 00:00:00 2001 From: fjy Date: Fri, 6 Nov 2015 12:21:51 -0800 Subject: [PATCH 1/4] New comparisons for Druid --- .../druid-vs-commericial-solutions.md | 62 +++++++++++++++++++ .../druid-vs-computing-frameworks.md | 46 ++++++++++++++ .../content/comparisons/druid-vs-key-value.md | 28 +++++++++ .../comparisons/druid-vs-search-systems.md | 21 +++++++ .../comparisons/druid-vs-sql-on-hadoop.md | 58 +++++++++++++++++ .../comparisons/druid-vs-storage-formats.md | 30 +++++++++ 6 files changed, 245 insertions(+) create mode 100644 docs/content/comparisons/druid-vs-commericial-solutions.md create mode 100644 docs/content/comparisons/druid-vs-computing-frameworks.md create mode 100644 docs/content/comparisons/druid-vs-key-value.md create mode 100644 docs/content/comparisons/druid-vs-search-systems.md create mode 100644 docs/content/comparisons/druid-vs-sql-on-hadoop.md create mode 100644 docs/content/comparisons/druid-vs-storage-formats.md diff --git a/docs/content/comparisons/druid-vs-commericial-solutions.md b/docs/content/comparisons/druid-vs-commericial-solutions.md new file mode 100644 index 00000000000..6f1e0ba62e3 --- /dev/null +++ b/docs/content/comparisons/druid-vs-commericial-solutions.md @@ -0,0 +1,62 @@ +--- +layout: doc_page +--- + +Druid vs Proprietary Commercial Solutions (Vertica/Redshift) +============================================================ + +The proprietary database world has numerous solutions by numerous vendors. At a very high level, Druid is distinguished from most +of these solutions by the scale at which it can operate at, the scale and performance seen with ingesting streaming data, the performance seen from issuing ad-hoc exploratory queries, +the support it has for multi-tenancy (making Druid ideal for powering user facing applications), and how cost effective it is to run at scale. Druid is an open source project and the development +is community led. The direction of the project is driven by the needs of the community. Below, we highlight comparisons between some specific +systems. We welcome contributions for additional comparisons. + +## Druid vs Redshift + +### How does Druid compare to Redshift? + +In terms of drawing a differentiation, Redshift started out as ParAccel (Actian), which Amazon is licensing and has since heavily modified. + +Aside from potential performance differences, there are some functional differences: + +### Real-time data ingestion + +Because Druid is optimized to provide insight against massive quantities of streaming data; it is able to load and aggregate data in real-time. + +Generally traditional data warehouses including column stores work only with batch ingestion and are not optimal for streaming data in regularly. + +### Druid is a read oriented analytical data store + +Druid’s write semantics are not as fluid and does not support full joins (we support large table to small table joins). Redshift provides full SQL support including joins and insert/update statements. + +### Data distribution model + +Druid’s data distribution is segment-based and leverages a highly available "deep" storage such as S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from "deep" storage. + +To contrast, ParAccel’s data distribution model is hash-based. Expanding the cluster requires re-hashing the data across the nodes, making it difficult to perform without taking downtime. Amazon’s Redshift works around this issue with a multi-step process: + +* set cluster into read-only mode +* copy data from cluster to new cluster that exists in parallel +* redirect traffic to new cluster + +### Replication strategy + +Druid employs segment-level data distribution meaning that more nodes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying. Replication is done automatically and without any impact to performance. + +ParAccel’s hash-based distribution generally means that replication is conducted via hot spares. This puts a numerical limit on the number of nodes you can lose without losing data, and this replication strategy often does not allow the hot spare to help share query load. + +### Indexing strategy + +Along with column oriented structures, Druid uses indexing structures to speed up query execution when a filter is provided. Indexing structures do increase storage overhead (and make it more difficult to allow for mutation), but they also significantly speed up queries. + +ParAccel does not appear to employ indexing strategies. + +## Druid vs Vertica + +### How does Druid compare to Vertica? + +Vertica is similar to ParAccel/Redshift described above in that it wasn’t built for real-time streaming data ingestion and it supports full SQL. + +The other big difference is that instead of employing indexing, Vertica tries to optimize processing by leveraging run-length encoding (RLE) and other compression techniques along with a "projection" system that creates materialized copies of the data in a different sort order (to maximize the effectiveness of RLE). + +We are unclear about how Vertica handles data distribution and replication, so we cannot speak to if/how Druid is different. diff --git a/docs/content/comparisons/druid-vs-computing-frameworks.md b/docs/content/comparisons/druid-vs-computing-frameworks.md new file mode 100644 index 00000000000..329c3fe26eb --- /dev/null +++ b/docs/content/comparisons/druid-vs-computing-frameworks.md @@ -0,0 +1,46 @@ +--- +layout: doc_page +--- + +Druid vs General Computing Frameworks (Spark/Hadoop) +==================================================== + +Druid is much more complementary to general computing frameworks than it is competitive. General compute engines can +very flexibly process and transform data. Druid acts as a query accelerator to these systems by indexing raw data into a custom +column format to speed up OLAP queries. + +## Druid vs Hadoop (HDFS/MapReduce) + +Hadoop has shown the world that it’s possible to house your data warehouse on commodity hardware for a fraction of the price +of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things. + +1. They can now query all of their data in a fairly flexible manner and answer any question they have +2. The queries take a long time + +The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency. + +Druid is a complementary addition to Hadoop. Hadoop is great at storing and making accessible large amounts of individually low-value data. +Unfortunately, Hadoop is not great at providing query speed guarantees on top of that data, nor does it have very good +operational characteristics for a customer-facing production system. Druid, on the other hand, excels at taking high-value +summaries of the low-value data on Hadoop, making it available in a fast and always-on fashion, such that it could be exposed directly to a customer. + +Druid also requires some infrastructure to exist for [deep storage](../dependencies/deep-storage.html). +HDFS is one of the implemented options for this [deep storage](../dependencies/deep-storage.html). + +Please note we are only comparing Druid to base Hadoop here, but we welcome comparisons of Druid vs other systems or combinations +of systems in the Hadoop ecosystems. + +## Druid vs Spark + +Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and +can be viewed as a back-office analytics platform. RDDs enable data reuse by persisting intermediate results +in memory and enable Spark to provide fast computations for iterative algorithms. +This is especially beneficial for certain work flows such as machine +learning, where the same operation may be applied over and over +again until some result is converged upon. Spark provides analysts with +the ability to run queries and analyze large amounts of data with a +wide array of different algorithms. + +Druid is designed to power analytic applications and focuses on the latencies to ingest data and serve queries +over that data. If you were to build an application where users could +arbitrarily explore data, the latencies seen by using Spark will likely be too slow for an interactive experience. diff --git a/docs/content/comparisons/druid-vs-key-value.md b/docs/content/comparisons/druid-vs-key-value.md new file mode 100644 index 00000000000..485b01a02c1 --- /dev/null +++ b/docs/content/comparisons/druid-vs-key-value.md @@ -0,0 +1,28 @@ +--- +layout: doc_page +--- + +Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB) +==================================================== + +Druid is highly optimized for scans and aggregations, it supports arbitrarily deep drill downs into data sets. This same functionality +is supported in key/value stores in 2 ways: + +1. Pre-compute all permutations of possible user queries +2. Range scans on event data + +When pre-computing results, the key is the exact parameters of the query, and the value is the result of the query. +The queries return extremely quickly, but at the cost of flexibility, as ad-hoc exploratory queries are not possible with +pre-computing every possible query permutation. Pre-computing all permutations of all ad-hoc queries leads to result sets +that grow exponentially with the number of columns of a data set, and pre-computing queries for complex real-world data sets +can require hours of pre-processing time. + +The other approach to using key/value stores for aggregations to use the dimensions of an event as the key and the event measures as the value. +Aggregations are done by issuing range scans on this data. Timeseries specific databases such as OpenTSDB use this approach. +One of the limitations here is that the key/value storage model does not have indexes for any kind of filtering other than prefix ranges, +which can be used to filter a query down to a metric and time range, but cannot resolve complex predicates to narrow the exact data to scan. +When the number of rows to scan gets large, this limitation can greatly reduce performance. It is also harder to achieve good +locality with key/value stores because most don’t support pushing down aggregates to the storage layer. + +For arbitrary exploration of data (flexible data filtering), Druid's custom column format enables ad-hoc queries without pre-computation. The format +also enables fast scans on columns, which is important for good aggregation performance. diff --git a/docs/content/comparisons/druid-vs-search-systems.md b/docs/content/comparisons/druid-vs-search-systems.md new file mode 100644 index 00000000000..88942d6c2c0 --- /dev/null +++ b/docs/content/comparisons/druid-vs-search-systems.md @@ -0,0 +1,21 @@ +--- +layout: doc_page +--- + +Druid vs Search Systems (Elasticsearch/Solr) +============================================ + +We are not experts on search systems, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means. + +Elasticsearch and Solr are a search systems based on Apache Lucene. They provides full text search for schema-free documents +and provides access to raw event level data. Search systems are increasingly adding more support for analytics and aggregations. +[Some members of the community](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ) have pointed out +the resource requirements for data ingestion and aggregation in search systems are much higher than those of Druid. + +Search systems also do not support data summarization/roll-up at ingestion time, which can compact the data that needs to be +stored up to 100x with real-world data sets. This leads to search systems having greater storage requirements. + +Druid focuses on OLAP work flows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost, +and supports a wide range of analytic operations. Druid has some basic search support for structured event data, but does not support +full text search. Druid also does not support completely unstructured data. Measures must be defined in a Druid schema such that +summarization/roll-up can be done. diff --git a/docs/content/comparisons/druid-vs-sql-on-hadoop.md b/docs/content/comparisons/druid-vs-sql-on-hadoop.md new file mode 100644 index 00000000000..4bee5560f19 --- /dev/null +++ b/docs/content/comparisons/druid-vs-sql-on-hadoop.md @@ -0,0 +1,58 @@ +--- +layout: doc_page +--- + +Druid vs SQL-on-Hadoop (Hive/Impala/Drill/Spark SQL/Presto) +=========================================================== + +Druid is much more complementary to SQL-on-Hadoop engines than it is competitive. SQL-on-Hadoop engines provide an +execution engine for various data formats and data stores, and +many can be made to push down computations down to Druid, while providing a SQL interface to Druid. + +For a direct comparison between the technologies and when to only use one or the other, the basically comes down to your +product requirements and what the systems were designed to do. + +Druid was designed to + +1. be an always on service +1. ingest data in real-time +1. handle slice-n-dice style ad-hoc queries + +SQL-on-Hadoop engines (as far as we are aware) were designed to replace Hadoop MapReduce with another, faster, query layer +that is completely generic and plays well with the other ecosystem of Hadoop technologies. + +What does this mean? We can talk about it in terms of three general areas + +1. Queries +1. Data Ingestion +1. Query Flexibility + +## Queries + +Druid segments stores data in a custom column format. Segments are scanned directly as part of queries and each Druid server +calculates a set of results that are eventually merged at the Broker level. This means the data that is transferred between servers +are queries and results, and all computation is done internally as part of the Druid servers. + +Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats. +They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce) +and they have facilities to cache data locally so that it can be accessed and updated quicker. +Many SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still +some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly +how much of a performance impact this makes. + +## Data Ingestion + +Druid is built to allow for real-time ingestion of data. You can ingest data and query it immediately upon ingestion, +the latency between how quickly the event is reflected in the data is dominated by how long it takes to deliver the event to Druid. + +SQL-on-Hadoop, being based on data in HDFS or some other backing store, are limited in their data ingestion rates by the +rate at which that backing store can make data available. Generally, the backing store is the biggest bottleneck for +how quickly data can become available. + +## Query Flexibility + +Druid's query language is fairly low level and maps to how Druid operates internally. Although Druid can be combined with a high level query +planner such as [Plywood](https://github.com/implydata/plywood) to support most SQL queries and analytic SQL queries (minus joins among large tables), +base Druid is less flexible than SQL-on-Hadoop solutions for generic processing. + +SQL-on-Hadoop support SQL style queries with full joins. diff --git a/docs/content/comparisons/druid-vs-storage-formats.md b/docs/content/comparisons/druid-vs-storage-formats.md new file mode 100644 index 00000000000..ac2467ffe38 --- /dev/null +++ b/docs/content/comparisons/druid-vs-storage-formats.md @@ -0,0 +1,30 @@ +--- +layout: doc_page +--- + +Druid vs Storage Formats (Parquet/Kudu) +======================================= + +The biggest difference between Druid and existing storage formats is that Druid includes an execution engine that can run +queries, and a time-optimized data management system for advanced data retention and distribution,. +The Druid segment is a custom column format designed for fast aggregates and filters. Below we compare Druid's segment format to +other existing formats. + +## Druid vs Parquet + +Druid's storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet's storage format is much +more hierachical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid. + +## Druid vs Kudu + +Kudu's storage format enables single row updates, whereas updates to existing Druid segments requires recreating the segment, hence, +the process for updating old values is slower in Druid. The requirements in Kudu for maintaining extra head space to store +updates as well as organizing data by id instead of time has the potential to introduce some extra latency and accessing +of data that is not need to answer a query at query time. Druid summarizes/rollups up data at ingestion time, which in practice reduces the raw data that needs to be +stored by an average of 40 times, and increases performance of scanning raw data significantly. +This summarization processes loses information about individual events however. Druid segments also contain bitmap indexes for +fast filtering, which Kudu does not currently support. Druid's segment architecture is heavily geared towards fast aggregates and filters, and for OLAP workflows. Appends are very +fast in Druid, whereas updates of older data is slower. This is by design as the data Druid is good far is typically event data, and does not need to be updated too frequently. +Kudu supports arbitrary primary keys with uniqueness constraints, and efficient lookup by ranges of those keys. +Kudu chooses not to include the execution engine, but supports sufficient operations so as to allow node-local processing from the execution engines. +This means that Kudu can support multiple frameworks on the same data (eg MR, Spark, and SQL). From b99576d8541b0e7cfc0567c0d8a8c35190cf4c6e Mon Sep 17 00:00:00 2001 From: fjy Date: Mon, 9 Nov 2015 16:40:07 -0800 Subject: [PATCH 2/4] rework compares again --- .../content/comparisons/druid-vs-cassandra.md | 13 ---- .../druid-vs-commericial-solutions.md | 62 ------------------- .../druid-vs-computing-frameworks.md | 46 -------------- .../comparisons/druid-vs-elasticsearch.md | 15 ++++- docs/content/comparisons/druid-vs-hadoop.md | 19 ++++-- .../comparisons/druid-vs-impala-or-shark.md | 49 --------------- docs/content/comparisons/druid-vs-kudu.md | 19 ++++++ docs/content/comparisons/druid-vs-redshift.md | 22 +++---- .../comparisons/druid-vs-search-systems.md | 21 ------- docs/content/comparisons/druid-vs-spark.md | 6 +- .../comparisons/druid-vs-sql-on-hadoop.md | 16 +++-- docs/content/comparisons/druid-vs-vertica.md | 15 ----- 12 files changed, 69 insertions(+), 234 deletions(-) delete mode 100644 docs/content/comparisons/druid-vs-cassandra.md delete mode 100644 docs/content/comparisons/druid-vs-commericial-solutions.md delete mode 100644 docs/content/comparisons/druid-vs-computing-frameworks.md delete mode 100644 docs/content/comparisons/druid-vs-impala-or-shark.md create mode 100644 docs/content/comparisons/druid-vs-kudu.md delete mode 100644 docs/content/comparisons/druid-vs-search-systems.md delete mode 100644 docs/content/comparisons/druid-vs-vertica.md diff --git a/docs/content/comparisons/druid-vs-cassandra.md b/docs/content/comparisons/druid-vs-cassandra.md deleted file mode 100644 index f8d1d59a473..00000000000 --- a/docs/content/comparisons/druid-vs-cassandra.md +++ /dev/null @@ -1,13 +0,0 @@ ---- -layout: doc_page ---- - -Druid vs. Cassandra -=================== - - -We are not experts on Cassandra, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means. We will fix this page. - -Druid is highly optimized for scans and aggregations, it supports arbitrarily deep drill downs into data sets without the need to pre-compute, and it can ingest event streams in real-time and allow users to query events as they come in. Cassandra is a great key-value store and it has some features that allow you to use it to do more interesting things than what you can do with a pure key-value store. But, it is not built for the same use cases that Druid handles, namely regularly scanning over billions of entries per query. - -Furthermore, Druid is fully read-consistent. Druid breaks down a data set into immutable chunks known as segments. All replicants always present the exact same view for the piece of data they are holding and we don’t have to worry about data synchronization. The tradeoff is that Druid has limited semantics for write and update operations. Cassandra, similar to Amazon’s Dynamo, has an eventually consistent data model. Writes are always supported but updates to data may take some time before all replicas sync up (data reconciliation is done at read time). This model favors availability and scalability over consistency. diff --git a/docs/content/comparisons/druid-vs-commericial-solutions.md b/docs/content/comparisons/druid-vs-commericial-solutions.md deleted file mode 100644 index 6f1e0ba62e3..00000000000 --- a/docs/content/comparisons/druid-vs-commericial-solutions.md +++ /dev/null @@ -1,62 +0,0 @@ ---- -layout: doc_page ---- - -Druid vs Proprietary Commercial Solutions (Vertica/Redshift) -============================================================ - -The proprietary database world has numerous solutions by numerous vendors. At a very high level, Druid is distinguished from most -of these solutions by the scale at which it can operate at, the scale and performance seen with ingesting streaming data, the performance seen from issuing ad-hoc exploratory queries, -the support it has for multi-tenancy (making Druid ideal for powering user facing applications), and how cost effective it is to run at scale. Druid is an open source project and the development -is community led. The direction of the project is driven by the needs of the community. Below, we highlight comparisons between some specific -systems. We welcome contributions for additional comparisons. - -## Druid vs Redshift - -### How does Druid compare to Redshift? - -In terms of drawing a differentiation, Redshift started out as ParAccel (Actian), which Amazon is licensing and has since heavily modified. - -Aside from potential performance differences, there are some functional differences: - -### Real-time data ingestion - -Because Druid is optimized to provide insight against massive quantities of streaming data; it is able to load and aggregate data in real-time. - -Generally traditional data warehouses including column stores work only with batch ingestion and are not optimal for streaming data in regularly. - -### Druid is a read oriented analytical data store - -Druid’s write semantics are not as fluid and does not support full joins (we support large table to small table joins). Redshift provides full SQL support including joins and insert/update statements. - -### Data distribution model - -Druid’s data distribution is segment-based and leverages a highly available "deep" storage such as S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from "deep" storage. - -To contrast, ParAccel’s data distribution model is hash-based. Expanding the cluster requires re-hashing the data across the nodes, making it difficult to perform without taking downtime. Amazon’s Redshift works around this issue with a multi-step process: - -* set cluster into read-only mode -* copy data from cluster to new cluster that exists in parallel -* redirect traffic to new cluster - -### Replication strategy - -Druid employs segment-level data distribution meaning that more nodes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying. Replication is done automatically and without any impact to performance. - -ParAccel’s hash-based distribution generally means that replication is conducted via hot spares. This puts a numerical limit on the number of nodes you can lose without losing data, and this replication strategy often does not allow the hot spare to help share query load. - -### Indexing strategy - -Along with column oriented structures, Druid uses indexing structures to speed up query execution when a filter is provided. Indexing structures do increase storage overhead (and make it more difficult to allow for mutation), but they also significantly speed up queries. - -ParAccel does not appear to employ indexing strategies. - -## Druid vs Vertica - -### How does Druid compare to Vertica? - -Vertica is similar to ParAccel/Redshift described above in that it wasn’t built for real-time streaming data ingestion and it supports full SQL. - -The other big difference is that instead of employing indexing, Vertica tries to optimize processing by leveraging run-length encoding (RLE) and other compression techniques along with a "projection" system that creates materialized copies of the data in a different sort order (to maximize the effectiveness of RLE). - -We are unclear about how Vertica handles data distribution and replication, so we cannot speak to if/how Druid is different. diff --git a/docs/content/comparisons/druid-vs-computing-frameworks.md b/docs/content/comparisons/druid-vs-computing-frameworks.md deleted file mode 100644 index 329c3fe26eb..00000000000 --- a/docs/content/comparisons/druid-vs-computing-frameworks.md +++ /dev/null @@ -1,46 +0,0 @@ ---- -layout: doc_page ---- - -Druid vs General Computing Frameworks (Spark/Hadoop) -==================================================== - -Druid is much more complementary to general computing frameworks than it is competitive. General compute engines can -very flexibly process and transform data. Druid acts as a query accelerator to these systems by indexing raw data into a custom -column format to speed up OLAP queries. - -## Druid vs Hadoop (HDFS/MapReduce) - -Hadoop has shown the world that it’s possible to house your data warehouse on commodity hardware for a fraction of the price -of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things. - -1. They can now query all of their data in a fairly flexible manner and answer any question they have -2. The queries take a long time - -The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency. - -Druid is a complementary addition to Hadoop. Hadoop is great at storing and making accessible large amounts of individually low-value data. -Unfortunately, Hadoop is not great at providing query speed guarantees on top of that data, nor does it have very good -operational characteristics for a customer-facing production system. Druid, on the other hand, excels at taking high-value -summaries of the low-value data on Hadoop, making it available in a fast and always-on fashion, such that it could be exposed directly to a customer. - -Druid also requires some infrastructure to exist for [deep storage](../dependencies/deep-storage.html). -HDFS is one of the implemented options for this [deep storage](../dependencies/deep-storage.html). - -Please note we are only comparing Druid to base Hadoop here, but we welcome comparisons of Druid vs other systems or combinations -of systems in the Hadoop ecosystems. - -## Druid vs Spark - -Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and -can be viewed as a back-office analytics platform. RDDs enable data reuse by persisting intermediate results -in memory and enable Spark to provide fast computations for iterative algorithms. -This is especially beneficial for certain work flows such as machine -learning, where the same operation may be applied over and over -again until some result is converged upon. Spark provides analysts with -the ability to run queries and analyze large amounts of data with a -wide array of different algorithms. - -Druid is designed to power analytic applications and focuses on the latencies to ingest data and serve queries -over that data. If you were to build an application where users could -arbitrarily explore data, the latencies seen by using Spark will likely be too slow for an interactive experience. diff --git a/docs/content/comparisons/druid-vs-elasticsearch.md b/docs/content/comparisons/druid-vs-elasticsearch.md index 362af084a5f..64064a32f43 100644 --- a/docs/content/comparisons/druid-vs-elasticsearch.md +++ b/docs/content/comparisons/druid-vs-elasticsearch.md @@ -5,8 +5,17 @@ layout: doc_page Druid vs Elasticsearch ====================== -We are not experts on Elasticsearch, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means. +We are not experts on search systems, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means. -Elasticsearch is a search server based on Apache Lucene. It provides full text search for schema-free documents and provides access to raw event level data. Elasticsearch also provides support for analytics and aggregations. Based on [user testimony](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ), the resource requirements for data ingestion and aggregation in Elasticsearch are higher than those of Druid. +Elasticsearch is a search systems based on Apache Lucene. It provides full text search for schema-free documents +and provides access to raw event level data. Elasticsearch is increasingly adding more support for analytics and aggregations. +[Some members of the community](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ) have pointed out +the resource requirements for data ingestion and aggregation in Elasticsearch is much higher than those of Druid. -Druid focuses on OLAP work flows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost, and supports a wide range of analytic operations. Druid has some basic search support for structured event data. +Elasticsearch also does not support data summarization/roll-up at ingestion time, which can compact the data that needs to be +stored up to 100x with real-world data sets. This leads to Elasticsearch having greater storage requirements. + +Druid focuses on OLAP work flows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost, +and supports a wide range of analytic operations. Druid has some basic search support for structured event data, but does not support +full text search. Druid also does not support completely unstructured data. Measures must be defined in a Druid schema such that +summarization/roll-up can be done. diff --git a/docs/content/comparisons/druid-vs-hadoop.md b/docs/content/comparisons/druid-vs-hadoop.md index 6a0745e49eb..d0209ace76f 100644 --- a/docs/content/comparisons/druid-vs-hadoop.md +++ b/docs/content/comparisons/druid-vs-hadoop.md @@ -2,17 +2,24 @@ layout: doc_page --- -Druid vs Hadoop -=============== +Druid vs Hadoop (HDFS/MapReduce) +================================ - -Hadoop has shown the world that it’s possible to house your data warehouse on commodity hardware for a fraction of the price of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things. +Hadoop has shown the world that it’s possible to house your data warehouse on commodity hardware for a fraction of the price +of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things. 1. They can now query all of their data in a fairly flexible manner and answer any question they have 2. The queries take a long time The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency. -Druid is a complementary addition to Hadoop. Hadoop is great at storing and making accessible large amounts of individually low-value data. Unfortunately, Hadoop is not great at providing query speed guarantees on top of that data, nor does it have very good operational characteristics for a customer-facing production system. Druid, on the other hand, excels at taking high-value summaries of the low-value data on Hadoop, making it available in a fast and always-on fashion, such that it could be exposed directly to a customer. +Druid is a complementary addition to Hadoop. Hadoop is great at storing and making accessible large amounts of individually low-value data. +Unfortunately, Hadoop is not great at providing query speed guarantees on top of that data, nor does it have very good +operational characteristics for a customer-facing production system. Druid, on the other hand, excels at taking high-value +summaries of the low-value data on Hadoop, making it available in a fast and always-on fashion, such that it could be exposed directly to a customer. -Druid also requires some infrastructure to exist for [deep storage](../dependencies/deep-storage.html). HDFS is one of the implemented options for this [deep storage](../dependencies/deep-storage.html). +Druid also requires some infrastructure to exist for [deep storage](../dependencies/deep-storage.html). +HDFS is one of the implemented options for this [deep storage](../dependencies/deep-storage.html). + +Please note we are only comparing Druid to base Hadoop here, but we welcome comparisons of Druid vs other systems or combinations +of systems in the Hadoop ecosystems. diff --git a/docs/content/comparisons/druid-vs-impala-or-shark.md b/docs/content/comparisons/druid-vs-impala-or-shark.md deleted file mode 100644 index 636540bc8fb..00000000000 --- a/docs/content/comparisons/druid-vs-impala-or-shark.md +++ /dev/null @@ -1,49 +0,0 @@ ---- -layout: doc_page ---- - -Druid vs Impala or Shark -======================== - -The question of Druid versus Impala or Shark basically comes down to your product requirements and what the systems were designed to do. - -Druid was designed to - -1. be an always on service -1. ingest data in real-time -1. handle slice-n-dice style ad-hoc queries - -Impala and Shark's primary design concerns (as far as I am aware) were to replace Hadoop MapReduce with another, faster, query layer that is completely generic and plays well with the other ecosystem of Hadoop technologies. I will caveat this discussion with the statement that I am not an expert on Impala or Shark, nor am I intimately familiar with their roadmaps. If anything is incorrect on this page, I'd be happy to change it, please send a note to the mailing list. - -What does this mean? We can talk about it in terms of four general areas - -1. Fault Tolerance -1. Query Speed -1. Data Ingestion -1. Query Flexibility - -## Fault Tolerance - -Druid pulls segments down from [Deep Storage](../dependencies/deep-storage.html) before serving queries on top of it. This means that for the data to exist in the Druid cluster, it must exist as a local copy on a historical node. If deep storage becomes unavailable for any reason, new segments will not be loaded into the system, but the cluster will continue to operate exactly as it was when the backing store disappeared. - -Impala and Shark, on the other hand, pull their data in from HDFS (or some other Hadoop FileSystem) in response to a query. This has implications for the operation of queries if you need to take HDFS down for a bit (say a software upgrade). It's possible that data that has been cached in the nodes is still available when the backing file system goes down, but I'm not sure. - -This is just one example, but Druid was built to continue operating in the face of failures of any one of its various pieces. The [Design](../design/design.html) describes these design decisions from the Druid side in more detail. - -## Query Speed - -Druid takes control of the data given to it, storing it in a column-oriented fashion, compressing it and adding indexing structures. All of which add to the speed at which queries can be processed. The column orientation means that we only look at the data that a query asks for in order to compute the answer. Compression increases the data storage capacity of RAM and allows us to fit more data into quickly accessible RAM. Indexing structures mean that as you add boolean filters to your queries, we do less processing and you get your result faster, whereas a lot of processing engines do *more* processing when filters are added. - -Impala/Shark can basically be thought of as daemon caching layers on top of HDFS. They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce) and they have facilities to cache data locally so that it can be accessed and updated quicker. But, I do not believe they go beyond caching capabilities to actually speed up queries. So, at the end of the day, they don't change the paradigm from a brute-force, scan-everything query processing paradigm. - -## Data Ingestion - -Druid is built to allow for real-time ingestion of data. You can ingest data and query it immediately upon ingestion, the latency between how quickly the event is reflected in the data is dominated by how long it takes to deliver the event to Druid. - -Impala/Shark, being based on data in HDFS or some other backing store, are limited in their data ingestion rates by the rate at which that backing store can make data available. Generally, the backing store is the biggest bottleneck for how quickly data can become available. - -## Query Flexibility - -Druid supports timeseries and groupBy style queries. It doesn't have support for joins, which makes it a lot less flexible for generic processing. - -Impala/Shark support SQL style queries with full joins. diff --git a/docs/content/comparisons/druid-vs-kudu.md b/docs/content/comparisons/druid-vs-kudu.md new file mode 100644 index 00000000000..35bb05c20aa --- /dev/null +++ b/docs/content/comparisons/druid-vs-kudu.md @@ -0,0 +1,19 @@ +--- +layout: doc_page +--- + +Druid vs Kudu +============= + +Kudu's storage format enables single row updates, whereas updates to existing Druid segments requires recreating the segment, hence, +the process for updating old values is slower in Druid. The requirements in Kudu for maintaining extra head space to store +updates as well as organizing data by id instead of time has the potential to introduce some extra latency and accessing +of data that is not need to answer a query at query time. Druid summarizes/rollups up data at ingestion time, which in practice reduces the raw data that needs to be +stored by an average of 40 times, and increases performance of scanning raw data significantly. +This summarization processes loses information about individual events however. Druid segments also contain bitmap indexes for +fast filtering, which Kudu does not currently support. Druid's segment architecture is heavily geared towards fast aggregates and filters, and for OLAP workflows. Appends are very +fast in Druid, whereas updates of older data is slower. This is by design as the data Druid is good far is typically event data, and does not need to be updated too frequently. +Kudu supports arbitrary primary keys with uniqueness constraints, and efficient lookup by ranges of those keys. +Kudu chooses not to include the execution engine, but supports sufficient operations so as to allow node-local processing from the execution engines. +This means that Kudu can support multiple frameworks on the same data (eg MR, Spark, and SQL). + diff --git a/docs/content/comparisons/druid-vs-redshift.md b/docs/content/comparisons/druid-vs-redshift.md index faaaa3f0f33..8998301927a 100644 --- a/docs/content/comparisons/druid-vs-redshift.md +++ b/docs/content/comparisons/druid-vs-redshift.md @@ -5,25 +5,25 @@ Druid vs Redshift ================= -###How does Druid compare to Redshift? +### How does Druid compare to Redshift? -In terms of drawing a differentiation, Redshift is essentially ParAccel (Actian) which Amazon is licensing. +In terms of drawing a differentiation, Redshift started out as ParAccel (Actian), which Amazon is licensing and has since heavily modified. Aside from potential performance differences, there are some functional differences: -###Real-time data ingestion +### Real-time data ingestion Because Druid is optimized to provide insight against massive quantities of streaming data; it is able to load and aggregate data in real-time. Generally traditional data warehouses including column stores work only with batch ingestion and are not optimal for streaming data in regularly. -###Druid is a read oriented analytical data store +### Druid is a read oriented analytical data store -It’s write semantics aren’t as fluid and does not support joins. ParAccel is a full database with SQL support including joins and insert/update statements. +Druid’s write semantics are not as fluid and does not support full joins (we support large table to small table joins). Redshift provides full SQL support including joins and insert/update statements. -###Data distribution model +### Data distribution model -Druid’s data distribution, is segment based which exists on highly available "deep" storage, like S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from "deep" storage. +Druid’s data distribution is segment-based and leverages a highly available "deep" storage such as S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from "deep" storage. To contrast, ParAccel’s data distribution model is hash-based. Expanding the cluster requires re-hashing the data across the nodes, making it difficult to perform without taking downtime. Amazon’s Redshift works around this issue with a multi-step process: @@ -31,14 +31,14 @@ To contrast, ParAccel’s data distribution model is hash-based. Expanding the c * copy data from cluster to new cluster that exists in parallel * redirect traffic to new cluster -###Replication strategy +### Replication strategy -Druid employs segment-level data distribution meaning that more nodes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying. +Druid employs segment-level data distribution meaning that more nodes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying. Replication is done automatically and without any impact to performance. ParAccel’s hash-based distribution generally means that replication is conducted via hot spares. This puts a numerical limit on the number of nodes you can lose without losing data, and this replication strategy often does not allow the hot spare to help share query load. -###Indexing strategy +### Indexing strategy -Along with column oriented structures, Druid uses indexing structures to speed up query execution when a filter is provided. Indexing structures do increase storage overhead (and make it more difficult to allow for mutation), but they can also significantly speed up queries. +Along with column oriented structures, Druid uses indexing structures to speed up query execution when a filter is provided. Indexing structures do increase storage overhead (and make it more difficult to allow for mutation), but they also significantly speed up queries. ParAccel does not appear to employ indexing strategies. diff --git a/docs/content/comparisons/druid-vs-search-systems.md b/docs/content/comparisons/druid-vs-search-systems.md deleted file mode 100644 index 88942d6c2c0..00000000000 --- a/docs/content/comparisons/druid-vs-search-systems.md +++ /dev/null @@ -1,21 +0,0 @@ ---- -layout: doc_page ---- - -Druid vs Search Systems (Elasticsearch/Solr) -============================================ - -We are not experts on search systems, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means. - -Elasticsearch and Solr are a search systems based on Apache Lucene. They provides full text search for schema-free documents -and provides access to raw event level data. Search systems are increasingly adding more support for analytics and aggregations. -[Some members of the community](https://groups.google.com/forum/#!msg/druid-development/nlpwTHNclj8/sOuWlKOzPpYJ) have pointed out -the resource requirements for data ingestion and aggregation in search systems are much higher than those of Druid. - -Search systems also do not support data summarization/roll-up at ingestion time, which can compact the data that needs to be -stored up to 100x with real-world data sets. This leads to search systems having greater storage requirements. - -Druid focuses on OLAP work flows. Druid is optimized for high performance (fast aggregation and ingestion) at low cost, -and supports a wide range of analytic operations. Druid has some basic search support for structured event data, but does not support -full text search. Druid also does not support completely unstructured data. Measures must be defined in a Druid schema such that -summarization/roll-up can be done. diff --git a/docs/content/comparisons/druid-vs-spark.md b/docs/content/comparisons/druid-vs-spark.md index 9032c4e65d6..c8721dee2a0 100644 --- a/docs/content/comparisons/druid-vs-spark.md +++ b/docs/content/comparisons/druid-vs-spark.md @@ -5,8 +5,6 @@ layout: doc_page Druid vs Spark ============== -We are not experts on Spark, if anything is incorrect about our portrayal, please let us know on the mailing list or via some other means. - Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and can be viewed as a back-office analytics platform. RDDs enable data reuse by persisting intermediate results in memory and enable Spark to provide fast computations for iterative algorithms. @@ -17,5 +15,5 @@ the ability to run queries and analyze large amounts of data with a wide array of different algorithms. Druid is designed to power analytic applications and focuses on the latencies to ingest data and serve queries -over that data. If you were to build a web UI where users could -arbitrarily explore data, the latencies seen by using Spark may be too slow for interactive use cases. +over that data. If you were to build an application where users could +arbitrarily explore data, the latencies seen by using Spark will likely be too slow for an interactive experience. diff --git a/docs/content/comparisons/druid-vs-sql-on-hadoop.md b/docs/content/comparisons/druid-vs-sql-on-hadoop.md index 4bee5560f19..94f8f9d5de0 100644 --- a/docs/content/comparisons/druid-vs-sql-on-hadoop.md +++ b/docs/content/comparisons/druid-vs-sql-on-hadoop.md @@ -2,7 +2,7 @@ layout: doc_page --- -Druid vs SQL-on-Hadoop (Hive/Impala/Drill/Spark SQL/Presto) +Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto) =========================================================== Druid is much more complementary to SQL-on-Hadoop engines than it is competitive. SQL-on-Hadoop engines provide an @@ -27,7 +27,7 @@ What does this mean? We can talk about it in terms of three general areas 1. Data Ingestion 1. Query Flexibility -## Queries +### Queries Druid segments stores data in a custom column format. Segments are scanned directly as part of queries and each Druid server calculates a set of results that are eventually merged at the Broker level. This means the data that is transferred between servers @@ -40,7 +40,7 @@ Many SQL-on-Hadoop engines have daemon processes that can be run where the data some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly how much of a performance impact this makes. -## Data Ingestion +### Data Ingestion Druid is built to allow for real-time ingestion of data. You can ingest data and query it immediately upon ingestion, the latency between how quickly the event is reflected in the data is dominated by how long it takes to deliver the event to Druid. @@ -49,10 +49,18 @@ SQL-on-Hadoop, being based on data in HDFS or some other backing store, are limi rate at which that backing store can make data available. Generally, the backing store is the biggest bottleneck for how quickly data can become available. -## Query Flexibility +### Query Flexibility Druid's query language is fairly low level and maps to how Druid operates internally. Although Druid can be combined with a high level query planner such as [Plywood](https://github.com/implydata/plywood) to support most SQL queries and analytic SQL queries (minus joins among large tables), base Druid is less flexible than SQL-on-Hadoop solutions for generic processing. SQL-on-Hadoop support SQL style queries with full joins. + +## Druid vs Parquet + +Parquet is a column storage format that is designed to work with SQL-on-Hadoop engines. Parquet doesn't have a query execution engine, and instead +relies on external sources to pull data out of it. + +Druid's storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet's storage format is much +more hierachical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid. diff --git a/docs/content/comparisons/druid-vs-vertica.md b/docs/content/comparisons/druid-vs-vertica.md deleted file mode 100644 index 4a813385f03..00000000000 --- a/docs/content/comparisons/druid-vs-vertica.md +++ /dev/null @@ -1,15 +0,0 @@ ---- -layout: doc_page ---- - -Druid vs Vertica -================ - - -How does Druid compare to Vertica? - -Vertica is similar to ParAccel/Redshift ([Druid-vs-Redshift](../comparisons/druid-vs-redshift.html)) described above in that it wasn’t built for real-time streaming data ingestion and it supports full SQL. - -The other big difference is that instead of employing indexing, Vertica tries to optimize processing by leveraging run-length encoding (RLE) and other compression techniques along with a "projection" system that creates materialized copies of the data in a different sort order (to maximize the effectiveness of RLE). - -We are unclear about how Vertica handles data distribution and replication, so we cannot speak to if/how Druid is different. From 8a8bb0369e2a1506cd993b4af0c3aa67310632ea Mon Sep 17 00:00:00 2001 From: fjy Date: Mon, 9 Nov 2015 16:56:43 -0800 Subject: [PATCH 3/4] address more comments --- docs/content/comparisons/druid-vs-spark.md | 2 ++ docs/content/comparisons/druid-vs-sql-on-hadoop.md | 14 ++++++-------- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/content/comparisons/druid-vs-spark.md b/docs/content/comparisons/druid-vs-spark.md index c8721dee2a0..9a8cce18675 100644 --- a/docs/content/comparisons/druid-vs-spark.md +++ b/docs/content/comparisons/druid-vs-spark.md @@ -5,6 +5,8 @@ layout: doc_page Druid vs Spark ============== +Druid and Spark are complementary solutions as Druid can be used to accelerate OLAP queries in Spark. + Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and can be viewed as a back-office analytics platform. RDDs enable data reuse by persisting intermediate results in memory and enable Spark to provide fast computations for iterative algorithms. diff --git a/docs/content/comparisons/druid-vs-sql-on-hadoop.md b/docs/content/comparisons/druid-vs-sql-on-hadoop.md index 94f8f9d5de0..2de6448a3c2 100644 --- a/docs/content/comparisons/druid-vs-sql-on-hadoop.md +++ b/docs/content/comparisons/druid-vs-sql-on-hadoop.md @@ -5,11 +5,11 @@ layout: doc_page Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto) =========================================================== -Druid is much more complementary to SQL-on-Hadoop engines than it is competitive. SQL-on-Hadoop engines provide an +SQL-on-Hadoop engines provide an execution engine for various data formats and data stores, and many can be made to push down computations down to Druid, while providing a SQL interface to Druid. -For a direct comparison between the technologies and when to only use one or the other, the basically comes down to your +For a direct comparison between the technologies and when to only use one or the other, things basically comes down to your product requirements and what the systems were designed to do. Druid was designed to @@ -18,9 +18,8 @@ Druid was designed to 1. ingest data in real-time 1. handle slice-n-dice style ad-hoc queries -SQL-on-Hadoop engines (as far as we are aware) were designed to replace Hadoop MapReduce with another, faster, query layer -that is completely generic and plays well with the other ecosystem of Hadoop technologies. - +SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems. +Some of these engines (including Impala and Presto) can be colocated with HDFS data nodes and coordinate with them to achieve data locality for queries. What does this mean? We can talk about it in terms of three general areas 1. Queries @@ -34,9 +33,8 @@ calculates a set of results that are eventually merged at the Broker level. This are queries and results, and all computation is done internally as part of the Druid servers. Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats. -They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce) -and they have facilities to cache data locally so that it can be accessed and updated quicker. -Many SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still +They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce). +Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly how much of a performance impact this makes. From 46bf1ba5efa535a323c442e4f5a5c32eae3058e9 Mon Sep 17 00:00:00 2001 From: fjy Date: Mon, 9 Nov 2015 17:03:00 -0800 Subject: [PATCH 4/4] remove unneeded --- docs/content/comparisons/druid-vs-hadoop.md | 25 ---------------- .../comparisons/druid-vs-storage-formats.md | 30 ------------------- 2 files changed, 55 deletions(-) delete mode 100644 docs/content/comparisons/druid-vs-hadoop.md delete mode 100644 docs/content/comparisons/druid-vs-storage-formats.md diff --git a/docs/content/comparisons/druid-vs-hadoop.md b/docs/content/comparisons/druid-vs-hadoop.md deleted file mode 100644 index d0209ace76f..00000000000 --- a/docs/content/comparisons/druid-vs-hadoop.md +++ /dev/null @@ -1,25 +0,0 @@ ---- -layout: doc_page ---- - -Druid vs Hadoop (HDFS/MapReduce) -================================ - -Hadoop has shown the world that it’s possible to house your data warehouse on commodity hardware for a fraction of the price -of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things. - -1. They can now query all of their data in a fairly flexible manner and answer any question they have -2. The queries take a long time - -The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency. - -Druid is a complementary addition to Hadoop. Hadoop is great at storing and making accessible large amounts of individually low-value data. -Unfortunately, Hadoop is not great at providing query speed guarantees on top of that data, nor does it have very good -operational characteristics for a customer-facing production system. Druid, on the other hand, excels at taking high-value -summaries of the low-value data on Hadoop, making it available in a fast and always-on fashion, such that it could be exposed directly to a customer. - -Druid also requires some infrastructure to exist for [deep storage](../dependencies/deep-storage.html). -HDFS is one of the implemented options for this [deep storage](../dependencies/deep-storage.html). - -Please note we are only comparing Druid to base Hadoop here, but we welcome comparisons of Druid vs other systems or combinations -of systems in the Hadoop ecosystems. diff --git a/docs/content/comparisons/druid-vs-storage-formats.md b/docs/content/comparisons/druid-vs-storage-formats.md deleted file mode 100644 index ac2467ffe38..00000000000 --- a/docs/content/comparisons/druid-vs-storage-formats.md +++ /dev/null @@ -1,30 +0,0 @@ ---- -layout: doc_page ---- - -Druid vs Storage Formats (Parquet/Kudu) -======================================= - -The biggest difference between Druid and existing storage formats is that Druid includes an execution engine that can run -queries, and a time-optimized data management system for advanced data retention and distribution,. -The Druid segment is a custom column format designed for fast aggregates and filters. Below we compare Druid's segment format to -other existing formats. - -## Druid vs Parquet - -Druid's storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet's storage format is much -more hierachical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid. - -## Druid vs Kudu - -Kudu's storage format enables single row updates, whereas updates to existing Druid segments requires recreating the segment, hence, -the process for updating old values is slower in Druid. The requirements in Kudu for maintaining extra head space to store -updates as well as organizing data by id instead of time has the potential to introduce some extra latency and accessing -of data that is not need to answer a query at query time. Druid summarizes/rollups up data at ingestion time, which in practice reduces the raw data that needs to be -stored by an average of 40 times, and increases performance of scanning raw data significantly. -This summarization processes loses information about individual events however. Druid segments also contain bitmap indexes for -fast filtering, which Kudu does not currently support. Druid's segment architecture is heavily geared towards fast aggregates and filters, and for OLAP workflows. Appends are very -fast in Druid, whereas updates of older data is slower. This is by design as the data Druid is good far is typically event data, and does not need to be updated too frequently. -Kudu supports arbitrary primary keys with uniqueness constraints, and efficient lookup by ranges of those keys. -Kudu chooses not to include the execution engine, but supports sufficient operations so as to allow node-local processing from the execution engines. -This means that Kudu can support multiple frameworks on the same data (eg MR, Spark, and SQL).