diff --git a/docs/content/comparisons/druid-vs-spark.md b/docs/content/comparisons/druid-vs-spark.md index c8721dee2a0..9a8cce18675 100644 --- a/docs/content/comparisons/druid-vs-spark.md +++ b/docs/content/comparisons/druid-vs-spark.md @@ -5,6 +5,8 @@ layout: doc_page Druid vs Spark ============== +Druid and Spark are complementary solutions as Druid can be used to accelerate OLAP queries in Spark. + Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and can be viewed as a back-office analytics platform. RDDs enable data reuse by persisting intermediate results in memory and enable Spark to provide fast computations for iterative algorithms. diff --git a/docs/content/comparisons/druid-vs-sql-on-hadoop.md b/docs/content/comparisons/druid-vs-sql-on-hadoop.md index 94f8f9d5de0..2de6448a3c2 100644 --- a/docs/content/comparisons/druid-vs-sql-on-hadoop.md +++ b/docs/content/comparisons/druid-vs-sql-on-hadoop.md @@ -5,11 +5,11 @@ layout: doc_page Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto) =========================================================== -Druid is much more complementary to SQL-on-Hadoop engines than it is competitive. SQL-on-Hadoop engines provide an +SQL-on-Hadoop engines provide an execution engine for various data formats and data stores, and many can be made to push down computations down to Druid, while providing a SQL interface to Druid. -For a direct comparison between the technologies and when to only use one or the other, the basically comes down to your +For a direct comparison between the technologies and when to only use one or the other, things basically comes down to your product requirements and what the systems were designed to do. Druid was designed to @@ -18,9 +18,8 @@ Druid was designed to 1. ingest data in real-time 1. handle slice-n-dice style ad-hoc queries -SQL-on-Hadoop engines (as far as we are aware) were designed to replace Hadoop MapReduce with another, faster, query layer -that is completely generic and plays well with the other ecosystem of Hadoop technologies. - +SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems. +Some of these engines (including Impala and Presto) can be colocated with HDFS data nodes and coordinate with them to achieve data locality for queries. What does this mean? We can talk about it in terms of three general areas 1. Queries @@ -34,9 +33,8 @@ calculates a set of results that are eventually merged at the Broker level. This are queries and results, and all computation is done internally as part of the Druid servers. Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats. -They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce) -and they have facilities to cache data locally so that it can be accessed and updated quicker. -Many SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still +They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce). +Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly how much of a performance impact this makes.