address more comments

This commit is contained in:
fjy 2015-11-09 16:56:43 -08:00
parent b99576d854
commit 8a8bb0369e
2 changed files with 8 additions and 8 deletions

View File

@ -5,6 +5,8 @@ layout: doc_page
Druid vs Spark Druid vs Spark
============== ==============
Druid and Spark are complementary solutions as Druid can be used to accelerate OLAP queries in Spark.
Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and
can be viewed as a back-office analytics platform. RDDs enable data reuse by persisting intermediate results can be viewed as a back-office analytics platform. RDDs enable data reuse by persisting intermediate results
in memory and enable Spark to provide fast computations for iterative algorithms. in memory and enable Spark to provide fast computations for iterative algorithms.

View File

@ -5,11 +5,11 @@ layout: doc_page
Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto) Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)
=========================================================== ===========================================================
Druid is much more complementary to SQL-on-Hadoop engines than it is competitive. SQL-on-Hadoop engines provide an SQL-on-Hadoop engines provide an
execution engine for various data formats and data stores, and execution engine for various data formats and data stores, and
many can be made to push down computations down to Druid, while providing a SQL interface to Druid. many can be made to push down computations down to Druid, while providing a SQL interface to Druid.
For a direct comparison between the technologies and when to only use one or the other, the basically comes down to your For a direct comparison between the technologies and when to only use one or the other, things basically comes down to your
product requirements and what the systems were designed to do. product requirements and what the systems were designed to do.
Druid was designed to Druid was designed to
@ -18,9 +18,8 @@ Druid was designed to
1. ingest data in real-time 1. ingest data in real-time
1. handle slice-n-dice style ad-hoc queries 1. handle slice-n-dice style ad-hoc queries
SQL-on-Hadoop engines (as far as we are aware) were designed to replace Hadoop MapReduce with another, faster, query layer SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems.
that is completely generic and plays well with the other ecosystem of Hadoop technologies. Some of these engines (including Impala and Presto) can be colocated with HDFS data nodes and coordinate with them to achieve data locality for queries.
What does this mean? We can talk about it in terms of three general areas What does this mean? We can talk about it in terms of three general areas
1. Queries 1. Queries
@ -34,9 +33,8 @@ calculates a set of results that are eventually merged at the Broker level. This
are queries and results, and all computation is done internally as part of the Druid servers. are queries and results, and all computation is done internally as part of the Druid servers.
Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats. Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats.
They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce) They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce).
and they have facilities to cache data locally so that it can be accessed and updated quicker. Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still
Many SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still
some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly
how much of a performance impact this makes. how much of a performance impact this makes.