address more comments

This commit is contained in:
fjy 2015-11-09 16:56:43 -08:00
parent b99576d854
commit 8a8bb0369e
2 changed files with 8 additions and 8 deletions

View File

@ -5,6 +5,8 @@ layout: doc_page
Druid vs Spark
==============
Druid and Spark are complementary solutions as Druid can be used to accelerate OLAP queries in Spark.
Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and
can be viewed as a back-office analytics platform. RDDs enable data reuse by persisting intermediate results
in memory and enable Spark to provide fast computations for iterative algorithms.

View File

@ -5,11 +5,11 @@ layout: doc_page
Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)
===========================================================
Druid is much more complementary to SQL-on-Hadoop engines than it is competitive. SQL-on-Hadoop engines provide an
SQL-on-Hadoop engines provide an
execution engine for various data formats and data stores, and
many can be made to push down computations down to Druid, while providing a SQL interface to Druid.
For a direct comparison between the technologies and when to only use one or the other, the basically comes down to your
For a direct comparison between the technologies and when to only use one or the other, things basically comes down to your
product requirements and what the systems were designed to do.
Druid was designed to
@ -18,9 +18,8 @@ Druid was designed to
1. ingest data in real-time
1. handle slice-n-dice style ad-hoc queries
SQL-on-Hadoop engines (as far as we are aware) were designed to replace Hadoop MapReduce with another, faster, query layer
that is completely generic and plays well with the other ecosystem of Hadoop technologies.
SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems.
Some of these engines (including Impala and Presto) can be colocated with HDFS data nodes and coordinate with them to achieve data locality for queries.
What does this mean? We can talk about it in terms of three general areas
1. Queries
@ -34,9 +33,8 @@ calculates a set of results that are eventually merged at the Broker level. This
are queries and results, and all computation is done internally as part of the Druid servers.
Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats.
They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce)
and they have facilities to cache data locally so that it can be accessed and updated quicker.
Many SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still
They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce).
Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still
some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly
how much of a performance impact this makes.