address more comments

2015-11-09 16:56:43 -08:00 · 2015-11-09 16:56:43 -08:00 · 8a8bb0369e
parent b99576d854
commit 8a8bb0369e
2 changed files with 8 additions and 8 deletions
--- a/docs/content/comparisons/druid-vs-spark.md
+++ b/docs/content/comparisons/druid-vs-spark.md
@ -5,6 +5,8 @@ layout: doc_page
 Druid vs Spark
 ==============

+Druid and Spark are complementary solutions as Druid can be used to accelerate OLAP queries in Spark.
+
 Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and
 can be viewed as a back-office analytics platform.  RDDs enable data reuse by persisting intermediate results
 in memory and enable Spark to provide fast computations for iterative algorithms.
--- a/docs/content/comparisons/druid-vs-sql-on-hadoop.md
+++ b/docs/content/comparisons/druid-vs-sql-on-hadoop.md
@ -5,11 +5,11 @@ layout: doc_page
 Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)
 ===========================================================

-Druid is much more complementary to SQL-on-Hadoop engines than it is competitive. SQL-on-Hadoop engines provide an 
+SQL-on-Hadoop engines provide an 
 execution engine for various data formats and data stores, and 
 many can be made to push down computations down to Druid, while providing a SQL interface to Druid.

-For a direct comparison between the technologies and when to only use one or the other, the basically comes down to your 
+For a direct comparison between the technologies and when to only use one or the other, things basically comes down to your 
 product requirements and what the systems were designed to do.  

 Druid was designed to
@ -18,9 +18,8 @@ Druid was designed to
 1. ingest data in real-time
 1. handle slice-n-dice style ad-hoc queries

-SQL-on-Hadoop engines (as far as we are aware) were designed to replace Hadoop MapReduce with another, faster, query layer 
-that is completely generic and plays well with the other ecosystem of Hadoop technologies.
-
+SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems. 
+Some of these engines (including Impala and Presto) can be colocated with HDFS data nodes and coordinate with them to achieve data locality for queries.
 What does this mean?  We can talk about it in terms of three general areas

 1. Queries
@ -34,9 +33,8 @@ calculates a set of results that are eventually merged at the Broker level. This
 are queries and results, and all computation is done internally as part of the Druid servers.

 Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats. 
-They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce) 
-and they have facilities to cache data locally so that it can be accessed and updated quicker.  
-Many SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still 
+They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce).  
+Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still 
 some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly 
 how much of a performance impact this makes.