mirror of https://github.com/apache/druid.git
address more comments
This commit is contained in:
parent
b99576d854
commit
8a8bb0369e
|
@ -5,6 +5,8 @@ layout: doc_page
|
||||||
Druid vs Spark
|
Druid vs Spark
|
||||||
==============
|
==============
|
||||||
|
|
||||||
|
Druid and Spark are complementary solutions as Druid can be used to accelerate OLAP queries in Spark.
|
||||||
|
|
||||||
Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and
|
Spark is a cluster computing framework built around the concept of Resilient Distributed Datasets (RDDs) and
|
||||||
can be viewed as a back-office analytics platform. RDDs enable data reuse by persisting intermediate results
|
can be viewed as a back-office analytics platform. RDDs enable data reuse by persisting intermediate results
|
||||||
in memory and enable Spark to provide fast computations for iterative algorithms.
|
in memory and enable Spark to provide fast computations for iterative algorithms.
|
||||||
|
|
|
@ -5,11 +5,11 @@ layout: doc_page
|
||||||
Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)
|
Druid vs SQL-on-Hadoop (Impala/Drill/Spark SQL/Presto)
|
||||||
===========================================================
|
===========================================================
|
||||||
|
|
||||||
Druid is much more complementary to SQL-on-Hadoop engines than it is competitive. SQL-on-Hadoop engines provide an
|
SQL-on-Hadoop engines provide an
|
||||||
execution engine for various data formats and data stores, and
|
execution engine for various data formats and data stores, and
|
||||||
many can be made to push down computations down to Druid, while providing a SQL interface to Druid.
|
many can be made to push down computations down to Druid, while providing a SQL interface to Druid.
|
||||||
|
|
||||||
For a direct comparison between the technologies and when to only use one or the other, the basically comes down to your
|
For a direct comparison between the technologies and when to only use one or the other, things basically comes down to your
|
||||||
product requirements and what the systems were designed to do.
|
product requirements and what the systems were designed to do.
|
||||||
|
|
||||||
Druid was designed to
|
Druid was designed to
|
||||||
|
@ -18,9 +18,8 @@ Druid was designed to
|
||||||
1. ingest data in real-time
|
1. ingest data in real-time
|
||||||
1. handle slice-n-dice style ad-hoc queries
|
1. handle slice-n-dice style ad-hoc queries
|
||||||
|
|
||||||
SQL-on-Hadoop engines (as far as we are aware) were designed to replace Hadoop MapReduce with another, faster, query layer
|
SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems.
|
||||||
that is completely generic and plays well with the other ecosystem of Hadoop technologies.
|
Some of these engines (including Impala and Presto) can be colocated with HDFS data nodes and coordinate with them to achieve data locality for queries.
|
||||||
|
|
||||||
What does this mean? We can talk about it in terms of three general areas
|
What does this mean? We can talk about it in terms of three general areas
|
||||||
|
|
||||||
1. Queries
|
1. Queries
|
||||||
|
@ -34,9 +33,8 @@ calculates a set of results that are eventually merged at the Broker level. This
|
||||||
are queries and results, and all computation is done internally as part of the Druid servers.
|
are queries and results, and all computation is done internally as part of the Druid servers.
|
||||||
|
|
||||||
Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats.
|
Most SQL-on-Hadoop engines are responsible for query planning and execution for underlying storage layers and storage formats.
|
||||||
They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce)
|
They are processes that stay on even if there is no query running (eliminating the JVM startup costs from Hadoop MapReduce).
|
||||||
and they have facilities to cache data locally so that it can be accessed and updated quicker.
|
Some (Impala/Presto) SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still
|
||||||
Many SQL-on-Hadoop engines have daemon processes that can be run where the data is stored, virtually eliminating network transfer costs. There is still
|
|
||||||
some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly
|
some latency overhead (e.g. serde time) associated with pulling data from the underlying storage layer into the computation layer. We are unaware of exactly
|
||||||
how much of a performance impact this makes.
|
how much of a performance impact this makes.
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue