mirror of https://github.com/apache/druid.git
remove unneeded
This commit is contained in:
parent
8a8bb0369e
commit
46bf1ba5ef
|
@ -1,25 +0,0 @@
|
||||||
---
|
|
||||||
layout: doc_page
|
|
||||||
---
|
|
||||||
|
|
||||||
Druid vs Hadoop (HDFS/MapReduce)
|
|
||||||
================================
|
|
||||||
|
|
||||||
Hadoop has shown the world that it’s possible to house your data warehouse on commodity hardware for a fraction of the price
|
|
||||||
of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things.
|
|
||||||
|
|
||||||
1. They can now query all of their data in a fairly flexible manner and answer any question they have
|
|
||||||
2. The queries take a long time
|
|
||||||
|
|
||||||
The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency.
|
|
||||||
|
|
||||||
Druid is a complementary addition to Hadoop. Hadoop is great at storing and making accessible large amounts of individually low-value data.
|
|
||||||
Unfortunately, Hadoop is not great at providing query speed guarantees on top of that data, nor does it have very good
|
|
||||||
operational characteristics for a customer-facing production system. Druid, on the other hand, excels at taking high-value
|
|
||||||
summaries of the low-value data on Hadoop, making it available in a fast and always-on fashion, such that it could be exposed directly to a customer.
|
|
||||||
|
|
||||||
Druid also requires some infrastructure to exist for [deep storage](../dependencies/deep-storage.html).
|
|
||||||
HDFS is one of the implemented options for this [deep storage](../dependencies/deep-storage.html).
|
|
||||||
|
|
||||||
Please note we are only comparing Druid to base Hadoop here, but we welcome comparisons of Druid vs other systems or combinations
|
|
||||||
of systems in the Hadoop ecosystems.
|
|
|
@ -1,30 +0,0 @@
|
||||||
---
|
|
||||||
layout: doc_page
|
|
||||||
---
|
|
||||||
|
|
||||||
Druid vs Storage Formats (Parquet/Kudu)
|
|
||||||
=======================================
|
|
||||||
|
|
||||||
The biggest difference between Druid and existing storage formats is that Druid includes an execution engine that can run
|
|
||||||
queries, and a time-optimized data management system for advanced data retention and distribution,.
|
|
||||||
The Druid segment is a custom column format designed for fast aggregates and filters. Below we compare Druid's segment format to
|
|
||||||
other existing formats.
|
|
||||||
|
|
||||||
## Druid vs Parquet
|
|
||||||
|
|
||||||
Druid's storage format is highly optimized for linear scans. Although Druid has support for nested data, Parquet's storage format is much
|
|
||||||
more hierachical, and is more designed for binary chunking. In theory, this should lead to faster scans in Druid.
|
|
||||||
|
|
||||||
## Druid vs Kudu
|
|
||||||
|
|
||||||
Kudu's storage format enables single row updates, whereas updates to existing Druid segments requires recreating the segment, hence,
|
|
||||||
the process for updating old values is slower in Druid. The requirements in Kudu for maintaining extra head space to store
|
|
||||||
updates as well as organizing data by id instead of time has the potential to introduce some extra latency and accessing
|
|
||||||
of data that is not need to answer a query at query time. Druid summarizes/rollups up data at ingestion time, which in practice reduces the raw data that needs to be
|
|
||||||
stored by an average of 40 times, and increases performance of scanning raw data significantly.
|
|
||||||
This summarization processes loses information about individual events however. Druid segments also contain bitmap indexes for
|
|
||||||
fast filtering, which Kudu does not currently support. Druid's segment architecture is heavily geared towards fast aggregates and filters, and for OLAP workflows. Appends are very
|
|
||||||
fast in Druid, whereas updates of older data is slower. This is by design as the data Druid is good far is typically event data, and does not need to be updated too frequently.
|
|
||||||
Kudu supports arbitrary primary keys with uniqueness constraints, and efficient lookup by ranges of those keys.
|
|
||||||
Kudu chooses not to include the execution engine, but supports sufficient operations so as to allow node-local processing from the execution engines.
|
|
||||||
This means that Kudu can support multiple frameworks on the same data (eg MR, Spark, and SQL).
|
|
Loading…
Reference in New Issue