druid/docs/content/Druid-vs-Redshift.md

45 lines
2.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
layout: doc_page
---
Druid vs Redshift
=================
###How does Druid compare to Redshift?
In terms of drawing a differentiation, Redshift is essentially ParAccel (Actian) which Amazon is licensing.
Aside from potential performance differences, there are some functional differences:
###Real-time data ingestion
Because Druid is optimized to provide insight against massive quantities of streaming data; it is able to load and aggregate data in real-time.
Generally traditional data warehouses including column stores work only with batch ingestion and are not optimal for streaming data in regularly.
###Druid is a read oriented analytical data store
Its write semantics arent as fluid and does not support joins. ParAccel is a full database with SQL support including joins and insert/update statements.
###Data distribution model
Druids data distribution, is segment based which exists on highly available "deep" storage, like S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from "deep" storage.
To contrast, ParAccels data distribution model is hash-based. Expanding the cluster requires re-hashing the data across the nodes, making it difficult to perform without taking downtime. Amazons Redshift works around this issue with a multi-step process:
* set cluster into read-only mode
* copy data from cluster to new cluster that exists in parallel
* redirect traffic to new cluster
###Replication strategy
Druid employs segment-level data distribution meaning that more nodes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying.
ParAccels hash-based distribution generally means that replication is conducted via hot spares. This puts a numerical limit on the number of nodes you can lose without losing data, and this replication strategy often does not allow the hot spare to help share query load.
###Indexing strategy
Along with column oriented structures, Druid uses indexing structures to speed up query execution when a filter is provided. Indexing structures do increase storage overhead (and make it more difficult to allow for mutation), but they can also significantly speed up queries.
ParAccel does not appear to employ indexing strategies.