druid/docs/content/comparisons/druid-vs-redshift.md

64 lines
3.4 KiB
Markdown
Raw Normal View History

---
layout: doc_page
title: "Druid vs Redshift"
---
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing,
~ software distributed under the License is distributed on an
~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
~ KIND, either express or implied. See the License for the
~ specific language governing permissions and limitations
~ under the License.
-->
# Druid vs Redshift
2015-11-09 19:40:07 -05:00
### How does Druid compare to Redshift?
2015-11-09 19:40:07 -05:00
In terms of drawing a differentiation, Redshift started out as ParAccel (Actian), which Amazon is licensing and has since heavily modified.
Aside from potential performance differences, there are some functional differences:
2015-11-09 19:40:07 -05:00
### Real-time data ingestion
Because Druid is optimized to provide insight against massive quantities of streaming data; it is able to load and aggregate data in real-time.
Generally traditional data warehouses including column stores work only with batch ingestion and are not optimal for streaming data in regularly.
2015-11-09 19:40:07 -05:00
### Druid is a read oriented analytical data store
2015-11-09 19:40:07 -05:00
Druids write semantics are not as fluid and does not support full joins (we support large table to small table joins). Redshift provides full SQL support including joins and insert/update statements.
2015-11-09 19:40:07 -05:00
### Data distribution model
2019-02-28 21:10:39 -05:00
Druids data distribution is segment-based and leverages a highly available "deep" storage such as S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of Historical processes does not result in data loss because new Historical processes can always be brought up by reading data from "deep" storage.
To contrast, ParAccels data distribution model is hash-based. Expanding the cluster requires re-hashing the data across the nodes, making it difficult to perform without taking downtime. Amazons Redshift works around this issue with a multi-step process:
* set cluster into read-only mode
* copy data from cluster to new cluster that exists in parallel
* redirect traffic to new cluster
2015-11-09 19:40:07 -05:00
### Replication strategy
2019-02-28 21:10:39 -05:00
Druid employs segment-level data distribution meaning that more processes can be added and rebalanced without having to perform a staged swap. The replication strategy also makes all replicas available for querying. Replication is done automatically and without any impact to performance.
ParAccels hash-based distribution generally means that replication is conducted via hot spares. This puts a numerical limit on the number of nodes you can lose without losing data, and this replication strategy often does not allow the hot spare to help share query load.
2015-11-09 19:40:07 -05:00
### Indexing strategy
2015-11-09 19:40:07 -05:00
Along with column oriented structures, Druid uses indexing structures to speed up query execution when a filter is provided. Indexing structures do increase storage overhead (and make it more difficult to allow for mutation), but they also significantly speed up queries.
ParAccel does not appear to employ indexing strategies.