druid/docs/content/misc/evaluate.md

3.0 KiB

layout
doc_page

Evaluate Druid

This page is meant to help you in evaluating Druid by answering common questions that come up.

Evaluating on a Single Machine

Most of the tutorials focus on running multiple Druid services on a single machine in an attempt to teach basic Druid concepts, and work out kinks in data ingestion. The configurations in the tutorials are very poor choices for an actual production cluster.

Capacity and Cost Planning

The best way to understand what your cluster will cost is to first understand how much data reduction you will get when you create segments. We recommend indexing and creating segments from 1G of your data and evaluating the resultant segment size. This will allow you to see how much your data rolls up, and how many segments will be able to be loaded on the hardware you have at your disposal.

Most of the cost of a Druid cluster is in historical nodes, followed by real-time indexing nodes if you have a high data intake. For high availability, you should have backup coordination nodes (coordinators and overlords). Coordination nodes should require much cheaper hardware than nodes that serve queries.

Selecting Hardware

Druid is designed to run on commodity hardware and we've tried to provide some general guidelines on how things should be tuned for various deployments. We've also provided some example specs for hardware for a production cluster.

Benchmarking Druid

The best resource to benchmark Druid is to follow the steps outlined in our blog post about the topic. The code to reproduce the results in the blog post are all open source. The blog post covers Druid queries on TPC-H data, but you should be able to customize configuration parameters to your data set. The blog post is a little outdated and uses an older version of Druid, but is still mostly relevant to demonstrate performance.

Colocating Druid Processes for a POC

Not all Druid node processes need to run on separate machines. You can set up a small cluster with colocated processes to load several gigabytes of data.

It is recommended you follow the example production configuration for an actual production setup.

  1. node1: Coordinator + metadata store + zookeeper
  2. node2: Broker + Historical
  3. node3: Overlord

The coordination pieces (coordinator, metadata store, ZK) can be colocated on the same node. These processes do not require many resources, even for reasonably large clusters.

The query pieces (broker + historical) can be colocated. You can add more of these nodes if your data doesn't fit on a single machine. Make sure to allocate enough heap/off-heap size to both processes.

For small ingest workloads, you can run the overlord in local mode to load your data.