--- layout: doc_page --- # Clustering Druid is designed to be deployed as a scalable, fault-tolerant cluster. In this document, we'll set up a simple cluster and discuss how it can be further configured to meet your needs. This simple cluster will feature scalable, fault-tolerant servers for Historicals and MiddleManagers, and a single coordination server to host the Coordinator and Overlord processes. In production, we recommend deploying Coordinators and Overlords in a fault-tolerant configuration as well. ## Select hardware The Coordinator and Overlord processes can be co-located on a single server that is responsible for handling the metadata and coordination needs of your cluster. The equivalent of an AWS [m3.xlarge](https://aws.amazon.com/ec2/instance-types/#M3) is sufficient for most clusters. This hardware offers: - 4 vCPUs - 15 GB RAM - 80 GB SSD storage Historicals and MiddleManagers can be colocated on a single server to handle the actual data in your cluster. These servers benefit greatly from CPU, RAM, and SSDs. The equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3) is a good starting point. This hardware offers: - 8 vCPUs - 61 GB RAM - 160 GB SSD storage Druid Brokers accept queries and farm them out to the rest of the cluster. They also optionally maintain an in-memory query cache. These servers benefit greatly from CPU and RAM, and can also be deployed on the equivalent of an AWS [r3.2xlarge](https://aws.amazon.com/ec2/instance-types/#r3). This hardware offers: - 8 vCPUs - 61 GB RAM - 160 GB SSD storage You can consider co-locating any open source UIs or query libraries on the same server that the Broker is running on. Very large clusters should consider selecting larger servers. ## Select OS We recommend running your favorite Linux distribution. You will also need: * Java 8 Your OS package manager should be able to help for both Java. If your Ubuntu-based OS does not have a recent enough version of Java, WebUpd8 offers [packages for those OSes](http://www.webupd8.org/2012/09/install-oracle-java-8-in-ubuntu-via-ppa.html). ## Download the distribution First, download and unpack the release archive. It's best to do this on a single machine at first, since you will be editing the configurations and then copying the modified distribution out to all of your servers. ```bash curl -O http://static.druid.io/artifacts/releases/druid-#{DRUIDVERSION}-bin.tar.gz tar -xzf druid-#{DRUIDVERSION}-bin.tar.gz cd druid-#{DRUIDVERSION} ``` In this package, you'll find: * `LICENSE` - the license files. * `bin/` - scripts related to the [single-machine quickstart](quickstart.html). * `conf/*` - template configurations for a clustered setup. * `conf-quickstart/*` - configurations for the [single-machine quickstart](quickstart.html). * `extensions/*` - all Druid extensions. * `hadoop-dependencies/*` - Druid Hadoop dependencies. * `lib/*` - all included software packages for core Druid. * `quickstart/*` - files related to the [single-machine quickstart](quickstart.html). We'll be editing the files in `conf/` in order to get things running. ## Configure deep storage Druid relies on a distributed filesystem or large object (blob) store for data storage. The most commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if you already have a Hadoop deployment). ### S3 In `conf/druid/_common/common.runtime.properties`, - Set `druid.extensions.loadList=["druid-s3-extensions"]`. - Comment out the configurations for local storage under "Deep Storage" and "Indexing service logs". - Uncomment and configure appropriate values in the "For S3" sections of "Deep Storage" and "Indexing service logs". After this, you should have made the following changes: ``` druid.extensions.loadList=["druid-s3-extensions"] #druid.storage.type=local #druid.storage.storageDirectory=var/druid/segments druid.storage.type=s3 druid.storage.bucket=your-bucket druid.storage.baseKey=druid/segments druid.s3.accessKey=... druid.s3.secretKey=... #druid.indexer.logs.type=file #druid.indexer.logs.directory=var/druid/indexing-logs druid.indexer.logs.type=s3 druid.indexer.logs.s3Bucket=your-bucket druid.indexer.logs.s3Prefix=druid/indexing-logs ``` ### HDFS In `conf/druid/_common/common.runtime.properties`, - Set `druid.extensions.loadList=["druid-hdfs-storage"]`. - Comment out the configurations for local storage under "Deep Storage" and "Indexing service logs". - Uncomment and configure appropriate values in the "For HDFS" sections of "Deep Storage" and "Indexing service logs". After this, you should have made the following changes: ``` druid.extensions.loadList=["druid-hdfs-storage"] #druid.storage.type=local #druid.storage.storageDirectory=var/druid/segments druid.storage.type=hdfs druid.storage.storageDirectory=/druid/segments #druid.indexer.logs.type=file #druid.indexer.logs.directory=var/druid/indexing-logs druid.indexer.logs.type=hdfs druid.indexer.logs.directory=/druid/indexing-logs ``` Also, - Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into `conf/druid/_common/`. ## Configure Tranquility Server (optional) Data streams can be sent to Druid through a simple HTTP API powered by Tranquility Server. If you will be using this functionality, then at this point you should [configure Tranquility Server](../ingestion/stream-ingestion.html#server). ## Configure Tranquility Kafka (optional) Druid can consuming streams from Kafka through Tranquility Kafka. If you will be using this functionality, then at this point you should [configure Tranquility Kafka](../ingestion/stream-ingestion.html#kafka). ## Configure for connecting to Hadoop (optional) If you will be loading data from a Hadoop cluster, then at this point you should configure Druid to be aware of your cluster: - Update `druid.indexer.task.hadoopWorkingPath` in `conf/middleManager/runtime.properties` to a path on HDFS that you'd like to use for temporary files required during the indexing process. `druid.indexer.task.hadoopWorkingPath=/tmp/druid-indexing` is a common choice. - Place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into `conf/druid/_common/core-site.xml`, `conf/druid/_common/hdfs-site.xml`, and so on. Note that you don't need to use HDFS deep storage in order to load data from Hadoop. For example, if your cluster is running on Amazon Web Services, we recommend using S3 for deep storage even if you are loading data using Hadoop or Elastic MapReduce. For more info, please see [batch ingestion](../ingestion/batch-ingestion.html). ## Configure addresses for Druid coordination In this simple cluster, you will deploy a single Druid Coordinator, a single Druid Overlord, a single ZooKeeper instance, and an embedded Derby metadata store on the same server. In `conf/druid/_common/common.runtime.properties`, replace "zk.service.host" with the address of the machine that runs your ZK instance: - `druid.zk.service.host` In `conf/druid/_common/common.runtime.properties`, replace "metadata.storage.*" with the address of the machine that you will use as your metadata store: - `druid.metadata.storage.connector.connectURI` - `druid.metadata.storage.connector.host`