diff --git a/docs/content/About-Experimental-Features.md b/docs/content/About-Experimental-Features.md index aeaef1c5c7a..72244df48e1 100644 --- a/docs/content/About-Experimental-Features.md +++ b/docs/content/About-Experimental-Features.md @@ -1,4 +1,4 @@ --- layout: doc_page --- -Experimental features are features we have developed but have not fully tested in a production environment. If you choose to try them out, there will likely to edge cases that we have not covered. We would love feedback on any of these features, whether they are bug reports, suggestions for improvements, or letting us know they work as intended. \ No newline at end of file +Experimental features are features we have developed but have not fully tested in a production environment. If you choose to try them out, there will likely to edge cases that we have not covered. We would love feedback on any of these features, whether they are bug reports, suggestions for improvement, or letting us know they work as intended. \ No newline at end of file diff --git a/docs/content/Batch-ingestion.md b/docs/content/Batch-ingestion.md index 4edab47866a..534e04fdb42 100644 --- a/docs/content/Batch-ingestion.md +++ b/docs/content/Batch-ingestion.md @@ -27,8 +27,8 @@ The interval is the [ISO8601 interval](http://en.wikipedia.org/wiki/ISO_8601#Tim { "dataSource": "the_data_source", "timestampSpec" : { - "timestampColumn": "ts", - "timestampFormat": "" + "column": "ts", + "format": "" }, "dataSpec": { "format": "", @@ -188,8 +188,8 @@ The schema of the Hadoop Index Task contains a task "type" and a Hadoop Index Co "config": { "dataSource" : "example", "timestampSpec" : { - "timestampColumn" : "timestamp", - "timestampFormat" : "auto" + "column" : "timestamp", + "format" : "auto" }, "dataSpec" : { "format" : "json", diff --git a/docs/content/Booting-a-production-cluster.md b/docs/content/Booting-a-production-cluster.md index 9b7ff40b00e..86d63ba6922 100644 --- a/docs/content/Booting-a-production-cluster.md +++ b/docs/content/Booting-a-production-cluster.md @@ -66,14 +66,10 @@ You can then use the EC2 dashboard to locate the instance and confirm that it ha If both the instance and the Druid cluster launch successfully, a few minutes later other messages to STDOUT should follow with information returned from EC2, including the instance ID: -<<<<<<< HEAD -# Apache Whirr -======= Started cluster of 1 instances Cluster{instances=[Instance{roles=[zookeeper, druid-mysql, druid-master, druid-broker, druid-compute, druid-realtime], publicIp= ... The final message will contain login information for the instance. ->>>>>>> master Note that the Whirr will return an exception if any of the nodes fail to launch, and the cluster will be destroyed. To destroy the cluster manually, run the following command: diff --git a/docs/content/Historical.md b/docs/content/Historical.md index 361098be512..5f0b6d628bc 100644 --- a/docs/content/Historical.md +++ b/docs/content/Historical.md @@ -31,7 +31,7 @@ druid.zk.service.host=localhost druid.server.maxSize=10000000000 # Change these to make Druid faster -druid.processing.buffer.sizeBytes=10000000 +druid.processing.buffer.sizeBytes=100000000 druid.processing.numThreads=1 druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize"\: 10000000000}] diff --git a/docs/content/Ingestion-FAQ.md b/docs/content/Ingestion-FAQ.md new file mode 100644 index 00000000000..5fdb29733c4 --- /dev/null +++ b/docs/content/Ingestion-FAQ.md @@ -0,0 +1,38 @@ +--- +layout: doc_page +--- +## Where do my Druid segments end up after ingestion? + +Depending on what `druid.storage.type` is set to, Druid will upload segments to some [Deep Storage](Deep-Storage.html). Local disk is used as the default deep storage. + +## My realtime node is not handing segments off + +Make sure that the `druid.publish.type` on your real-time nodes is set to "db". Also make sure that `druid.storage.type` is set to a deep storage that makes sense. Some example configs: + +``` +druid.publish.type=db + +druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid +druid.db.connector.user=druid +druid.db.connector.password=diurd + +druid.storage.type=s3 +druid.storage.bucket=druid +druid.storage.baseKey=sample +``` + +## I don't see my Druid segments on my historical nodes +You can check the coordinator console located at :/cluster.html. Make sure that your segments have actually loaded on [historical nodes](Historical.html). If your segments are not present, check the coordinator logs for messages about capacity of replication errors. One reason that segments are not downloaded is because historical nodes have maxSizes that are too small, making them incapable of downloading more data. You can change that with (for example): + +``` +-Ddruid.segmentCache.locations=[{"path":"/tmp/druid/storageLocation","maxSize":"500000000000"}] +-Ddruid.server.maxSize=500000000000 + ``` + +## My queries are returning empty results + +You can check :/druid/v2/datasources/ for the dimensions and metrics that have been created for your datasource. Make sure that the name of the aggregators you use in your query match one of these metrics. Also make sure that the query interval you specify match a valid time range where data exists. + +## More information + +Getting data into Druid can definitely be difficult for first time users. Please don't hesitate to ask questions in our IRC channel or on our [google groups page](https://groups.google.com/forum/#!forum/druid-development). diff --git a/docs/content/Performance-FAQ.md b/docs/content/Performance-FAQ.md new file mode 100644 index 00000000000..8bc696d6825 --- /dev/null +++ b/docs/content/Performance-FAQ.md @@ -0,0 +1,18 @@ +--- +layout: doc_page +--- + +## What should I set my JVM heap? +The size of the JVM heap really depends on the type of Druid node you are running. Below are a few considerations. + +[Broker nodes](Broker.html) can use the JVM heap as a query cache and thus, the size of the heap will affect on the number of results that can be cached. Broker nodes do not require off-heap memory and generally, heap sizes can be set to be close to the maximum memory on the machine (leaving some room for JVM overhead). The heap is used to merge results from different real-time and historical nodes, along with other computational processing. + +[Historical nodes](Historical.html) use off-heap memory to store intermediate results, and by default, all segments are memory mapped before they can be queried. The more off-heap memory is available, the more segments can be served without the possibility of data being paged onto disk. On historicals, the JVM heap is used for [GroupBy queries](GroupByQuery.html), some data structures used for intermediate computation, and general processing. + +[Coordinator nodes](Coordinator nodes) do not require off-heap memory and the heap is used for loading information about all segments to determine what segments need to be loaded, dropped, moved, or replicated. + +## What is the intermediate computation buffer? +The intermediate computation buffer specifies a buffer size for the storage of intermediate results. The computation engine in both the Historical and Realtime nodes will use a scratch buffer of this size to do all of their intermediate computations off-heap. Larger values allow for more aggregations in a single pass over the data while smaller values can require more passes depending on the query that is being executed. The default size is 1073741824 bytes (1GB). + +## What is server maxSize? +Server maxSize sets the maximum cumulative segment size (in bytes) that a node can hold. Changing this parameter will affect performance by controlling the memory/disk ratio on a node. Setting this parameter to a value greater than the total memory capacity on a node and may cause disk paging to occur. This paging time introduces a query latency delay. \ No newline at end of file diff --git a/docs/content/Realtime.md b/docs/content/Realtime.md index f0d49dee466..93923b05be6 100644 --- a/docs/content/Realtime.md +++ b/docs/content/Realtime.md @@ -42,7 +42,7 @@ druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid druid.db.connector.user=druid druid.db.connector.password=diurd -druid.processing.buffer.sizeBytes=268435456 +druid.processing.buffer.sizeBytes=100000000 ``` The realtime module also uses several of the default modules in [Configuration](Configuration.html). For more information on the realtime spec file (or configuration file), see [realtime ingestion](Realtime-ingestion.html) page. diff --git a/docs/content/Tutorial:-All-About-Queries.md b/docs/content/Tutorial:-All-About-Queries.md index ab3135ecea1..2e275bf5131 100644 --- a/docs/content/Tutorial:-All-About-Queries.md +++ b/docs/content/Tutorial:-All-About-Queries.md @@ -194,6 +194,12 @@ Which gets us metrics about only those edits where the namespace is 'article': Check out [Filters](Filters.html) for more information. +What Types of Queries to Use +---------------------------- + +The types of query you should use depends on your use case. [TimeBoundary queries](TimeBoundaryQuery.html) are useful to understand the range of your data. [Timeseries queries](TimeseriesQuery.html) are useful for aggregates and filters over a time range, and offer significant speed improvements over [GroupBy queries](GroupByQuery.html). To find the top values for a given dimension, [TopN queries](TopNQuery.html) should be used over group by queries as well. + + ## Learn More ## You can learn more about querying at [Querying](Querying.html)! If you are ready to evaluate Druid more in depth, check out [Booting a production cluster](Booting-a-production-cluster.html)! diff --git a/docs/content/Tutorial:-Loading-Your-Data-Part-1.md b/docs/content/Tutorial:-Loading-Your-Data-Part-1.md index 1f0992bb6bd..45ad2179cee 100644 --- a/docs/content/Tutorial:-Loading-Your-Data-Part-1.md +++ b/docs/content/Tutorial:-Loading-Your-Data-Part-1.md @@ -269,6 +269,8 @@ Next Steps This tutorial covered ingesting a small batch data set and loading it into Druid. In [Loading Your Data Part 2](Tutorial%3A-Loading-Your-Data-Part-2.html), we will cover how to ingest data using Hadoop for larger data sets. +Note: The index task and local firehose can be used to ingest your own data if the size of that data is relatively small (< 1G). The index task is fairly slow and we highly recommend using the Hadoop Index Task for ingesting larger quantities of data. + Additional Information ---------------------- diff --git a/docs/content/Tutorial:-Loading-Your-Data-Part-2.md b/docs/content/Tutorial:-Loading-Your-Data-Part-2.md index a9042143475..6dcd37fbc6a 100644 --- a/docs/content/Tutorial:-Loading-Your-Data-Part-2.md +++ b/docs/content/Tutorial:-Loading-Your-Data-Part-2.md @@ -264,8 +264,10 @@ Examining the contents of the file, you should find: "type" : "index_hadoop", "config": { "dataSource" : "wikipedia", - "timestampColumn" : "timestamp", - "timestampFormat" : "auto", + "timestampSpec" : { + "column" : "timestamp", + "format" : "auto" + }, "dataSpec" : { "format" : "json", "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] @@ -303,7 +305,8 @@ Examining the contents of the file, you should find: } ``` -If you are curious about what all this configuration means, see [here](Task.html) +If you are curious about what all this configuration means, see [here](Task.html). + To submit the task: ```bash diff --git a/docs/content/Tutorial:-The-Druid-Cluster.md b/docs/content/Tutorial:-The-Druid-Cluster.md index fbb4836d65f..c8568f4bb07 100644 --- a/docs/content/Tutorial:-The-Druid-Cluster.md +++ b/docs/content/Tutorial:-The-Druid-Cluster.md @@ -158,7 +158,7 @@ druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ druid.server.maxSize=10000000000 # Change these to make Druid faster -druid.processing.buffer.sizeBytes=268435456 +druid.processing.buffer.sizeBytes=100000000 druid.processing.numThreads=1 druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize"\: 10000000000}] @@ -250,7 +250,7 @@ druid.publish.type=noop # druid.db.connector.user=druid # druid.db.connector.password=diurd -druid.processing.buffer.sizeBytes=268435456 +druid.processing.buffer.sizeBytes=100000000 ``` Next Steps diff --git a/docs/content/toc.textile b/docs/content/toc.textile index 373641625b2..23b4023edaf 100644 --- a/docs/content/toc.textile +++ b/docs/content/toc.textile @@ -19,6 +19,7 @@ h2. Operations * "Extending Druid":./Modules.html * "Cluster Setup":./Cluster-setup.html * "Booting a Production Cluster":./Booting-a-production-cluster.html +* "Performance FAQ":./Performance-FAQ.html h2. Data Ingestion * "Realtime":./Realtime-ingestion.html @@ -27,6 +28,7 @@ h2. Data Ingestion * "Batch":./Batch-ingestion.html * "Indexing Service":./Indexing-Service.html ** "Tasks":./Tasks.html +* "Ingestion FAQ":./Ingestion-FAQ.html h2. Querying * "Querying":./Querying.html diff --git a/examples/bin/examples/indexing/wikipedia_hadoop_config.json b/examples/bin/examples/indexing/wikipedia_hadoop_config.json index 82f365fdc6b..3b268056222 100644 --- a/examples/bin/examples/indexing/wikipedia_hadoop_config.json +++ b/examples/bin/examples/indexing/wikipedia_hadoop_config.json @@ -1,7 +1,9 @@ { "dataSource": "wikipedia", - "timestampColumn": "timestamp", - "timestampFormat": "iso", + "timestampSpec" : { + "column": "timestamp", + "format": "iso", + }, "dataSpec": { "format": "json", "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] diff --git a/examples/bin/examples/indexing/wikipedia_index_hadoop_task.json b/examples/bin/examples/indexing/wikipedia_index_hadoop_task.json index fbcafce83a2..46a4a689758 100644 --- a/examples/bin/examples/indexing/wikipedia_index_hadoop_task.json +++ b/examples/bin/examples/indexing/wikipedia_index_hadoop_task.json @@ -2,8 +2,10 @@ "type" : "index_hadoop", "config": { "dataSource" : "wikipedia", - "timestampColumn" : "timestamp", - "timestampFormat" : "auto", + "timestampSpec" : { + "column": "timestamp", + "format": "auto" + }, "dataSpec" : { "format" : "json", "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"] diff --git a/examples/config/historical/runtime.properties b/examples/config/historical/runtime.properties index e460c539394..dbd093e0469 100644 --- a/examples/config/historical/runtime.properties +++ b/examples/config/historical/runtime.properties @@ -13,7 +13,7 @@ druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ druid.server.maxSize=10000000000 # Change these to make Druid faster -druid.processing.buffer.sizeBytes=268435456 +druid.processing.buffer.sizeBytes=100000000 druid.processing.numThreads=1 druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize"\: 10000000000}] \ No newline at end of file diff --git a/examples/config/realtime/runtime.properties b/examples/config/realtime/runtime.properties index 17c9db1fc23..b31bda951a1 100644 --- a/examples/config/realtime/runtime.properties +++ b/examples/config/realtime/runtime.properties @@ -14,4 +14,4 @@ druid.publish.type=noop # druid.db.connector.user=druid # druid.db.connector.password=diurd -druid.processing.buffer.sizeBytes=268435456 +druid.processing.buffer.sizeBytes=100000000 diff --git a/indexing-hadoop/src/main/java/io/druid/indexer/DeterminePartitionsJob.java b/indexing-hadoop/src/main/java/io/druid/indexer/DeterminePartitionsJob.java index a832d4511cd..30f9471e445 100644 --- a/indexing-hadoop/src/main/java/io/druid/indexer/DeterminePartitionsJob.java +++ b/indexing-hadoop/src/main/java/io/druid/indexer/DeterminePartitionsJob.java @@ -443,7 +443,11 @@ public class DeterminePartitionsJob implements Jobby final int index = bytes.getInt(); if (index >= numPartitions) { - throw new ISE("Not enough partitions, index[%,d] >= numPartitions[%,d]", index, numPartitions); + throw new ISE( + "Not enough partitions, index[%,d] >= numPartitions[%,d]. Please increase the number of reducers to the index size or check your config & settings!", + index, + numPartitions + ); } return index; @@ -453,7 +457,6 @@ public class DeterminePartitionsJob implements Jobby private static abstract class DeterminePartitionsDimSelectionBaseReducer extends Reducer { - protected static volatile HadoopDruidIndexerConfig config = null; @Override diff --git a/server/src/main/java/io/druid/guice/HttpClientModule.java b/server/src/main/java/io/druid/guice/HttpClientModule.java index 700aff7baee..5ba51bfcfa6 100644 --- a/server/src/main/java/io/druid/guice/HttpClientModule.java +++ b/server/src/main/java/io/druid/guice/HttpClientModule.java @@ -103,7 +103,7 @@ public class HttpClientModule implements Module private int numConnections = 5; @JsonProperty - private Period readTimeout = new Period("PT5M"); + private Period readTimeout = new Period("PT15M"); public int getNumConnections() {