a whole bunch of docs and fixes

2025-03-01 14:59:08 +00:00 · 2014-01-13 18:01:56 -08:00 · 2014-01-13 18:01:56 -08:00 · 3b17c4c03c
commit 3b17c4c03c
parent 35ed3f74bf
18 changed files with 97 additions and 25 deletions
--- a/docs/content/About-Experimental-Features.md
+++ b/docs/content/About-Experimental-Features.md
@ -1,4 +1,4 @@
 ---
 layout: doc_page
 ---
-Experimental features are features we have developed but have not fully tested in a production environment. If you choose to try them out, there will likely to edge cases that we have not covered. We would love feedback on any of these features, whether they are bug reports, suggestions for improvements, or letting us know they work as intended.
+Experimental features are features we have developed but have not fully tested in a production environment. If you choose to try them out, there will likely to edge cases that we have not covered. We would love feedback on any of these features, whether they are bug reports, suggestions for improvement, or letting us know they work as intended.
--- a/docs/content/Batch-ingestion.md
+++ b/docs/content/Batch-ingestion.md
@ -27,8 +27,8 @@ The interval is the [ISO8601 interval](http://en.wikipedia.org/wiki/ISO_8601#Tim
 {
  "dataSource": "the_data_source",
  "timestampSpec" : {
-    "timestampColumn": "ts",
-    "timestampFormat": "<iso, millis, posix, auto or any Joda time format>"
+    "column": "ts",
+    "format": "<iso, millis, posix, auto or any Joda time format>"
  },
  "dataSpec": {
    "format": "<csv, tsv, or json>",
@ -188,8 +188,8 @@ The schema of the Hadoop Index Task contains a task "type" and a Hadoop Index Co
  "config": {
    "dataSource" : "example",
    "timestampSpec" : {
-      "timestampColumn" : "timestamp",
-      "timestampFormat" : "auto"
+      "column" : "timestamp",
+      "format" : "auto"
    },
    "dataSpec" : {
      "format" : "json",
--- a/docs/content/Booting-a-production-cluster.md
+++ b/docs/content/Booting-a-production-cluster.md
@ -66,14 +66,10 @@ You can then use the EC2 dashboard to locate the instance and confirm that it ha

 If both the instance and the Druid cluster launch successfully, a few minutes later other messages to STDOUT should follow with information returned from EC2, including the instance ID:

-<<<<<<< HEAD
-# Apache Whirr
-=======
    Started cluster of 1 instances
    Cluster{instances=[Instance{roles=[zookeeper, druid-mysql, druid-master, druid-broker, druid-compute, druid-realtime], publicIp= ...
    
 The final message will contain login information for the instance.
->>>>>>> master

 Note that the Whirr will return an exception if any of the nodes fail to launch, and the cluster will be destroyed. To destroy the cluster manually, run the following command:

--- a/docs/content/Historical.md
+++ b/docs/content/Historical.md
@ -31,7 +31,7 @@ druid.zk.service.host=localhost
 druid.server.maxSize=10000000000

 # Change these to make Druid faster
-druid.processing.buffer.sizeBytes=10000000
+druid.processing.buffer.sizeBytes=100000000
 druid.processing.numThreads=1

 druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize"\: 10000000000}]
--- a/docs/content/Ingestion-FAQ.md
+++ b/docs/content/Ingestion-FAQ.md
@ -0,0 +1,38 @@
+---
+layout: doc_page
+---
+## Where do my Druid segments end up after ingestion?
+
+Depending on what `druid.storage.type` is set to, Druid will upload segments to some [Deep Storage](Deep-Storage.html). Local disk is used as the default deep storage.
+
+## My realtime node is not handing segments off
+
+Make sure that the `druid.publish.type` on your real-time nodes is set to "db". Also make sure that `druid.storage.type` is set to a deep storage that makes sense. Some example configs:
+
+```
+druid.publish.type=db
+
+druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
+druid.db.connector.user=druid
+druid.db.connector.password=diurd
+
+druid.storage.type=s3
+druid.storage.bucket=druid
+druid.storage.baseKey=sample
+```
+
+## I don't see my Druid segments on my historical nodes
+You can check the coordinator console located at <COORDINATOR_IP>:<PORT>/cluster.html. Make sure that your segments have actually loaded on [historical nodes](Historical.html). If your segments are not present, check the coordinator logs for messages about capacity of replication errors. One reason that segments are not downloaded is because historical nodes have maxSizes that are too small, making them incapable of downloading more data. You can change that with (for example):
+
+```
+-Ddruid.segmentCache.locations=[{"path":"/tmp/druid/storageLocation","maxSize":"500000000000"}]
+-Ddruid.server.maxSize=500000000000
+ ```
+
+## My queries are returning empty results
+
+You can check <BROKER_IP>:<PORT>/druid/v2/datasources/<YOUR_DATASOURCE> for the dimensions and metrics that have been created for your datasource. Make sure that the name of the aggregators you use in your query match one of these metrics. Also make sure that the query interval you specify match a valid time range where data exists.
+
+## More information
+
+Getting data into Druid can definitely be difficult for first time users. Please don't hesitate to ask questions in our IRC channel or on our [google groups page](https://groups.google.com/forum/#!forum/druid-development).
--- a/docs/content/Performance-FAQ.md
+++ b/docs/content/Performance-FAQ.md
@ -0,0 +1,18 @@
+---
+layout: doc_page
+---
+
+## What should I set my JVM heap?
+The size of the JVM heap really depends on the type of Druid node you are running. Below are a few considerations.
+
+[Broker nodes](Broker.html) can use the JVM heap as a query cache and thus, the size of the heap will affect on the number of results that can be cached. Broker nodes do not require off-heap memory and generally, heap sizes can be set to be close to the maximum memory on the machine (leaving some room for JVM overhead). The heap is used to merge results from different real-time and historical nodes, along with other computational processing.
+
+[Historical nodes](Historical.html) use off-heap memory to store intermediate results, and by default, all segments are memory mapped before they can be queried. The more off-heap memory is available, the more segments can be served without the possibility of data being paged onto disk. On historicals, the JVM heap is used for [GroupBy queries](GroupByQuery.html), some data structures used for intermediate computation, and general processing.
+
+[Coordinator nodes](Coordinator nodes) do not require off-heap memory and the heap is used for loading information about all segments to determine what segments need to be loaded, dropped, moved, or replicated.
+
+## What is the intermediate computation buffer?
+The intermediate computation buffer specifies a buffer size for the storage of intermediate results. The computation engine in both the Historical and Realtime nodes will use a scratch buffer of this size to do all of their intermediate computations off-heap. Larger values allow for more aggregations in a single pass over the data while smaller values can require more passes depending on the query that is being executed. The default size is 1073741824 bytes (1GB).
+
+## What is server maxSize?
+Server maxSize sets the maximum cumulative segment size (in bytes) that a node can hold. Changing this parameter will affect performance by controlling the memory/disk ratio on a node. Setting this parameter to a value greater than the total memory capacity on a node and may cause disk paging to occur. This paging time introduces a query latency delay.
--- a/docs/content/Realtime.md
+++ b/docs/content/Realtime.md
@ -42,7 +42,7 @@ druid.db.connector.connectURI=jdbc\:mysql\://localhost\:3306/druid
 druid.db.connector.user=druid
 druid.db.connector.password=diurd

-druid.processing.buffer.sizeBytes=268435456
+druid.processing.buffer.sizeBytes=100000000
 ```

 The realtime module also uses several of the default modules in [Configuration](Configuration.html). For more information on the realtime spec file (or configuration file), see [realtime ingestion](Realtime-ingestion.html) page.
--- a/docs/content/Tutorial:-All-About-Queries.md
+++ b/docs/content/Tutorial:-All-About-Queries.md
@ -194,6 +194,12 @@ Which gets us metrics about only those edits where the namespace is 'article':

 Check out [Filters](Filters.html) for more information.

+What Types of Queries to Use
+----------------------------
+
+The types of query you should use depends on your use case. [TimeBoundary queries](TimeBoundaryQuery.html) are useful to understand the range of your data. [Timeseries queries](TimeseriesQuery.html) are useful for aggregates and filters over a time range, and offer significant speed improvements over [GroupBy queries](GroupByQuery.html). To find the top values for a given dimension, [TopN queries](TopNQuery.html) should be used over group by queries as well.
+
+
 ## Learn More ##

 You can learn more about querying at [Querying](Querying.html)! If you are ready to evaluate Druid more in depth, check out [Booting a production cluster](Booting-a-production-cluster.html)!
--- a/docs/content/Tutorial:-Loading-Your-Data-Part-1.md
+++ b/docs/content/Tutorial:-Loading-Your-Data-Part-1.md
@ -269,6 +269,8 @@ Next Steps

 This tutorial covered ingesting a small batch data set and loading it into Druid. In [Loading Your Data Part 2](Tutorial%3A-Loading-Your-Data-Part-2.html), we will cover how to ingest data using Hadoop for larger data sets.

+Note: The index task and local firehose can be used to ingest your own data if the size of that data is relatively small (< 1G). The index task is fairly slow and we highly recommend using the Hadoop Index Task for ingesting larger quantities of data.
+
 Additional Information
 ----------------------

--- a/docs/content/Tutorial:-Loading-Your-Data-Part-2.md
+++ b/docs/content/Tutorial:-Loading-Your-Data-Part-2.md
@ -264,8 +264,10 @@ Examining the contents of the file, you should find:
    "type" : "index_hadoop",
    "config": {
      "dataSource" : "wikipedia",
-      "timestampColumn" : "timestamp",
-      "timestampFormat" : "auto",
+      "timestampSpec" : {
+        "column" : "timestamp",
+        "format" : "auto"
+      },
      "dataSpec" : {
        "format" : "json",
        "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
@ -303,7 +305,8 @@ Examining the contents of the file, you should find:
  }
  ```

-If you are curious about what all this configuration means, see [here](Task.html)
+If you are curious about what all this configuration means, see [here](Task.html).
+
 To submit the task:

 ```bash
--- a/docs/content/Tutorial:-The-Druid-Cluster.md
+++ b/docs/content/Tutorial:-The-Druid-Cluster.md
@ -158,7 +158,7 @@ druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ
 druid.server.maxSize=10000000000

 # Change these to make Druid faster
-druid.processing.buffer.sizeBytes=268435456
+druid.processing.buffer.sizeBytes=100000000
 druid.processing.numThreads=1

 druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize"\: 10000000000}]
@ -250,7 +250,7 @@ druid.publish.type=noop
 # druid.db.connector.user=druid
 # druid.db.connector.password=diurd

-druid.processing.buffer.sizeBytes=268435456
+druid.processing.buffer.sizeBytes=100000000
 ```

 Next Steps
--- a/docs/content/toc.textile
+++ b/docs/content/toc.textile
@ -19,6 +19,7 @@ h2. Operations
 * "Extending Druid":./Modules.html
 * "Cluster Setup":./Cluster-setup.html
 * "Booting a Production Cluster":./Booting-a-production-cluster.html
+* "Performance FAQ":./Performance-FAQ.html

 h2. Data Ingestion
 * "Realtime":./Realtime-ingestion.html
@ -27,6 +28,7 @@ h2. Data Ingestion
 * "Batch":./Batch-ingestion.html
 * "Indexing Service":./Indexing-Service.html
 ** "Tasks":./Tasks.html
+* "Ingestion FAQ":./Ingestion-FAQ.html

 h2. Querying
 * "Querying":./Querying.html
--- a/examples/bin/examples/indexing/wikipedia_hadoop_config.json
+++ b/examples/bin/examples/indexing/wikipedia_hadoop_config.json
@ -1,7 +1,9 @@
 {
  "dataSource": "wikipedia",
-  "timestampColumn": "timestamp",
-  "timestampFormat": "iso",
+  "timestampSpec" : {
+    "column": "timestamp",
+    "format": "iso",
+  },
  "dataSpec": {
    "format": "json",
    "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
--- a/examples/bin/examples/indexing/wikipedia_index_hadoop_task.json
+++ b/examples/bin/examples/indexing/wikipedia_index_hadoop_task.json
@ -2,8 +2,10 @@
  "type" : "index_hadoop",
  "config": {
    "dataSource" : "wikipedia",
-    "timestampColumn" : "timestamp",
-    "timestampFormat" : "auto",
+    "timestampSpec" : {
+        "column": "timestamp",
+        "format": "auto"
+    },
    "dataSpec" : {
      "format" : "json",
      "dimensions" : ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"]
--- a/examples/config/historical/runtime.properties
+++ b/examples/config/historical/runtime.properties
@ -13,7 +13,7 @@ druid.s3.accessKey=AKIAIMKECRUYKDQGR6YQ
 druid.server.maxSize=10000000000

 # Change these to make Druid faster
-druid.processing.buffer.sizeBytes=268435456
+druid.processing.buffer.sizeBytes=100000000
 druid.processing.numThreads=1

 druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize"\: 10000000000}]
--- a/examples/config/realtime/runtime.properties
+++ b/examples/config/realtime/runtime.properties
@ -14,4 +14,4 @@ druid.publish.type=noop
 # druid.db.connector.user=druid
 # druid.db.connector.password=diurd

-druid.processing.buffer.sizeBytes=268435456
+druid.processing.buffer.sizeBytes=100000000
--- a/indexing-hadoop/src/main/java/io/druid/indexer/DeterminePartitionsJob.java
+++ b/indexing-hadoop/src/main/java/io/druid/indexer/DeterminePartitionsJob.java
@ -443,7 +443,11 @@ public class DeterminePartitionsJob implements Jobby
      final int index = bytes.getInt();

      if (index >= numPartitions) {
-        throw new ISE("Not enough partitions, index[%,d] >= numPartitions[%,d]", index, numPartitions);
+        throw new ISE(
+            "Not enough partitions, index[%,d] >= numPartitions[%,d]. Please increase the number of reducers to the index size or check your config & settings!",
+            index,
+            numPartitions
+        );
      }

      return index;
@ -453,7 +457,6 @@ public class DeterminePartitionsJob implements Jobby
  private static abstract class DeterminePartitionsDimSelectionBaseReducer
      extends Reducer<BytesWritable, Text, BytesWritable, Text>
  {
-
    protected static volatile HadoopDruidIndexerConfig config = null;

    @Override
--- a/server/src/main/java/io/druid/guice/HttpClientModule.java
+++ b/server/src/main/java/io/druid/guice/HttpClientModule.java
@ -103,7 +103,7 @@ public class HttpClientModule implements Module
    private int numConnections = 5;

    @JsonProperty
-    private Period readTimeout = new Period("PT5M");
+    private Period readTimeout = new Period("PT15M");

    public int getNumConnections()
    {