add test for batch indexing from hadoop

2016-02-16 09:23:37 -06:00 · 2016-02-16 09:23:37 -06:00 · e7a7ecd65d
parent 773d6fe86c
commit e7a7ecd65d
9 changed files with 3955 additions and 21 deletions
--- a/integration-tests/README.md
+++ b/integration-tests/README.md
@ -1,6 +1,20 @@
 Integration Testing 
 ===================

+To run integration tests, you have to specify the druid cluster the
+tests should use.  
+
+Druid comes with the mvn profile integration-tests
+for setting up druid running in docker containers, and using that
+cluster to run the integration tests.
+
+To use a druid cluster that is already running, use the
+mvn profile int-tests-config-file, which uses a configuration file 
+describing the cluster.
+
+Integration Testing Using Docker 
+-------------------
+
 ## Installing Docker

 Please refer to instructions at [https://github.com/druid-io/docker-druid/blob/master/docker-install.md](https://github.com/druid-io/docker-druid/blob/master/docker-install.md).
@ -20,24 +34,22 @@ eval "$(docker-machine env integration)"
 export DOCKER_IP=$(docker-machine ip integration)
 ```

-Running Integration tests
-=========================
+## Running tests

-Make sure that you have at least 6GB of memory available before you run the tests.
-
-## Starting docker tests
-
-To run all the tests using docker and mvn run the following command -
+To run all the tests using docker and mvn run the following command:
 ```
  mvn verify -P integration-tests
 ```

-To run only a single test using mvn run the following command -
+To run only a single test using mvn run the following command:
 ```
  mvn verify -P integration-tests -Dit.test=<test_name>
 ```

-## Configure and run integration tests using existing cluster
+Running Tests Using A Configuration File for Any Cluster
+-------------------
+
+Make sure that you have at least 6GB of memory available before you run the tests.

 To run tests on any druid cluster that is already running, create a configuration file:

@ -54,23 +66,84 @@ To run tests on any druid cluster that is already running, create a configuratio
       "zookeeper_hosts": "<comma-separated list of zookeeper_ip:zookeeper_port>",
    }

-Set the environment variable CONFIG_FILE to the name of the configuration file -
+Set the environment variable CONFIG_FILE to the name of the configuration file:
 ```
 export CONFIG_FILE=<config file name>
 ```

-To run all the tests using mvn run the following command -
+To run all the tests using mvn run the following command: 
 ```
  mvn verify -P int-tests-config-file
 ```

-To run only a single test using mvn run the following command -
+To run only a single test using mvn run the following command:
 ```
  mvn verify -P int-tests-config-file -Dit.test=<test_name>
 ```

+Running a Test That Uses Hadoop
+-------------------
+
+The integration test that indexes from hadoop is not run as part
+of the integration test run discussed above.  This is because druid
+test clusters might not, in general, have access to hadoop.
+That's the case (for now, at least) when using the docker cluster set 
+up by the integration-tests profile, so the hadoop test
+has to be run using a cluster specified in a configuration file.
+
+The data file is 
+integration-tests/src/test/resources/hadoop/batch_hadoop.data.
+Create a directory called batchHadoop1 in the hadoop file system
+(anywhere you want) and put batch_hadoop.data into that directory
+(as its only file).
+
+Add this keyword to the configuration file (see above):
+
+```
+    "hadoopTestDir": "<name_of_dir_containing_batchHadoop1>"
+```
+
+Run the test using mvn:
+
+```
+  mvn verify -P int-tests-config-file -Dit.test=ITHadoopIndexTest
+```
+
+In some test environments, the machine where the tests need to be executed
+cannot access the outside internet, so mvn cannot be run.  In that case,
+do the following instead of running the tests using mvn:
+
+### Compile druid and the integration tests
+
+On a machine that can do mvn builds:
+
+```
+cd druid 
+mvn clean package
+cd integration_tests 
+mvn dependency:copy-dependencies package
+```
+
+### Put the compiled test code into your test cluster
+
+Copy the integration-tests directory to the test cluster.
+
+### Set CLASSPATH
+
+```
+TDIR=<directory containing integration-tests>/target
+VER=<version of druid you built>
+export CLASSPATH=$TDIR/dependency/*:$TDIR/druid-integration-tests-$VER.jar:$TDIR/druid-integration-tests-$VER-tests.jar
+```
+
+### Run the test
+
+```
+java -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.test.config.type=configFile -Ddruid.test.config.configFile=<pathname of configuration file> org.testng.TestNG -testrunfactory org.testng.DruidTestRunnerFactory -testclass io.druid.tests.hadoop.ITHadoopIndexTest
+```
+
 Writing a New Test
-===============
+-------------------

 ## What should we cover in integration tests

@ -78,7 +151,7 @@ For every end-user functionality provided by druid we should have an integration

 ## Rules to be followed while writing a new integration test

-### Every Integration Test must follow these rules
+### Every Integration Test must follow these rules:

 1) Name of the test must start with a prefix "IT"
 2) A test should be independent of other tests
@ -111,5 +184,5 @@ This will tell the test framework that the test class needs to be constructed us
 Refer ITIndexerTest as an example on how to use dependency Injection

 TODOS
-=======================
+-----------------------
 1) Remove the patch for TestNG after resolution of Surefire-622
--- a/integration-tests/pom.xml
+++ b/integration-tests/pom.xml
@ -45,6 +45,11 @@
            <artifactId>druid-s3-extensions</artifactId>
            <version>${project.parent.version}</version>
        </dependency>
+        <dependency>
+            <groupId>io.druid.extensions</groupId>
+            <artifactId>druid-datasketches</artifactId>
+            <version>${project.parent.version}</version>
+        </dependency>
        <dependency>
            <groupId>io.druid.extensions</groupId>
            <artifactId>druid-kafka-eight</artifactId>
@ -190,6 +195,7 @@
                                -Dfile.encoding=UTF-8
                                -Dtestrunfactory=org.testng.DruidTestRunnerFactory
                                -Ddruid.test.config.dockerIp=${env.DOCKER_IP}
+                                -Ddruid.test.config.hadoopDir=${env.HADOOP_DIR}
                                -Ddruid.zk.service.host=${env.DOCKER_IP}
                            </argLine>
                            <suiteXmlFiles>
--- a/integration-tests/src/main/java/io/druid/testing/DockerConfigProvider.java
+++ b/integration-tests/src/main/java/io/druid/testing/DockerConfigProvider.java
@ -31,6 +31,10 @@ public class DockerConfigProvider implements IntegrationTestingConfigProvider
  @NotNull
  private String dockerIp;

+  @JsonProperty
+  @NotNull
+  private String hadoopDir;
+
  @Override
  public IntegrationTestingConfig get()
  {
@ -87,7 +91,10 @@ public class DockerConfigProvider implements IntegrationTestingConfigProvider
      @Override
      public String getProperty(String prop)
      {
-        throw new UnsupportedOperationException("DockerConfigProvider does not support getProperty()");
+        if (prop.equals("hadoopTestDir")) {
+	   return hadoopDir;
+	}
+        throw new UnsupportedOperationException("DockerConfigProvider does not support property " + prop);
      }
    };
  }
--- a/integration-tests/src/main/java/io/druid/testing/clients/OverlordResourceTestClient.java
+++ b/integration-tests/src/main/java/io/druid/testing/clients/OverlordResourceTestClient.java
@ -194,6 +194,11 @@ public class OverlordResourceTestClient
  }

  public void waitUntilTaskCompletes(final String taskID)
+  {
+      waitUntilTaskCompletes(taskID, 60000, 10);
+  }
+
+  public void waitUntilTaskCompletes(final String taskID, final int millisEach, final int numTimes)
  {
    RetryUtil.retryUntil(
        new Callable<Boolean>()
@ -209,9 +214,9 @@ public class OverlordResourceTestClient
          }
        },
        true,
-        60000,
-        10,
-        "Index Task to complete"
+        millisEach,
+        numTimes,
+        taskID
    );
  }

--- a/integration-tests/src/test/java/io/druid/tests/hadoop/ITHadoopIndexTest.java
+++ b/integration-tests/src/test/java/io/druid/tests/hadoop/ITHadoopIndexTest.java
@ -0,0 +1,114 @@
+/*
+ * Licensed to Metamarkets Group Inc. (Metamarkets) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. Metamarkets licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package io.druid.tests.hadoop;
+
+import com.google.common.base.Throwables;
+import com.google.inject.Inject;
+import com.metamx.common.ISE;
+import com.metamx.common.logger.Logger;
+import io.druid.indexing.common.TaskStatus;
+import io.druid.testing.IntegrationTestingConfig;
+import io.druid.testing.guice.DruidTestModuleFactory;
+import io.druid.testing.utils.RetryUtil;
+import io.druid.tests.indexer.AbstractIndexerTest;
+import org.testng.annotations.AfterClass;
+import org.testng.annotations.BeforeClass;
+import org.testng.annotations.Guice;
+import org.testng.annotations.Test;
+
+import java.util.concurrent.Callable;
+
+@Guice(moduleFactory = DruidTestModuleFactory.class)
+public class ITHadoopIndexTest extends AbstractIndexerTest
+{
+  private static final Logger LOG = new Logger(ITHadoopIndexTest.class);
+  private static final String BATCH_TASK = "/hadoop/batch_hadoop_indexer.json";
+  private static final String BATCH_QUERIES_RESOURCE = "/hadoop/batch_hadoop_queries.json";
+  private static final String BATCH_DATASOURCE = "batchHadoop";
+  private boolean dataLoaded = false;
+
+  @Inject
+  private IntegrationTestingConfig config;
+
+  @BeforeClass
+  public void beforeClass()
+  {
+    loadData(config.getProperty ("hadoopTestDir") + "/batchHadoop1");
+    dataLoaded = true;
+  }
+
+  @Test
+  public void testHadoopIndex() throws Exception
+  {
+      queryHelper.testQueriesFromFile(BATCH_QUERIES_RESOURCE, 2);
+  }
+
+  private void loadData(String hadoopDir)
+  {
+    String indexerSpec = "";
+
+    try {
+       LOG.info("indexerFile name: [%s]", BATCH_TASK);
+       indexerSpec = getTaskAsString(BATCH_TASK);
+       indexerSpec = indexerSpec.replaceAll("%%HADOOP_TEST_PATH%%", hadoopDir);
+    } catch (Exception e) {
+      LOG.error ("could not read and modify indexer file: %s", e.getMessage());
+      throw Throwables.propagate(e);
+    }
+
+    try {
+      final String taskID = indexer.submitTask(indexerSpec);
+      LOG.info("TaskID for loading index task %s", taskID);
+      indexer.waitUntilTaskCompletes (taskID, 60000, 20);
+      RetryUtil.retryUntil(
+          new Callable<Boolean>()
+          {
+            @Override
+            public Boolean call() throws Exception
+            {
+              return coordinator.areSegmentsLoaded(BATCH_DATASOURCE);
+
+            }
+          },
+          true,
+          20000,
+          10,
+          "Segment-Load-Task-" + taskID
+      );
+    } catch (Exception e) {
+      LOG.error ("data could not be loaded: %s", e.getMessage());
+      throw Throwables.propagate(e);
+    }
+  }
+
+  @AfterClass
+  public void afterClass()
+  {
+      if (dataLoaded)
+      {
+         try
+         {
+	     unloadAndKillData(BATCH_DATASOURCE);
+         } catch (Exception e) {
+	     LOG.warn ("exception while removing segments: [%s]", e);
+         }
+      }
+  }
+}
--- a/integration-tests/src/test/resources/hadoop/batch_hadoop.data
+++ b/integration-tests/src/test/resources/hadoop/batch_hadoop.data
--- a/integration-tests/src/test/resources/hadoop/batch_hadoop_indexer.json
+++ b/integration-tests/src/test/resources/hadoop/batch_hadoop_indexer.json
@ -0,0 +1,72 @@
+{
+  "type": "index_hadoop",
+  "spec": {
+    "dataSchema": {
+      "dataSource": "batchHadoop",
+      "parser": {
+        "type": "string",
+        "parseSpec": {
+          "type": "tsv",
+          "timestampSpec": {
+            "column": "timestamp",
+            "format": "yyyyMMddHH"
+          },
+          "dimensionsSpec": {
+            "dimensions": [
+              "location",
+              "product"
+            ]
+          },
+          "columns": [
+            "timestamp",
+            "location",
+            "product",
+            "other_metric",
+            "user_id_sketch"
+          ]
+        }
+      },
+      "metricsSpec": [
+        {
+          "type": "thetaSketch",
+          "name": "other_metric",
+          "fieldName": "other_metric",
+          "size": 16384
+        },
+        {
+          "type": "thetaSketch",
+          "name": "user_id_sketch",
+          "fieldName": "user_id_sketch",
+          "size": 16384
+        }
+      ],
+      "granularitySpec": {
+        "type": "uniform",
+        "segmentGranularity": "DAY",
+        "queryGranularity": "DAY",
+        "intervals": [
+          "2014-10-20T00:00:00Z/P2W"
+        ]
+      }
+    },
+    "ioConfig": {
+      "type": "hadoop",
+      "inputSpec": {
+        "type": "static",
+        "paths": "%%HADOOP_TEST_PATH%%"
+      }
+    },
+    "tuningConfig": {
+      "type": "hadoop",
+      "partitionsSpec": {
+        "assumeGrouped": true,
+        "targetPartitionSize": 75000,
+        "type": "hashed"
+      },
+      "jobProperties": {
+        "fs.permissions.umask-mode": "022"
+      },
+      "rowFlushBoundary": 10000
+    }
+  }
+}
--- a/integration-tests/src/test/resources/hadoop/batch_hadoop_queries.json
+++ b/integration-tests/src/test/resources/hadoop/batch_hadoop_queries.json
@ -0,0 +1,295 @@
+[
+  {
+    "description": "segmentMetadata_query",
+    "query": {
+      "queryType": "segmentMetadata",
+      "dataSource": "batchHadoop",
+      "merge": "true",
+      "intervals": "2014-10/2014-12"
+    },
+    "expectedResults": [
+      {
+        "intervals": [
+          "2014-10-20T00:00:00.000Z/2014-11-03T00:00:00.000Z"
+        ],
+        "id": "merged",
+        "columns": {
+          "location": {
+            "type": "STRING",
+            "size": 10140,
+            "hasMultipleValues": false,
+            "minValue": "location_1",
+            "maxValue": "location_5",
+            "cardinality": 5,
+            "errorMessage": null
+          },
+          "user_id_sketch": {
+            "type": "thetaSketch",
+            "size": 0,
+            "hasMultipleValues": false,
+            "minValue": null,
+            "maxValue": null,
+            "cardinality": null,
+            "errorMessage": null
+          },
+          "other_metric": {
+            "type": "thetaSketch",
+            "size": 0,
+            "hasMultipleValues": false,
+            "minValue": null,
+            "maxValue": null,
+            "cardinality": null,
+            "errorMessage": null
+          },
+          "__time": {
+            "type": "LONG",
+            "size": 10140,
+            "hasMultipleValues": false,
+            "minValue": null,
+            "maxValue": null,
+            "cardinality": null,
+            "errorMessage": null
+          },
+          "product": {
+            "type": "STRING",
+            "size": 9531,
+            "hasMultipleValues": false,
+            "minValue": "product_1",
+            "maxValue": "product_9",
+            "cardinality": 15,
+            "errorMessage": null
+          }
+        },
+        "size": 34881,
+        "numRows": 1014,
+        "aggregators": null
+      }
+    ]
+  },
+  {
+    "description": "time_boundary_query",
+    "query": {
+      "queryType": "timeBoundary",
+      "dataSource": "batchHadoop"
+    },
+    "expectedResults": [
+      {
+        "result": {
+          "maxTime": "2014-11-02T00:00:00.000Z",
+          "minTime": "2014-10-20T00:00:00.000Z"
+        },
+        "timestamp": "2014-10-20T00:00:00.000Z"
+      }
+    ]
+  },
+  {
+    "description": "unique_query3",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": {
+        "name": "batchHadoop",
+        "type": "table"
+      },
+      "dimensions": [],
+      "granularity": {
+        "period": "P1D",
+        "type": "period"
+      },
+      "intervals": [
+        "2014-10-19T00:00:00.000Z/2014-11-05T00:00:00.000Z"
+      ],
+      "filter": {
+        "type": "or",
+        "fields": [
+          {
+            "type": "selector",
+            "dimension": "product",
+            "value": "product_1"
+          },
+          {
+            "type": "selector",
+            "dimension": "product",
+            "value": "product_7"
+          }
+        ]
+      },
+      "aggregations": [
+        {
+          "type": "filtered",
+          "filter": {
+            "type": "selector",
+            "dimension": "product",
+            "value": "product_1"
+          },
+          "aggregator": {
+            "type": "thetaSketch",
+            "name": "unique1",
+            "fieldName": "user_id_sketch"
+          }
+        },
+        {
+          "type": "filtered",
+          "filter": {
+            "type": "selector",
+            "dimension": "product",
+            "value": "product_7"
+          },
+          "aggregator": {
+            "type": "thetaSketch",
+            "name": "unique7",
+            "fieldName": "user_id_sketch"
+          }
+        }
+      ],
+      "postAggregations": [
+        {
+          "type": "thetaSketchEstimate",
+          "name": "final_unique",
+          "field": {
+            "type": "thetaSketchSetOp",
+            "name": "final_unique_sketch",
+            "func": "INTERSECT",
+            "fields": [
+              {
+                "type": "fieldAccess",
+                "fieldName": "unique1"
+              },
+              {
+                "type": "fieldAccess",
+                "fieldName": "unique7"
+              }
+            ]
+          }
+        }
+      ]
+    },
+    "expectedResults": [
+      {
+        "version": "v1",
+        "timestamp": "2014-10-20T00:00:00.000Z",
+        "event": {
+          "unique1": 16.0,
+          "unique7": 17.0,
+          "final_unique": 8.0
+        }
+      },
+      {
+        "version": "v1",
+        "timestamp": "2014-10-21T00:00:00.000Z",
+        "event": {
+          "unique1": 8.0,
+          "unique7": 17.0,
+          "final_unique": 5.0
+        }
+      },
+      {
+        "version": "v1",
+        "timestamp": "2014-10-22T00:00:00.000Z",
+        "event": {
+          "unique1": 14.0,
+          "unique7": 11.0,
+          "final_unique": 3.0
+        }
+      },
+      {
+        "version": "v1",
+        "timestamp": "2014-10-23T00:00:00.000Z",
+        "event": {
+          "unique1": 12.0,
+          "unique7": 14.0,
+          "final_unique": 3.0
+        }
+      },
+      {
+        "version": "v1",
+        "timestamp": "2014-10-24T00:00:00.000Z",
+        "event": {
+          "unique1": 14.0,
+          "unique7": 11.0,
+          "final_unique": 3.0
+        }
+      },
+      {
+        "version": "v1",
+        "timestamp": "2014-10-25T00:00:00.000Z",
+        "event": {
+          "unique1": 11.0,
+          "unique7": 14.0,
+          "final_unique": 4.0
+        }
+      },
+      {
+        "version": "v1",
+        "timestamp": "2014-10-26T00:00:00.000Z",
+        "event": {
+          "unique1": 16.0,
+          "unique7": 13.0,
+          "final_unique": 5.0
+        }
+      },
+      {
+        "version": "v1",
+        "timestamp": "2014-10-27T00:00:00.000Z",
+        "event": {
+          "unique1": 17.0,
+          "unique7": 13.0,
+          "final_unique": 3.0
+        }
+      },
+      {
+        "version": "v1",
+        "timestamp": "2014-10-28T00:00:00.000Z",
+        "event": {
+          "unique1": 15.0,
+          "unique7": 16.0,
+          "final_unique": 7.0
+        }
+      },
+      {
+        "version": "v1",
+        "timestamp": "2014-10-29T00:00:00.000Z",
+        "event": {
+          "unique1": 11.0,
+          "unique7": 15.0,
+          "final_unique": 3.0
+        }
+      },
+      {
+        "version": "v1",
+        "timestamp": "2014-10-30T00:00:00.000Z",
+        "event": {
+          "unique1": 14.0,
+          "unique7": 10.0,
+          "final_unique": 3.0
+        }
+      },
+      {
+        "version": "v1",
+        "timestamp": "2014-10-31T00:00:00.000Z",
+        "event": {
+          "unique1": 17.0,
+          "unique7": 12.0,
+          "final_unique": 5.0
+        }
+      },
+      {
+        "version": "v1",
+        "timestamp": "2014-11-01T00:00:00.000Z",
+        "event": {
+          "unique1": 18.0,
+          "unique7": 18.0,
+          "final_unique": 7.0
+        }
+      },
+      {
+        "version": "v1",
+        "timestamp": "2014-11-02T00:00:00.000Z",
+        "event": {
+          "unique1": 16.0,
+          "unique7": 17.0,
+          "final_unique": 3.0
+        }
+      }
+    ]
+  }
+]
--- a/integration-tests/src/test/resources/testng.xml
+++ b/integration-tests/src/test/resources/testng.xml
@ -24,7 +24,9 @@
 </listeners>
 <test name="AllTests">
 <packages>
-  <package name="io.druid.tests.*"/>
+  <package name="io.druid.tests.*">
+    <exclude name="io.druid.tests.hadoop">
+  </package>
 </packages>
 </test>
 </suite>