Add doc. on how to validate a deep storage implementation.

2015-04-14 11:43:17 -07:00 · 2015-04-14 11:43:17 -07:00 · f912e45464
parent f0a19349bf
commit f912e45464
1 changed files with 52 additions and 1 deletions
--- a/docs/content/Modules.md
+++ b/docs/content/Modules.md
@ -92,7 +92,7 @@ If your jar has this file, then when it is added to the classpath or as an exten
 ### Adding a new deep storage implementation
-Check the `cassandra-storage`, `hdfs-storage` and `s3-extensions` modules for examples of how to do this.
+Check the `azure-storage`, `cassandra-storage`, `hdfs-storage` and `s3-extensions` modules for examples of how to do this.
 The basic idea behind the extension is that you need to add bindings for your DataSegmentPusher and DataSegmentPuller objects.  The way to add them is something like (taken from HdfsStorageDruidModule)
@ -112,6 +112,57 @@ Binders.dataSegmentPusherBinder(binder)
 `to(...).in(...);` is normal Guice stuff.
 In addition to DataSegmentPusher and DataSegmentPuller, you can also bind:
 * DataSegmentKiller: Removes segments, used as part of the Kill Task to delete unused segments, i.e. perform garbage collection of segments that are either superseded by newer versions or that have been dropped from the cluster. 
 * DataSegmentMover: Allow migrating segments from one place to another. Currently this is only used as part of the MoveTask to move unused segments to a different S3 bucket or prefix in order to reduce storage costs of unused data (e.g. move to glacier or cheaper storage).
 * DataSegmentArchiver: It's a wrapper around Mover which comes with a pre-configured target bucket/path. Thus it's not necessary to specify those parameters at runtime as part of the ArchiveTask.
 ### Validating your deep storage implementation
 **WARNING!** This is not a formal procedure, but a collection of hints to validate if your new deep storage implementation is able do push, pull and kill segments.
 It's recommended to use batch ingestion tasks to validate your implementation.
 The segment will be automatically rolled up to historical note after ~20 seconds. 
 In this way, you can validate both push (at realtime node) and pull (at historical node) segments.
 * DataSegmentPusher
 Wherever your data storage (cloud storage service, distributed file system, etc.) is, you should be able to see two new files: `descriptor.json` and `index.zip` after your ingestion task ends.
 * DataSegmentPuller
 After ~20 secs your ingestion task ends, you should be able to see your historical node trying to load the new segment.
 The following example was retrieved from a historical node configured to use Azure for deep storage:
 ```
 2015-04-14T02:42:33,450 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - New request[LOAD: dde_2015-01-02T00:00:00.000Z_2015-01-03T00:00:00
 .000Z_2015-04-14T02:41:09.484Z] with zNode[/druid/dev/loadQueue/192.168.33.104:8081/dde_2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z_2015-04-14T02:41:09.
 484Z].
 2015-04-14T02:42:33,451 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Loading segment dde_2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.0
 00Z_2015-04-14T02:41:09.484Z
 2015-04-14T02:42:33,463 INFO [ZkCoordinator-0] io.druid.guice.JsonConfigurator - Loaded class[class io.druid.storage.azure.AzureAccountConfig] from props[drui
 d.azure.] as [io.druid.storage.azure.AzureAccountConfig@759c9ad9]
 2015-04-14T02:49:08,275 INFO [ZkCoordinator-0] com.metamx.common.CompressionUtils - Unzipping file[/opt/druid/tmp/compressionUtilZipCache1263964429587449785.z
 ip] to [/opt/druid/zk_druid/dde/2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z/2015-04-14T02:41:09.484Z/0]
 2015-04-14T02:49:08,276 INFO [ZkCoordinator-0] io.druid.storage.azure.AzureDataSegmentPuller - Loaded 1196 bytes from [dde/2015-01-02T00:00:00.000Z_2015-01-03
 T00:00:00.000Z/2015-04-14T02:41:09.484Z/0/index.zip] to [/opt/druid/zk_druid/dde/2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z/2015-04-14T02:41:09.484Z/0]
 2015-04-14T02:49:08,277 WARN [ZkCoordinator-0] io.druid.segment.loading.SegmentLoaderLocalCacheManager - Segment [dde_2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z_2015-04-14T02:41:09.484Z] is different than expected size. Expected [0] found [1196]
 2015-04-14T02:49:08,282 INFO [ZkCoordinator-0] io.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing segment[dde_2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z_2015-04-14T02:41:09.484Z] at path[/druid/dev/segments/192.168.33.104:8081/192.168.33.104:8081_historical__default_tier_2015-04-14T02:49:08.282Z_7bb87230ebf940188511dd4a53ffd7351]
 2015-04-14T02:49:08,292 INFO [ZkCoordinator-0] io.druid.server.coordination.ZkCoordinator - Completed request [LOAD: dde_2015-01-02T00:00:00.000Z_2015-01-03T00:00:00.000Z_2015-04-14T02:41:09.484Z]
 ```
 * DataSegmentKiller
 The easiest way of testing the segment killing is marking a segment as not used and then starting a killing task through the old Coordinator console.
 To mark a segment as not used, you need to connect to your metadata storage and update the `used` column to `false` on the segment table rows.
 To start a segment killing task, you need to access the old Coordinator console `http://<COODRINATOR_IP>:<COORDINATOR_PORT/old-console/kill.html` then select the appropriate datasource and then input a time range (e.g. `2000/3000`).
 After the killing task ends, both `descriptor.json` and `index.zip` files should be deleted from the data storage.
 ### Adding a new Firehose
 There is an example of this in the `s3-extensions` module with the StaticS3FirehoseFactory.