druid/google.md at f5402169319737381d005ab9a69a84c3857d4ba1

5.4 KiB

Raw Blame History

id	title
google	Google Cloud Storage

To use this Apache Druid extension, make sure to include druid-google-extensions extension.

Deep Storage

Deep storage can be written to Google Cloud Storage either via this extension or the druid-hdfs-storage extension.

Configuration

Property	Possible Values	Description	Default
`druid.storage.type`	google		Must be set.
`druid.google.bucket`		GCS bucket name.	Must be set.
`druid.google.prefix`		GCS prefix.	No-prefix

Google cloud storage batch ingestion input source

This extension also provides an input source for Druid native batch ingestion to support reading objects directly from Google Cloud Storage. Objects can be specified as list of Google Cloud Storage URI strings. The Google Cloud Storage input source is splittable and can be used by native parallel index tasks, where each worker task of index_parallel will read a single object.

...
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "google",
        "uris": ["gs://foo/bar/file.json", "gs://bar/foo/file2.json"]
      },
      "inputFormat": {
        "type": "json"
      },
      ...
    },
...

...
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "google",
        "prefixes": ["gs://foo/bar", "gs://bar/foo"]
      },
      "inputFormat": {
        "type": "json"
      },
      ...
    },
...

...
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "google",
        "objects": [
          { "bucket": "foo", "path": "bar/file1.json"},
          { "bucket": "bar", "path": "foo/file2.json"}
        ]
      },
      "inputFormat": {
        "type": "json"
      },
      ...
    },
...

property	description	default	required?
type	This should be `google`.	N/A	yes
uris	JSON array of URIs where Google Cloud Storage objects to be ingested are located.	N/A	`uris` or `prefixes` or `objects` must be set
prefixes	JSON array of URI prefixes for the locations of Google Cloud Storage objects to be ingested.	N/A	`uris` or `prefixes` or `objects` must be set
objects	JSON array of Google Cloud Storage objects to be ingested.	N/A	`uris` or `prefixes` or `objects` must be set

Google Cloud Storage object:

property	description	default	required?
bucket	Name of the Google Cloud Storage bucket	N/A	yes
path	The path where data is located.	N/A	yes

Firehose

StaticGoogleBlobStoreFirehose

This firehose ingests events, similar to the StaticS3Firehose, but from an Google Cloud Store.

As with the S3 blobstore, it is assumed to be gzipped if the extension ends in .gz

This firehose is splittable and can be used by native parallel index tasks. Since each split represents an object in this firehose, each worker task of index_parallel will read an object.

Sample spec:

"firehose" : {
    "type" : "static-google-blobstore",
    "blobs": [
        {
          "bucket": "foo",
          "path": "/path/to/your/file.json"
        },
        {
          "bucket": "bar",
          "path": "/another/path.json"
        }
    ]
}

This firehose provides caching and prefetching features. In IndexTask, a firehose can be read twice if intervals or shardSpecs are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scan of objects is slow.

property	description	default	required?
type	This should be `static-google-blobstore`.	N/A	yes
blobs	JSON array of Google Blobs.	N/A	yes
maxCacheCapacityBytes	Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.	1073741824	no
maxFetchCapacityBytes	Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.	1073741824	no
prefetchTriggerBytes	Threshold to trigger prefetching Google Blobs.	maxFetchCapacityBytes / 2	no
fetchTimeout	Timeout for fetching a Google Blob.	60000	no
maxFetchRetry	Maximum retry for fetching a Google Blob.	3	no

Google Blobs:

property	description	default	required?
bucket	Name of the Google Cloud bucket	N/A	yes
path	The path where data is located.	N/A	yes

5.4 KiB Raw Blame History