druid/docs/development/extensions-contrib/iceberg.md

6.2 KiB

id title
iceberg Iceberg extension

Iceberg Ingest extension

Apache Iceberg is an open table format for huge analytic datasets. IcebergInputSource lets you ingest data stored in the Iceberg table format into Apache Druid. To use the iceberg extension, add the druid-iceberg-extensions to the list of loaded extensions. See Loading extensions for more information.

Iceberg manages most of its metadata in metadata files in the object storage. However, it is still dependent on a metastore to manage a certain amount of metadata. Iceberg refers to these metastores as catalogs. The Iceberg extension lets you connect to the following Iceberg catalog types:

  • Hive metastore catalog
  • Local catalog

Druid does not support AWS Glue and REST based catalogs yet.

For a given catalog, Iceberg input source reads the table name from the catalog, applies the filters, and extracts all the underlying live data files up to the latest snapshot. The data files can be in Parquet, ORC, or Avro formats. The data files typically reside in a warehouse location, which can be in HDFS, S3, or the local filesystem. The druid-iceberg-extensions extension relies on the existing input source connectors in Druid to read the data files from the warehouse. Therefore, the Iceberg input source can be considered as an intermediate input source, which provides the file paths for other input source implementations.

Hive metastore catalog

For Druid to seamlessly talk to the Hive metastore, ensure that the Hive configuration files such as hive-site.xml and core-site.xml are available in the Druid classpath for peon processes.
You can also specify Hive properties under the catalogProperties object in the ingestion spec.

The druid-iceberg-extensions extension presently only supports HDFS, S3 and local warehouse directories.

Read from HDFS warehouse

To read from a HDFS warehouse, load the druid-hdfs-storage extension. Druid extracts data file paths from the Hive metastore catalog and uses HDFS input source to ingest these files. The warehouseSource type in the ingestion spec should be hdfs.

For authenticating with Kerberized clusters, include principal and keytab properties in the catalogProperties object:

"catalogProperties": {
  "principal": "krb_principal",
  "keytab": "/path/to/keytab"
}

Only Kerberos based authentication is supported as of now.

Read from S3 warehouse

To read from a S3 warehouse, load the druid-s3-extensions extension. Druid extracts the data file paths from the Hive metastore catalog and uses S3InputSource to ingest these files. Set the type property of the warehouseSource object to s3 in the ingestion spec. If the S3 endpoint for the warehouse is different from the endpoint configured as the deep storage, include the following properties in the warehouseSource object to define the S3 endpoint settings:

"warehouseSource": {
  "type": "s3",
  "endpointConfig": {
    "url": "S3_ENDPOINT_URL",
    "signingRegion": "us-east-1"
  },
  "clientConfig": {
    "protocol": "http",
    "disableChunkedEncoding": true,
    "enablePathStyleAccess": true,
    "forceGlobalBucketAccessEnabled": false
  },
  "properties": {
    "accessKeyId": {
      "type": "default",
      "password": "<ACCESS_KEY_ID"
    },
    "secretAccessKey": {
      "type": "default",
      "password": "<SECRET_ACCESS_KEY>"
    }
  }
}

This extension uses the Hadoop AWS module to connect to S3 and retrieve the metadata and data file paths. The following properties are required in the catalogProperties:

"catalogProperties": {
  "fs.s3a.access.key" : "S3_ACCESS_KEY",
  "fs.s3a.secret.key" : "S3_SECRET_KEY",
  "fs.s3a.endpoint" : "S3_API_ENDPOINT"
}

Since the Hadoop AWS connector uses the s3a filesystem client, specify the warehouse path with the s3a:// protocol instead of s3://.

Local catalog

The local catalog type can be used for catalogs configured on the local filesystem. Set the icebergCatalog type to local. You can use this catalog for demos or localized tests. It is not recommended for production use cases. The warehouseSource is set to local because this catalog only supports reading from a local filesystem.

Known limitations

This section lists the known limitations that apply to the Iceberg extension.

  • This extension does not fully utilize the Iceberg features such as snapshotting or schema evolution.
  • The Iceberg input source reads every single live file on the Iceberg table up to the latest snapshot, which makes the table scan less performant. It is recommended to use Iceberg filters on partition columns in the ingestion spec in order to limit the number of data files being retrieved. Since, Druid doesn't store the last ingested iceberg snapshot ID, it cannot identify the files created between that snapshot and the latest snapshot on Iceberg.
  • It does not handle Iceberg schema evolution yet. In cases where an existing Iceberg table column is deleted and recreated with the same name, ingesting this table into Druid may bring the data for this column before it was deleted.
  • The Hive catalog has not been tested on Hadoop 2.x.x and is not guaranteed to work with Hadoop 2.