17 KiB
Integeration of Tencent COS in Hadoop
Introduction
Tencent COS is a famous object storage system provided by Tencent Corp. Hadoop-COS is a client that makes the upper computing systems based on HDFS be able to use the COS as its underlying storage system. The big data-processing systems that have been identified for support are: Hadoop MR, Spark, Alluxio and etc. In addition, Druid also can use COS as its deep storage by configuring HDFS-Load-Plugin integerating with HADOOP-COS.
Features
-
Support Hadoop MapReduce and Spark write data into COS and read from it directly.
-
Implements the interfaces of the Hadoop file system and provides the pseudo-hierarchical directory structure same as HDFS.
-
Supports multipart uploads for a large file. Single file supports up to 19TB
-
High performance and high availability. The performance difference between Hadoop-COS and HDFS is not more than 30%.
Notes:
Object Storage is not a file system and it has some limitations:
- Object storage is a key-value storage and it does not support hierarchical directory naturally. Usually, using the directory separatory in object key to simulate the hierarchical directory, such as "/hadoop/data/words.dat".
- COS Object storage can not support the object's append operation currently. It means that you can not append content to the end of an existing object(file).
- Both
delete
andrename
operations are non-atomic, which means that the operations are interrupted, the operation result may be inconsistent state.
- Object storages have different authorization models:
- Directory permissions are reported as 777.
- File permissions are reported as 666.
- File owner is reported as the local current user.
- File group is also reported as the local current user.
- Supports multipart uploads for a large file(up to 40TB), but the number of part is limited as 10000.
- The number of files listed each time is limited to 1000.
Quick Start
Concepts
-
Bucket: A container for storing data in COS. Its name is made up of user-defined bucketname and user appid.
-
Appid: Unique resource identifier for the user dimension.
-
SecretId: ID used to authenticate the user
-
SecretKey: Key used to authenticate the user
-
Region: The region where a bucket locates.
-
CosN: Hadoop-COS uses
cosn
as its URI scheme, so CosN is often used to refer to Hadoop-COS.
Usage
System Requirements
Linux kernel 2.6+
Dependencies
- cos_api (version 5.4.10 or later )
- cos-java-sdk (version 2.0.6 recommended)
- joda-time (version 2.9.9 recommended)
- httpClient (version 4.5.1 or later recommended)
- Jackson: jackson-core, jackson-databind, jackson-annotations (version 2.9.8 or later)
- bcprov-jdk15on (version 1.59 recommended)
Configure Properties
URI and Region Properties
If you plan to use COS as the default file system for Hadoop or other big data systems, you need to configure fs.defaultFS
as the URI of Hadoop-COS in core-site.xml. Hadoop-COS uses cosn
as its URI scheme, and the bucket as its URI host. At the same time, you need to explicitly set fs.cosn.userinfo.region
to indicate the region your bucket locates.
NOTE:
-
For Hadoop-COS,
fs.defaultFS
is an option. If you are only temporarily using the COS as a data source for Hadoop, you do not need to set the property, just specify the full URI when you use it. For example:hadoop fs -ls cosn://testBucket-125236746/testDir/test.txt
. -
fs.cosn.userinfo.region
is an required property for Hadoop-COS. The reason is that Hadoop-COS must know the region of the using bucket in order to accurately construct a URL to access it. -
COS supports multi-region storage, and different regions have different access domains by default. It is recommended to choose the nearest storage region according to your own business scenarios, so as to improve the object upload and download speed. You can find the available region from https://intl.cloud.tencent.com/document/product/436/6224
The following is an example for the configuration format:
<property>
<name>fs.defaultFS</name>
<value>cosn://<bucket-appid></value>
<description>
Optional: If you don't want to use CosN as the default file system, you don't need to configure it.
</description>
</property>
<property>
<name>fs.cosn.bucket.region</name>
<value>ap-xxx</value>
<description>The region where the bucket is located</description>
</property>
User Authentication Properties
Each user needs to properly configure the credentials ( User's secreteId and secretKey ) properly to access the object stored in COS. These credentials can be obtained from the official console provided by Tencent Cloud.
<property>
<name>fs.cosn.credentials.provider</name>
<value>org.apache.hadoop.fs.auth.SimpleCredentialProvider</value>
<description>
This option allows the user to specify how to get the credentials.
Comma-separated class names of credential provider classes which implement
com.qcloud.cos.auth.COSCredentialsProvider:
1.org.apache.hadoop.fs.auth.SimpleCredentialProvider: Obtain the secret id and secret key
from fs.cosn.userinfo.secretId and fs.cosn.userinfo.secretKey in core-site.xml
2.org.apache.hadoop.fs.auth.EnvironmentVariableCredentialProvider: Obtain the secret id and secret key from system environment variables named COS_SECRET_ID and COS_SECRET_KEY
If unspecified, the default order of credential providers is:
1. org.apache.hadoop.fs.auth.SimpleCredentialProvider
2. org.apache.hadoop.fs.auth.EnvironmentVariableCredentialProvider
</description>
</property>
<property>
<name>fs.cosn.userinfo.secretId</name>
<value>xxxxxxxxxxxxxxxxxxxxxxxxx</value>
<description>Tencent Cloud Secret Id </description>
</property>
<property>
<name>fs.cosn.userinfo.secretKey</name>
<value>xxxxxxxxxxxxxxxxxxxxxxxx</value>
<description>Tencent Cloud Secret Key</description>
</property>
Integration Properties
You need to explicitly specify the A and B options in order for Hadoop to properly integrate the COS as the underlying file system
Only correctly set fs.cosn.impl
and fs.AbstractFileSystem.cosn.impl
to enable Hadoop to integrate COS as its underlying file system. fs.cosn.impl
must be set as org.apache.hadoop.fs.cos.CosFileSystem
and fs.AbstractFileSystem.cosn.impl
must be set as org.apache.hadoop.fs.cos.CosN
.
<property>
<name>fs.cosn.impl</name>
<value>org.apache.hadoop.fs.cosn.CosNFileSystem</value>
<description>The implementation class of the CosN Filesystem</description>
</property>
<property>
<name>fs.AbstractFileSystem.cosn.impl</name>
<value>org.apache.hadoop.fs.cos.CosN</value>
<description>The implementation class of the CosN AbstractFileSystem.</description>
</property>
Other Runtime Properties
Hadoop-COS provides rich runtime properties to set, and most of these do not require custom values because a well-run default value provided for them.
It is important to note that:
-
Hadoop-COS will generate some temporary files and consumes some disk space. All temporary files would be placed in the directory specified by option
fs.cosn.tmp.dir
(Default: /tmp/hadoop_cos); -
The default block size is 8MB and it means that you can only upload a single file up to 78GB into the COS blob storage system. That is mainly due to the fact that the multipart-upload can only support up to 10,000 blocks. For this reason, if needing to support larger single files, you must increase the block size accordingly by setting the property
fs.cosn.block.size
. For example, the size of the largest single file is 1TB, the block size is at least greater than or equal to (1 * 1024 * 1024 * 1024 * 1024)/10000 = 109951163. Currently, the maximum support file is 19TB (block size: 2147483648)
<property>
<name>fs.cosn.tmp.dir</name>
<value>/tmp/hadoop_cos</value>
<description>Temporary files would be placed here.</description>
</property>
<property>
<name>fs.cosn.buffer.size</name>
<value>33554432</value>
<description>The total size of the buffer pool.</description>
</property>
<property>
<name>fs.cosn.block.size</name>
<value>8388608</value>
<description>
Block size to use cosn filesysten, which is the part size for MultipartUpload. Considering the COS supports up to 10000 blocks, user should estimate the maximum size of a single file. For example, 8MB part size can allow writing a 78GB single file.
</description>
</property>
<property>
<name>fs.cosn.maxRetries</name>
<value>3</value>
<description>
The maximum number of retries for reading or writing files to COS, before throwing a failure to the application.
</description>
</property>
<property>
<name>fs.cosn.retry.interval.seconds</name>
<value>3</value>
<description>The number of seconds to sleep between each COS retry.</description>
</property>
Properties Summary
properties | description | default value | required |
---|---|---|---|
fs.defaultFS | Configure the default file system used by Hadoop. | None | NO |
fs.cosn.credentials.provider | This option allows the user to specify how to get the credentials. Comma-separated class names of credential provider classes which implement com.qcloud.cos.auth.COSCredentialsProvider: 1. org.apache.hadoop.fs.cos.auth.SimpleCredentialProvider: Obtain the secret id and secret key from fs.cosn.userinfo.secretId and fs.cosn.userinfo.secretKey in core-site.xml; 2. org.apache.hadoop.fs.auth.EnvironmentVariableCredentialProvider: Obtain the secret id and secret key from system environment variables named COSN_SECRET_ID and COSN_SECRET_KEY . If unspecified, the default order of credential providers is: 1. org.apache.hadoop.fs.auth.SimpleCredentialProvider; 2. org.apache.hadoop.fs.auth.EnvironmentVariableCredentialProvider. |
None | NO |
fs.cosn.userinfo.secretId/secretKey | The API key information of your account | None | YES |
fs.cosn.bucket.region | The region where the bucket is located. | None | YES |
fs.cosn.impl | The implementation class of the CosN filesystem. | None | YES |
fs.AbstractFileSystem.cosn.impl | The implementation class of the CosN AbstractFileSystem. | None | YES |
fs.cosn.tmp.dir | Temporary files generated by cosn would be stored here during the program running. | /tmp/hadoop_cos | NO |
fs.cosn.buffer.size | The total size of the buffer pool. Require greater than or equal to block size. | 33554432 | NO |
fs.cosn.block.size | The size of file block. Considering the limitation that each file can be divided into a maximum of 10,000 to upload, the option must be set according to the maximum size of used single file. For example, 8MB part size can allow writing a 78GB single file. | 8388608 | NO |
fs.cosn.upload_thread_pool | Number of threads used for concurrent uploads when files are streamed to COS. | CPU core number * 3 | NO |
fs.cosn.read.ahead.block.size | The size of each read-ahead block. | 524288 (512KB) | NO |
fs.cosn.read.ahead.queue.size | The length of readahead queue. | 10 | NO |
fs.cosn.maxRetries | The maxium number of retries for reading or writing files to COS, before throwing a failure to the application. | 3 | NO |
fs.cosn.retry.interval.seconds | The number of seconds to sleep between each retry | 3 | NO |
Command Usage
Command format: hadoop fs -ls -R cosn://bucket-appid/<path>
or hadoop fs -ls -R /<path>
, the latter requires the defaultFs option to be set as cosn
.
Example
Use CosN as the underlying file system to run the WordCount routine:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-x.x.x.jar wordcount cosn://example/mr/input.txt cosn://example/mr/output
If setting CosN as the default file system for Hadoop, you can run it as follows:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-x.x.x.jar wordcount /mr/input.txt /mr/output
Testing the hadoop-cos Module
To test CosN filesystem, the following two files which pass in authentication details to the test runner are needed.
- auth-keys.xml
- core-site.xml
These two files need to be created under the hadoop-cloud-storage-project/hadoop-cos/src/test/resource
directory.
auth-key.xml
COS credentials can specified in auth-key.xml
. At the same time, it is also a trigger for the CosN filesystem tests.
COS bucket URL should be provided by specifying the option: test.fs.cosn.name
.
An example of the auth-keys.xml
is as follow:
<configuration>
<property>
<name>test.fs.cosn.name</name>
<value>cosn://testbucket-12xxxxxx</value>
</property>
<property>
<name>fs.cosn.bucket.region</name>
<value>ap-xxx</value>
<description>The region where the bucket is located</description>
</property>
<property>
<name>fs.cosn.userinfo.secretId</name>
<value>AKIDXXXXXXXXXXXXXXXXXXXX</value>
</property>
<property>
<name>fs.cosn.userinfo.secretKey</name>
<value>xxxxxxxxxxxxxxxxxxxxxxxxx</value>
</property>
</configuration>
Without this file, all tests in this module will be skipped.
core-site.xml
This file pre-exists and sources the configurations created in auth-keys.xml. For most cases, no modification is needed, unless a specific, non-default property needs to be set during the testing.
contract-test-options.xml
All configurations related to support contract tests need to be specified in contract-test-options.xml
. Here is an example of contract-test-options.xml
.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<include xmlns="http://www.w3.org/2001/XInclude"
href="auth-keys.xml"/>
<property>
<name>fs.contract.test.fs.cosn</name>
<value>cosn://testbucket-12xxxxxx</value>
</property>
<property>
<name>fs.cosn.bucket.region</name>
<value>ap-xxx</value>
<description>The region where the bucket is located</description>
</property>
</configuration>
If the option fs.contract.test.fs.cosn
not definded in the file, all contract tests will be skipped.
Other issues
Performance Loss
The IO performance of COS is lower than HDFS in principle, even on virtual clusters running on Tencent CVM.
The main reason can be attributed to the following points:
-
HDFS replicates data for faster query.
-
HDFS is significantly faster for many “metadata” operations: listing the contents of a directory, calling getFileStatus() on path, creating or deleting directories.
-
HDFS stores the data on the local hard disks, avoiding network traffic if the code can be executed on that host. But access to the object storing in COS requires access to network almost each time. It is a critical point in damaging IO performance. Hadoop-COS also do a lot of optimization work for it, such as the pre-read queue, the upload buffer pool, the concurrent upload thread pool, etc.
-
File IO performing many seek calls/positioned read calls will also encounter performance problems due to the size of the HTTP requests made. Despite the pre-read cache optimizations, a large number of random reads can still cause frequent network requests.
-
On HDFS, both the
rename
andmv
for a directory or a file are an atomic and O(1)-level operation, but in COS, the operation need to combinecopy
anddelete
sequentially. Therefore, performing rename and move operations on a COS object is not only low performance, but also difficult to guarantee data consistency.
At present, using the COS blob storage system through Hadoop-COS occurs about 20% ~ 25% performance loss compared to HDFS. But, the cost of using COS is lower than HDFS, which includes both storage and maintenance costs.