hadoop/hadoop-azure-datalake/index.html

882 lines
44 KiB
HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
| Generated by Apache Maven Doxia at 2023-04-06
| Rendered using Apache Maven Stylus Skin 1.5
-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Apache Hadoop Azure Data Lake support &#x2013; Hadoop Azure Data Lake Support</title>
<style type="text/css" media="all">
@import url("./css/maven-base.css");
@import url("./css/maven-theme.css");
@import url("./css/site.css");
</style>
<link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
<meta name="Date-Revision-yyyymmdd" content="20230406" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body class="composite">
<div id="banner">
<a href="http://hadoop.apache.org/" id="bannerLeft">
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
</a>
<a href="http://www.apache.org/" id="bannerRight">
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
</a>
<div class="clear">
<hr/>
</div>
</div>
<div id="breadcrumbs">
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
<a href="https://gitbox.apache.org/repos/asf/hadoop.git" class="externalLink">git</a>
&nbsp;| Last Published: 2023-04-06
&nbsp;| Version: 3.4.0-SNAPSHOT
</div>
<div class="clear">
<hr/>
</div>
</div>
<div id="leftColumn">
<div id="navcolumn">
<h5>General</h5>
<ul>
<li class="none">
<a href="../index.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CommandsManual.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Compatibility.html">Compatibility Specification</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/DownstreamDev.html">Downstream Developer's Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/AdminCompatibilityGuide.html">Admin Compatibility Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/InterfaceClassification.html">Interface Classification</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/filesystem/index.html">FileSystem Specification</a>
</li>
</ul>
<h5>Common</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/FairCallQueue.html">Fair Call Queue</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Superusers.html">Proxy User</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>
</li>
<li class="none">
<a href="../hadoop-kms/index.html">Hadoop KMS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Tracing.html">Tracing</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/UnixShellGuide.html">Unix Shell Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/registry/index.html">Registry</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/AsyncProfilerServlet.html">Async Profiler</a>
</li>
</ul>
<h5>HDFS</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Architecture</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">User Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html">NameNode HA With QJM</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html">NameNode HA With NFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html">Observer NameNode</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ViewFs.html">ViewFs</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ViewFsOverloadScheme.html">ViewFsOverloadScheme</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">Snapshots</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">libhdfs (C API)</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS (REST API)</a>
</li>
<li class="none">
<a href="../hadoop-hdfs-httpfs/index.html">HttpFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">NFS Gateway</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html">Rolling Upgrade</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html">Extended Attributes</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html">Transparent Encryption</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html">Multihoming</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html">Storage Policies</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/MemoryStorage.html">Memory Storage Support</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html">Synthetic Load Generator</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html">Erasure Coding</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html">Disk Balancer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html">Upgrade Domain</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html">DataNode Admin</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html">Router Federation</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsProvidedStorage.html">Provided Storage</a>
</li>
</ul>
<h5>MapReduce</h5>
<ul>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Tutorial</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibility with 1.x</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/SharedCacheSupport.html">Support for YARN Shared Cache</a>
</li>
</ul>
<h5>MapReduce REST APIs</h5>
<ul>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html">MR Application Master</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html">MR History Server</a>
</li>
</ul>
<h5>YARN</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YARN.html">Architecture</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html">ResourceManager Restart</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html">ResourceManager HA</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceModel.html">Resource Model</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeLabel.html">Node Labels</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeAttributes.html">Node Attributes</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServer.html">Timeline Server</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html">Timeline Service V.2</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html">YARN Application Security</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManager.html">NodeManager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/DockerContainers.html">Running Applications in Docker Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/RuncContainers.html">Running Applications in runC Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html">Using CGroups</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/SecureContainer.html">Secure Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ReservationSystem.html">Reservation System</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html">Graceful Decommission</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html">Opportunistic Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/Federation.html">YARN Federation</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/SharedCache.html">Shared Cache</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/UsingGpus.html">Using GPU</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/UsingFPGA.html">Using FPGA</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html">Placement Constraints</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnUI2.html">YARN UI2</a>
</li>
</ul>
<h5>YARN REST APIs</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_v1">Timeline Server</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html#Timeline_Service_v.2_REST_API">Timeline Service V.2</a>
</li>
</ul>
<h5>YARN Service</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html">QuickStart</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/Concepts.html">Concepts</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html">Yarn Service API</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceDiscovery.html">Service Discovery</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html">System Services</a>
</li>
</ul>
<h5>Hadoop Compatible File Systems</h5>
<ul>
<li class="none">
<a href="../hadoop-aliyun/tools/hadoop-aliyun/index.html">Aliyun OSS</a>
</li>
<li class="none">
<a href="../hadoop-aws/tools/hadoop-aws/index.html">Amazon S3</a>
</li>
<li class="none">
<a href="../hadoop-azure/index.html">Azure Blob Storage</a>
</li>
<li class="none">
<a href="../hadoop-azure-datalake/index.html">Azure Data Lake Storage</a>
</li>
<li class="none">
<a href="../hadoop-cos/cloud-storage/index.html">Tencent COS</a>
</li>
<li class="none">
<a href="../hadoop-huaweicloud/cloud-storage/index.html">Huaweicloud OBS</a>
</li>
</ul>
<h5>Auth</h5>
<ul>
<li class="none">
<a href="../hadoop-auth/index.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-auth/Examples.html">Examples</a>
</li>
<li class="none">
<a href="../hadoop-auth/Configuration.html">Configuration</a>
</li>
<li class="none">
<a href="../hadoop-auth/BuildingIt.html">Building</a>
</li>
</ul>
<h5>Tools</h5>
<ul>
<li class="none">
<a href="../hadoop-streaming/HadoopStreaming.html">Hadoop Streaming</a>
</li>
<li class="none">
<a href="../hadoop-archives/HadoopArchives.html">Hadoop Archives</a>
</li>
<li class="none">
<a href="../hadoop-archive-logs/HadoopArchiveLogs.html">Hadoop Archive Logs</a>
</li>
<li class="none">
<a href="../hadoop-distcp/DistCp.html">DistCp</a>
</li>
<li class="none">
<a href="../hadoop-federation-balance/HDFSFederationBalance.html">HDFS Federation Balance</a>
</li>
<li class="none">
<a href="../hadoop-gridmix/GridMix.html">GridMix</a>
</li>
<li class="none">
<a href="../hadoop-rumen/Rumen.html">Rumen</a>
</li>
<li class="none">
<a href="../hadoop-resourceestimator/ResourceEstimator.html">Resource Estimator Service</a>
</li>
<li class="none">
<a href="../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Benchmarking.html">Hadoop Benchmarking</a>
</li>
<li class="none">
<a href="../hadoop-dynamometer/Dynamometer.html">Dynamometer</a>
</li>
</ul>
<h5>Reference</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/release/">Changelog and Release Notes</a>
</li>
<li class="none">
<a href="../api/index.html">Java API docs</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/UnixShellAPI.html">Unix Shell API</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Metrics.html">Metrics</a>
</li>
</ul>
<h5>Configuration</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml">hdfs-rbf-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-kms/kms-default.html">kms-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-hdfs-httpfs/httpfs-default.html">httpfs-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
</li>
</ul>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
</a>
</div>
</div>
<div id="bodyColumn">
<div id="contentBox">
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<h1>Hadoop Azure Data Lake Support</h1>
<ul>
<li><a href="#Introduction">Introduction</a>
<ul>
<li><a href="#Related_Documents">Related Documents</a></li></ul></li>
<li><a href="#Features">Features</a></li>
<li><a href="#Limitations">Limitations</a></li>
<li><a href="#Usage">Usage</a>
<ul>
<li><a href="#Concepts">Concepts</a>
<ul>
<li><a href="#OAuth2_Support">OAuth2 Support</a></li></ul></li>
<li><a href="#Configuring_Credentials_and_FileSystem">Configuring Credentials and FileSystem</a>
<ul>
<li><a href="#Using_Refresh_Tokens">Using Refresh Tokens</a></li>
<li><a href="#Using_Client_Keys">Using Client Keys</a></li>
<li><a href="#Using_MSI_.28Managed_Service_Identity.29">Using MSI (Managed Service Identity)</a></li></ul></li>
<li><a href="#Using_Device_Code_Auth_for_interactive_login">Using Device Code Auth for interactive login</a>
<ul>
<li><a href="#Protecting_the_Credentials_with_Credential_Providers">Protecting the Credentials with Credential Providers</a></li></ul></li>
<li><a href="#Accessing_adl_URLs">Accessing adl URLs</a></li>
<li><a href="#User.2FGroup_Representation">User/Group Representation</a></li></ul></li>
<li><a href="#Configurations_for_different_ADL_accounts">Configurations for different ADL accounts</a></li>
<li><a href="#Testing_the_azure-datalake-store_Module">Testing the azure-datalake-store Module</a></li></ul>
<section>
<h2><a name="Introduction"></a>Introduction</h2>
<p>The <code>hadoop-azure-datalake</code> module provides support for integration with the <a class="externalLink" href="https://azure.microsoft.com/en-in/documentation/services/data-lake-store/">Azure Data Lake Store</a>. This support comes via the JAR file <code>azure-datalake-store.jar</code>.</p><section>
<h3><a name="Related_Documents"></a>Related Documents</h3>
<ul>
<li><a href="troubleshooting_adl.html">Troubleshooting</a>.</li>
</ul></section></section><section>
<h2><a name="Features"></a>Features</h2>
<ul>
<li>Read and write data stored in an Azure Data Lake Storage account.</li>
<li>Reference file system paths using URLs using the <code>adl</code> scheme for Secure Webhdfs i.e. SSL encrypted access.</li>
<li>Can act as a source of data in a MapReduce job, or a sink.</li>
<li>Tested on both Linux and Windows.</li>
<li>Tested for scale.</li>
<li>API <code>setOwner()</code>, <code>setAcl</code>, <code>removeAclEntries()</code>, <code>modifyAclEntries()</code> accepts UPN or OID (Object ID) as user and group names.</li>
<li>Supports per-account configuration.</li>
</ul></section><section>
<h2><a name="Limitations"></a>Limitations</h2>
<p>Partial or no support for the following operations :</p>
<ul>
<li>Operation on Symbolic Links</li>
<li>Proxy Users</li>
<li>File Truncate</li>
<li>File Checksum</li>
<li>File replication factor</li>
<li>Home directory the active user on Hadoop cluster.</li>
<li>Extended Attributes(XAttrs) Operations</li>
<li>Snapshot Operations</li>
<li>Delegation Token Operations</li>
<li>User and group information returned as <code>listStatus()</code> and <code>getFileStatus()</code> is in the form of the GUID associated in Azure Active Directory.</li>
</ul></section><section>
<h2><a name="Usage"></a>Usage</h2><section>
<h3><a name="Concepts"></a>Concepts</h3>
<p>Azure Data Lake Storage access path syntax is:</p>
<div class="source">
<div class="source">
<pre>adl://&lt;Account Name&gt;.azuredatalakestore.net/
</pre></div></div>
<p>For details on using the store, see <a class="externalLink" href="https://azure.microsoft.com/en-in/documentation/articles/data-lake-store-get-started-portal/"><b>Get started with Azure Data Lake Store using the Azure Portal</b></a></p><section>
<h4><a name="OAuth2_Support"></a>OAuth2 Support</h4>
<p>Usage of Azure Data Lake Storage requires an OAuth2 bearer token to be present as part of the HTTPS header as per the OAuth2 specification. A valid OAuth2 bearer token must be obtained from the Azure Active Directory service for those valid users who have access to Azure Data Lake Storage Account.</p>
<p>Azure Active Directory (Azure AD) is Microsoft&#x2019;s multi-tenant cloud based directory and identity management service. See <a class="externalLink" href="https://azure.microsoft.com/en-in/documentation/articles/active-directory-whatis/"><i>What is ActiveDirectory</i></a>.</p>
<p>Following sections describes theOAuth2 configuration in <code>core-site.xml</code>.</p></section></section><section>
<h3><a name="Configuring_Credentials_and_FileSystem"></a>Configuring Credentials and FileSystem</h3>
<p>Credentials can be configured using either a refresh token (associated with a user), or a client credential (analogous to a service principal).</p><section>
<h4><a name="Using_Refresh_Tokens"></a>Using Refresh Tokens</h4>
<p>Add the following properties to the cluster&#x2019;s <code>core-site.xml</code></p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;fs.adl.oauth2.access.token.provider.type&lt;/name&gt;
&lt;value&gt;RefreshToken&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>Applications must set the Client id and OAuth2 refresh token from the Azure Active Directory service associated with the client id. See <a class="externalLink" href="https://github.com/AzureAD/azure-activedirectory-library-for-java"><i>Active Directory Library For Java</i></a>.</p>
<p><b>Do not share client id and refresh token, it must be kept secret.</b></p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;fs.adl.oauth2.client.id&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.adl.oauth2.refresh.token&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
</section><section>
<h4><a name="Using_Client_Keys"></a>Using Client Keys</h4><section>
<h5><a name="Generating_the_Service_Principal"></a>Generating the Service Principal</h5>
<ol style="list-style-type: decimal">
<li>Go to <a class="externalLink" href="https://portal.azure.com">the portal</a></li>
<li>Under services in left nav, look for Azure Active Directory and click it.</li>
<li>Using &#x201c;App Registrations&#x201d; in the menu, create &#x201c;Web Application&#x201d;. Remember the name you create here - that is what you will add to your ADL account as authorized user.</li>
<li>Go through the wizard</li>
<li>Once app is created, go to &#x201c;keys&#x201d; under &#x201c;settings&#x201d; for the app</li>
<li>Select a key duration and hit save. Save the generated keys.</li>
<li>Go back to the App Registrations page, and click on the &#x201c;Endpoints&#x201d; button at the top a. Note down the &#x201c;Token Endpoint&#x201d; URL</li>
<li>Note down the properties you will need to auth:
<ul>
<li>The &#x201c;Application ID&#x201d; of the Web App you created above</li>
<li>The key you just generated above</li>
<li>The token endpoint</li>
</ul>
</li>
</ol></section><section>
<h5><a name="Adding_the_service_principal_to_your_ADL_Account"></a>Adding the service principal to your ADL Account</h5>
<ol style="list-style-type: decimal">
<li>Go to the portal again, and open your ADL account</li>
<li>Select <code>Access control (IAM)</code></li>
<li>Add your user name you created in Step 6 above (note that it does not show up in the list, but will be found if you searched for the name)</li>
<li>Add &#x201c;Owner&#x201d; role</li>
</ol></section><section>
<h5><a name="Configure_core-site.xml"></a>Configure core-site.xml</h5>
<p>Add the following properties to your <code>core-site.xml</code></p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;fs.adl.oauth2.access.token.provider.type&lt;/name&gt;
&lt;value&gt;ClientCredential&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.adl.oauth2.refresh.url&lt;/name&gt;
&lt;value&gt;TOKEN ENDPOINT FROM STEP 7 ABOVE&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.adl.oauth2.client.id&lt;/name&gt;
&lt;value&gt;CLIENT ID FROM STEP 7 ABOVE&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.adl.oauth2.credential&lt;/name&gt;
&lt;value&gt;PASSWORD FROM STEP 7 ABOVE&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
</section></section><section>
<h4><a name="Using_MSI_.28Managed_Service_Identity.29"></a>Using MSI (Managed Service Identity)</h4>
<p>Azure VMs can be provisioned with &#x201c;service identities&#x201d; that are managed by the Identity extension within the VM. The advantage of doing this is that the credentials are managed by the extension, and do not have to be put into core-site.xml.</p>
<p>To use MSI, modify the VM deployment template to use the identity extension. Note the port number you specified in the template: this is the port number for the REST endpoint of the token service exposed to localhost by the identity extension in the VM. The default recommended port number is 50342 - if the recommended port number is used, then the msi.port setting below can be omitted in the configuration.</p><section>
<h5><a name="Configure_core-site.xml"></a>Configure core-site.xml</h5>
<p>Add the following properties to your <code>core-site.xml</code></p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;fs.adl.oauth2.access.token.provider.type&lt;/name&gt;
&lt;value&gt;Msi&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.adl.oauth2.msi.port&lt;/name&gt;
&lt;value&gt;PORT NUMBER FROM ABOVE (if different from the default of 50342)&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
</section></section></section><section>
<h3><a name="Using_Device_Code_Auth_for_interactive_login"></a>Using Device Code Auth for interactive login</h3>
<p><b>Note:</b> This auth method is suitable for running interactive tools, but will not work for jobs submitted to a cluster.</p>
<p>To use user-based login, Azure ActiveDirectory provides login flow using device code.</p>
<p>To use device code flow, user must first create a <b>Native</b> app registration in the Azure portal, and provide the client ID for the app as a config. Here are the steps:</p>
<ol style="list-style-type: decimal">
<li>Go to <a class="externalLink" href="https://portal.azure.com">the portal</a></li>
<li>Under services in left nav, look for Azure Active Directory and click on it.</li>
<li>Using &#x201c;App Registrations&#x201d; in the menu, create &#x201c;Native Application&#x201d;.</li>
<li>Go through the wizard</li>
<li>Once app is created, note down the &#x201c;Appplication ID&#x201d; of the app</li>
<li>Grant permissions to the app:
<ol style="list-style-type: decimal">
<li>Click on &#x201c;Permissions&#x201d; for the app, and then add &#x201c;Azure Data Lake&#x201d; and &#x201c;Windows Azure Service Management API&#x201d; permissions</li>
<li>Click on &#x201c;Grant Permissions&#x201d; to add the permissions to the app</li>
</ol>
</li>
</ol>
<p>Add the following properties to your <code>core-site.xml</code></p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;fs.adl.oauth2.devicecode.clientappid&lt;/name&gt;
&lt;value&gt;APP ID FROM STEP 5 ABOVE&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>It is usually not desirable to add DeviceCode as the default token provider type. But it can be used when using a local command:</p>
<div class="source">
<div class="source">
<pre> hadoop fs -Dfs.adl.oauth2.access.token.provider.type=DeviceCode -ls ...
</pre></div></div>
<p>Running this will print a URL and device code that can be used to login from any browser (even on a different machine, outside of the ssh session). Once the login is done, the command continues.</p><section>
<h4><a name="Protecting_the_Credentials_with_Credential_Providers"></a>Protecting the Credentials with Credential Providers</h4>
<p>In many Hadoop clusters, the <code>core-site.xml</code> file is world-readable. To protect these credentials, it is recommended that you use the credential provider framework to securely store them and access them.</p>
<p>All ADLS credential properties can be protected by credential providers. For additional reading on the credential provider API, see <a href="../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>.</p><section>
<h5><a name="Provisioning"></a>Provisioning</h5>
<div class="source">
<div class="source">
<pre>hadoop credential create fs.adl.oauth2.client.id -value 123
-provider localjceks://file/home/foo/adls.jceks
hadoop credential create fs.adl.oauth2.refresh.token -value 123
-provider localjceks://file/home/foo/adls.jceks
</pre></div></div>
</section><section>
<h5><a name="Configuring_core-site.xml_or_command_line_property"></a>Configuring core-site.xml or command line property</h5>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;fs.adl.oauth2.access.token.provider.type&lt;/name&gt;
&lt;value&gt;RefreshToken&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;hadoop.security.credential.provider.path&lt;/name&gt;
&lt;value&gt;localjceks://file/home/foo/adls.jceks&lt;/value&gt;
&lt;description&gt;Path to interrogate for protected credentials.&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</section><section>
<h5><a name="Running_DistCp"></a>Running DistCp</h5>
<div class="source">
<div class="source">
<pre>hadoop distcp
[-D fs.adl.oauth2.access.token.provider.type=RefreshToken
-D hadoop.security.credential.provider.path=localjceks://file/home/user/adls.jceks]
hdfs://&lt;NameNode Hostname&gt;:9001/user/foo/srcDir
adl://&lt;Account Name&gt;.azuredatalakestore.net/tgtDir/
</pre></div></div>
<p>NOTE: You may optionally add the provider path property to the <code>distcp</code> command line instead of added job specific configuration to a generic <code>core-site.xml</code>. The square brackets above illustrate this capability.`</p></section></section></section><section>
<h3><a name="Accessing_adl_URLs"></a>Accessing adl URLs</h3>
<p>After credentials are configured in <code>core-site.xml</code>, any Hadoop component may reference files in that Azure Data Lake Storage account by using URLs of the following format:</p>
<div class="source">
<div class="source">
<pre>adl://&lt;Account Name&gt;.azuredatalakestore.net/&lt;path&gt;
</pre></div></div>
<p>The schemes <code>adl</code> identifies a URL on a Hadoop-compatible file system backed by Azure Data Lake Storage. <code>adl</code> utilizes encrypted HTTPS access for all interaction with the Azure Data Lake Storage API.</p>
<p>For example, the following <a href="../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a> commands demonstrate access to a storage account named <code>youraccount</code>.</p>
<div class="source">
<div class="source">
<pre>hadoop fs -mkdir adl://yourcontainer.azuredatalakestore.net/testDir
hadoop fs -put testFile adl://yourcontainer.azuredatalakestore.net/testDir/testFile
hadoop fs -cat adl://yourcontainer.azuredatalakestore.net/testDir/testFile
test file content
</pre></div></div>
</section><section>
<h3><a name="User.2FGroup_Representation"></a>User/Group Representation</h3>
<p>The <code>hadoop-azure-datalake</code> module provides support for configuring how User/Group information is represented during <code>getFileStatus()</code>, <code>listStatus()</code>, and <code>getAclStatus()</code> calls..</p>
<p>Add the following properties to <code>core-site.xml</code></p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;adl.feature.ownerandgroup.enableupn&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;description&gt;
When true : User and Group in FileStatus/AclStatus response is
represented as user friendly name as per Azure AD profile.
When false (default) : User and Group in FileStatus/AclStatus
response is represented by the unique identifier from Azure AD
profile (Object ID as GUID).
For performance optimization, Recommended default value.
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</section></section><section>
<h2><a name="Configurations_for_different_ADL_accounts"></a>Configurations for different ADL accounts</h2>
<p>Different ADL accounts can be accessed with different ADL client configurations. This also allows for different login details.</p>
<ol style="list-style-type: decimal">
<li>All <code>fs.adl</code> options can be set on a per account basis.</li>
<li>The account specific option is set by replacing the <code>fs.adl.</code> prefix on an option with <code>fs.adl.account.ACCOUNTNAME.</code>, where <code>ACCOUNTNAME</code> is the name of the account.</li>
<li>When connecting to an account, all options explicitly set will override the base <code>fs.adl.</code> values.</li>
</ol>
<p>As an example, a configuration could have a base configuration to use the public account <code>adl://&lt;some-public-account&gt;.azuredatalakestore.net/</code> and an account-specific configuration to use some private account <code>adl://myprivateaccount.azuredatalakestore.net/</code></p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;fs.adl.oauth2.client.id&lt;/name&gt;
&lt;value&gt;CLIENTID&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.adl.oauth2.credential&lt;/name&gt;
&lt;value&gt;CREDENTIAL&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.adl.account.myprivateaccount.oauth2.client.id&lt;/name&gt;
&lt;value&gt;CLIENTID1&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.adl.account.myprivateaccount.oauth2.credential&lt;/name&gt;
&lt;value&gt;CREDENTIAL1&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
</section><section>
<h2><a name="Testing_the_azure-datalake-store_Module"></a>Testing the azure-datalake-store Module</h2>
<p>The <code>hadoop-azure</code> module includes a full suite of unit tests. Most of the tests will run without additional configuration by running <code>mvn test</code>. This includes tests against mocked storage, which is an in-memory emulation of Azure Data Lake Storage.</p>
<p>A selection of tests can run against the Azure Data Lake Storage. To run these tests, please create <code>src/test/resources/auth-keys.xml</code> with Adl account information mentioned in the above sections and the following properties.</p>
<div class="source">
<div class="source">
<pre>&lt;property&gt;
&lt;name&gt;fs.adl.test.contract.enable&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;test.fs.adl.name&lt;/name&gt;
&lt;value&gt;adl://yourcontainer.azuredatalakestore.net&lt;/value&gt;
&lt;/property&gt;
</pre></div></div></section>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">
&#169; 2008-2023
Apache Software Foundation
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
</div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>