HBASE-21450 [documentation] Point spark doc at hbase-connectors spark
Signed-off-by: Guanghao Zhang <zghao@apache.org>
This commit is contained in:
parent
9370347efe
commit
e65744a813
|
@ -58,7 +58,7 @@ takes in HBase configurations and pushes them to the Spark executors. This allow
|
||||||
us to have an HBase Connection per Spark Executor in a static location.
|
us to have an HBase Connection per Spark Executor in a static location.
|
||||||
|
|
||||||
For reference, Spark Executors can be on the same nodes as the Region Servers or
|
For reference, Spark Executors can be on the same nodes as the Region Servers or
|
||||||
on different nodes there is no dependence of co-location. Think of every Spark
|
on different nodes, there is no dependence on co-location. Think of every Spark
|
||||||
Executor as a multi-threaded client application. This allows any Spark Tasks
|
Executor as a multi-threaded client application. This allows any Spark Tasks
|
||||||
running on the executors to access the shared Connection object.
|
running on the executors to access the shared Connection object.
|
||||||
|
|
||||||
|
@ -134,7 +134,7 @@ try {
|
||||||
All functionality between Spark and HBase will be supported both in Scala and in
|
All functionality between Spark and HBase will be supported both in Scala and in
|
||||||
Java, with the exception of SparkSQL which will support any language that is
|
Java, with the exception of SparkSQL which will support any language that is
|
||||||
supported by Spark. For the remaining of this documentation we will focus on
|
supported by Spark. For the remaining of this documentation we will focus on
|
||||||
Scala examples for now.
|
Scala examples.
|
||||||
|
|
||||||
The examples above illustrate how to do a foreachPartition with a connection. A
|
The examples above illustrate how to do a foreachPartition with a connection. A
|
||||||
number of other Spark base functions are supported out of the box:
|
number of other Spark base functions are supported out of the box:
|
||||||
|
@ -148,7 +148,11 @@ access to HBase
|
||||||
`hBaseRDD`:: To simplify a distributed scan to create a RDD
|
`hBaseRDD`:: To simplify a distributed scan to create a RDD
|
||||||
// end::spark_base_functions[]
|
// end::spark_base_functions[]
|
||||||
|
|
||||||
For examples of all these functionalities, see the HBase-Spark Module.
|
For examples of all these functionalities, see the
|
||||||
|
link:https://github.com/apache/hbase-connectors/tree/master/spark[hbase-spark integration]
|
||||||
|
in the link:https://github.com/apache/hbase-connectors[hbase-connectors] repository
|
||||||
|
(the hbase-spark connectors live outside hbase core in a related,
|
||||||
|
Apache HBase project maintained, associated repo).
|
||||||
|
|
||||||
== Spark Streaming
|
== Spark Streaming
|
||||||
https://spark.apache.org/streaming/[Spark Streaming] is a micro batching stream
|
https://spark.apache.org/streaming/[Spark Streaming] is a micro batching stream
|
||||||
|
@ -157,12 +161,12 @@ companions in that HBase can help serve the following benefits alongside Spark
|
||||||
Streaming.
|
Streaming.
|
||||||
|
|
||||||
* A place to grab reference data or profile data on the fly
|
* A place to grab reference data or profile data on the fly
|
||||||
* A place to store counts or aggregates in a way that supports Spark Streaming
|
* A place to store counts or aggregates in a way that supports Spark Streaming's
|
||||||
promise of _only once processing_.
|
promise of _only once processing_.
|
||||||
|
|
||||||
The HBase-Spark module’s integration points with Spark Streaming are similar to
|
The link:https://github.com/apache/hbase-connectors/tree/master/spark[hbase-spark integration]
|
||||||
its normal Spark integration points, in that the following commands are possible
|
with Spark Streaming is similar to its normal Spark integration points, in that the following
|
||||||
straight off a Spark Streaming DStream.
|
commands are possible straight off a Spark Streaming DStream.
|
||||||
|
|
||||||
include::spark.adoc[tags=spark_base_functions]
|
include::spark.adoc[tags=spark_base_functions]
|
||||||
|
|
||||||
|
@ -202,10 +206,10 @@ dStream.hbaseBulkPut(
|
||||||
----
|
----
|
||||||
|
|
||||||
There are three inputs to the `hbaseBulkPut` function.
|
There are three inputs to the `hbaseBulkPut` function.
|
||||||
. The hbaseContext that carries the configuration boardcast information link us
|
The hbaseContext that carries the configuration broadcast information link
|
||||||
to the HBase Connections in the executors
|
to the HBase Connections in the executor, the table name of the table we are
|
||||||
. The table name of the table we are putting data into
|
putting data into, and a function that will convert a record in the DStream
|
||||||
. A function that will convert a record in the DStream into an HBase Put object.
|
into an HBase Put object.
|
||||||
====
|
====
|
||||||
|
|
||||||
== Bulk Load
|
== Bulk Load
|
||||||
|
@ -213,11 +217,11 @@ to the HBase Connections in the executors
|
||||||
There are two options for bulk loading data into HBase with Spark. There is the
|
There are two options for bulk loading data into HBase with Spark. There is the
|
||||||
basic bulk load functionality that will work for cases where your rows have
|
basic bulk load functionality that will work for cases where your rows have
|
||||||
millions of columns and cases where your columns are not consolidated and
|
millions of columns and cases where your columns are not consolidated and
|
||||||
partitions before the on the map side of the Spark bulk load process.
|
partitioned before the map side of the Spark bulk load process.
|
||||||
|
|
||||||
There is also a thin record bulk load option with Spark, this second option is
|
There is also a thin record bulk load option with Spark. This second option is
|
||||||
designed for tables that have less then 10k columns per row. The advantage
|
designed for tables that have less then 10k columns per row. The advantage
|
||||||
of this second option is higher throughput and less over all load on the Spark
|
of this second option is higher throughput and less over-all load on the Spark
|
||||||
shuffle operation.
|
shuffle operation.
|
||||||
|
|
||||||
Both implementations work more or less like the MapReduce bulk load process in
|
Both implementations work more or less like the MapReduce bulk load process in
|
||||||
|
@ -225,7 +229,7 @@ that a partitioner partitions the rowkeys based on region splits and
|
||||||
the row keys are sent to the reducers in order, so that HFiles can be written
|
the row keys are sent to the reducers in order, so that HFiles can be written
|
||||||
out directly from the reduce phase.
|
out directly from the reduce phase.
|
||||||
|
|
||||||
In Spark terms, the bulk load will be implemented around a the Spark
|
In Spark terms, the bulk load will be implemented around a Spark
|
||||||
`repartitionAndSortWithinPartitions` followed by a Spark `foreachPartition`.
|
`repartitionAndSortWithinPartitions` followed by a Spark `foreachPartition`.
|
||||||
|
|
||||||
First lets look at an example of using the basic bulk load functionality
|
First lets look at an example of using the basic bulk load functionality
|
||||||
|
@ -386,20 +390,24 @@ values for this row for all column families.
|
||||||
|
|
||||||
== SparkSQL/DataFrames
|
== SparkSQL/DataFrames
|
||||||
|
|
||||||
HBase-Spark Connector (in HBase-Spark Module) leverages
|
The link:https://github.com/apache/hbase-connectors/tree/master/spark[hbase-spark integration]
|
||||||
|
leverages
|
||||||
link:https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html[DataSource API]
|
link:https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html[DataSource API]
|
||||||
(link:https://issues.apache.org/jira/browse/SPARK-3247[SPARK-3247])
|
(link:https://issues.apache.org/jira/browse/SPARK-3247[SPARK-3247])
|
||||||
introduced in Spark-1.2.0, bridges the gap between simple HBase KV store and complex
|
introduced in Spark-1.2.0, which bridges the gap between simple HBase KV store and complex
|
||||||
relational SQL queries and enables users to perform complex data analytical work
|
relational SQL queries and enables users to perform complex data analytical work
|
||||||
on top of HBase using Spark. HBase Dataframe is a standard Spark Dataframe, and is able to
|
on top of HBase using Spark. HBase Dataframe is a standard Spark Dataframe, and is able to
|
||||||
interact with any other data sources such as Hive, Orc, Parquet, JSON, etc.
|
interact with any other data sources such as Hive, Orc, Parquet, JSON, etc.
|
||||||
HBase-Spark Connector applies critical techniques such as partition pruning, column pruning,
|
The link:https://github.com/apache/hbase-connectors/tree/master/spark[hbase-spark integration]
|
||||||
|
applies critical techniques such as partition pruning, column pruning,
|
||||||
predicate pushdown and data locality.
|
predicate pushdown and data locality.
|
||||||
|
|
||||||
To use HBase-Spark connector, users need to define the Catalog for the schema mapping
|
To use the
|
||||||
|
link:https://github.com/apache/hbase-connectors/tree/master/spark[hbase-spark integration]
|
||||||
|
connector, users need to define the Catalog for the schema mapping
|
||||||
between HBase and Spark tables, prepare the data and populate the HBase table,
|
between HBase and Spark tables, prepare the data and populate the HBase table,
|
||||||
then load HBase DataFrame. After that, users can do integrated query and access records
|
then load the HBase DataFrame. After that, users can do integrated query and access records
|
||||||
in HBase table with SQL query. Following illustrates the basic procedure.
|
in HBase tables with SQL query. The following illustrates the basic procedure.
|
||||||
|
|
||||||
=== Define catalog
|
=== Define catalog
|
||||||
|
|
||||||
|
@ -564,8 +572,9 @@ sqlContext.sql("select count(col1) from table").show
|
||||||
|
|
||||||
.Native Avro support
|
.Native Avro support
|
||||||
====
|
====
|
||||||
HBase-Spark Connector support different data formats like Avro, Jason, etc. The use case below
|
The link:https://github.com/apache/hbase-connectors/tree/master/spark[hbase-spark integration]
|
||||||
shows how spark supports Avro. User can persist the Avro record into HBase directly. Internally,
|
connector supports different data formats like Avro, JSON, etc. The use case below
|
||||||
|
shows how spark supports Avro. Users can persist the Avro record into HBase directly. Internally,
|
||||||
the Avro schema is converted to a native Spark Catalyst data type automatically.
|
the Avro schema is converted to a native Spark Catalyst data type automatically.
|
||||||
Note that both key-value parts in an HBase table can be defined in Avro format.
|
Note that both key-value parts in an HBase table can be defined in Avro format.
|
||||||
|
|
||||||
|
@ -687,4 +696,4 @@ The date frame `df` returned by `withCatalog` function could be used to access t
|
||||||
After loading df DataFrame, users can query data. registerTempTable registers df DataFrame
|
After loading df DataFrame, users can query data. registerTempTable registers df DataFrame
|
||||||
as a temporary table using the table name avrotable. `sqlContext.sql` function allows the
|
as a temporary table using the table name avrotable. `sqlContext.sql` function allows the
|
||||||
user to execute SQL queries.
|
user to execute SQL queries.
|
||||||
====
|
====
|
||||||
|
|
Loading…
Reference in New Issue