diff --git a/hadoop-hdds/docs/content/SparkOzoneFSK8S.md b/hadoop-hdds/docs/content/SparkOzoneFSK8S.md new file mode 100644 index 00000000000..ef88d618960 --- /dev/null +++ b/hadoop-hdds/docs/content/SparkOzoneFSK8S.md @@ -0,0 +1,184 @@ +--- +title: Spark in Kubernetes with OzoneFS +menu: + main: + parent: Recipes +--- + + +Using Ozone from Apache Spark +=== + +This recipe shows how Ozone object store can be used from Spark using: + + - OzoneFS (Hadoop compatible file system) + - Hadoop 2.7 (included in the Spark distribution) + - Kubernetes Spark scheduler + - Local spark client + + +## Requirements + +Download latest Spark and Ozone distribution and extract them. This method is +tested with the `spark-2.4.0-bin-hadoop2.7` distribution. + +You also need the following: + + * A container repository to push and pull the spark+ozone images. (In this recipe we will use the dockerhub) + * A repo/name for the custom containers (in this recipe _myrepo/ozone-spark_) + * A dedicated namespace in kubernetes (we use _yournamespace_ in this recipe) + +## Create the docker image for drivers + +### Create the base Spark driver/executor image + +First of all create a docker image with the Spark image creator. +Execute the following from the Spark distribution + +``` +./bin/docker-image-tool.sh -r myrepo -t 2.4.0 build +``` + +_Note_: if you use Minikube add the `-m` flag to use the docker daemon of the Minikube image: + +``` +./bin/docker-image-tool.sh -m -r myrepo -t 2.4.0 build +``` + +`./bin/docker-image-tool.sh` is an official Spark tool to create container images and this step will create multiple Spark container images with the name _myrepo/spark_. The first container will be used as a base container in the following steps. + +### Customize the docker image + +Create a new directory for customizing the created docker image. + +Copy the `ozone-site.xml` from the cluster: + +``` +kubectl cp om-0:/opt/hadoop/etc/hadoop/ozone-site.xml . +``` + +And create a custom `core-site.xml`: + +``` + + + fs.o3fs.impl + org.apache.hadoop.fs.ozone.OzoneFileSystem + + +``` + +Copy the `ozonefs.jar` file from an ozone distribution (__use the legacy version!__) + +``` +kubectl cp om-0:/opt/hadoop/share/ozone/lib/hadoop-ozone-filesystem-lib-legacy-0.4.0-SNAPSHOT.jar . +``` + + +Create a new Dockerfile and build the image: +``` +FROM myrepo/spark:2.4.0 +ADD core-site.xml /opt/hadoop/conf/core-site.xml +ADD ozone-site.xml /opt/hadoop/conf/ozone-site.xml +ENV HADOOP_CONF_DIR=/opt/hadoop/conf +ENV SPARK_EXTRA_CLASSPATH=/opt/hadoop/conf +ADD hadoop-ozone-filesystem-lib-legacy-0.4.0-SNAPSHOT.jar /opt/hadoop-ozone-filesystem-lib-legacy.jar +``` + +``` +docker build -t myrepo/spark-ozone +``` + +For remote kubernetes cluster you may need to push it: + +``` +docker push myrepo/spark-ozone +``` + +## Create a bucket and identify the ozonefs path + +Download any text file and put it to the `/tmp/alice.txt` first. + +``` +kubectl port-forward s3g-0 9878:9878 +aws s3api --endpoint http://localhost:9878 create-bucket --bucket=test +aws s3api --endpoint http://localhost:9878 put-object --bucket test --key alice.txt --body /tmp/alice.txt +kubectl exec -it scm-0 ozone sh bucket path test +``` + +The output of the last command is something like this: + +``` +Volume name for S3Bucket is : s3asdlkjqiskjdsks +Ozone FileSystem Uri is : o3fs://test.s3asdlkjqiskjdsks +``` + +Write down the ozone filesystem uri as it should be used with the spark-submit command. + +## Create service account to use + +``` +kubectl create serviceaccount spark -n yournamespace +kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=poc:yournamespace --namespace=yournamespace +``` +## Execute the job + +Execute the following spar-submit command, but change at least the following values: + + * the kubernetes master url (you can check your ~/.kube/config to find the actual value) + * the kubernetes namespace (yournamespace in this example) + * serviceAccountName (you can use the _spark_ value if you folllowed the previous steps) + * container.image (in this example this is myrepo/spark-ozone. This is pushed to the registry in the previous steps) + * location of the input file (o3fs://...), use the string which is identified earlier with the `ozone sh bucket path` command + +``` +bin/spark-submit \ + --master k8s://https://kubernetes:6443 \ + --deploy-mode cluster \ + --name spark-word-count \ + --class org.apache.spark.examples.JavaWordCount \ + --conf spark.executor.instances=1 \ + --conf spark.kubernetes.namespace=yournamespace \ + --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ + --conf spark.kubernetes.container.image=myrepo/spark-ozone \ + --conf spark.kubernetes.container.image.pullPolicy=Always \ + --jars /opt/hadoop-ozone-filesystem-lib-legacy.jar \ + local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar \ + o3fs://bucket.volume/alice.txt +``` + +Check the available `spark-word-count-...` pods with `kubectl get pod` + +Check the output of the calculation with `kubectl logs spark-word-count-1549973913699-driver` + +You should see the output of the wordcount job. For example: + +``` +... +name: 8 +William: 3 +this,': 1 +SOUP!': 1 +`Silence: 1 +`Mine: 1 +ordered.: 1 +considering: 3 +muttering: 3 +candle: 2 +... +```