NIFI-11000 Add compression example to CreateHadoopSequenceFile documentation

This closes #6801 Signed-off-by: David Handermann <exceptionfactory@apache.org>
2022-12-21 17:55:34 +01:00 · 2022-12-21 17:55:34 +01:00 · 4d3fcb6843
parent f32a60af33
commit 4d3fcb6843
1 changed files with 64 additions and 18 deletions
--- a/nifi-nar-bundles/nifi-hadoop-bundle/nifi-hdfs-processors/src/main/resources/docs/org.apache.nifi.processors.hadoop.CreateHadoopSequenceFile/additionalDetails.html
+++ b/nifi-nar-bundles/nifi-hadoop-bundle/nifi-hdfs-processors/src/main/resources/docs/org.apache.nifi.processors.hadoop.CreateHadoopSequenceFile/additionalDetails.html
@ -23,24 +23,70 @@

    <body>
        <!-- Processor Documentation ================================================== -->
-        <h2>Description:</h2>
-        <p>This processor is used to create a Hadoop Sequence File, which essentially is a file of key/value pairs. The key 
-            will be a file name and the value will be the flow file content. The processor will take either a merged (a.k.a. packaged) flow 
-            file or a singular flow file. Historically, this processor handled the merging by type and size or time prior to creating a 
+        <h2>Description</h2>
+        <p>
+            This processor is used to create a Hadoop Sequence File, which essentially is a file of key/value pairs. The key
+            will be a file name and the value will be the flow file content. The processor will take either a merged (a.k.a. packaged) flow
+            file or a singular flow file. Historically, this processor handled the merging by type and size or time prior to creating a
            SequenceFile output; it no longer does this. If creating a SequenceFile that contains multiple files of the same type is desired,
            precede this processor with a <code>RouteOnAttribute</code> processor to segregate files of the same type and follow that with a
-            <code>MergeContent</code> processor to bundle up files. If the type of files is not important, just use the 
-            <code>MergeContent</code> processor. When using the <code>MergeContent</code> processor, the following Merge Formats are 
+            <code>MergeContent</code> processor to bundle up files. If the type of files is not important, just use the
+            <code>MergeContent</code> processor. When using the <code>MergeContent</code> processor, the following Merge Formats are
            supported by this processor:
-        <ul>
-            <li>TAR</li>
-            <li>ZIP</li>
-            <li>FlowFileStream v3</li>
-        </ul>
-        The created SequenceFile is named the same as the incoming FlowFile with the suffix '.sf'. For incoming FlowFiles that are 
-        bundled, the keys in the SequenceFile are the individual file names, the values are the contents of each file.
-    </p>
-    NOTE: The value portion of a key/value pair is loaded into memory. While there is a max size limit of 2GB, this could cause memory
-    issues if there are too many concurrent tasks and the flow file sizes are large.
-</body>
-</html>
+            <ul>
+                <li>TAR</li>
+                <li>ZIP</li>
+                <li>FlowFileStream v3</li>
+            </ul>
+            The created SequenceFile is named the same as the incoming FlowFile with the suffix '.sf'. For incoming FlowFiles that are
+            bundled, the keys in the SequenceFile are the individual file names, the values are the contents of each file.
+        </p>
+        <p>
+            NOTE: The value portion of a key/value pair is loaded into memory. While there is a max size limit of 2GB, this could cause memory
+            issues if there are too many concurrent tasks and the flow file sizes are large.
+        </p>
+
+        <h2>Using Compression</h2>
+        <p>
+            The value of the <code>Compression codec</code> property determines the compression library the processor uses to compress content.
+            Third party libraries are used for compression. These third party libraries can be Java libraries or native libraries.
+            In case of native libraries, the path of the parent folder needs to be in an environment variable called <code>LD_LIBRARY_PATH</code> so that NiFi can find the libraries.
+        </p>
+        <h3>Example: using Snappy compression with native library on CentOS</h3>
+        <p>
+            <ol>
+                <li>
+                    Snappy compression needs to be installed on the server running NiFi:
+                    <br/>
+                    <code>sudo yum install snappy</code>
+                    <br/>
+                </li>
+                <li>
+                    Suppose that the server running NiFi has the native compression libraries in <code>/opt/lib/hadoop/lib/native</code> .
+                    (Native libraries have file extensions like <code>.so</code>, <code>.dll</code>, <code>.lib</code>, etc. depending on the platform.)
+                    <br/>
+                    We need to make sure that the files can be executed by the NiFi process' user. For this purpose we can make a copy of these files
+                    to e.g. <code>/opt/nativelibs</code> and change their owner. If NiFi is executed by <code>nifi</code> user in the <code>nifi</code> group, then:
+                    <br/>
+                    <code>chown nifi:nifi /opt/nativelibs</code>
+                    <br/>
+                    <code>chown nifi:nifi /opt/nativelibs/*</code>
+                    <br/>
+                </li>
+                <li>
+                    The <code>LD_LIBRARY_PATH</code> needs to be set to contain the path to the folder <code>/opt/nativelibs</code>.
+                    <br/>
+                </li>
+                <li>
+                    NiFi needs to be restarted.
+                </li>
+                <li>
+                    <code>Compression codec</code> property can be set to <code>SNAPPY</code> and a <code>Compression type</code> can be selected.
+                </li>
+                <li>
+                    The processor can be started.
+                </li>
+            </ol>
+        </p>
+    </body>
+</html>