NIFI-11000 Add compression example to CreateHadoopSequenceFile documentation

This closes #6801

Signed-off-by: David Handermann <exceptionfactory@apache.org>
This commit is contained in:
Peter Gyori 2022-12-21 17:55:34 +01:00 committed by exceptionfactory
parent f32a60af33
commit 4d3fcb6843
No known key found for this signature in database
GPG Key ID: 29B6A52D2AAE8DBA
1 changed files with 64 additions and 18 deletions

View File

@ -23,24 +23,70 @@
<body>
<!-- Processor Documentation ================================================== -->
<h2>Description:</h2>
<p>This processor is used to create a Hadoop Sequence File, which essentially is a file of key/value pairs. The key
will be a file name and the value will be the flow file content. The processor will take either a merged (a.k.a. packaged) flow
file or a singular flow file. Historically, this processor handled the merging by type and size or time prior to creating a
<h2>Description</h2>
<p>
This processor is used to create a Hadoop Sequence File, which essentially is a file of key/value pairs. The key
will be a file name and the value will be the flow file content. The processor will take either a merged (a.k.a. packaged) flow
file or a singular flow file. Historically, this processor handled the merging by type and size or time prior to creating a
SequenceFile output; it no longer does this. If creating a SequenceFile that contains multiple files of the same type is desired,
precede this processor with a <code>RouteOnAttribute</code> processor to segregate files of the same type and follow that with a
<code>MergeContent</code> processor to bundle up files. If the type of files is not important, just use the
<code>MergeContent</code> processor. When using the <code>MergeContent</code> processor, the following Merge Formats are
<code>MergeContent</code> processor to bundle up files. If the type of files is not important, just use the
<code>MergeContent</code> processor. When using the <code>MergeContent</code> processor, the following Merge Formats are
supported by this processor:
<ul>
<li>TAR</li>
<li>ZIP</li>
<li>FlowFileStream v3</li>
</ul>
The created SequenceFile is named the same as the incoming FlowFile with the suffix '.sf'. For incoming FlowFiles that are
bundled, the keys in the SequenceFile are the individual file names, the values are the contents of each file.
</p>
NOTE: The value portion of a key/value pair is loaded into memory. While there is a max size limit of 2GB, this could cause memory
issues if there are too many concurrent tasks and the flow file sizes are large.
</body>
</html>
<ul>
<li>TAR</li>
<li>ZIP</li>
<li>FlowFileStream v3</li>
</ul>
The created SequenceFile is named the same as the incoming FlowFile with the suffix '.sf'. For incoming FlowFiles that are
bundled, the keys in the SequenceFile are the individual file names, the values are the contents of each file.
</p>
<p>
NOTE: The value portion of a key/value pair is loaded into memory. While there is a max size limit of 2GB, this could cause memory
issues if there are too many concurrent tasks and the flow file sizes are large.
</p>
<h2>Using Compression</h2>
<p>
The value of the <code>Compression codec</code> property determines the compression library the processor uses to compress content.
Third party libraries are used for compression. These third party libraries can be Java libraries or native libraries.
In case of native libraries, the path of the parent folder needs to be in an environment variable called <code>LD_LIBRARY_PATH</code> so that NiFi can find the libraries.
</p>
<h3>Example: using Snappy compression with native library on CentOS</h3>
<p>
<ol>
<li>
Snappy compression needs to be installed on the server running NiFi:
<br/>
<code>sudo yum install snappy</code>
<br/>
</li>
<li>
Suppose that the server running NiFi has the native compression libraries in <code>/opt/lib/hadoop/lib/native</code> .
(Native libraries have file extensions like <code>.so</code>, <code>.dll</code>, <code>.lib</code>, etc. depending on the platform.)
<br/>
We need to make sure that the files can be executed by the NiFi process' user. For this purpose we can make a copy of these files
to e.g. <code>/opt/nativelibs</code> and change their owner. If NiFi is executed by <code>nifi</code> user in the <code>nifi</code> group, then:
<br/>
<code>chown nifi:nifi /opt/nativelibs</code>
<br/>
<code>chown nifi:nifi /opt/nativelibs/*</code>
<br/>
</li>
<li>
The <code>LD_LIBRARY_PATH</code> needs to be set to contain the path to the folder <code>/opt/nativelibs</code>.
<br/>
</li>
<li>
NiFi needs to be restarted.
</li>
<li>
<code>Compression codec</code> property can be set to <code>SNAPPY</code> and a <code>Compression type</code> can be selected.
</li>
<li>
The processor can be started.
</li>
</ol>
</p>
</body>
</html>