mirror of https://github.com/apache/nifi.git
NIFI-11000 Add compression example to CreateHadoopSequenceFile documentation
This closes #6801 Signed-off-by: David Handermann <exceptionfactory@apache.org>
This commit is contained in:
parent
f32a60af33
commit
4d3fcb6843
|
@ -23,24 +23,70 @@
|
|||
|
||||
<body>
|
||||
<!-- Processor Documentation ================================================== -->
|
||||
<h2>Description:</h2>
|
||||
<p>This processor is used to create a Hadoop Sequence File, which essentially is a file of key/value pairs. The key
|
||||
will be a file name and the value will be the flow file content. The processor will take either a merged (a.k.a. packaged) flow
|
||||
file or a singular flow file. Historically, this processor handled the merging by type and size or time prior to creating a
|
||||
<h2>Description</h2>
|
||||
<p>
|
||||
This processor is used to create a Hadoop Sequence File, which essentially is a file of key/value pairs. The key
|
||||
will be a file name and the value will be the flow file content. The processor will take either a merged (a.k.a. packaged) flow
|
||||
file or a singular flow file. Historically, this processor handled the merging by type and size or time prior to creating a
|
||||
SequenceFile output; it no longer does this. If creating a SequenceFile that contains multiple files of the same type is desired,
|
||||
precede this processor with a <code>RouteOnAttribute</code> processor to segregate files of the same type and follow that with a
|
||||
<code>MergeContent</code> processor to bundle up files. If the type of files is not important, just use the
|
||||
<code>MergeContent</code> processor. When using the <code>MergeContent</code> processor, the following Merge Formats are
|
||||
<code>MergeContent</code> processor to bundle up files. If the type of files is not important, just use the
|
||||
<code>MergeContent</code> processor. When using the <code>MergeContent</code> processor, the following Merge Formats are
|
||||
supported by this processor:
|
||||
<ul>
|
||||
<li>TAR</li>
|
||||
<li>ZIP</li>
|
||||
<li>FlowFileStream v3</li>
|
||||
</ul>
|
||||
The created SequenceFile is named the same as the incoming FlowFile with the suffix '.sf'. For incoming FlowFiles that are
|
||||
bundled, the keys in the SequenceFile are the individual file names, the values are the contents of each file.
|
||||
</p>
|
||||
NOTE: The value portion of a key/value pair is loaded into memory. While there is a max size limit of 2GB, this could cause memory
|
||||
issues if there are too many concurrent tasks and the flow file sizes are large.
|
||||
</body>
|
||||
</html>
|
||||
<ul>
|
||||
<li>TAR</li>
|
||||
<li>ZIP</li>
|
||||
<li>FlowFileStream v3</li>
|
||||
</ul>
|
||||
The created SequenceFile is named the same as the incoming FlowFile with the suffix '.sf'. For incoming FlowFiles that are
|
||||
bundled, the keys in the SequenceFile are the individual file names, the values are the contents of each file.
|
||||
</p>
|
||||
<p>
|
||||
NOTE: The value portion of a key/value pair is loaded into memory. While there is a max size limit of 2GB, this could cause memory
|
||||
issues if there are too many concurrent tasks and the flow file sizes are large.
|
||||
</p>
|
||||
|
||||
<h2>Using Compression</h2>
|
||||
<p>
|
||||
The value of the <code>Compression codec</code> property determines the compression library the processor uses to compress content.
|
||||
Third party libraries are used for compression. These third party libraries can be Java libraries or native libraries.
|
||||
In case of native libraries, the path of the parent folder needs to be in an environment variable called <code>LD_LIBRARY_PATH</code> so that NiFi can find the libraries.
|
||||
</p>
|
||||
<h3>Example: using Snappy compression with native library on CentOS</h3>
|
||||
<p>
|
||||
<ol>
|
||||
<li>
|
||||
Snappy compression needs to be installed on the server running NiFi:
|
||||
<br/>
|
||||
<code>sudo yum install snappy</code>
|
||||
<br/>
|
||||
</li>
|
||||
<li>
|
||||
Suppose that the server running NiFi has the native compression libraries in <code>/opt/lib/hadoop/lib/native</code> .
|
||||
(Native libraries have file extensions like <code>.so</code>, <code>.dll</code>, <code>.lib</code>, etc. depending on the platform.)
|
||||
<br/>
|
||||
We need to make sure that the files can be executed by the NiFi process' user. For this purpose we can make a copy of these files
|
||||
to e.g. <code>/opt/nativelibs</code> and change their owner. If NiFi is executed by <code>nifi</code> user in the <code>nifi</code> group, then:
|
||||
<br/>
|
||||
<code>chown nifi:nifi /opt/nativelibs</code>
|
||||
<br/>
|
||||
<code>chown nifi:nifi /opt/nativelibs/*</code>
|
||||
<br/>
|
||||
</li>
|
||||
<li>
|
||||
The <code>LD_LIBRARY_PATH</code> needs to be set to contain the path to the folder <code>/opt/nativelibs</code>.
|
||||
<br/>
|
||||
</li>
|
||||
<li>
|
||||
NiFi needs to be restarted.
|
||||
</li>
|
||||
<li>
|
||||
<code>Compression codec</code> property can be set to <code>SNAPPY</code> and a <code>Compression type</code> can be selected.
|
||||
</li>
|
||||
<li>
|
||||
The processor can be started.
|
||||
</li>
|
||||
</ol>
|
||||
</p>
|
||||
</body>
|
||||
</html>
|
Loading…
Reference in New Issue