diff --git a/nifi-docs/src/main/asciidoc/administration-guide.adoc b/nifi-docs/src/main/asciidoc/administration-guide.adoc index 9f7d56d2eb..7ddbc7893c 100644 --- a/nifi-docs/src/main/asciidoc/administration-guide.adoc +++ b/nifi-docs/src/main/asciidoc/administration-guide.adoc @@ -1136,11 +1136,11 @@ below marked with an asterisk (*) in such a way that upgrading will be easier. F at the end of this section. Note that values for periods of time and data sizes must include the unit of measure, for example "10 sec" or "10 MB", not simply "10". -==== Core Properties + +=== Core Properties + The first section of the _nifi.properties_ file is for the Core Properties. These properties apply to the core framework as a whole. -|==== +|=== |*Property*|*Description* |nifi.version|The version number of the current release. If upgrading but reusing this file, be sure to update this value. |nifi.flow.configuration.file*|The location of the flow configuration file (i.e., the file that contains what is currently displayed on the NiFi graph). The default value is ./conf/flow.xml.gz. @@ -1169,10 +1169,10 @@ Providing three total locations, including _nifi.nar.library.directory_. |nifi.nar.working.directory|The location of the nar working directory. The default value is ./work/nar and probably should be left as is. |nifi.documentation.working.directory|The documentation working directory. The default value is ./work/docs/components and probably should be left as is. |nifi.processor.scheduling.timeout|Time to wait for a Processor's life-cycle operation (@OnScheduled and @OnUnscheduled) to finish before other life-cycle operation (e.g., stop) could be invoked. Default is 1 minute. -|==== +|=== -==== State Management + +=== State Management + The State Management section of the Properties file provides a mechanism for configuring local and cluster-wide mechanisms for components to persist state. See the <> section for more information on how this is used. @@ -1187,7 +1187,7 @@ for components to persist state. See the <> section for more i |==== -==== H2 Settings +=== H2 Settings The H2 Settings section defines the settings for the H2 database, which keeps track of user access and flow controller history. @@ -1198,7 +1198,7 @@ The H2 Settings section defines the settings for the H2 database, which keeps tr |==== -==== FlowFile Repository +=== FlowFile Repository The FlowFile repository keeps track of the attributes and current state of each FlowFile in the system. By default, this repository is installed in the same root installation directory as all the other repositories; however, it is advisable @@ -1213,7 +1213,7 @@ to configure it on a separate drive if available. |nifi.flowfile.repository.always.sync|If set to _true_, any change to the repository will be synchronized to the disk, meaning that NiFi will ask the operating system not to cache the information. This is very expensive and can significantly reduce NiFi performance. However, if it is _false_, there could be the potential for data loss if either there is a sudden power loss or the operating system crashes. The default value is _false_. |==== -==== Swap Management +=== Swap Management NiFi keeps FlowFile information in memory (the JVM) but during surges of incoming data, the FlowFile information can start to take up so much of the JVM that system performance @@ -1230,7 +1230,7 @@ available again. These properties govern how that process occurs. |nifi.swap.out.threads|The number of threads to use for swapping out. The default value is 4. |==== -==== Content Repository +=== Content Repository The Content Repository holds the content for all the FlowFiles in the system. By default, it is installed in the same root installation directory as all the other repositories; however, administrators will likely want to configure it on a separate @@ -1243,7 +1243,7 @@ FlowFile Repository, if also on that disk, could become corrupt. To avoid this s |nifi.content.repository.implementation|The Content Repository implementation. The default value is org.apache.nifi.controller.repository.FileSystemRepository and should only be changed with caution. To store flowfile content in memory instead of on disk (at the risk of data loss in the event of power/machine failure), set this property to org.apache.nifi.controller.repository.VolatileContentRepository. |==== -==== File System Content Repository Properties +=== File System Content Repository Properties |==== |*Property*|*Description* @@ -1268,7 +1268,7 @@ this property specifies the maximum amount of time to keep the archived data. It |nifi.content.viewer.url|The URL for a web-based content viewer if one is available. It is blank by default. |==== -==== Volatile Content Repository Properties +=== Volatile Content Repository Properties |==== |*Property*|*Description* @@ -1276,7 +1276,7 @@ this property specifies the maximum amount of time to keep the archived data. It |nifi.volatile.content.repository.block.size|The Content Repository block size. The default value is 32KB. |==== -==== Provenance Repository +=== Provenance Repository The Provenance Repository contains the information related to Data Provenance. The next three sections are for Provenance Repository properties. @@ -1285,7 +1285,7 @@ The Provenance Repository contains the information related to Data Provenance. T |nifi.provenance.repository.implementation|The Provenance Repository implementation. The default value is org.apache.nifi.provenance.PersistentProvenanceRepository and should only be changed with caution. To store provenance events in memory instead of on disk (at the risk of data loss in the event of power/machine failure), set this property to org.apache.nifi.provenance.VolatileProvenanceRepository. |==== -==== Persistent Provenance Repository Properties +=== Persistent Provenance Repository Properties |==== |*Property*|*Description* @@ -1317,14 +1317,14 @@ Providing three total locations, including _nifi.provenance.repository.director |nifi.provenance.repository.max.attribute.length|Indicates the maximum length that a FlowFile attribute can be when retrieving a Provenance Event from the repository. If the length of any attribute exceeds this value, it will be truncated when the event is retrieved. The default is 65536. |==== -==== Volatile Provenance Repository Properties +=== Volatile Provenance Repository Properties |==== |*Property*|*Description* |nifi.provenance.repository.buffer.size|The Provenance Repository buffer size. The default value is 100000. |==== -==== Component Status Repository +=== Component Status Repository The Component Status Repository contains the information for the Component Status History tool in the User Interface. These properties govern how that tool works. @@ -1344,7 +1344,7 @@ of 576. [[site_to_site_properties]] -==== Site to Site Properties +=== Site to Site Properties These properties govern how this instance of NiFi communicates with remote instances of NiFi when Remote Process Groups are configured in the dataflow. Remote Process Groups can choose transport protocol from RAW and HTTP. Properties named with _nifi.remote.input.socket.*_ are RAW transport protocol specific. Similarly, _nifi.remote.input.http.*_ are HTTP transport protocol specific properties. @@ -1361,7 +1361,7 @@ Whether a Site-to-Site client uses HTTP or HTTPS is determined by _nifi.remote.i |nifi.remote.input.http.transaction.ttl|Specify how long a transaction can stay alive on server. If a Site-to-Site client didn't proceed to next action for this period of time, the transaction is discarded from remote NiFi instance. For example, a client creates a transaction but doesn't send or receive flow files, or send or received flow files but doesn't confirm that transaction. By default, it is set to 30 seconds.| |==== -==== Web Properties +=== Web Properties These properties pertain to the web-based User Interface. @@ -1376,7 +1376,7 @@ These properties pertain to the web-based User Interface. |nifi.web.jetty.threads|The number of Jetty threads. The default value is 200. |==== -==== Security Properties +=== Security Properties These properties pertain to various security features in NiFi. Many of these properties are covered in more detail in the Security Configuration section of this Administrator's Guide. @@ -1400,7 +1400,7 @@ in the file specified in `nifi.login.identity.provider.configuration.file`. Sett |nifi.security.ocsp.responder.certificate|This is the location of the OCSP responder certificate if one is being used. It is blank by default. |==== -==== Cluster Common Properties +=== Cluster Common Properties When setting up a NiFi cluster, these properties should be configured the same way on all nodes. @@ -1418,7 +1418,7 @@ from the remote node before considering the communication with the node a failur to the cluster. It provides an additional layer of security. This value is blank by default, meaning that no firewall file is to be used. |==== -==== Cluster Node Properties +=== Cluster Node Properties Configure these properties for cluster nodes. @@ -1432,7 +1432,7 @@ in the cluster. This property defaults to 10, but for large clusters, this value |==== [[claim_management]] -==== Claim Management +=== Claim Management Whenever a request is made to change the dataflow, it is important that all nodes in the NiFi cluster are kept in-sync. In order to allow for this, NiFi employs a two-phase commit. The request @@ -1453,7 +1453,7 @@ unlocking. |==== -==== ZooKeeper Properties +=== ZooKeeper Properties NiFi depends on Apache ZooKeeper for determining which node in the cluster should play the role of Primary Node and which node should play the role of Cluster Coordinator. These properties must be configured in order for NiFi @@ -1474,7 +1474,7 @@ that is specified. |==== [[kerberos_properties]] -==== Kerberos Properties +=== Kerberos Properties |==== |*Property*|*Description* diff --git a/nifi-docs/src/main/asciidoc/nifi-in-depth.adoc b/nifi-docs/src/main/asciidoc/nifi-in-depth.adoc index 29f3bd6246..88b1031f16 100644 --- a/nifi-docs/src/main/asciidoc/nifi-in-depth.adoc +++ b/nifi-docs/src/main/asciidoc/nifi-in-depth.adoc @@ -45,7 +45,7 @@ A snapshot is automatically taken periodically by the system, which creates a ne The period between system checkpoints is configurable in the nifi.properties file (documented in the NiFi System Administrator's Guide). The default is a two-minute interval. -===== Effect of System Failure on Transactions +==== Effect of System Failure on Transactions NiFi protects against hardware and system failures by keeping a record of what was happening on each node at that time in their respective FlowFile Repo. As mentioned above, the FlowFile Repo is NiFi's Write-Ahead Log. When the node comes back online, it works to restore its state by first checking for the "snapshot" and ".partial" files. The node either accepts the "snapshot" and deletes the ".partial" (if it exits), or renames the ".partial" file to "snapshot" if the "snapshot" file doesn't exist. If the Node was in the middle of writing content when it went down, nothing is corrupted, thanks to the Copy On Write (mentioned below) and Immutability (mentioned above) paradigms. Since FlowFile transactions never modify the original content (pointed to by the content pointer), the original is safe. When NiFi goes down, the write claim for the change is orphaned and then cleaned up by the background garbage collection. This provides a “rollback” to the last known stable state. @@ -54,7 +54,7 @@ The Node then restores its state from the FlowFile. For a more in-depth, step-by This setup, in terms of transactional units of work, allows NiFi to be very resilient in the face of adversity, ensuring that even if NiFi is suddenly killed, it can pick back up without any loss of data. -===== Deeper View: FlowFiles in Memory and on Disk +==== Deeper View: FlowFiles in Memory and on Disk The term "FlowFile" is a bit of a misnomer. This would lead one to believe that each FlowFile corresponds to a file on disk, but that is not true. There are two main locations that the FlowFile attributes exist, the Write-Ahead Log that is explained above and a hash map in working memory. This hash map has a reference to all of the FlowFiles actively being used in the Flow. The object referenced by this map is the same one that is used by processors and held in connections queues. Since the FlowFile object is held in memory, all which has to be done for the Processor to get the FlowFile is to ask the ProcessSession to grab it from the queue. When a change occurs to the FlowFile, the delta is written out to the Write-Ahead Log and the object in memory is modified accordingly. This allows the system to quickly work with FlowFiles while also keeping track of what has happened and what will happen when the session is committed. This provides a very robust and durable system. @@ -67,7 +67,7 @@ The Content Repository is simply a place in local storage where the content of a In the same way the JVM Heap has a garbage collection process to reclaim unreachable objects when space is needed, there exists a dedicated thread in NiFi to analyze the Content repo for un-used content (more info in the " Deeper View: Deletion After Checkpointing" section). After a FlowFile's content is identified as no longer in use it will either be deleted or archived. If archiving is enabled in nifi.properties then the FlowFile’s content will exist in the Content Repo either until it is aged off (deleted after a certain amount of time) or deleted due to the Content Repo taking up too much space. The conditions for archiving and/or deleting are configured in the nifi.properties file ("nifi.content.repository.archive.max.retention.period", "nifi.content.repository.archive.max.usage.percentage") and outlined in the Admin guide. Refer to the "Data Egress" section for more information on the deletion of content. -===== Deeper View: Content Claim +==== Deeper View: Content Claim In general, when talking about a FlowFile, the reference to its content can simply be referred to as a "pointer" to the content. Though, the underlying implementation of the FlowFile Content reference has multiple layers of complexity. The Content Repository is made up of a collection of files on disk. These files are binned into Containers and Sections. A Section is a subdirectory of a Container. A Container can be thought of as a “root directory” for the Content Repository. The Content Repository, though, can be made up of many Containers. This is done so that NiFi can take advantage of multiple physical partitions in parallel.” NiFi is then capable of reading from, and writing to, all of these disks in parallel, in order to achieve data rates of hundreds of Megabytes or even Gigabytes per second of disk throughput on a single node. "Resource Claims" are Java objects that point to specific files on disk (this is done by keeping track of the file ID, the section the file is in, and the container the section is a part of). To keep track of the FlowFile's contents, the FlowFile has a "Content Claim" object. This Content Claim has a reference to the Resource Claim that contains the content, the offset of the content within the file, and the length of the content. To access the content, the Content Repository drills down using to the specific file on disk using the Resource Claim's properties and then seeks to the offset specified by the Resource Claim before streaming content from the file. @@ -86,7 +86,7 @@ Note: Since provenance events are snapshots of the FlowFile, as it exists in the For a look at the design decisions behind the Provenance Repository check out this link: https://cwiki.apache.org/confluence/display/NIFI/Persistent+Provenance+Repository+Design -===== Deeper View: Provenance Log Files +==== Deeper View: Provenance Log Files Each provenance event has two maps, one for the attributes before the event and one for the updated attribute values. In general, provenance events don't store the updated values of the attributes as they existed when the event was emitted but instead, the attribute values when the session is committed. The events are cached and saved until the session is committed and once the session is committed the events are emitted with the attributes associated with the FlowFile when the session is committed. The exception to this rule is the "SEND" event, in which case the event contains the attributes as they existed when the event was emitted. This is done because if the attributes themselves were also sent, it is important to have an accurate account of exactly what information was sent. As NiFi is running, there is a rolling group of 16 provenance log files. As provenance events are emitted they are written to one of the 16 files (there are multiple files to increase throughput). The log files are periodically rolled over (the default timeframe is every 30 seconds). This means the newly created provenance events start writing to a new group of 16 log files and the original ones are processed for long term storage. First the rolled over logs are merged into one file. Then the file is optionally compressed (determined by the "nifi.provenance.repository.compress.on.rollover" property). Lastly the events are indexed using Lucene and made available for querying. This batched approach for indexing means provenance events aren't available immediately for querying but in return this dramatically increases performance because committing a transaction and indexing are very expensive tasks. @@ -97,11 +97,11 @@ The Provenance Repo is a Lucene index that is broken into multiple shards. This === General Repository Notes -===== Multiple Physical Storage Points +==== Multiple Physical Storage Points For the Provenance and Content repos, there is the option to stripe the information across multiple physical partitions. An admin would do this if they wanted to federate reads and writes across multiple disks. The repo (Content or Provenance) is still one logical store but writes will be striped across multiple volumes/partitions automatically by the system. The directories are specified in the nifi.properties file. -===== Best Practice +==== Best Practice It is considered a best practice to analyze the contents of a FlowFile as few times as possible and instead extract key information from the contents into the attributes of the FlowFile; then read/write information from the FlowFile attributes. One example of this is the ExtractText processor, which extracts text from the FlowFile Content and puts it as an attribute so other processors can make use of it. This provides far better performance than continually processing the entire content of the FlowFile, as the attributes are kept in-memory and updating the FlowFile repository is much faster than updating the Content repository, given the amount of data stored in each. @@ -117,7 +117,7 @@ Note: To use this flow you need to configure a couple options. First a Distribut keytool -importkeystore -srckeystore /Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/security/cacerts -destkeystore myTrustStore -===== WebCrawler Template: +=== WebCrawler Template: Note that it is not uncommon for bulletins with messages such as "Connection timed out" to appear on the InvokeHttp processor due to the random nature of web crawling. @@ -144,14 +144,14 @@ The ProvenanceReporter documents the changes that occurred which includes a CLON image::PassByReference.png["Pass By Reference"] -===== Extended Routing Use-cases: +=== Extended Routing Use-cases: In addition to routing FlowFiles based on attributes, some processors also route based on content. While it is not as efficient, sometimes it is necessary because you want to split up the content of the FlowFile into multiple FlowFiles. One example is the SplitText processor. This processor analyzes the content looking for end line characters and creates new FlowFiles containing a configurable number of lines. The Web Crawler flow uses this to split the potential URLs into single lines for URL extraction and to act as requests for InvokeHttp. One benefit of the SplitText processor is that since the processor is splitting contiguous chunks (no FlowFile content is disjoint or overlapping) the processor can do this routing without copying any content. All it does is create new FlowFiles, each with a pointer to a section of the original FlowFile’s content. This is made possible by the content demarcation and split facilities built into the NiFi API. While not always feasible to split in this manner when it is feasible the performance benefits are considerable. RouteText is a processor that shows why copying content can be needed for certain styles of routing. This processor analyzes each line and routes it to one or more relationships based on configurable properties. When more than one line gets routed to the same relationship (for the same input FlowFile), those lines get combined into one FlowFile. Since the lines could be disjoint (lines 1 and 100 route to the same relationship) and one pointer cannot describe the FlowFile's content accurately, the processor must copy the contents to a new location. For example, in the Web Crawler flow, the RouteText processor routes all lines that contain "nifi" to the "NiFi" relationship. So when there is one input FlowFile that has "nifi" multiple times on the web page, only one email will be sent (via the subsequent PutEmail processor). -===== Funnels +=== Funnels The funnel is a component that takes input from one or more connections and routes them to one or more destinations. The typical use-cases of which are described in the User Guide. Regardless of use-case, if there is only one processor downstream from the funnel then there are not any provenance events emitted by the funnel and it appears to be invisible in the Provenance graph. If there are multiple downstream processors, like the one in WebCrawler, then a clone event occurs. Referring to the graphic below, you can see that a new FlowFile (F¬2) is cloned from the original FlowFile (F1) and, just like the Routing above, the new FlowFile just has a pointer to the same content (the content is not copied). From a developer point of view, you can view a Funnel just as a very simple processor. When it is scheduled to run, it simply does a "ProcessSession.get()" and then "ProcessSession.transfer()" to the output connection . If there is more than one output connection (like the example below) then a "ProcessSession.clone()" is run. Finally a "ProcessSession.commit()" is called, completing the transaction. @@ -167,7 +167,7 @@ Note: For the sake of focusing on the Copy on Write event, the FlowFile's (F1) p image::CopyOnWrite.png["Copy On Write"] -===== Extended Copy on Write Use-case +==== Extended Copy on Write Use-case A unique case of Copy on Write is the MergeContent processor. Just about every processor only acts on one FlowFile at a time. The MergeContent processor is unique in that it takes in multiple FlowFiles and combines them into one. Currently, MergeContent has multiple different Merge Strategies but all of them require the contents of the input FlowFiles to be copied to a new merged location. After MergeContent finishes, it emits a provenance event of type "JOIN" that establishes that the given parents were joined together to create a new child FlowFile. @@ -183,14 +183,14 @@ Note: For the sake of focusing on the ATTRIBUTES_MODIFIED event the FlowFile's ( image::UpdatingAttributes.png["Updating Attributes"] -===== Typical Use-case Note +==== Typical Use-case Note In addition to adding arbitrary attributes via UpdateAttribute, extracting information from the content of a FlowFile into the attributes is a very common use-case. One such example in the Web Crawler flow is the ExtractText processor. We cannot use the URL when it is embedded within the content of the FlowFile, so we much extract the URL from the contents of the FlowFile and place it as an attribute. This way we can use the Expression Language to reference this attribute in the URL Property of InvokeHttp. === Data Egress Eventually data in NiFi will reach a point where it has either been loaded into another system and we can stop processing it, or we filtered the FlowFile out and determined we no longer care about it. Either way, the FlowFile will eventually be "DROPPED". "DROP" is a provenance event meaning that we are no longer processing the FlowFile in the Flow and it is available for deletion. It remains in the FlowFile Repository until the next repository checkpoint. The Provenance Repository keeps the Provenance events for an amount of time stated in nifi.properties (default is 24 hours). The content in the Content Repo is marked for deletion once the FlowFile leaves NiFi and the background checkpoint processing of the Write-Ahead Log to compact/remove occurs. That is unless another FlowFile references the same content or if archiving is enabled in nifi.properties. If archiving is enabled, the content exists until either the max percentage of disk is reached or max retention period is reached (also set in nifi.properties). -===== Deeper View: Deletion After Checkpointing +==== Deeper View: Deletion After Checkpointing Note: This section relies heavily on information from the "Deeper View: Content Claim" section above. Once the “.partial” file is synchronized with the underlying storage mechanism and renamed to be the new snapshot (detailed in the FlowFile Repo section) there is a callback to the FlowFile Repo to release all the old content claims (this is done after checkpointing so that content is not lost if something goes wrong). The FlowFile Repo knows which Content Claims can be released and notifies the Resource Claim Manager. The Resource Claim Manager keeps track of all the content claims that have been released and which resource claims are ready to be deleted (a resource claim is ready to be deleted when there are no longer any FlowFiles referencing it in the flow). @@ -198,12 +198,12 @@ Once the “.partial” file is synchronized with the underlying storage mechani Periodically the Content Repo asks the Resource Claim Manager which Resource Claims can be cleaned up. The Content Repo then makes the decision whether the Resource Claims should be archived or deleted (based on the value of the "nifi.content.repository.archive.enabled" property in the “nifi.properties” file). If archiving is disabled then the file is simply deleted from the disk. Otherwise, a background thread runs to see when archives should be deleted (based on the conditions above). This background thread keeps a list of the 10,000 oldest content claims and deletes them until below the necessary threshold. If it runs out of content claims it scans the repo for the oldest content to re-populate the list. This provides a model that is efficient in terms of both Java heap utilization as well as disk I/O utilization. -===== Associating Disparate Data +==== Associating Disparate Data One of the features of the Provenance Repository is that it allows efficient access to events that occur sequentially. A NiFi Reporting Task could then be used to iterate over these events and send them to an external service. If other systems are also sending similar types of events to this external system, it may be necessary to associate a NiFi FlowFile with another piece of information. For instance, if GetSFTP is used to retrieve data, NiFi refers to that FlowFile using its own, unique UUID. However, if the system that placed the file there referred to the file by filename, NiFi should have a mechanism to indicate that these are the same piece of data. This is accomplished by calling the ProvenanceReporter.associate() method and providing both the UUID of the FlowFile and the alternate name (the filename, in this example). Since the determination that two pieces of data are the same may be flow-dependent, it is often necessary for the DataFlow Manager to make this association. A simple way of doing this is to use the UpdateAttribute processor and configure it to set the "alternate.identifier" attribute. This automatically emits the "associate" event, using whatever value is added as the “alternate.identifier” attribute. -= Closing Remarks +== Closing Remarks Utilizing the copy-on-write, pass-by-reference, and immutability concepts in conjunction with the three repositories, NiFi is a fast, efficient, and robust enterprise dataflow platform. This document has covered specific implementations of pluggable interfaces. These include the Write-Ahead Log based implementation of the FlowFile Repository, the File based Provenance Repository, and the File based Content Repository. These implementations are the NiFi defaults but are pluggable so that, if needed, users can write their own to fulfill certain use-cases. Hopefully this document has given you a better understanding of the low-level functionality of NiFi and the decisions behind them. If there is something you wish to have explained more in depth or you feel should be included please feel free to send an email to the Apache NiFi Developer mailing list (dev@nifi.apache.org).