In traditional NiFi, FlowFile content is stored on disk, not in memmory. As a result, it is capable of handling any size
data as long as it fits on the disk. However, in Stateless, FlowFile contents are stored in memory, in the JVM heap. As
a result, it is generally not advisable to attempt to load massive files, such as a 100 GB dataset, into Stateless NiFi.
Doing so will often result in an OutOfMemoryError, or at a minimum cause significant garbage collection, which can degrade
performance.
## Feature Comparisons
As mentioned above, Stateless NiFi offers a different set of features and tradeoffs from traditional NiFi.
Here, we summarize the key differences. This comparison is not exhaustive but provides a quick look at how
the two runtimes operate.
| Feature | Traditional NiFi | Stateless NiFi |
|---------|------------------|----------------|
| Data Durability | Data is reliably stored on disk in the FlowFile and Content Repositories | Data is stored in-memory and must be consumed from the source again upon restart |
| Data Ordering | Data is ordered independently in each Connection based on the selected Prioritizers | Data flows through the system in the order it was received (First-In, First-Out / FIFO) |
| Site-to-Site | Supports full Site-to-Site capabilities, including Server and Client roles | Can push to, or pull from, a NiFi instance but cannot receive incoming Site-to-Site connections. I.e., works as a client but not a server. |
| Form Factor | Large form factor. Designed to take advantage of many cores and disks. | Light-weight form factor. Easily embedded into another application. Single-threaded processing. |
| Heap Considerations | Typically, many processors in use by many users. FlowFile content should not be loaded into heap because it can easily cause heap exhaustion. | Smaller dataflows use less heap. Flow operates on only one or a few FlowFiles at a time and holds FlowFile contents in memory in the Java heap. |
| Data Provenance | Fully stored, indexed data provenance that can be browsed through the UI and exported via Reporting Tasks | Limited Data Provenance capabilities, events being stored in memory. No ability to view but can be exported using Reporting Tasks. However, since they are in-memory, they will be lost upon restart and may roll off before they can be exported. |
| Embeddability | While technically possible to embed traditional NiFi, it is not recommended, as it launches a heavy-weight User Interface, deals with complex authentication and authorization, and several file-based external dependencies, which can be difficult to manage. | Has minimal external dependencies (directory containing extensions and a working directory to use for temporary storage) and is much simpler to manage. Embeddability is an important feature of Stateless NiFi. |
## Running Stateless NiFi
Stateless NiFi can be used as a library and embedded into other applications. However, it can also be run directly
from the command-line from a NiFi build using the `bin/nifi.sh` script.
To do so requires three files:
- The engine configuration properties file
- The dataflow configuration properties file
- The dataflow itself (which may exist as a file, or point to a flow in a NiFi registry)
Stateless NiFi accepts two separate configuration files: an engine configuration file and a dataflow configuration file.
This is done because typically the engine configuration will be the same for all flows that are run, so it can be created
only once. The dataflow configuration will be different for each dataflow that is to be run.
All properties in the Engine Configuration file are prefixed with `nifi.stateless.`. Below is a list of property names,
descriptions, and example values:
| Property Name | Description | Example Value |
|---------------|-------------|---------------|
| nifi.stateless.nar.directory | The location of a directory containing all NiFi Archives (NARs) that are necessary for running the dataflow | /var/lib/nifi/lib |
| nifi.stateless.working.directory | The location of a directory where Stateless should store its expanded NAR files and use for temporary storage | /var/lib/nifi/work/stateless |
The following properties may be used for configuring security parameters:
| Property Name | Description | Example Value |
|---------------|-------------|---------------|
| nifi.stateless.security.truststore | Filename of a Truststore to use for Site-to-Site or for interacting with NiFi Registry or Extension Clients | /etc/certs/truststore.jks |
| nifi.stateless.security.truststorePasswd | The password of the Truststore. | do-not-use-this-password |
| nifi.stateless.security.keystore | Filename of a Keystore to use for Site-to-Site or for interacting with NiFi Registry or Extension Clients | /etc/certs/keystore.jks |
| nifi.stateless.security.keystorePasswd | The password of the Keystore | do-not-use-this-password-either |
| nifi.stateless.security.keyPasswd | An optional password for the key in the Keystore. If not specified, the password of the Keystore itself will be used. | password |
| nifi.stateless.sensitive.props.key | The dataflow does not hold sensitive passwords, but some processors may have a need to encrypt data before storing it. This key is used to allow processors to encrypt and decrypt data. At present, the only Processor supported by the community that makes use of this feature is hte GetJMSTopic processor, which is deprecated. However, it is provided here for completeness. | Some Passphrase That's Difficult to Guess |
| nifi.stateless.kerberos.krb5.file | The KRB5 file to use for interacting with Kerberos. This is only necessary if the dataflow interacts with a Kerberized data source/sink. If not specified, will default to `/etc/krb5.conf` | /etc/krb5.conf |
When Stateless NiFi is started, it parses the provided dataflow and determines which bundles/extensions are necessary
to run the dataflow. If an extension is not available, or the version referenced by the flow is not available, Stateless
may attempt to download the extensions automatically. To do this, one or more Extension Clients must be configured.
Each client is configured using several properties, which are all tied together using a 'key'. For example, if we have
`nifi.stateless.extension.client.XYZ.type`, and `nifi.stateless.extension.client.XYZ.baseUrl`, then we know that
the first `type` property refers to the same client as the first `baseUrl` property because they both have the 'key'
`ABC`. Similarly, the second `type` and `baseUrl` properties refer to the same client because they have the same 'key':
`XYZ`.
Any extension that is downloaded will be stored in the directory specified by the `nifi.stateless.nar.directory` property described above.
| Property Name | Description | Example Value |
|---------------|-------------|---------------|
| nifi.stateless.extension.client.\<key>.type | The type of Extension Client. Currently, the only supported value is 'nexus'. | nexus |
| nifi.stateless.extension.client.\<key>.baseUrl | The Base URL to use when connecting to the service. The example here is for Maven Central. | https://repo1.maven.org/maven2/ |
| nifi.stateless.extension.client.\<key>.timeout | The amount of time to wait to connect to the system or receive data from the system. | 30 secs |
| nifi.stateless.extension.client.\<key>.useSslContext | If the Base URL indicates that the HTTPS protocol is to be used, this property dictates whether the SSL Context defined above is to be used or not. If not, then the default Java truststore information will be used. | false |
A full example of the Engine Configuration may look as follows:
The flow's location must be provided either by specifying a NiFi Registry URL, Bucket ID, and Flow ID (and optional version);
by specifying a local filename for the flow; by specifying a URL for the flow; or by including a "stringified" version of the JSON flow definition itself.
Note that if using a local filename, the format of the file is not the same as
the `flow.xml.gz` file that NiFi uses but rather is the `Versioned Flow Snapshot` format that is used by the NiFi Registry.
The easiest way to export a flow from NiFi onto local disk for use by Stateless NiFi is to right-click on a Process Group or
the canvas in NiFi and choose `Downlaod Flow`.
The following properties are supported for specifying the location of a flow:
| Property Name | Description | Example Value |
|---------------|-------------|---------------|
| nifi.stateless.registry.url | The URL of the NiFi Registry to source the dataflow from. If specified, the `flow.bucketId` and the `flow.id` must also be specified. | https://nifi-registry/ |
| nifi.stateless.flow.bucketId | The UUID of the bucket in NiFi Registry that holds the flow. | 00000000-0000-0000-0000-000000000011 |
| nifi.stateless.flow.id | The UUID of the flow in NiFi Registry. | 00000000-0000-0000-0000-000000000044 |
| nifi.stateless.flow.version | The version of the dataflow to run. If not specified, will use the latest version of the flow. | 5 |
| nifi.stateless.flow.snapshot.file | Instead of using the NiFi Registry to source the flow, the flow can be a local file. In this case, this provides the filename of the file. | /var/lib/nifi/flows/my-flow.json |
| nifi.stateless.flow.snapshot.url | A URL that contains the Flow Definition to use. | https://gist.github.com/apache/223389cb6cbbd82985fbb8d429b58899 |
| nifi.stateless.flow.snapshot.url.use.ssl.context | A boolean value indicating whether or not the SSL Context that is defined in the Engine Configuration properties file should be used when downloading the flow | false |
Stateless NiFi also allows the user to provide one or more Parameter Contexts to use in the dataflow:
| Property Name | Description | Example Value |
|---------------|-------------|---------------|
| nifi.stateless.parameters.\<key> | The name of the Parameter Context. This must match the name of a Parameter Context that is referenced within the dataflow. | My Parameter Context |
| nifi.stateless.parameters.\<key>.\<parameter name> | The name of a Parameter to use, the value of the property being the value of the Parameter | My Value |
For example, to create a Parameter Context with the name "Kafka Parameter Context" and 2 parameters, "Kafka Topic" and "Kafka Brokers",
There are times, however, when we do not want to provide the list of Parameters in the dataflow properties file. We may want to fetch the Parameters from some file or
an external service. For this reason, Stateless supports a notion of a Parameter Provider. A Parameter Provider is an extension point that can be used to retrieve Parameters
from elsewhere. For information on how to configure Parameter Provider, see the [Passing Parameters](#passing-parameters) section below.
When a stateless dataflow is triggered, it can also be important to consider how much data should be allowed to enter the dataflow for a given invocation.
Typically, this consists of a single FlowFile at a time or a single batch of FlowFiles at a time, depending on the source processor. However, some processors may
require additional data in order to perform their tasks. For example, if we have a dataflow whose source processor brings in a single message from a JMS Queue, and
later in the flow have a MergeContent processor, that MergeContent processor may not be able to perform its function with just one message. As a result, the source
processor will be triggered again. This process will continue until either the MergeContent processor is able to make progress and empty its incoming FlowFile Queues
OR until some threshold has been reached. These thresholds can be configured using the following properties:
/ nifi.stateless.transaction.thresholds.flowfiles / The maximum number of FlowFiles that a source processors should bring into the flow each time the dataflow is triggered. / 1000 /
/ nifi.stateless.transaction.thresholds.bytes / The maximum amount of data for all FlowFiles' contents. / 100 MB /
/ nifi.stateless.transaction.thresholds.time / The amount of time between when the dataflow was triggered and when the source processors should stop being triggered. / 1 sec /
For example, to ensure that the source processors are not triggered to bring in more than 1 MB of data and not more than 10 FlowFiles, we can use:
With this configuration, each time the dataflow is triggered, the source processor (or all sources, cumulatively, if there is more than one) will not be triggered again after it has brought
10 FlowFiles OR 1 MB worth of FlowFile content (regardless if that 1 MB was from 1 FlowFiles or the sum of all FlowFiles) into the flow.
Note, however, that if the source were to bring in 1,000 FlowFiles and 50 MB of data in a single invocation, that would be allowed, but the component would no longer be triggered until the dataflow
The dataflow configuration also allows for defining Reporting Tasks. Similarly, multiple properties for a given Reporting Task
are tied together with a common key. The following properties are supported:
| Property Name | Description | Example Value |
|---------------|-------------|---------------|
| nifi.stateless.reporting.task.\<key>.name | The name of the Reporting Task | Log Status
| nifi.stateless.reporting.task.\<key>.type | The type of the Reporting Task. This may be the fully qualified classname or the simple name, if only a single class exists with the simple name | ControllerStatusReportingTask |
| nifi.stateless.reporting.task.\<key>.bundle | The bundle that holds the Reporting Task. If not specified, the bundle will be automatically identified, if there exists exactly one bundle with the reporting task. However, if no Bundle is specified, none will be downloaded and if more than 1 is already available, the Reporting Task cannot be created. The format is \<group id>:\<artifact id>:\<version> | org.apache.nifi:nifi-standard-nar:1.12.1 |
| nifi.stateless.reporting.task.\<key>.properties.\<property name> | One or more Reporting Task properties may be configured using this syntax | Any valid value for the corresponding property |
| nifi.stateless.reporting.task.\<key>.frequency | How often the Reporting Task should be triggered | 2 sec |
An example Reporting Task that will log stats to the log file every 30 seconds is as follows:
There is one additional property that is supported in the dataflow configuration:
| Property Name | Description | Example Value |
|---------------|-------------|---------------|
| nifi.stateless.failure.port.names | A comma-delimited list of Output Port names. If a FlowFile is routed to any of these Output Ports, it is considered a failure and will rollback the entire session. | Unknown Kafka Type, Parse Failure, Failed to Write to HDFS |
This property allows the user to enter one or more ports that should be considered failures. The value is a comma-separted list of names of Outport Ports. In the example above, if a FlowFile is
routed to the "Unknown Kafka Type" port, the "Parse Failure" port, or the "Failed to Write to HDFS" port, then the flow is considered a failure. The entire session will be rolled back, and the source
Processor will not acknowledge the data from the source. As a result, the next time that the dataflow is triggered, it will consume the same data again.
While used as an illustrative example here, it may not make sense to route data to a "Parse Failure" Output Port and consider that a failure, though. That's because a Parse Failure is likely not
going to succeed the next time around. Such a case may result in constantly consuming the same data and attempting to process it over and over again. However, it may make sense if the use case
dictates that no more data may be processed until such a message on the Kafka queue has been properly dealt with.
##### Full Examples
An example of a fully formed dataflow configuration file that will import a dataflow from NiFi Registry is as follows:
In this case, any Parameter Context that has a name of "Kafka Brokers" will have the parameter resolved to `kafka-01:9092,kafka-02:9092,kafka-03:9092`, regardless of the name
of the Parameter Context.
If a given Parameter is referenced and is not defined using the `-p` syntax, an environment variable may also be used to provide the value. However, environment variables typically are
allowed to contain only letters, numbers, and underscores in their names. As a result, it is important that the Parameters' names also adhere to that same rule, or the environment variable
At times, none of the built-in capabilities for resolving Parameters are ideal, though. In these situations, we can use a custom Parameter Provider in order to source Parameter values from elsewhere.
To configure a custom Parameter Provider, we must configure it similarly to Reporting Tasks, using a common key to indicate which Parameter Provider the property belongs to.
The following properties are supported:
| Property Name | Description | Example Value |
|---------------|-------------|---------------|
| nifi.stateless.parameter.provider.\<key>.name | The name of the Parameter Provider | My Secret Parameter Provider
| nifi.stateless.parameter.provider.\<key>.type | The type of the Parameter Provider. This may be the fully qualified classname or the simple name, if only a single class exists with the simple name | MySecretParameterProvider |
| nifi.stateless.parameter.provider.\<key>.bundle | The bundle that holds the Parameter Provider. If not specified, the bundle will be automatically identified, if there exists exactly one bundle with the reporting task. However, if no Bundle is specified, none will be downloaded and if more than 1 is already available, the Parameter Provider cannot be created. The format is \<group id>:\<artifact id>:\<version> | org.apache.nifi:nifi-standard-nar:1.14.0 |
| nifi.stateless.parameter.provider.\<key>.properties.\<property name> | One or more Parameter Provider properties may be configured using this syntax | Any valid value for the corresponding property |
An example Parameter Provider might be configured as follows: