NIFI-13854 Updated Getting Started Guide for 2.0.0 (#9362)

Signed-off-by: David Handermann <exceptionfactory@apache.org>
This commit is contained in:
Michael Moser 2024-10-11 17:06:00 -04:00 committed by GitHub
parent 8e867be875
commit fa2a01c823
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 58 additions and 88 deletions

View File

@ -16,7 +16,7 @@
//
= Getting Started with Apache NiFi
Apache NiFi Team <dev@nifi.apache.org>
:homepage: http://nifi.apache.org
:homepage: https://nifi.apache.org
:linkattrs:
@ -60,16 +60,8 @@ dataflows.
WARNING: Before proceeding, check the Admin Guide to confirm you have the <<administration-guide.adoc#system_requirements,minimum system requirements>> to install and run NiFi.
NiFi can be downloaded from the link:http://nifi.apache.org/download.html[NiFi Downloads page^]. There are two packaging options
available:
- a "tarball" (tar.gz) that is tailored more to Linux
- a zip file that is more applicable for Windows users
macOS users may also use the tarball or can install via link:https://brew.sh[Homebrew^] by simply running the command `brew install nifi` from the command line terminal.
For users who are not running macOS or do not have Homebrew installed, after downloading the version of NiFi that you
would like to use, simply extract the archive to the location that you wish to run the application from.
NiFi can be downloaded from the link:https://nifi.apache.org/download/[NiFi Downloads page^]. After downloading the version of NiFi
that you would like to use, simply extract the zip archive to the location that you wish to run the application from.
For information on how to configure the instance of NiFi (for example, to configure security, data storage
configuration, or the port that NiFi is running on), see the link:administration-guide.html[Admin Guide].
@ -83,7 +75,7 @@ appropriate for your operating system.
=== For Windows Users
For Windows users, navigate to the folder where NiFi was installed. Within this folder is a subfolder
named `bin`. Navigate to this subfolder and run `nifi.cmd start` file.
named `bin`. Navigate to this subfolder and run `nifi.cmd start`.
This will launch NiFi and leave it running in the foreground. To shut down NiFi, select the window that
was launched and hold the Ctrl key while pressing C.
@ -104,18 +96,6 @@ be used.
If NiFi was installed with Homebrew, run the commands `nifi start` or `nifi stop` from anywhere in your file system to start or stop NiFi.
=== Installing as a Service
Currently, installing NiFi as a service is supported only for Linux and macOS users. To install the application
as a service, navigate to the installation directory in a Terminal window and execute the command `bin/nifi.sh install`
to install the service with the default name `nifi`. To specify a custom name for the service, execute the command
with an optional second argument that is the name of the service. For example, to install NiFi as a service with the
name `dataflow`, use the command `bin/nifi.sh install dataflow`.
Once installed, the service can be started and stopped using the appropriate commands, such as `sudo service nifi start`
and `sudo service nifi stop`. Additionally, the running status can be checked via `sudo service nifi status`.
== I Started NiFi. Now What?
@ -188,14 +168,16 @@ for the Processor. The properties that are available depend on the type of Proce
for each type. Properties that are in bold are required properties. The Processor cannot be started until all required
properties have been configured. The most important property to configure for GetFile is the directory from which
to pick up files. If we set the directory name to `./data-in`, this will cause the Processor to start picking up
any data in the `data-in` subdirectory of the NiFi Home directory. We can choose to configure several different
Properties for this Processor. If unsure what a particular Property does, we can hover over the Help icon (
image:iconInfo.png["Help"]
any data in the `data-in` subdirectory of the NiFi Home directory.
We can choose to configure several different Properties for this Processor.
If unsure what a particular Property does, we can hover over the Info icon (
image:iconInfo2.png["Info"]
)
next to the Property Name with the mouse in order to read a description of the property. Additionally, the
tooltip that is displayed when hovering over the Help icon will provide the default value for that property,
if one exists, information about whether or not the property supports the Expression Language (see the
<<ExpressionLanguage>> section below), and previously configured values for that property.
tooltip that is displayed will provide the default value for that property if one exists,
information about whether the property supports the Expression Language (see the <<ExpressionLanguage>> section below),
whether the property is sensitive and will be encrypted at rest, and history of previously configured values for that property.
In order for this property to be valid, create a directory named `data-in` in the NiFi home directory and then
click the `Ok` button to close the dialog.
@ -220,7 +202,7 @@ transfers to the `success` Relationship.
In order to address this, let's add another Processor that we can connect the GetFile Processor to, by following
the same steps above. This time, however, we will simply log the attributes that exist for the FlowFile. To do this,
we will add a LogAttributes Processor.
we will add a LogAttribute Processor.
We can now send the output of the GetFile Processor to the LogAttribute Processor. Hover over the GetFile Processor
with the mouse and a Connection Icon (
@ -256,8 +238,8 @@ image:iconStop.png[Stopped]
). The LogAttribute Processor, however, is now invalid because its `success` Relationship has not been connected to
anything. Let's address this by signaling that data that is routed to `success` by LogAttribute should be "Auto Terminated,"
meaning that NiFi should consider the FlowFile's processing complete and "drop" the data. To do this, we configure the
LogAttribute Processor. On the Settings tab, in the right-hand side we can check the box next to the `success` Relationship
to Auto Terminate the data. Clicking `OK` will close the dialog and show that both Processors are now stopped.
LogAttribute Processor. On the Relationships tab, we can check the `terminate` box next to the `success` Relationship
to Auto Terminate the data. Clicking the `Apply` button will close the dialog and show that both Processors are now stopped.
=== Starting and Stopping Processors
@ -281,10 +263,11 @@ corner of the Processor, but nothing is shown there if there are currently no ta
With each Processor having the ability to expose multiple different Properties and Relationships, it can be challenging
to remember how all of the different pieces work for each Processor. To address this, you are able to right-click
on a Processor and choose the `Usage` menu item. This will provide you with the Processor's usage information, such as a
on a Processor and choose the `View Documentation` menu item. This will provide you with the Processor's usage information, such as a
description of the Processor, the different Relationships that are available, when the different Relationships are used,
Properties that are exposed by the Processor and their documentation, as well as which FlowFile Attributes (if any) are
expected on incoming FlowFiles and which Attributes (if any) are added to outgoing FlowFiles.
expected on incoming FlowFiles and which Attributes (if any) are added to outgoing FlowFiles. Some processors also describe
specific configurations needed to accomplish use cases where they are commonly used.
=== Other Components
@ -310,7 +293,7 @@ categorizing them by their functions.
=== Data Transformation
- *CompressContent*: Compress or Decompress Content
- *ConvertCharacterSet*: Convert the character set used to encode the content from one character set to another
- *EncryptContent*: Encrypt or Decrypt Content
- *EncryptContentAge* / *EncryptContentPGP*: Encrypt or Decrypt Content
- *ReplaceText*: Use Regular Expressions to modify textual Content
- *TransformXml*: Apply an XSLT transform to XML Content
- *JoltTransformJSON*: Apply a JOLT specification to transform JSON Content
@ -318,7 +301,7 @@ categorizing them by their functions.
=== Routing and Mediation
- *ControlRate*: Throttle the rate at which data can flow through one part of the flow
- *DetectDuplicate*: Monitor for duplicate FlowFiles, based on some user-defined criteria. Often used in conjunction
with HashContent
with CryptographicHashContent
- *DistributeLoad*: Load balance or sample data by distributing only a portion of data to each user-defined Relationship
- *MonitorActivity*: Sends a notification when a user-defined period of time elapses without any data coming through a particular
point in the flow. Optionally send a notification when dataflow resumes.
@ -335,8 +318,6 @@ categorizing them by their functions.
=== Database Access
- *ExecuteSQL*: Executes a user-defined SQL SELECT command, writing the results to a FlowFile in Avro format
- *PutSQL*: Updates a database by executing the SQL DDM statement defined by the FlowFile's content
- *SelectHiveQL*: Executes a user-defined HiveQL SELECT command against an Apache Hive database, writing the results to a FlowFile in Avro or CSV format
- *PutHiveQL*: Updates a Hive database by executing the HiveQL DDM statement defined by the FlowFile's content
[[AttributeExtraction]]
=== Attribute Extraction
@ -348,8 +329,7 @@ categorizing them by their functions.
Content or extract the value into the user-named Attribute.
- *ExtractText*: User supplies one or more Regular Expressions that are then evaluated against the textual content of the FlowFile, and the
values that are extracted are then added as user-named Attributes.
- *HashAttribute*: Performs a hashing function against the concatenation of a user-defined list of existing Attributes.
- *HashContent*: Performs a hashing function against the content of a FlowFile and adds the hash value as an Attribute.
- *CryptographicHashContent*: Performs a hashing function against the content of a FlowFile and adds the hash value as an Attribute.
- *IdentifyMimeType*: Evaluates the content of a FlowFile in order to determine what type of file the FlowFile encapsulates. This Processor is
capable of detecting many different MIME Types, such as images, word processor documents, text, and compression formats just to name
a few.
@ -374,12 +354,9 @@ categorizing them by their functions.
the data from one location to another location and is not to be used for copying the data.
- *GetSFTP*: Downloads the contents of a remote file via SFTP into NiFi and then deletes the original file. This Processor is expected to move
the data from one location to another location and is not to be used for copying the data.
- *GetJMSQueue*: Downloads a message from a JMS Queue and creates a FlowFile based on the contents of the JMS message. The JMS Properties are
optionally copied over as Attributes, as well.
- *GetJMSTopic*: Downloads a message from a JMS Topic and creates a FlowFile based on the contents of the JMS message. The JMS Properties are
optionally copied over as Attributes, as well. This Processor supports both durable and non-durable subscriptions.
- *GetHTTP*: Downloads the contents of a remote HTTP- or HTTPS-based URL into NiFi. The Processor will remember the ETag and Last-Modified Date
in order to ensure that the data is not continually ingested.
- *ConsumeJMS*: Downloads a message from a JMS Queue or Topic and creates a FlowFile based on the contents of the JMS message. The JMS Properties are
optionally copied over as Attributes, as well. This Processor also supports durable topic subscriptions.
- *InvokeHTTP*: Can download data from a remote HTTP server. See the <<HTTP>> section below.
- *ListenHTTP*: Starts an HTTP (or HTTPS) Server and listens for incoming connections. For any incoming POST request, the contents of the request
are written out as a FlowFile, and a 200 response is returned.
- *ListenUDP*: Listens for incoming UDP packets and creates a FlowFile per packet or per bundle of packets (depending on configuration) and
@ -387,16 +364,16 @@ categorizing them by their functions.
- *GetHDFS*: Monitors a user-specified directory in HDFS. Whenever a new file enters HDFS, it is copied into NiFi and deleted from HDFS. This
Processor is expected to move the file from one location to another location and is not to be used for copying the data. This Processor is also
expected to be run On Primary Node only, if run within a cluster. In order to copy the data from HDFS and leave it in-tact, or to stream the data
from multiple nodes in the cluster, see the ListHDFS Processor.
from multiple nodes in the cluster, see the ListHDFS Processor. _HDFS components are available via NiFi plugin extension._
- *ListHDFS* / *FetchHDFS*: ListHDFS monitors a user-specified directory in HDFS and emits a FlowFile containing the filename for each file that it
encounters. It then persists this state across the entire NiFi cluster by way of a Distributed Cache. These FlowFiles can then be fanned out across
the cluster and sent to the FetchHDFS Processor, which is responsible for fetching the actual content of those files and emitting FlowFiles that contain
the content fetched from HDFS.
the content fetched from HDFS. _HDFS components are available via NiFi plugin extension._
- *FetchS3Object*: Fetches the contents of an object from the Amazon Web Services (AWS) Simple Storage Service (S3). The outbound FlowFile contains the contents
received from S3.
- *GetKafka*: Fetches messages from Apache Kafka, specifically for 0.8.x versions. The messages can be emitted as a FlowFile per message or can be batched together using a user-specified delimiter.
- *ConsumeKafka*: Fetches messages from Apache Kafka. The messages can be emitted as a FlowFile per message or can be batched together using a user-specified delimiter.
- *GetMongo*: Executes a user-specified query against MongoDB and writes the contents to a new FlowFile.
- *GetTwitter*: Allows Users to register a filter to listen to the Twitter "garden hose" or Enterprise endpoint, creating a FlowFile for each tweet
- *ConsumeTwitter*: Allows Users to register a filter to listen to the X/Twitter "garden hose" or Enterprise endpoint, creating a FlowFile for each post
that is received.
=== Data Egress / Sending Data
@ -404,12 +381,13 @@ categorizing them by their functions.
- *PutFile*: Writes the contents of a FlowFile to a directory on the local (or network attached) file system.
- *PutFTP*: Copies the contents of a FlowFile to a remote FTP Server.
- *PutSFTP*: Copies the contents of a FlowFile to a remote SFTP Server.
- *PutJMS*: Sends the contents of a FlowFile as a JMS message to a JMS broker, optionally adding JMS Properties based on Attributes.
- *InvokeHTTP*: Send the contents of a FlowFile to a remote HTTP server. See the <<HTTP>> section below.
- *PublishJMS*: Sends the contents of a FlowFile as a JMS message to a JMS broker, optionally adding JMS Properties based on Attributes.
- *PutSQL*: Executes the contents of a FlowFile as a SQL DDL Statement (INSERT, UPDATE, or DELETE). The contents of the FlowFile must be a valid
SQL statement. Attributes can be used as parameters so that the contents of the FlowFile can be parameterized SQL statements in order to avoid
SQL injection attacks.
- *PutKafka*: Sends the contents of a FlowFile as a message to Apache Kafka, specifically for 0.8.x versions. The FlowFile can be sent as a single message or a delimiter, such as a
new-line can be specified, in order to send many messages for a single FlowFile.
- *PublishKafka*: Sends the contents of a FlowFile as a message to Apache Kafka. The FlowFile can be sent as a single message or a delimiter, such as a
new-line, can be specified in order to send many messages for a single FlowFile.
- *PutMongo*: Sends the contents of a FlowFile to Mongo as an INSERT or an UPDATE.
=== Splitting and Aggregation
@ -433,18 +411,12 @@ categorizing them by their functions.
- *SplitContent*: Splits a single FlowFile into potentially many FlowFiles, similarly to SegmentContent. However, with SplitContent, the splitting
is not performed on arbitrary byte boundaries but rather a byte sequence is specified on which to split the content.
[[HTTP]]
=== HTTP
- *GetHTTP*: Downloads the contents of a remote HTTP- or HTTPS-based URL into NiFi. The Processor will remember the ETag and Last-Modified Date
in order to ensure that the data is not continually ingested.
- *ListenHTTP*: Starts an HTTP (or HTTPS) Server and listens for incoming connections. For any incoming POST request, the contents of the request
are written out as a FlowFile, and a 200 response is returned.
- *InvokeHTTP*: Performs an HTTP Request that is configured by the user. This Processor is much more versatile than the GetHTTP and PostHTTP
but requires a bit more configuration. This Processor cannot be used as a Source Processor and is required to have incoming FlowFiles in order
to be triggered to perform its task.
- *PostHTTP*: Performs an HTTP POST request, sending the contents of the FlowFile as the body of the message. This is often used in conjunction
with ListenHTTP in order to transfer data between two different instances of NiFi in cases where Site-to-Site cannot be used (for instance,
when the nodes cannot access each other directly and are able to communicate through an HTTP proxy).
*Note*: HTTP is available as a link:user-guide.html#site-to-site[Site-to-Site] transport protocol in addition to the existing RAW socket transport. It also supports HTTP Proxy. Using HTTP Site-to-Site is recommended since it's more scalable, and can provide bi-directional data transfer using input/output ports with better user authentication and authorization.
- *InvokeHTTP*: Can send a wide variety of HTTP Requests to a server, as configured by the user. A GET request can download data from an HTTP server.
A POST request can send the contents of a FlowFile in the body of the request to an HTTP server.
- *HandleHttpRequest* / *HandleHttpResponse*: The HandleHttpRequest Processor is a Source Processor that starts an embedded HTTP(S) server
similarly to ListenHTTP. However, it does not send a response to the client. Instead, the FlowFile is sent out with the body of the HTTP request
as its contents and attributes for all of the typical Servlet parameters, headers, etc. as Attributes. The HandleHttpResponse then is able to
@ -522,8 +494,8 @@ While this may seem confusing at first, the section below on <<ExpressionLanguag
here.
In addition to always adding a defined set of Attributes, the UpdateAttribute Processor has an Advanced UI that allows the user
to configure a set of rules for which Attributes should be added when. To access this capability, in the Configure dialog's
Properties tab, click the `Advanced` button at the bottom of the dialog. This will provide a UI that is tailored specifically
to configure a set of rules for which Attributes should be added when. To access this capability, right-click on the UpdateAttribute
Processor and choose the `Advanced` menu item. This will provide a UI that is tailored specifically
to this Processor, rather than the simple Properties table that is provided for all Processors. Within this UI, the user is able
to configure a rules engine, essentially, specifying rules that must match in order to have the configured Attributes added
to the FlowFile.
@ -553,10 +525,10 @@ to that Relationship. All other FlowFiles will be routed to 'unmatched'.
As we extract Attributes from FlowFiles' contents and add user-defined Attributes, they don't do us much good as an operator unless
we have some mechanism by which we can use them. The NiFi Expression Language allows us to access and manipulate FlowFile Attribute
values as we configure our flows. Not all Processor properties allow the Expression Language to be used, but many do. In order to
determine whether or not a property supports the Expression Language, a user can hover over the Help icon (
image:iconInfo.png["Help"]
) in the Properties tab of the Processor Configure dialog. This will provide a tooltip that shows a description of the property, the
default value, if any, and whether or not the property supports the Expression Language.
determine whether or not a property supports the Expression Language, a user can hover over the Info icon (
image:iconInfo2.png["Info"]
) in the Properties tab of the Processor Configure dialog. This will provide a tooltip that shows a description of the property
and whether the property supports the Expression Language.
For properties that do support the Expression Language, it is used by adding an expression within the opening `${` tag and the closing
`}` tag. An expression can be as simple as an attribute name. For example, to reference the `uuid` Attribute, we can simply use the
@ -601,9 +573,9 @@ are currently queued across the entire flow, as well as the total size of those
If the NiFi instance is in a cluster, we will also see an indicator here telling us how many nodes are in the cluster and how many are currently
connected. In this case, the number of active threads and the queue size are indicative of all the sum of all nodes that are currently connected.
It is important to note that active threads only captures threads by objects that are in the graph (processors, processor groups, remote processor groups, funnels, etc.).
When broken down by node in the cluster (Global Menu -> Cluster), the active thread count is more comprehensive and includes these as well as any
other threads (reporting tasks, controller services, etc.)
It is important to note that active threads only captures threads by Processors that are on the graph.
When broken down by node in the cluster (Global Menu -> Cluster), the active thread count is more comprehensive and includes these plus any
other threads (Input and Output Ports, Funnels, Remote Process Groups, Reporting Tasks, etc.)
=== Component Statistics
@ -612,10 +584,10 @@ by the component. These statistics provide information about how much data has b
window and allows us to see things like the number of FlowFiles that have been consumed by a Processor, as well as the number of FlowFiles
that have been emitted by the Processor.
The connections between Processors also expose the number of items that are currently queued.
The connections between Processors also expose several statistics about items that pass through the connection.
It may also be valuable to see historical values for these metrics and, if clustered, how the different nodes compare to one another.
In order to see this information, we can right-click on a component and choose the `Stats` menu item. This will show us a graph that spans
In order to see this information, we can right-click on a component and choose the `View Status History` menu item. This will show us a graph that spans
the time since NiFi was started, or up to 24 hours, whichever is less. The amount of time that is shown here can be extended or reduced
by changing the configuration in the properties file.
@ -656,9 +628,9 @@ choose which Attributes will be important to your specific dataflows and make th
[[EventDetails]]
=== Event Details
Once we have performed our search, our table will be populated only with the events that match the search criteria. From here, we
can choose the Info icon (
image:iconDetails.png[Details Icon]
) on the left-hand side of the table to view the details of that event:
can click the kebab icon (
image:iconKebab.png["Menu"]
) on the right-hand side of the table and choose to `View Details` of that event:
image:event-details.png[Event Details]
@ -692,10 +664,10 @@ this iterative development of the flow until it is processing the data exactly a
=== Lineage Graph
In addition to viewing the details of a Provenance event, we can also view the lineage of the FlowFile involved by clicking on the Lineage Icon (
image:iconLineage.png[Lineage]
) from the table view.
In addition to viewing the details of a Provenance event, we can also view the lineage of the FlowFile involved.
Click the kebab icon (
image:iconKebab.png["Menu"]
) on the right-hand side of the table and choose to `Show Lineage` of that event.
This provides us with a graphical representation of exactly what happened to that piece of data as it traversed the system:
image:lineage-graph-annotated.png[Lineage Graph]
@ -722,7 +694,7 @@ addition to this Getting Started Guide:
lengthy discussions of all of the different components that comprise the application. This guide is written with the NiFi Operator as its
audience. It provides information on each of the different components available in NiFi and explains how to use the different features
provided by the application.
- link:administration-guide.html[Administration Guide] - A guide for setting up and administering Apache NiFi for production environments.
- link:administration-guide.html[Administrator's Guide] - A guide for setting up and administering Apache NiFi for production environments.
This guide provides information about the different system-level settings, such as setting up clusters of NiFi and securing access to the
web UI and data.
- link:expression-language-guide.html[Expression Language Guide] - A far more exhaustive guide for understanding the Expression Language than
@ -734,12 +706,10 @@ addition to this Getting Started Guide:
- link:https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide[Contributor's Guide^] - A guide for explaining how to contribute
work back to the Apache NiFi community so that others can make use of it.
Several blog postings have also been added to the Apache NiFi blog site:
link:https://blogs.apache.org/nifi/[https://blogs.apache.org/nifi/^]
In addition to the blog and guides provided here, you can browse the different
link:https://nifi.apache.org/mailing_lists.html[NiFi Mailing Lists^] or send an e-mail to one of the mailing lists at
In addition to the guides provided here, you can browse the different
link:https://nifi.apache.org/community/contact/[NiFi Mailing Lists^] or send an e-mail to one of the mailing lists at
link:mailto:users@nifi.apache.org[users@nifi.apache.org] or
link:mailto:dev@nifi.apache.org[dev@nifi.apache.org].
Many of the members of the NiFi community are also available on Twitter and actively monitor for tweets that mention @apachenifi.
Many of the members of the NiFi community are available on link:https://apachenifi.slack.com[Apache NiFi on Slack^]
and also actively monitor X/Twitter for posts that mention @apachenifi.

Binary file not shown.

After

Width:  |  Height:  |  Size: 301 B

Binary file not shown.

After

Width:  |  Height:  |  Size: 166 B