mirror of https://github.com/apache/nifi.git
NIFI-162 I copy-edited the Overview
Signed-off-by: joewitt <joewitt@apache.org>
This commit is contained in:
parent
b7afc6999e
commit
7e70fd53e9
|
@ -22,9 +22,9 @@ Apache NiFi Team <dev@nifi.incubator.apache.org>
|
||||||
What is Apache NiFi?
|
What is Apache NiFi?
|
||||||
--------------------
|
--------------------
|
||||||
Put simply NiFi was built to automate the flow of data between systems. While
|
Put simply NiFi was built to automate the flow of data between systems. While
|
||||||
the term 'dataflow' is used in a variety of contexts we'll use it here
|
the term 'dataflow' is used in a variety of contexts, we'll use it here
|
||||||
to mean the automated and managed flow of information between systems. This
|
to mean the automated and managed flow of information between systems. This
|
||||||
problem space has been around ever since enterprises had more than one system
|
problem space has been around ever since enterprises had more than one system,
|
||||||
where some of the systems created data and some of the systems consumed data.
|
where some of the systems created data and some of the systems consumed data.
|
||||||
The problems and solution patterns that emerged have been discussed and
|
The problems and solution patterns that emerged have been discussed and
|
||||||
articulated extensively. A comprehensive and readily consumed form is found in
|
articulated extensively. A comprehensive and readily consumed form is found in
|
||||||
|
@ -45,7 +45,7 @@ What is noise one day becomes signal the next::
|
||||||
Priorities of an organization change - rapidly. Enabling new flows and changing existing ones must be fast.
|
Priorities of an organization change - rapidly. Enabling new flows and changing existing ones must be fast.
|
||||||
|
|
||||||
Systems evolve at different rates::
|
Systems evolve at different rates::
|
||||||
The protocols and formats used by a given system can change anytime and often irrespective of the systems around them. Dataflow exists to connect what is essentially a massively distributed system of components loosely or not-at-all designed to work together.
|
The protocols and formats used by a given system can change anytime and often irrespective of the systems around them. Dataflow exists to connect what is essentially a massively distributed system of components that are loosely or not-at-all designed to work together.
|
||||||
|
|
||||||
Compliance and security::
|
Compliance and security::
|
||||||
Laws, regulations, and policies change. Business to business agreements change. System to system and system to user interactions must be secure, trusted, accountable.
|
Laws, regulations, and policies change. Business to business agreements change. System to system and system to user interactions must be secure, trusted, accountable.
|
||||||
|
@ -60,7 +60,7 @@ success of a given enterprise. These include things like; Service Oriented
|
||||||
Architecture <<soa>>, the rise of the API <<api>><<api2>>, Internet of Things <<iot>>,
|
Architecture <<soa>>, the rise of the API <<api>><<api2>>, Internet of Things <<iot>>,
|
||||||
and Big Data <<bigdata>>. In addition, the level of rigor necessary for
|
and Big Data <<bigdata>>. In addition, the level of rigor necessary for
|
||||||
compliance, privacy, and security is constantly on the rise. Even still with
|
compliance, privacy, and security is constantly on the rise. Even still with
|
||||||
all of these new concepts coming about the patterns and needs of dataflow is
|
all of these new concepts coming about, the patterns and needs of dataflow are
|
||||||
still largely the same. The primary differences then are the scope of
|
still largely the same. The primary differences then are the scope of
|
||||||
complexity, the rate of change necessary to adapt, and that at scale
|
complexity, the rate of change necessary to adapt, and that at scale
|
||||||
the edge case becomes common occurrence. NiFi is built to help tackle these
|
the edge case becomes common occurrence. NiFi is built to help tackle these
|
||||||
|
@ -78,21 +78,21 @@ the main NiFi concepts and how they map to FBP:
|
||||||
| NiFi Term | FBP Term| Description
|
| NiFi Term | FBP Term| Description
|
||||||
|
|
||||||
| FlowFile | Information Packet |
|
| FlowFile | Information Packet |
|
||||||
A FlowFile represents the objects moving through the system and for each one NiFi
|
A FlowFile represents each object moving through the system and for each one, NiFi
|
||||||
keeps track of a Map of key/value pair attribute strings and its associated
|
keeps track of a map of key/value pair attribute strings and its associated
|
||||||
content of zero or more bytes.
|
content of zero or more bytes.
|
||||||
|
|
||||||
| FlowFile Processor | Black Box |
|
| FlowFile Processor | Black Box |
|
||||||
Processors are what actually performs work. In <<eip>> terms a processor is
|
Processors actually perform the work. In <<eip>> terms a processor is
|
||||||
doing some combination of data Routing, Transformation, or mediation between
|
doing some combination of data Routing, Transformation, or Mediation between
|
||||||
systems. Processors have access to attributes of a given flow file and its
|
systems. Processors have access to attributes of a given FlowFile and its
|
||||||
content stream. Processors can operate on zero or more FlowFiles in a given unit of work
|
content stream. Processors can operate on zero or more FlowFiles in a given unit of work
|
||||||
and either commit that work or rollback.
|
and either commit that work or rollback.
|
||||||
|
|
||||||
| Connection | Bounded Buffer |
|
| Connection | Bounded Buffer |
|
||||||
Connections provide the actual linkage between processors. These act as queues
|
Connections provide the actual linkage between processors. These act as queues
|
||||||
and allow various processes to interact at differing rates. These queues then
|
and allow various processes to interact at differing rates. These queues then
|
||||||
can be prioritized dynamically and can have upper bounds on load which enables
|
can be prioritized dynamically and can have upper bounds on load which enable
|
||||||
back pressure.
|
back pressure.
|
||||||
|
|
||||||
| Flow Controller | Scheduler |
|
| Flow Controller | Scheduler |
|
||||||
|
@ -103,7 +103,7 @@ between processors.
|
||||||
|
|
||||||
| Process Group | subnet |
|
| Process Group | subnet |
|
||||||
A Process Group is a specific set of processes and their connections which can
|
A Process Group is a specific set of processes and their connections which can
|
||||||
receive data via input ports and which can send data out via output ports. In
|
receive data via input ports and send data out via output ports. In
|
||||||
this manner process groups allow creation of entirely new components simply by
|
this manner process groups allow creation of entirely new components simply by
|
||||||
composition of other components.
|
composition of other components.
|
||||||
|
|
||||||
|
@ -153,10 +153,10 @@ image::nifi-arch-cluster.png["NiFi Cluster Architecture Diagram"]
|
||||||
A NiFi cluster is comprised of one or more 'NiFi Nodes' (Node) controlled
|
A NiFi cluster is comprised of one or more 'NiFi Nodes' (Node) controlled
|
||||||
by a single NiFi Cluster Manager (NCM). The design of clustering is a simple
|
by a single NiFi Cluster Manager (NCM). The design of clustering is a simple
|
||||||
master/slave model where the NCM is the master and the Nodes are the slaves.
|
master/slave model where the NCM is the master and the Nodes are the slaves.
|
||||||
The NCM's reason for existence is to keep track of which Nodes are in the flow,
|
The NCM's reason for existence is to keep track of which Nodes are in the cluster,
|
||||||
their status, and to replicate requests to modify or observe the
|
their status, and to replicate requests to modify or observe the
|
||||||
flow. Fundamentally then the NCM keeps the state of the cluster consistent.
|
flow. Fundamentally then the NCM keeps the state of the cluster consistent.
|
||||||
While the model is that of master and slave if the master dies the Nodes are all
|
While the model is that of master and slave, if the master dies the Nodes are all
|
||||||
instructed to continue operating as they were to ensure the data flow remains live.
|
instructed to continue operating as they were to ensure the data flow remains live.
|
||||||
The absence of the NCM simply means new nodes cannot come on-line and flow changes
|
The absence of the NCM simply means new nodes cannot come on-line and flow changes
|
||||||
cannot occur until the NCM is restored.
|
cannot occur until the NCM is restored.
|
||||||
|
@ -164,7 +164,7 @@ cannot occur until the NCM is restored.
|
||||||
Performance Expections and Characteristics of NiFi
|
Performance Expections and Characteristics of NiFi
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
NiFi is designed to fully leverage the capabilities of the underlying host system
|
NiFi is designed to fully leverage the capabilities of the underlying host system
|
||||||
its is operating on. This maximization of resources is particularly strong with
|
it is operating on. This maximization of resources is particularly strong with
|
||||||
regard to CPU and disk. Many more details will
|
regard to CPU and disk. Many more details will
|
||||||
be provided on best practices and configuration tips in the Administration Guide.
|
be provided on best practices and configuration tips in the Administration Guide.
|
||||||
|
|
||||||
|
@ -173,22 +173,22 @@ The throughput or latency
|
||||||
one can expect to see will vary greatly on how the system is configured. Given
|
one can expect to see will vary greatly on how the system is configured. Given
|
||||||
that there are pluggable approaches to most of the major NiFi subsystems the
|
that there are pluggable approaches to most of the major NiFi subsystems the
|
||||||
performance will depend on the implementation. But, for something concrete and broadly
|
performance will depend on the implementation. But, for something concrete and broadly
|
||||||
applicable lets consider the out of the box default implementations that are used.
|
applicable, let's consider the out-of-the-box default implementations that are used.
|
||||||
These are all persistent with guaranteed delivery and do so using local disk. So
|
These are all persistent with guaranteed delivery and do so using local disk. So
|
||||||
being conservative assume roughly 50 MB/s read/write rate on modest disks or RAID volumes
|
being conservative, assume roughly 50 MB/s read/write rate on modest disks or RAID volumes
|
||||||
within a typical server. NiFi for a large class of data flows then should be able to
|
within a typical server. NiFi for a large class of dataflows then should be able to
|
||||||
efficiently reach one hundred or more MB/s of throughput. That is because linear growth
|
efficiently reach 100 or more MB/s of throughput. That is because linear growth
|
||||||
is expected for each physical parition and content repository added to NiFi. This will
|
is expected for each physical partition and content repository added to NiFi. This will
|
||||||
bottleneck at some point on the FlowFile repository and provenance repository.
|
bottleneck at some point on the FlowFile repository and provenance repository.
|
||||||
We plan to provide a benchmarking/performance test template to
|
We plan to provide a benchmarking/performance test template to
|
||||||
include in the build which will allow users to easily test their system and
|
include in the build which will allow users to easily test their system and
|
||||||
to identify where bottlenecks are and at which point they might become a factor. It
|
to identify where bottlenecks are and at which point they might become a factor. It
|
||||||
should also make it easy for system administrators to make changes and to verity the impact.
|
should also make it easy for system administrators to make changes and to verify the impact.
|
||||||
|
|
||||||
For CPU::
|
For CPU::
|
||||||
The FlowController acts as the engine dictating when a given processor will be
|
The Flow Controller acts as the engine dictating when a particular processor will be
|
||||||
given a thread to execute. Processors should be written to return the thread
|
given a thread to execute. Processors should be written to return the thread
|
||||||
as soon as they're done executing their task. The FlowController can be given a
|
as soon as they're done executing their task. The Flow Controller can be given a
|
||||||
configuration value indicating how many threads there should be for the various
|
configuration value indicating how many threads there should be for the various
|
||||||
thread pools it maintains. The ideal number of threads to use will depend on the
|
thread pools it maintains. The ideal number of threads to use will depend on the
|
||||||
resources of the host system in terms of numbers of cores, whether that system is
|
resources of the host system in terms of numbers of cores, whether that system is
|
||||||
|
@ -205,7 +205,7 @@ how well the application will run over time.
|
||||||
High Level Overview of Key NiFi Features
|
High Level Overview of Key NiFi Features
|
||||||
----------------------------------------
|
----------------------------------------
|
||||||
Guaranteed Delivery::
|
Guaranteed Delivery::
|
||||||
A core philosophy of NiFi has been that even at very high scale guaranteed delivery
|
A core philosophy of NiFi has been that even at very high scale, guaranteed delivery
|
||||||
is a must. This is achieved through effective use of a purpose-built persistent
|
is a must. This is achieved through effective use of a purpose-built persistent
|
||||||
write-ahead log and content repository. Together they are designed in such a way
|
write-ahead log and content repository. Together they are designed in such a way
|
||||||
as to allow for very high transaction rates, effective load-spreading, copy-on-write,
|
as to allow for very high transaction rates, effective load-spreading, copy-on-write,
|
||||||
|
@ -218,12 +218,12 @@ as it reaches a specified age (its value has perished).
|
||||||
|
|
||||||
Prioritized Queuing::
|
Prioritized Queuing::
|
||||||
NiFi allows the setting of one or more prioritization schemes for how data is
|
NiFi allows the setting of one or more prioritization schemes for how data is
|
||||||
retrieved from a queue. The default is oldest first but there are times when
|
retrieved from a queue. The default is oldest first, but there are times when
|
||||||
data should be pulled newest first, largest first, or some other custom scheme.
|
data should be pulled newest first, largest first, or some other custom scheme.
|
||||||
|
|
||||||
Flow Specific QoS (latency v throughput, loss tolerance, etc..)::
|
Flow Specific QoS (latency v throughput, loss tolerance, etc.)::
|
||||||
There are points of a dataflow where the data is absolutely critical and it is
|
There are points of a dataflow where the data is absolutely critical and it is
|
||||||
loss intolerant. There are times when it must be processed and delivered within
|
loss intolerant. There are also times when it must be processed and delivered within
|
||||||
seconds to be of any value. NiFi enables the fine-grained flow specific configuration
|
seconds to be of any value. NiFi enables the fine-grained flow specific configuration
|
||||||
of these concerns.
|
of these concerns.
|
||||||
|
|
||||||
|
@ -237,21 +237,21 @@ Recovery / Recording a rolling buffer of fine-grained history::
|
||||||
NiFi's content repository is designed to act as a rolling buffer of history. Data
|
NiFi's content repository is designed to act as a rolling buffer of history. Data
|
||||||
is removed only as it ages off the content repository or as space is needed. This
|
is removed only as it ages off the content repository or as space is needed. This
|
||||||
combined with the data provenance capability makes for an incredibly useful basis
|
combined with the data provenance capability makes for an incredibly useful basis
|
||||||
to enable click-to-content, download of content, replay, and all at a specific
|
to enable click-to-content, download of content, and replay, all at a specific
|
||||||
point in and objects lifecycle which can even span generations.
|
point in an object's lifecycle which can even span generations.
|
||||||
|
|
||||||
Visual Command and Control::
|
Visual Command and Control::
|
||||||
Dataflows can become quite complex. Being able to visualize those flows and express
|
Dataflows can become quite complex. Being able to visualize those flows and express
|
||||||
them visually can help greatly to reduce that complexity and to identify areas which
|
them visually can help greatly to reduce that complexity and to identify areas that
|
||||||
need to be simplified. NiFi enables not only the visual establishment of dataflows but
|
need to be simplified. NiFi enables not only the visual establishment of dataflows but
|
||||||
it does so in real-time. Rather than being 'design and deploy' it is much more like
|
it does so in real-time. Rather than being 'design and deploy' it is much more like
|
||||||
molding clay. If you make a change to the dataflow that change is taking effect. Changes
|
molding clay. If you make a change to the dataflow that change immediately takes effect. Changes
|
||||||
are fine-grained and isolated to the affected components. You don't need to stop an entire
|
are fine-grained and isolated to the affected components. You don't need to stop an entire
|
||||||
flow or set of flows just to make some specific modification.
|
flow or set of flows just to make some specific modification.
|
||||||
|
|
||||||
Flow Templates::
|
Flow Templates::
|
||||||
Dataflows tend to be highly pattern oriented and while there are often many different
|
Dataflows tend to be highly pattern oriented and while there are often many different
|
||||||
ways to solve a problem it helps greatly to be able to share those best practices. Templates
|
ways to solve a problem, it helps greatly to be able to share those best practices. Templates
|
||||||
allow subject matter experts to build and publish their flow designs and for others to benefit
|
allow subject matter experts to build and publish their flow designs and for others to benefit
|
||||||
and collaborate on them.
|
and collaborate on them.
|
||||||
|
|
||||||
|
@ -263,8 +263,8 @@ Security::
|
||||||
either side of the sender/recipient equation.
|
either side of the sender/recipient equation.
|
||||||
User to system;;
|
User to system;;
|
||||||
NiFi enables 2-Way SSL authentication and provides pluggable authorization so that it can properly control
|
NiFi enables 2-Way SSL authentication and provides pluggable authorization so that it can properly control
|
||||||
a users access and at particular levels (read-only, dataflow manager, admin). If a user enters a
|
a user's access and at particular levels (read-only, dataflow manager, admin). If a user enters a
|
||||||
sensitive property like a password into the flow it is immediately encrypted server side and never again exposed
|
sensitive property like a password into the flow, it is immediately encrypted server side and never again exposed
|
||||||
on the client side even in its encrypted form.
|
on the client side even in its encrypted form.
|
||||||
|
|
||||||
Designed for Extension::
|
Designed for Extension::
|
||||||
|
@ -275,12 +275,12 @@ Designed for Extension::
|
||||||
For any component based system one problem that can quickly occur is dependency nightmares. NiFi addresses this by providing a custom class loader model
|
For any component based system one problem that can quickly occur is dependency nightmares. NiFi addresses this by providing a custom class loader model
|
||||||
ensuring that each extension bundle is exposed to a very limited set of dependencies. As a result extensions can be built with little concern for whether
|
ensuring that each extension bundle is exposed to a very limited set of dependencies. As a result extensions can be built with little concern for whether
|
||||||
they might conflict with another extension. The concept of these extension bundles is called 'NiFi Archives' and will be discussed in greater detail
|
they might conflict with another extension. The concept of these extension bundles is called 'NiFi Archives' and will be discussed in greater detail
|
||||||
in the developers guide.
|
in the developer's guide.
|
||||||
Clustering (scale-out)::
|
Clustering (scale-out)::
|
||||||
NiFi is designed to scale-out through the use of clustering many nodes together as described above. If a single node is provisioned and configured
|
NiFi is designed to scale-out through the use of clustering many nodes together as described above. If a single node is provisioned and configured
|
||||||
to handle hundreds of MB/s then a modest cluster could be configured to handle GB/s. This then brings about interesting challenges of load balancing
|
to handle hundreds of MB/s then a modest cluster could be configured to handle GB/s. This then brings about interesting challenges of load balancing
|
||||||
and fail-over between NiFi and the systems from which it gets data. Use of asynchronous queuing based protocols like messaging services, Kafka, etc.. can
|
and fail-over between NiFi and the systems from which it gets data. Use of asynchronous queuing based protocols like messaging services, Kafka, etc., can
|
||||||
help. Use of NiFi's 'site-to-site' feature is also very effective as it is a protocol that allows NiFi and a client (could be another NiFi cluster) to talk to eachother, share information
|
help. Use of NiFi's 'site-to-site' feature is also very effective as it is a protocol that allows NiFi and a client (could be another NiFi cluster) to talk to each other, share information
|
||||||
about loading, and to exchange data on specific authorized ports.
|
about loading, and to exchange data on specific authorized ports.
|
||||||
|
|
||||||
# References
|
# References
|
||||||
|
@ -292,4 +292,4 @@ Clustering (scale-out)::
|
||||||
- [[[iot]]] Wikipedia. Internet of Things [online]. Retrieved: 27 Dec 2014, from: http://en.wikipedia.org/wiki/Internet_of_Things
|
- [[[iot]]] Wikipedia. Internet of Things [online]. Retrieved: 27 Dec 2014, from: http://en.wikipedia.org/wiki/Internet_of_Things
|
||||||
- [[[bigdata]]] Wikipedia. Big Data [online]. Retrieved: 27 Dec 2014, from: http://en.wikipedia.org/wiki/Big_data
|
- [[[bigdata]]] Wikipedia. Big Data [online]. Retrieved: 27 Dec 2014, from: http://en.wikipedia.org/wiki/Big_data
|
||||||
- [[[fbp]]] Wikipedia. Flow Based Programming [online]. Retrieved: 28 Dec 2014, from: http://en.wikipedia.org/wiki/Flow-based_programming#Concepts
|
- [[[fbp]]] Wikipedia. Flow Based Programming [online]. Retrieved: 28 Dec 2014, from: http://en.wikipedia.org/wiki/Flow-based_programming#Concepts
|
||||||
- [[[seda]]] Matt Welsh. Harvard. SEDA: An Architecture for Highly Concurrent Server Applications [online]. Retrieved: 28 Dec 2014, from: http://www.eecs.harvard.edu/~mdw/proj/seda/
|
- [[[seda]]] Matt Welsh. Harvard. SEDA: An Architecture for Highly Concurrent Server Applications [online]. Retrieved: 28 Dec 2014, from: http://www.eecs.harvard.edu/~mdw/proj/seda/
|
||||||
|
|
Loading…
Reference in New Issue