mirror of https://github.com/apache/nifi.git
NIFI-162 I copy-edited the Overview
Signed-off-by: joewitt <joewitt@apache.org>
This commit is contained in:
parent
b7afc6999e
commit
7e70fd53e9
|
@ -22,9 +22,9 @@ Apache NiFi Team <dev@nifi.incubator.apache.org>
|
|||
What is Apache NiFi?
|
||||
--------------------
|
||||
Put simply NiFi was built to automate the flow of data between systems. While
|
||||
the term 'dataflow' is used in a variety of contexts we'll use it here
|
||||
the term 'dataflow' is used in a variety of contexts, we'll use it here
|
||||
to mean the automated and managed flow of information between systems. This
|
||||
problem space has been around ever since enterprises had more than one system
|
||||
problem space has been around ever since enterprises had more than one system,
|
||||
where some of the systems created data and some of the systems consumed data.
|
||||
The problems and solution patterns that emerged have been discussed and
|
||||
articulated extensively. A comprehensive and readily consumed form is found in
|
||||
|
@ -45,7 +45,7 @@ What is noise one day becomes signal the next::
|
|||
Priorities of an organization change - rapidly. Enabling new flows and changing existing ones must be fast.
|
||||
|
||||
Systems evolve at different rates::
|
||||
The protocols and formats used by a given system can change anytime and often irrespective of the systems around them. Dataflow exists to connect what is essentially a massively distributed system of components loosely or not-at-all designed to work together.
|
||||
The protocols and formats used by a given system can change anytime and often irrespective of the systems around them. Dataflow exists to connect what is essentially a massively distributed system of components that are loosely or not-at-all designed to work together.
|
||||
|
||||
Compliance and security::
|
||||
Laws, regulations, and policies change. Business to business agreements change. System to system and system to user interactions must be secure, trusted, accountable.
|
||||
|
@ -60,7 +60,7 @@ success of a given enterprise. These include things like; Service Oriented
|
|||
Architecture <<soa>>, the rise of the API <<api>><<api2>>, Internet of Things <<iot>>,
|
||||
and Big Data <<bigdata>>. In addition, the level of rigor necessary for
|
||||
compliance, privacy, and security is constantly on the rise. Even still with
|
||||
all of these new concepts coming about the patterns and needs of dataflow is
|
||||
all of these new concepts coming about, the patterns and needs of dataflow are
|
||||
still largely the same. The primary differences then are the scope of
|
||||
complexity, the rate of change necessary to adapt, and that at scale
|
||||
the edge case becomes common occurrence. NiFi is built to help tackle these
|
||||
|
@ -78,21 +78,21 @@ the main NiFi concepts and how they map to FBP:
|
|||
| NiFi Term | FBP Term| Description
|
||||
|
||||
| FlowFile | Information Packet |
|
||||
A FlowFile represents the objects moving through the system and for each one NiFi
|
||||
keeps track of a Map of key/value pair attribute strings and its associated
|
||||
A FlowFile represents each object moving through the system and for each one, NiFi
|
||||
keeps track of a map of key/value pair attribute strings and its associated
|
||||
content of zero or more bytes.
|
||||
|
||||
| FlowFile Processor | Black Box |
|
||||
Processors are what actually performs work. In <<eip>> terms a processor is
|
||||
doing some combination of data Routing, Transformation, or mediation between
|
||||
systems. Processors have access to attributes of a given flow file and its
|
||||
Processors actually perform the work. In <<eip>> terms a processor is
|
||||
doing some combination of data Routing, Transformation, or Mediation between
|
||||
systems. Processors have access to attributes of a given FlowFile and its
|
||||
content stream. Processors can operate on zero or more FlowFiles in a given unit of work
|
||||
and either commit that work or rollback.
|
||||
|
||||
| Connection | Bounded Buffer |
|
||||
Connections provide the actual linkage between processors. These act as queues
|
||||
and allow various processes to interact at differing rates. These queues then
|
||||
can be prioritized dynamically and can have upper bounds on load which enables
|
||||
can be prioritized dynamically and can have upper bounds on load which enable
|
||||
back pressure.
|
||||
|
||||
| Flow Controller | Scheduler |
|
||||
|
@ -103,7 +103,7 @@ between processors.
|
|||
|
||||
| Process Group | subnet |
|
||||
A Process Group is a specific set of processes and their connections which can
|
||||
receive data via input ports and which can send data out via output ports. In
|
||||
receive data via input ports and send data out via output ports. In
|
||||
this manner process groups allow creation of entirely new components simply by
|
||||
composition of other components.
|
||||
|
||||
|
@ -153,10 +153,10 @@ image::nifi-arch-cluster.png["NiFi Cluster Architecture Diagram"]
|
|||
A NiFi cluster is comprised of one or more 'NiFi Nodes' (Node) controlled
|
||||
by a single NiFi Cluster Manager (NCM). The design of clustering is a simple
|
||||
master/slave model where the NCM is the master and the Nodes are the slaves.
|
||||
The NCM's reason for existence is to keep track of which Nodes are in the flow,
|
||||
The NCM's reason for existence is to keep track of which Nodes are in the cluster,
|
||||
their status, and to replicate requests to modify or observe the
|
||||
flow. Fundamentally then the NCM keeps the state of the cluster consistent.
|
||||
While the model is that of master and slave if the master dies the Nodes are all
|
||||
While the model is that of master and slave, if the master dies the Nodes are all
|
||||
instructed to continue operating as they were to ensure the data flow remains live.
|
||||
The absence of the NCM simply means new nodes cannot come on-line and flow changes
|
||||
cannot occur until the NCM is restored.
|
||||
|
@ -164,7 +164,7 @@ cannot occur until the NCM is restored.
|
|||
Performance Expections and Characteristics of NiFi
|
||||
--------------------------------------------------
|
||||
NiFi is designed to fully leverage the capabilities of the underlying host system
|
||||
its is operating on. This maximization of resources is particularly strong with
|
||||
it is operating on. This maximization of resources is particularly strong with
|
||||
regard to CPU and disk. Many more details will
|
||||
be provided on best practices and configuration tips in the Administration Guide.
|
||||
|
||||
|
@ -173,22 +173,22 @@ The throughput or latency
|
|||
one can expect to see will vary greatly on how the system is configured. Given
|
||||
that there are pluggable approaches to most of the major NiFi subsystems the
|
||||
performance will depend on the implementation. But, for something concrete and broadly
|
||||
applicable lets consider the out of the box default implementations that are used.
|
||||
applicable, let's consider the out-of-the-box default implementations that are used.
|
||||
These are all persistent with guaranteed delivery and do so using local disk. So
|
||||
being conservative assume roughly 50 MB/s read/write rate on modest disks or RAID volumes
|
||||
within a typical server. NiFi for a large class of data flows then should be able to
|
||||
efficiently reach one hundred or more MB/s of throughput. That is because linear growth
|
||||
is expected for each physical parition and content repository added to NiFi. This will
|
||||
being conservative, assume roughly 50 MB/s read/write rate on modest disks or RAID volumes
|
||||
within a typical server. NiFi for a large class of dataflows then should be able to
|
||||
efficiently reach 100 or more MB/s of throughput. That is because linear growth
|
||||
is expected for each physical partition and content repository added to NiFi. This will
|
||||
bottleneck at some point on the FlowFile repository and provenance repository.
|
||||
We plan to provide a benchmarking/performance test template to
|
||||
include in the build which will allow users to easily test their system and
|
||||
to identify where bottlenecks are and at which point they might become a factor. It
|
||||
should also make it easy for system administrators to make changes and to verity the impact.
|
||||
should also make it easy for system administrators to make changes and to verify the impact.
|
||||
|
||||
For CPU::
|
||||
The FlowController acts as the engine dictating when a given processor will be
|
||||
The Flow Controller acts as the engine dictating when a particular processor will be
|
||||
given a thread to execute. Processors should be written to return the thread
|
||||
as soon as they're done executing their task. The FlowController can be given a
|
||||
as soon as they're done executing their task. The Flow Controller can be given a
|
||||
configuration value indicating how many threads there should be for the various
|
||||
thread pools it maintains. The ideal number of threads to use will depend on the
|
||||
resources of the host system in terms of numbers of cores, whether that system is
|
||||
|
@ -205,7 +205,7 @@ how well the application will run over time.
|
|||
High Level Overview of Key NiFi Features
|
||||
----------------------------------------
|
||||
Guaranteed Delivery::
|
||||
A core philosophy of NiFi has been that even at very high scale guaranteed delivery
|
||||
A core philosophy of NiFi has been that even at very high scale, guaranteed delivery
|
||||
is a must. This is achieved through effective use of a purpose-built persistent
|
||||
write-ahead log and content repository. Together they are designed in such a way
|
||||
as to allow for very high transaction rates, effective load-spreading, copy-on-write,
|
||||
|
@ -218,12 +218,12 @@ as it reaches a specified age (its value has perished).
|
|||
|
||||
Prioritized Queuing::
|
||||
NiFi allows the setting of one or more prioritization schemes for how data is
|
||||
retrieved from a queue. The default is oldest first but there are times when
|
||||
retrieved from a queue. The default is oldest first, but there are times when
|
||||
data should be pulled newest first, largest first, or some other custom scheme.
|
||||
|
||||
Flow Specific QoS (latency v throughput, loss tolerance, etc..)::
|
||||
Flow Specific QoS (latency v throughput, loss tolerance, etc.)::
|
||||
There are points of a dataflow where the data is absolutely critical and it is
|
||||
loss intolerant. There are times when it must be processed and delivered within
|
||||
loss intolerant. There are also times when it must be processed and delivered within
|
||||
seconds to be of any value. NiFi enables the fine-grained flow specific configuration
|
||||
of these concerns.
|
||||
|
||||
|
@ -237,21 +237,21 @@ Recovery / Recording a rolling buffer of fine-grained history::
|
|||
NiFi's content repository is designed to act as a rolling buffer of history. Data
|
||||
is removed only as it ages off the content repository or as space is needed. This
|
||||
combined with the data provenance capability makes for an incredibly useful basis
|
||||
to enable click-to-content, download of content, replay, and all at a specific
|
||||
point in and objects lifecycle which can even span generations.
|
||||
to enable click-to-content, download of content, and replay, all at a specific
|
||||
point in an object's lifecycle which can even span generations.
|
||||
|
||||
Visual Command and Control::
|
||||
Dataflows can become quite complex. Being able to visualize those flows and express
|
||||
them visually can help greatly to reduce that complexity and to identify areas which
|
||||
them visually can help greatly to reduce that complexity and to identify areas that
|
||||
need to be simplified. NiFi enables not only the visual establishment of dataflows but
|
||||
it does so in real-time. Rather than being 'design and deploy' it is much more like
|
||||
molding clay. If you make a change to the dataflow that change is taking effect. Changes
|
||||
molding clay. If you make a change to the dataflow that change immediately takes effect. Changes
|
||||
are fine-grained and isolated to the affected components. You don't need to stop an entire
|
||||
flow or set of flows just to make some specific modification.
|
||||
|
||||
Flow Templates::
|
||||
Dataflows tend to be highly pattern oriented and while there are often many different
|
||||
ways to solve a problem it helps greatly to be able to share those best practices. Templates
|
||||
ways to solve a problem, it helps greatly to be able to share those best practices. Templates
|
||||
allow subject matter experts to build and publish their flow designs and for others to benefit
|
||||
and collaborate on them.
|
||||
|
||||
|
@ -263,8 +263,8 @@ Security::
|
|||
either side of the sender/recipient equation.
|
||||
User to system;;
|
||||
NiFi enables 2-Way SSL authentication and provides pluggable authorization so that it can properly control
|
||||
a users access and at particular levels (read-only, dataflow manager, admin). If a user enters a
|
||||
sensitive property like a password into the flow it is immediately encrypted server side and never again exposed
|
||||
a user's access and at particular levels (read-only, dataflow manager, admin). If a user enters a
|
||||
sensitive property like a password into the flow, it is immediately encrypted server side and never again exposed
|
||||
on the client side even in its encrypted form.
|
||||
|
||||
Designed for Extension::
|
||||
|
@ -275,12 +275,12 @@ Designed for Extension::
|
|||
For any component based system one problem that can quickly occur is dependency nightmares. NiFi addresses this by providing a custom class loader model
|
||||
ensuring that each extension bundle is exposed to a very limited set of dependencies. As a result extensions can be built with little concern for whether
|
||||
they might conflict with another extension. The concept of these extension bundles is called 'NiFi Archives' and will be discussed in greater detail
|
||||
in the developers guide.
|
||||
in the developer's guide.
|
||||
Clustering (scale-out)::
|
||||
NiFi is designed to scale-out through the use of clustering many nodes together as described above. If a single node is provisioned and configured
|
||||
to handle hundreds of MB/s then a modest cluster could be configured to handle GB/s. This then brings about interesting challenges of load balancing
|
||||
and fail-over between NiFi and the systems from which it gets data. Use of asynchronous queuing based protocols like messaging services, Kafka, etc.. can
|
||||
help. Use of NiFi's 'site-to-site' feature is also very effective as it is a protocol that allows NiFi and a client (could be another NiFi cluster) to talk to eachother, share information
|
||||
and fail-over between NiFi and the systems from which it gets data. Use of asynchronous queuing based protocols like messaging services, Kafka, etc., can
|
||||
help. Use of NiFi's 'site-to-site' feature is also very effective as it is a protocol that allows NiFi and a client (could be another NiFi cluster) to talk to each other, share information
|
||||
about loading, and to exchange data on specific authorized ports.
|
||||
|
||||
# References
|
||||
|
@ -292,4 +292,4 @@ Clustering (scale-out)::
|
|||
- [[[iot]]] Wikipedia. Internet of Things [online]. Retrieved: 27 Dec 2014, from: http://en.wikipedia.org/wiki/Internet_of_Things
|
||||
- [[[bigdata]]] Wikipedia. Big Data [online]. Retrieved: 27 Dec 2014, from: http://en.wikipedia.org/wiki/Big_data
|
||||
- [[[fbp]]] Wikipedia. Flow Based Programming [online]. Retrieved: 28 Dec 2014, from: http://en.wikipedia.org/wiki/Flow-based_programming#Concepts
|
||||
- [[[seda]]] Matt Welsh. Harvard. SEDA: An Architecture for Highly Concurrent Server Applications [online]. Retrieved: 28 Dec 2014, from: http://www.eecs.harvard.edu/~mdw/proj/seda/
|
||||
- [[[seda]]] Matt Welsh. Harvard. SEDA: An Architecture for Highly Concurrent Server Applications [online]. Retrieved: 28 Dec 2014, from: http://www.eecs.harvard.edu/~mdw/proj/seda/
|
||||
|
|
Loading…
Reference in New Issue