Apache NiFi in a Nutshell

A brief introduction to Apache NiFi, a powerful and scalable Flow-Based open-source Data Ingestion and Distribution framework.

Overview

Apache NiFi is an open-source, easy to use, and reliable system to process and distribute data. The data can be propagated from almost any source to any destination. NiFi operates on the principle of Configuration over Coding. That means it is possible to create simple ETL to complex data lack without even writing a single line of code. NiFi can operate on batch as well as stream of data. NiFi seamlessly ingests data from multiple data sources and provides mechanisms to handle different schema in the data.

Common terminologies in NiFi

Before digging deeper into what is NiFi and how it works, let’s get to know some common terminologies used in NiFi.

NiFi Architecture

Figure: 1 NiFi architecture diagram, referred from NiFi doc
  • Flow Controller: The flow controller serves as the brain of NiFi. Controls the running of Nifi extensions and schedules allocation of resources for this to happen.
  • Extensions: This can be considered as various plugins that allow NiFi to communicate with other systems.
  • FlowFile Repository: The FlowFile Repository is where NiFi keeps track of the state of what it knows about a given FlowFile that is presently active in the flow.
  • Content Repository: The Content Repository is where the actual content bytes of a given FlowFile live.
  • Provenance Repository: The Provenance Repository is where all provenance event data is stored.

Unboxing Apache NiFi

When you start NiFi, you finally land on its web interface. The web UI facilitates a platform on which you can create automated dataflows, as well as visualizing, editing, monitoring, and administering those dataflows. Without writing any code NiFi’s user interface allows you to build your pipeline by drag and drop components on the canvas. The below screenshots of NiFi application highlights the different segments of the UI.

Figure: 2 Apache NiFi web UI
Figure: 3 Components Toolbar
  • Input port is used to get data from the processor, which is not present in that process group.
  • Output Port provide a mechanism for transferring data from a Process Group to destinations outside of the Process Group. All Input/Output Ports within a Process Group must have unique name
  • Process Group can be used to logically group a set of components so that the dataflow is easier to understand and maintain.
  • Remote Process Group is similar to Process Group the only difference is Remote Process Group references a remote instance of NiFi.
  • Funnel is a NiFi component that is used to combine the data from several Connections into a single Connection.
  • Template helps to reuse the data flow in the same or different NiFi instances.
  • Label are used to provide documentation to parts of a dataflow.
Figure: 4 Global Menu

Type of Available Processors

NiFi contains different Processors out of the box along with the capability to write custom processors. These Processors which are the building blocks of NiFi, provide the capability to consume data from various sources route, transform, process, split, and aggregate data, and distribute data to almost any system. The below table represents some of the frequently used Processors, categorizing them by their functions.

Figure: 5 Type of Available Processors

Nifi Templates

NiFi allows us to build very large and complex DataFlows using basic components like Processor, Funnel, Input/Output Port, Process Group, and Remote Process Group. As we know these components can be considered as the basic building blocks for constructing a DataFlow. At times, though, using these small building blocks can become tedious if the same logic needs to be repeated several times.

Creating a Template

To create a Template, select the components that are to be a part of the template, and then click the “Create Template” button in the Operate Palette (at the left hand side of the NiFi canvas). Clicking this button without selecting anything will create a Template that contains all of the contents of the current Process Group. Each template must have a unique name.

Importing or Uploading a Template

To use a Template received by exporting from another NiFi, the first step is to import the template into this instance of NiFi. You may import templates into the canvas as a whole dataflow or to any Process Group.

Instantiating or Adding a Template

Once a Template has been created or imported to the NiFi instance, it is ready to be instantiated, or added to the canvas. This is accomplished by dragging the Template icon from the Components Toolbar onto the canvas. Choose the template from the dialog box showing the list of templates present in the current NiFi instance.

Managing Templates

Ability to export or import dataflow partially or completely is one of the most powerful features of NiFi. You can select Templates from the Global Menu to open a dialog that displays all of the Templates that are currently available, filter the templates to see only those of interest, export, and delete Templates.

Building Apache NiFi data flow

Starting NiFi

You can launch NiFi via Docker or install it on your local machine.

docker run -d -h nifi -p 8080:8080 — name nifi_latest — memory=4g -v /docker/apache/nifi apache/nifi:latest
  1. Extract to a specific folder (eg. c:\Users\username\nifi).
  2. Find and execute run-nifi.bat or run.sh in “your nifi folder\bin\” based on the operating environment.

Building Nifi dataflow

We will try to build a real word like scenario to get the notion of NiFi. You can find the template file and other required files to run this example in the github.

Figure: 6 Add Processor
Figure: 7 Configure GetFile Processor
Figure: 8 Connecting Processors
Figure: 9 Configure SplitRecordProcessor
Figure: 10 Adding Controller Service
Figure: 11 Process group configuration screen
Figure: 12 DBCPConnectionPool Configuration
Figure: 13 Configure ConvertJsonToSQL Processor
Figure: 14 Logger data parsing NiFi setup

Common use cases or applications

NiFi empowers you to quickly start moving data from multiple different types of source systems to various types of target systems including HDFS, Databases, Streams, etc. This is particularly important in Big Data where the aim is to ingest from a variety of data sources like ERP, CRM, Files, HTTP Links, IoT data, etc. To ingest data from various sources into the Bigdata platform for further analysis needs a well-rounded, scalable, fault-tolerant solution to handle the entire “data flow” logistics of an enterprise. Enterprises are also looking for tools and technologies through which rapid development can be done, supporting ease of use for the developers, reliability in data delivery, scalability to handle large data sets, and lineage tracking.

  • It offers real-time control which helps you to manage the movement of data between any source & destination.
  • Event though NiFi is not limited to an ETL tool, it can be a managed ETL service
Figure: 15 High-level system design of data collection and integration

Alternatives to Apache NiFi

Following are some of the alternative for NiFi in opensource and cloud solution.

Cloud Solutions:

What’s Next?

If you’ve made it until here, you obtained an over all idea of what is NiFi and how to build a working platform to leverage your application. This article scratches only the tip of the iceberg; there are a lot to cover like data provenance, logging, variables and parameters, labels, versioning, creating custom processors and controllers, NiFi registry, etc. The supporting tools for extensive automation capabilities which is also a huge area to cover.

References:

Get the Medium app