Airflow orchestration for big data organisations

Airflow Orchestration is the most powerful platforms used for orchestrating workflows by Data Scientist and Engineers. Airflow was already gaining momentum among the data community going beyond hard-core data engineers.

Airflow Overview

Airflow, an open-source tool for authoring and orchestrating big data workflows. Basically, a platform that can programmatically schedules and monitor workflows. It triggers task execution based on schedule interval and execution time. Think of it as a tool to coordinate work done by other services. While following the specified dependencies, the airflow scheduler executes your tasks on an array of workers.

Today, Airflow is used to solve a variety of data ingestion, preparation and consumption problems. Key problems such as behavioural analytical systems, CRMs, data warehouses, data lakes and BI tools solved by Airflow is Integrating data between disparate systems which are used for deeper analytics and AI. Airflow can also orchestrate complex ML workflows, plus it is designed as a configuration-as-a-code system and heavily customized with plugins.

An Approach- DAG (Directed Acyclic Graph)

In the workflow orchestration space, Airflow has become a go-to framework. It bridges the gap between GUI-based and purely programmatic orchestration. Its pipelines are code (Python) defined and pipeline management is via a GUI has written in a flask. Airflow decouples the processing stages from the orchestration.

A collection of inter-dependent processes is controlled first and foremost by a dependency DAG in which a graph where data flows in only one direction among the connected processes. Any number of arbitrarily complex dependencies can be modelled using the pythonic DAG design API.

As we assume that the user has this background, beyond the basics of configuring the DAGs, no attention has been given to the more detailed configurations of an Airflow DAG, such as starting dates and retries.

The flow we are using in this example is as follows:

– save-bash — print ‘Hello World’ the STDOUT and redirect it to a file called out.txt

– print-file — print the output of out.txt file to STDOUT

– copy-file — copy out.txt to out_copy.txt

– delete-files — delete the out.txt and out_copy.txt files

The workflow can be visualized as a DAG in the given above example.  The rectangle around the diagram represents that the inside one is a self-contained DAG.

SINGLE DAG APPROACH

Apache Airflow can be used to orchestrate the dependent execution of the pipelines when locating a bunch of Spark jobs that are interrelated. These dependencies are naturally expressed as a DAG (Directed Acyclic Graph) with the Airflow Python API.

TRIGGERED DAGS APPROACH

This approach is more advanced, but not overly so. This additional complexity is necessary if:

-You have complicated flows you need to break down.

-There are flows that should not reprocess computationally expensive stages.

-It is generally desirable to decouple the execution and orchestration of a flow.

DAG DEFINITION FILE

The Airflow Python script is just a configuration file specifying the DAG’s structure as code. As mentioned above it is a set of all the tasks you want to run in a workflow, so two core things which this file is handling are the right order and the right time. This contains the schedule level parameters, defined operations, tasks, and arguments.

import os
from airflow import DAG
from datetime import datetime
from airflow.utils.dates import days_ago
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator

DAG_ID = os.path.basename(__file__).replace(\".pyc\", \"\").replace(\".py\", \"\")
DAG_OWNER_NAME = \"FACTSPAN\"
SCHEDULE_INTERVAL = None
START_DATE = days_ago(1)

def my_func():
    print(\'Hey Folks!\')
default_args = {
    \'owner\': DAG_OWNER_NAME,
    \'depends_on_past\': False,
    \'start_date\': START_DATE,
    \'email\': [\'******@factspan.com\'],
    \'email_on_failure\':False,
    \'email_on_retry\':False,
    \'retries\':5
}
dag = DAG(DAG_ID, default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL, start_date=START_DATE)

t1=DummyOperator(task_id=\'dummy_task\',retries=3)
t2=PythonOperator(task_id=\'python_task\',python_callable=my_func)
t1>>t2
Operator

While DAGs describes how to run a workflow, the Operator defines an individual task that needs to be performed. A DAG consists of operators. An operator is simply a Python class with an “execute()” method, which gets called when it is being run. Operators generally run independently, except for those dependencies where the DAG makes sure that operators run in the correct order.

TYPES OF THE OPERATOR:

–  BashOpertor – executes a bash command

– PythonOperator – calls an arbitrary Python function

– EmailOperator – sends an email

– SimpleHttpOperator – sends an HTTP request

– DAG run for a specific dag_id

– MySqlOperator, SqliteOperator, OracleOperator, JdbcOperator, etc. – executes a SQL command

Sensor

Sensors are a special kind of operator. When they run, they will check to see if certain criteria are met before they complete and let their downstream tasks execute; and also fail when they time out. The basic function of this sensor is to monitor the execution. The sensor operator is a Python class with a “poke()” method, which gets called repeatedly until “True” is returned.

Dependencies

Setting up dependencies between tasks is the next most important step after following all the above steps. Basically defining the order in which tasks must be executed.

Task dependencies are set using the set_upstream and set_downstream operators (Though, in version ≥ 1.8, it’s also possible to use the bitshift operators << and >> to perform the same operations more concisely). A task can have multiple dependencies. There are different ways of defining dependencies

t2.set_upstream(t1)
t2<<t1
t1>>t2>>t3
t1.set_downstream([t2,t3])
t1>>[t2,t3]
t1>>(t2,d3)
[t2,t3]<<t1

After this vibrant overview of Airflow, we are looking forward to the next part uncovering most of the enigmas of Airflow Orchestration.

About the Author

Tanya Sharma is a budding business analyst at Factspan who likes to solve real-life problems and enjoys data analysis and working with different analytics and BI tools.  She has previously worked with Google’s partners and clients like NASA, World Bank, NOAA, Walmart, Target etc across multiple domains. When not doing her day job, she can be found sketching or wandering off to an offbeat getaway, and in bonus, she is always available for a quick chat.

Featured content

Enhancing Data Processing with Aggregate Functions...

Implementing Sustainable Model Design with Green A...

Snowflake Copilot

Streamline SQL Workflow with Snowflake Copilot...

GCP vs. AWS vs. Azure (2024)...

Exploring Data Mesh – PoV...

Choosing the Right Cloud Data Engineering & Analytics Platform: Databricks vs. Snowflake

Databricks vs. Snowflake (2024)...

Enhancing Retail Data Quality with Apache Airflow ...

Data governance consulting

Data Governance – Starter Kit...

Snowflake tutorial

Quick Tutorial on DataFrame Updates in Snowpark...

Building Gen AI for Enterprise – PoV...

Scroll to Top