Airflow orchestration for big data organisations

Airflow Orchestration is the most powerful platforms used for orchestrating workflows by Data Scientist and Engineers. Airflow was already gaining momentum among the data community going beyond hard-core data engineers.

Airflow maintains the complexity and ensures the system is scalable and performant. In this article series, we will walk you through Airflow overview, Approaches, concepts, objects, and their usage while writing the pipeline.

Airflow Overview

Airflow, an open-source tool for authoring and orchestrating big data workflows. Basically, a platform that can programmatically schedules and monitor workflows. It triggers task execution based on schedule interval and execution time. Think of it as a tool to coordinate work done by other services. While following the specified dependencies, the airflow scheduler executes your tasks on an array of workers.

Today, Airflow is used to solve a variety of data ingestion, preparation and consumption problems. Key problems such as behavioural analytical systems, CRMs, data warehouses, data lakes and BI tools solved by Airflow is Integrating data between disparate systems which are used for deeper analytics and AI. Airflow can also orchestrate complex ML workflows, plus it is designed as a configuration-as-a-code system and heavily customized with plugins.

An Approach- DAG (Directed Acyclic Graph)

In the workflow orchestration space, Airflow has become a go-to framework. It bridges the gap between GUI-based and purely programmatic orchestration. Its pipelines are code (Python) defined and pipeline management is via a GUI has written in a flask. Airflow decouples the processing stages from the orchestration.

A collection of inter-dependent processes is controlled first and foremost by a dependency DAG in which a graph where data flows in only one direction among the connected processes. Any number of arbitrarily complex dependencies can be modelled using the pythonic DAG design API.

As we assume that the user has this background, beyond the basics of configuring the DAGs, no attention has been given to the more detailed configurations of an Airflow DAG, such as starting dates and retries.

The flow we are using in this example is as follows:

– save-bash — print ‘Hello World’ the STDOUT and redirect it to a file called out.txt

– print-file — print the output of out.txt file to STDOUT

– copy-file — copy out.txt to out_copy.txt

– delete-files — delete the out.txt and out_copy.txt files

The workflow can be visualized as a DAG in the given above example.  The rectangle around the diagram represents that the inside one is a self-contained DAG.


Apache Airflow can be used to orchestrate the dependent execution of the pipelines when locating a bunch of Spark jobs that are interrelated. These dependencies are naturally expressed as a DAG (Directed Acyclic Graph) with the Airflow Python API.


This approach is more advanced, but not overly so. This additional complexity is necessary if:

-You have complicated flows you need to break down.

-There are flows that should not reprocess computationally expensive stages.

-It is generally desirable to decouple the execution and orchestration of a flow.


The Airflow Python script is just a configuration file specifying the DAG’s structure as code. As mentioned above it is a set of all the tasks you want to run in a workflow, so two core things which this file is handling are the right order and the right time. This contains the schedule level parameters, defined operations, tasks, and arguments.

import os
from airflow import DAG
from datetime import datetime
from airflow.utils.dates import days_ago
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator

DAG_ID = os.path.basename(__file__).replace(".pyc", "").replace(".py", "")
START_DATE = days_ago(1)

def my_func():
    print('Hey Folks!')
default_args = {
    'owner': DAG_OWNER_NAME,
    'depends_on_past': False,
    'start_date': START_DATE,
    'email': ['******'],
dag = DAG(DAG_ID, default_args=default_args,
schedule_interval=SCHEDULE_INTERVAL, start_date=START_DATE)



While DAGs describes how to run a workflow, the Operator defines an individual task that needs to be performed. A DAG consists of operators. An operator is simply a Python class with an “execute()” method, which gets called when it is being run. Operators generally run independently, except for those dependencies where the DAG makes sure that operators run in the correct order.


–  BashOpertor – executes a bash command

– PythonOperator – calls an arbitrary Python function

– EmailOperator – sends an email

– SimpleHttpOperator – sends an HTTP request

– DAG run for a specific dag_id

– MySqlOperator, SqliteOperator, OracleOperator, JdbcOperator, etc. – executes a SQL command


Sensors are a special kind of operator. When they run, they will check to see if certain criteria are met before they complete and let their downstream tasks execute; and also fail when they time out. The basic function of this sensor is to monitor the execution. The sensor operator is a Python class with a “poke()” method, which gets called repeatedly until “True” is returned.


Setting up dependencies between tasks is the next most important step after following all the above steps. Basically defining the order in which tasks must be executed.

Task dependencies are set using the set_upstream and set_downstream operators (Though, in version ≥ 1.8, it’s also possible to use the bitshift operators << and >> to perform the same operations more concisely). A task can have multiple dependencies. There are different ways of defining dependencies


After this vibrant overview of Airflow, we are looking forward to the next part uncovering most of the enigmas of Airflow Orchestration.

About the Author

Tanya Sharma is a budding business analyst at Factspan who likes to solve real-life problems and enjoys data analysis and working with different analytics and BI tools.  She has previously worked with Google’s partners and clients like NASA, World Bank, NOAA, Walmart, Target etc across multiple domains. When not doing her day job, she can be found sketching or wandering off to an offbeat getaway, and in bonus, she is always available for a quick chat.

Most Popular

Let's Connect

Please enable JavaScript in your browser to complete this form.

Join Factspan Community

Subscribe to our newsletter

Related Articles

Add Your Heading Text Here


Modernizing Medication Management: Data-driven Approach to Pyxis MedStation

Delve into the significance of Pyxis MedStation in healthcare, highlighting its challenges and the data-driven solutions offered by Factspan. Discover how analytics improves medication management, saving costs and enhancing patient care in the process

Read More ...

Meta’s LLAMA 2 Vs Open AI’s ChatGPT

Explore the world of cutting-edge AI with a detailed analysis of Meta’s LLaMA and OpenAI’s ChatGPT. Uncover their workings, advantages, and considerations to help you make the right choice for your specific needs. Dive into the future of AI and its profound impact on content creation and data analysis.

Read More ...

Data Contract Implementation in a Kafka Project: Ensuring Data Consistency and Adaptability

Data contracts are essential for ensuring data consistency and adaptability in data engineering projects. This blog explains how to implement data contract in a Kafka project and how it can be utilized to solve data quality and inconsistency issues.

Read More ...

CDP: A band-aid solution?

Step into the world of Customer Data Platforms (CDPs) with our captivating blog, designed to guide you through every angle. Discover the origin story of CDPs – why they stepped into the spotlight. Uncover their true essence and explore the four common categories they belong to. Delve into real-life scenarios with eight compelling use cases that are revolutionizing businesses today. Tackle the question: are CDPs a quick fix or a sustainable solution? And don’t shy away from addressing the challenges that come with CDP territory. Wrapping it all up, you’ll find key takeaways that provide fresh insights into this dynamic technology.

Read More ...

The Magical Transformation: How Nike Used Marketing Intelligence to Win the Game

Discover how Marketing Intelligence and Generative AI shape effective strategies. Learn from Nike’s success against Adidas in 2018. Dive into personalized content, automation, and insights.

Read More ...

Web 3.0: Transforming the Future of E-commerce

With Web 3.0, users will experience heightened control over their data, leading to faster and safer transactions. For businesses, this paradigm shift will necessitate embracing AI, blockchain, and machine learning technologies to better connect with customers and thrive in this new era of digital commerce.

Read More ...