What is Apache Airflow

What is Apache Airflow?

Apache Airflow is an open-source platform for orchestrating, scheduling, and monitoring data pipelines. Originally developed at Airbnb in 2014, it is today one of the most widely used tools in modern data engineering. The key advantage: workflows are defined not as configuration files, but as Python code – meaning they can be versioned, tested, and reused like software.

DAGs – The Heart of Airflow

The central concept is the DAG (Directed Acyclic Graph). A DAG describes a set of tasks and their dependencies. Execution always flows in one direction – no task can have itself as a prerequisite.

extract_data PythonOperator Starting point
validate_schema PythonOperator parallel to load_staging
load_staging SQLExecuteQueryOperator parallel to validate_schema
run_dbt BashOperator waits for both previous steps
notify EmailOperator Conclusion

Example: A simple Airflow DAG with parallel steps followed by a dbt transformation

Architecture and Core Components
Webserver

The web interface (UI) for managing, monitoring, and manually controlling DAGs and individual task executions.

Scheduler

Monitors all DAGs, evaluates their schedules and dependencies, and passes due tasks to the executor.

Executor

Executes the actual tasks. Depending on the configuration, locally (SequentialExecutor), distributed (CeleryExecutor), or in Kubernetes pods (KubernetesExecutor).

Metadata Database

A relational database (e.g. PostgreSQL) in which Airflow stores the state of all DAG runs, tasks, and their logs.

Worker

In distributed setups (e.g. Celery), workers are independent processes that receive and execute tasks.

DAG Folder

A shared directory where all Python DAG definitions are stored and read by the scheduler.

Operators – The Building Blocks of Tasks

Each task is defined by an operator that encapsulates the actual work. Airflow comes with many built-in operators, including:

  • PythonOperator – executes a Python function
  • BashOperator – executes shell commands
  • SQLExecuteQueryOperator – executes SQL against a database
  • EmailOperator – sends notifications

Additional operators for AWS, GCP, Azure, Snowflake, Databricks, Slack, and many more are available via provider packages.

Airflow and dbt – A Natural Complement

Airflow and dbt Core complement each other perfectly: dbt handles the transformation in the data warehouse, while Airflow controls the overarching process – when and in what order extraction, loading, transformation, and notifications are executed. Both tools share the principle of “code over configuration”.

Supported Platforms

Through the provider system, Airflow integrates natively with virtually all modern data platforms:

  • Amazon Web Services (AWS)
  • Google Cloud Platform (GCP)
  • Microsoft Azure
  • Snowflake
  • Databricks
  • PostgreSQL / MySQL
  • Oracle DB
  • Kubernetes
  • Apache Spark
  • dbt Core
  • Slack / Email
  • HTTP / REST APIs

Deployment Options

  • Self-Hosted
    Maximum control, requires own setup and maintenance.

  • Docker / Kubernetes
    De facto standard – containerized operation with dynamic scaling via the KubernetesExecutor.

  • Managed Services (Astronomer, MWAA, Cloud Composer)
    Reduced operational overhead, ideal for production environments.

Strengths and Limitations

Airflow excels through flexibility, a large integration ecosystem, and the ability to map complex dynamic pipelines in Python. As a batch orchestrator, however, it is not designed for real-time streaming – tools like Apache Kafka or Flink are better suited for that. For smaller teams, the initial setup effort may be an argument in favor of a managed service.

Conclusion

Apache Airflow is the industry standard for orchestrating data pipelines. In combination with dbt Core and modern cloud DWH platforms, it forms the backbone of the modern analytics engineering stack.