What is Apache Airflow?
Apache Airflow is an open-source platform for orchestrating, scheduling, and monitoring data pipelines. Originally developed at Airbnb in 2014, it is today one of the most widely used tools in modern data engineering. The key advantage: workflows are defined not as configuration files, but as Python code – meaning they can be versioned, tested, and reused like software.
DAGs – The Heart of Airflow
The central concept is the DAG (Directed Acyclic Graph). A DAG describes a set of tasks and their dependencies. Execution always flows in one direction – no task can have itself as a prerequisite.
| extract_data | PythonOperator | Starting point |
|---|---|---|
| validate_schema | PythonOperator | parallel to load_staging |
| load_staging | SQLExecuteQueryOperator | parallel to validate_schema |
| run_dbt | BashOperator | waits for both previous steps |
| notify | EmailOperator | Conclusion |
Example: A simple Airflow DAG with parallel steps followed by a dbt transformation
Architecture and Core Components
Webserver
The web interface (UI) for managing, monitoring, and manually controlling DAGs and individual task executions.
Scheduler
Monitors all DAGs, evaluates their schedules and dependencies, and passes due tasks to the executor.
Executor
Executes the actual tasks. Depending on the configuration, locally (SequentialExecutor), distributed (CeleryExecutor), or in Kubernetes pods (KubernetesExecutor).
Metadata Database
A relational database (e.g. PostgreSQL) in which Airflow stores the state of all DAG runs, tasks, and their logs.
Worker
In distributed setups (e.g. Celery), workers are independent processes that receive and execute tasks.
DAG Folder
A shared directory where all Python DAG definitions are stored and read by the scheduler.
Operators – The Building Blocks of Tasks
Each task is defined by an operator that encapsulates the actual work. Airflow comes with many built-in operators, including:
- PythonOperator – executes a Python function
- BashOperator – executes shell commands
- SQLExecuteQueryOperator – executes SQL against a database
- EmailOperator – sends notifications
Additional operators for AWS, GCP, Azure, Snowflake, Databricks, Slack, and many more are available via provider packages.
Airflow and dbt – A Natural Complement
Airflow and dbt Core complement each other perfectly: dbt handles the transformation in the data warehouse, while Airflow controls the overarching process – when and in what order extraction, loading, transformation, and notifications are executed. Both tools share the principle of “code over configuration”.
Supported Platforms
Through the provider system, Airflow integrates natively with virtually all modern data platforms:
- Amazon Web Services (AWS)
- Google Cloud Platform (GCP)
- Microsoft Azure
- Snowflake
- Databricks
- PostgreSQL / MySQL
- Oracle DB
- Kubernetes
- Apache Spark
- dbt Core
- Slack / Email
- HTTP / REST APIs
Deployment Options
-
Self-Hosted Maximum control, requires own setup and maintenance.
-
Docker / Kubernetes De facto standard – containerized operation with dynamic scaling via the KubernetesExecutor.
-
Managed Services (Astronomer, MWAA, Cloud Composer) Reduced operational overhead, ideal for production environments.
Strengths and Limitations
Airflow excels through flexibility, a large integration ecosystem, and the ability to map complex dynamic pipelines in Python. As a batch orchestrator, however, it is not designed for real-time streaming – tools like Apache Kafka or Flink are better suited for that. For smaller teams, the initial setup effort may be an argument in favor of a managed service.
Conclusion
Apache Airflow is the industry standard for orchestrating data pipelines. In combination with dbt Core and modern cloud DWH platforms, it forms the backbone of the modern analytics engineering stack.