What does DAG mean?

What Does DAG Mean?

DAG stands for Directed Acyclic Graph. The term originates from graph theory, but has a central practical significance in data engineering: it describes the dependency structure of tasks or transformation steps that must be executed in a defined order.

A graph consists of nodes and edges. In data engineering, each node corresponds to a task or a data model, and each edge represents a dependency between two tasks. Two additional properties make a graph a DAG:

Directed
Each edge points in a unique direction – from a prerequisite to a dependent task.

Acyclic
There are no cycles. No node can lead back to itself through a chain of dependencies.

Directed and Acyclic – What Does That Mean in Practice?

The difference between a simple graph and a DAG is best illustrated visually. A cycle would mean that task A depends on B, B depends on C – and C depends on A again. That would be an unresolvable dependency loop that allows no execution order.

NOT A DAG – CYCLE PRESENT
Task A → Task B
Task B → Task C
Task C → Task A (Cycle!)
C → A creates a loop – no execution order possible

VALID DAG – ACYCLIC
Task A → Task B
Task A → Task C
Task B → End
Task C → End
Clear direction, no cycle – unambiguous execution order

Why Are DAGs So Important in Data Engineering?

Data pipelines almost always consist of multiple steps that depend on each other: raw data must be loaded before it can be transformed; transformations must be complete before reports can be updated. A DAG makes these dependencies explicit and machine-readable.

This brings three key advantages: First, a valid execution order can be automatically derived from it. Second, independent steps can be executed in parallel, significantly reducing runtime. Third, when an error occurs, it is immediately clear which downstream steps are affected and which do not need to be recalculated.
Core advantage A DAG is not merely a visualization – it is a formal data structure from which tools like Airflow or dbt automatically derive execution plans, parallelizations, and dependency checks.

DAGs in Practice: dbt and Airflow

Both tools, which were introduced in previous articles, are centrally built on the DAG concept – but at different levels:

Tool Level What a node represents What an edge represents
dbt Core Transformation A SQL model (view or table) A ref() dependency between models
Apache Airflow Orchestration A task (Python, SQL, Bash, …) A » dependency between tasks

In dbt, the DAG is created implicitly through the ref() function: as soon as model B references model A, dbt knows that A must be executed first. Airflow, on the other hand, defines the DAG explicitly in Python code – the developer specifies which tasks precede which others.

A Simple Example

A typical analytics workflow can be represented as a DAG with five nodes. Step 1 must be completed first; steps 2 and 3 are independent of each other and can run in parallel; step 4 waits for both; step 5 concludes.

Step 1 Load raw data Prerequisite for everything that follows
Step 2 Validate schema parallel to step 3
Step 3 Populate staging parallel to step 2
Step 4 Transform dbt models waits for steps 2 and 3
Step 5 Update report Conclusion

Conclusion

The DAG is the conceptual foundation on which modern data engineering tools are built. It makes dependencies between tasks explicit, enables parallel execution, and prevents unresolvable cycles. Anyone who understands dbt Core or Apache Airflow works with DAGs on a daily basis – often without explicitly naming them.