pythonadvanced

Build a Data Lineage Graph with NetworkX

Track and visualise data lineage across ETL pipeline stages using a directed graph.

python
import networkx as nx

G = nx.DiGraph()

for node in ['raw_events','clean_events','enriched_events','fact_orders','dim_customer']:
    G.add_node(node)

edges = [
    ('raw_events',     'clean_events',    {'op':'deduplicate'}),
    ('clean_events',   'enriched_events', {'op':'join customer'}),
    ('enriched_events','fact_orders',     {'op':'aggregate'}),
    ('dim_customer',   'enriched_events', {'op':'join'}),
]
G.add_edges_from(edges)

print('Ancestors:', nx.ancestors(G, 'fact_orders'))
print('Topological order:', list(nx.topological_sort(G)))

Use Cases

  • data governance
  • impact analysis
  • ETL documentation

Tags

Related Snippets

Similar patterns you can reuse in the same workflow.