All Posts

Notes on Building a Databricks Pipeline That Doesn't Fall Over

These are notes from building a data pipeline at Airbus that processed 10TB+ of aviation operational data on Databricks. None of this is novel. All of it was learned the hard way.

If you're a data science student about to do your first internship at a company with real data, this might save you a week of debugging.

1. Your notebook works at 1GB. It will betray you at 100GB.

The most common failure mode I saw: someone prototypes a pipeline on a sample dataset, it works fine, then it falls over spectacularly on the full dataset. Usually because of one of three things:

2. Partition strategy matters more than you think

We partitioned by date and aircraft ID. Sounds obvious in retrospect. The first version partitioned only by date, which meant queries filtering by aircraft had to scan every partition. The second version partitioned by both, which cut query time dramatically.

The general principle: partition by the columns you filter on most often. Not by the columns that make the most logical sense. These are sometimes different things.

# Good: partition by what you actually filter on
df.write.partitionBy("flight_date", "aircraft_id") \
  .parquet("/mnt/data/operations/")

# Less good: partition by what seems "correct"
df.write.partitionBy("data_source", "year") \
  .parquet("/mnt/data/operations/")

Also: too many partitions is nearly as bad as too few. If your partition column has 500,000 unique values, you'll create 500,000 tiny files and your reads will be slow because of metadata overhead. Aim for partitions in the hundreds, maybe low thousands.

3. Delta Lake is worth the overhead

We switched from raw Parquet to Delta Lake midway through the project. Three things that made it worth it:

-- Compact small files and optimise for common query patterns
OPTIMIZE operations.flight_data
ZORDER BY (aircraft_id, flight_date)

-- Query yesterday's version of the table
SELECT * FROM operations.flight_data VERSION AS OF 42

4. Scheduling: make it boring

Our pipeline ran nightly via Databricks Workflows. The most important thing I did was make it boring:

The goal is that when something goes wrong — and it will — you can diagnose it from the logs without re-running anything.

5. The 40% came from one stupid thing

The "reduced processing time by 40%" line on my CV sounds impressive. The actual fix was embarrassing: the previous pipeline was reading the entire raw dataset every run, even though only the last day's data was new. I added incremental loading — check the max timestamp in the target table, only ingest rows newer than that.

# Instead of this:
raw_df = spark.read.parquet("/mnt/raw/")

# Do this:
last_loaded = spark.sql("""
  SELECT MAX(_pipeline_timestamp)
  FROM operations.flight_data
""").collect()[0][0]

raw_df = spark.read.parquet("/mnt/raw/") \
  .filter(col("ingestion_time") > last_loaded)

That's it. That's the 40%. Most performance improvements in real pipelines aren't clever algorithms — they're avoiding doing work you've already done.

Summary

If you're building your first serious data pipeline, the things that matter most are not the things your coursework emphasises. Partition strategy, incremental loading, monitoring, and file compaction will account for more of your performance than any choice of algorithm. The boring infrastructure work is the work that makes everything else possible.

Prev: What Airbus Taught Me Next: From Loss Functions to Decisions