These are notes from building a data pipeline at Airbus that processed 10TB+ of aviation operational data on Databricks. None of this is novel. All of it was learned the hard way.
If you're a data science student about to do your first internship at a company with real data, this might save you a week of debugging.
1. Your notebook works at 1GB. It will betray you at 100GB.
The most common failure mode I saw: someone prototypes a pipeline on a sample dataset, it works fine, then it falls over spectacularly on the full dataset. Usually because of one of three things:
.collect()somewhere it shouldn't be. This pulls the entire distributed dataset into the driver node's memory. Fine at 1GB, death at 100GB. Search your codebase for every.collect(),.toPandas(), and.toLocalIterator()— each one is a potential bomb.- Unpartitioned writes. Writing a large DataFrame without partitioning means one node does all the work.
- Implicit broadcast joins. Spark tries to broadcast small tables automatically. When your "small" table grows past the threshold, the join silently switches strategy and performance craters.
2. Partition strategy matters more than you think
We partitioned by date and aircraft ID. Sounds obvious in retrospect. The first version partitioned only by date, which meant queries filtering by aircraft had to scan every partition. The second version partitioned by both, which cut query time dramatically.
The general principle: partition by the columns you filter on most often. Not by the columns that make the most logical sense. These are sometimes different things.
# Good: partition by what you actually filter on
df.write.partitionBy("flight_date", "aircraft_id") \
.parquet("/mnt/data/operations/")
# Less good: partition by what seems "correct"
df.write.partitionBy("data_source", "year") \
.parquet("/mnt/data/operations/")
Also: too many partitions is nearly as bad as too few. If your partition column has 500,000 unique values, you'll create 500,000 tiny files and your reads will be slow because of metadata overhead. Aim for partitions in the hundreds, maybe low thousands.
3. Delta Lake is worth the overhead
We switched from raw Parquet to Delta Lake midway through the project. Three things that made it worth it:
- ACID transactions. Before Delta, a failed write could leave the table in a corrupt state. After Delta, writes are atomic.
OPTIMIZEandZORDER. These compact small files and co-locate related data. RanOPTIMIZEnightly and saw 3-5x read speedup on our most common queries.- Time travel. Being able to query the table as it was yesterday is invaluable when someone asks "did the numbers change?" and you need to prove they didn't (or explain why they did).
-- Compact small files and optimise for common query patterns
OPTIMIZE operations.flight_data
ZORDER BY (aircraft_id, flight_date)
-- Query yesterday's version of the table
SELECT * FROM operations.flight_data VERSION AS OF 42
4. Scheduling: make it boring
Our pipeline ran nightly via Databricks Workflows. The most important thing I did was make it boring:
- Every step logs its row count before and after transformation. If the count drops by more than 5%, the job sends an alert instead of continuing.
- Every output table has a
_pipeline_timestampcolumn so we can always trace which run produced which data. - Retry logic on the ingestion step only (where transient failures are common). No retries on transformation (where failures mean bugs).
The goal is that when something goes wrong — and it will — you can diagnose it from the logs without re-running anything.
5. The 40% came from one stupid thing
The "reduced processing time by 40%" line on my CV sounds impressive. The actual fix was embarrassing: the previous pipeline was reading the entire raw dataset every run, even though only the last day's data was new. I added incremental loading — check the max timestamp in the target table, only ingest rows newer than that.
# Instead of this:
raw_df = spark.read.parquet("/mnt/raw/")
# Do this:
last_loaded = spark.sql("""
SELECT MAX(_pipeline_timestamp)
FROM operations.flight_data
""").collect()[0][0]
raw_df = spark.read.parquet("/mnt/raw/") \
.filter(col("ingestion_time") > last_loaded)
That's it. That's the 40%. Most performance improvements in real pipelines aren't clever algorithms — they're avoiding doing work you've already done.
Summary
If you're building your first serious data pipeline, the things that matter most are not the things your coursework emphasises. Partition strategy, incremental loading, monitoring, and file compaction will account for more of your performance than any choice of algorithm. The boring infrastructure work is the work that makes everything else possible.