PySpark Functional Programming: Stop Writing Imperative Spark Pipelines
In my recent project I ran into a situation where I had to review a set of PySpark notebooks in Microsoft Fabric — 14 notebooks, some of them over 3000 lines long, hundreds of cells, multiple data domains crammed into a single file. The code worked, but reading it felt like archaeology. Every notebook started the same way: df = spark.read..., then df = df.withColumn(...) repeated dozens of times, sprinkled with display(df) calls and bare except: blocks. I kept asking myself — how did we end up writing Spark code like this?