From manual chaos to automated bliss

The importance of pipelines: Why work hard when you can work smart?

Imagine waking up every morning knowing you have to manually retrieve, clean, and upload data. It’s like making coffee by picking the beans, roasting them, grinding them, and then brewing the coffee, all before you've had your first sip. Exhausting, right?

Data pipelines exist to save us from this never-ending grind. They help automate the flow of data, ensuring that everything runs smoothly and efficiently without needing a person to hit “Run” on a SQL query every morning. But what happens when pipelines don’t exist? Let’s step into the chaotic world of John Doe.

The story of John Doe: The daily data marathon

John Doe is a data scientist working for the Open Data team at XYZ. Every morning, he logs into his database, writes queries, filters the data, and uploads it to the Open Data Portal. One dataset, two datasets, three… twenty. By the time John finishes, it’s already afternoon, and he still hasn’t had lunch. At first, this was fine. But as the number of datasets grew, John’s workflow became a bottleneck. More queries, more filters, more uploads. The process that once took an hour now consumes his entire day.

And the consequences?

Delays in data publishing: The public isn’t getting fresh data on time.

Errors creep in: A tired John is more likely to make mistakes.

Burnout is real: Who wants to spend all day running SQL queries manually?

John wonders, “There must be a better way…”

A cartoon of a man drowning in a sea of paper

Figure 1: John Doe looking frustrated with a pile of reports and data Image credit: Rahnuma Tarannum

The story of Jane Doe: A smart fix, but not quite there

Meanwhile, over at ABC, Jane Doe faces the same challenge. But she’s got a trick up her sleeve. Instead of running queries manually, she writes Python scripts to automate the process. With just a few keystrokes, her script pulls, transforms, and prepares the data. Efficient, right?

Well… almost.

Jane still has to manually upload the data. Every. Single. Time. But it’s fine—until the day she takes a well-earned vacation.

The problem?

No one else has her scripts.

No one else has her credentials.

No data gets uploaded while she’s gone.

The result? A total data blackout.

When Jane returns, her inbox is flooded with messages: “Where’s the latest dataset?” “Why wasn’t the portal updated?” “Did something break?” She realizes her “automated” solution isn’t scalable: It still depends on her being there.

A woman in an office covered in sticky notes

Figure 2: Jane Doe feeling overwhelmed with a lot of to-dos after her vacation Image credit: Rahnuma Tarannum

The case for automation: Why do the work when Airflow can?

What do John and Jane have in common?

They’re both spending unnecessary time on manual processes.

They’re both single points of failure.

Their workflows don’t scale with more data and more demand.

So, what’s the solution? Automated pipelines using Apache Airflow!

Why automated pipelines make a difference

Queries, transformations, and uploads happen like magic—no babysitting required. The best part? No single points of failure. If John takes a day off, data still flows. If Jane finally takes that long-overdue vacation, the Open Data Portal won’t even notice. Everything runs on schedule, updating like clockwork, so there’s no more waiting around or playing catch-up.

Thanks to Airflow, John doesn’t need spend mornings babysitting SQL queries, and Jane isn’t rushing to upload files before heading out. Their data pipelines now run on autopilot.

So, why settle for chaos when bliss is just an automated pipeline away?

Open Data team × Data Engineering: The perfect recipe for automation

John and Jane had been battling their daily data struggles for too long. Automation was the answer, but getting there wasn’t just about writing code. It required a strong partnership between the Open Data team and the Data Engineering team, where each side played a crucial role in making the transition successful.

While the Data Engineering team focused on building the infrastructure, ensuring that pipelines could scale and data could move securely, the Open Data team worked on understanding what users actually needed. We weren’t just making things more efficient behind the scenes. We were improving the data itself to make it more accessible, insightful, and relevant.

As automation took shape, we weren’t just moving data faster; we were shaping datasets based on real-world needs. The Data Engineering team provided the technical foundation, laying out the tools, pipelines, and processes. The Open Data team, in turn, used these ingredients to “bake the perfect batch of Open Data cookies” for the public.

This collaboration ensured that automation wasn’t just about reducing manual work. It was about optimizing and building reliable way to deliver structured data and source of truth for not only external users but also internal users.

Now, let’s take a look at how all the pieces came together to transform our workflow.

This is how we roll at the MTA

John and Jane have finally found their solution: automation. But what exactly does this look like at the MTA?

Before we dive into the automation, give our blog post How we build Analytics at scale at the MTA a read.

To automate the process of uploading datasets to the Open Data Portal, the MTA’s data engineers built a robust infrastructure that extracts, transforms, and loads data without manual intervention. No more late uploads, no more burnout—just a smooth and efficient pipeline.

Figure 3: Delta-to-SFTP Pipeline to extract, transform and load data on Open Data Portal

This may look like a lot, but let’s walk through it together.

Before we could even think about automating the workflow, we had to tackle a fundamental problem: How do we securely transfer data? Manually downloading and uploading files wasn’t just tedious, it was risky. The MTA shares our data through New York State's Open Data Portal at data.ny.gov. Their IT team provided us with a SFTP endpoint for our automated uploads. So, working backwards, we knew the last step in our process needed to end with dropping data there.

You’re probably wondering, “What is SFTP?” SFTP (Secure File Transfer Protocol) is like a high-security armored truck for your data, ensuring that it moves safely between the source and destination. Instead of relying on error-prone manual uploads, our automated pipelines encrypt and securely transfer datasets, guaranteeing that every file arrives intact.

Now that we have a secure way to transfer data, the next question was: How do we automate the process, so that John and Jane would never have to hit ‘Run’ again?

The answer? Apache Airflow. Think of Airflow as a personal data butler—one that runs tasks on a schedule and makes sure everything happens exactly when it should. With Airflow, we built an automated system that extracts data from multiple databases, runs transformations using SQL queries, and loads final datasets to the Open Data Portal, all without human intervention.

No more 9 a.m. SQL query runs. No more panic when Jane is out of office. Airflow ensures everything runs as scheduled.

Before datasets can be published, they need to be extracted from their original source. This used to be a manual task: John and Jane would query the source databases to extract the data. Now, thanks to in-house SQL loaders built by data engineers at the MTA, this process is fully automated.

Flowchart of the Delta-to-SFTP pipeline. The part where data is extracted from source databases and loaded in the Data Lake is highlighted.

Figure 4: Extracting data from source databases and loading it in Data Lake as Delta tables

These loaders query the source databases and store the extracted data in Delta table format in Data Lake, providing a seamless process for the next steps in the pipeline.

With data now extracted and stored, the next challenge was making sure it was ready for public use.

No more copy-pasting SQL queries. No more data wrangling by hand. Just automated, scalable transformations that ensure the users get clean data every time.

The last and most important step in the pipeline is getting the transformed datasets onto the Open Data Portal. This used to be the most painful part of Jane’s job: She had to manually upload CSV files one by one, every single day. And if she wasn’t there? No updates. No data. No progress.

Time to pull out the big guns: Operation Delta-to-SFTP!

Flowchart of the Delta-to-SFTP pipeline. The part where an SQL query is executed to transform data and upload it in a CSV format using Spark is highlighted.

Figure 5: Executing the SQL query to transform data and upload it in a CSV format using Spark

Data engineers at the MTA built a Delta-to-SFTP module that takes care of all of Jane’s concerns. This module performs reverse ETL (Extract-Transform-Load), which extracts data from the Data Lake and stores it in Blob Storage.

Flowchart of the Delta-to-SFTP pipeline. The part where Blob is extracted and uploaded on the Open Data Portal using SFTP is highlighted.

Figure 6: Extracting Blob and uploading on Open Data Portal using SFTP

Once extraction is done, this module fetches the latest transformed CSV files from Blob Storage, uploads them directly to the Open Data Portal via SFTP, and runs on schedule, without anyone lifting a finger.

The result? A seamless, fully automated pipeline that ensures Open Data is always updated, secure, and accessible.

Now that we've shared how automation revolutionized Open Data uploads, it's your turn! We’d love to hear about how you're implementing automated processes in your analytical workflow. Whether you're streamlining data collection, or improving reporting accuracy, share your story with us at opendata@mtahq.org.

About the authors

Rahnuma Tarannum is a Data Science Specialist on the Data & Analytics team, focusing on Open Data. Check out the MTA’s open datasets by visiting data.ny.gov.

Shubham Mittal is a Data Engineer on the Data & Analytics team.

From manual chaos to automated bliss: Streamlining Open Data pipelines