Jul 5, 2025
Building a Lightweight Data Validation Framework with PyTest and GitHub Actions
“In every data pipeline, there’s a moment of quiet risk — the part where you assume the input data is fine.” — a data engineer, moments before everything broke.
Press enter or click to view image in full size
Thanks, GPT-4o.
In analytics and machine learning pipelines, bad data is a silent killer. A single null in an ID field might break joins. A value like age = 130 could pollute forecasts. But many teams still rely on hope instead of testing.
This guide walks you through building a lightweight data validation framework using PyTest for writing tests and GitHub Actions for CI/CD automation. You’ll get:
A realistic dirty dataset
A suite of reusable PyTest checks
A GitHub Actions pipeline that enforces QA with every push
Real outputs, logs, and test results
No overengineered tools. Just clean, scalable validation that actually works.
🚨 Why This Matters
Let’s say you receive a weekly extract from your CRM vendor. It looks fine. But:
4% of rows have
signup_dateafterlast_login7% are missing emails
Some users are
Noneyears oldOne row has
income = -1000
Now imagine you join this into a dashboard seen by leadership.
That’s why validation before ingestion matters. If you’re building pipelines without automated checks, every deployment is a gamble.
🧪 Why PyTest?
PyTest is the gold standard for testing in Python — but most data teams overlook its power.
Key benefits:
✅ Declarative: just use
assert🔁 Reusable fixtures (perfect for loading shared data)
📂 Auto-discovery (
test_*.py,test_*)🧾 Clean failure messages
🛠️ Works with notebooks, pipelines, Airflow, and CI/CD
Install it:
Run it:
PyTest scans the tests/ folder for files starting with test_ and runs every function that starts with test_.
📊 Simulating Dirty Data
Let’s generate a dataset with realistic problems:
🔍 Writing PyTest Checks
✅ Sample Test Output
Running pytest tests/ returns:
Each failure shows exactly which rows failed, making debugging much faster.
⚙️ GitHub Actions: Automating Validation
We’ll add a .github/workflows/data-validation.yml file:
✅ Live Feedback
After pushing to GitHub, go to the Actions tab. You’ll see:
Click “Run tests” to see which tests failed and why — just like a failing unit test.
🧱 Folder Structure
requirements.txt:
🌍 Real-World Use Cases
✅ Validate CRM exports before ingesting into Snowflake
✅ Block PRs with failing data tests using GitHub branch protection
✅ Detect schema drift in ML training data
✅ Prevent stakeholder dashboards from silently breaking
🔚 Final Thoughts
You don’t need a heavyweight tool to enforce data quality. With just PyTest and GitHub Actions, you can:
Define clear, reusable validation logic
Catch silent data bugs before they spread
Integrate QA into your GitOps or analytics workflows
Start with small, testable rules. Add more as needed. And treat your data like code — with versioning, reviews, and validation.
Resources:
If this helped, give it a clap — or better, show your implementation and tag me. Happy validating!



