Use for Free

More Templates

Husein Ghadiali

Data Scientist

New York, NY

Let's Talk

Husein Ghadiali

Data Scientist

New York, NY

Let's Talk

Husein Ghadiali

Data Scientist

New York, NY

Let's Talk

All Thoughts

Jul 5, 2025

Building a Lightweight Data Validation Framework with PyTest and GitHub Actions

“In every data pipeline, there’s a moment of quiet risk — the part where you assume the input data is fine.” — a data engineer, moments before everything broke.

Press enter or click to view image in full size

Thanks, GPT-4o.

In analytics and machine learning pipelines, bad data is a silent killer. A single null in an ID field might break joins. A value like age = 130 could pollute forecasts. But many teams still rely on hope instead of testing.

This guide walks you through building a lightweight data validation framework using PyTest for writing tests and GitHub Actions for CI/CD automation. You’ll get:

A realistic dirty dataset
A suite of reusable PyTest checks
A GitHub Actions pipeline that enforces QA with every push
Real outputs, logs, and test results

No overengineered tools. Just clean, scalable validation that actually works.

🚨 Why This Matters

Let’s say you receive a weekly extract from your CRM vendor. It looks fine. But:

4% of rows have signup_date after last_login
7% are missing emails
Some users are None years old
One row has income = -1000

Now imagine you join this into a dashboard seen by leadership.

That’s why validation before ingestion matters. If you’re building pipelines without automated checks, every deployment is a gamble.

🧪 Why PyTest?

PyTest is the gold standard for testing in Python — but most data teams overlook its power.

Key benefits:

✅ Declarative: just use assert
🔁 Reusable fixtures (perfect for loading shared data)
📂 Auto-discovery (test_*.py, test_*)
🧾 Clean failure messages
🛠️ Works with notebooks, pipelines, Airflow, and CI/CD

Install it:

pip install pytest

Run it:

pytest tests

PyTest scans the tests/ folder for files starting with test_ and runs every function that starts with test_.

📊 Simulating Dirty Data

Let’s generate a dataset with realistic problems:

# src/etl.pyimport pandas as pdimport numpy as npdef load_dirty_data():    np.random.seed(42)    n_rows = 100    data = {        "id": np.arange(1, n_rows + 1),        "email": [f"@example.com">user{i}@example.com" if i % 10 != 0 else None for i in range(1, n_rows + 1)],        "age": [np.random.choice([25, 30, 45, 60, -5, 130, None]) for _ in range(n_rows)],        "gender": [np.random.choice(["M", "F", "Other", "X", None]) for _ in range(n_rows)],        "signup_date": pd.date_range(start="2022-01-01", periods=n_rows),        "last_login": [pd.Timestamp("2022-01-01") + pd.to_timedelta(np.random.randint(-10, 300), unit='D') for _ in range(n_rows)],        "country": [np.random.choice(["US", "UK", "IN", "Unknown", ""]) for _ in range(n_rows)],        "income": [np.random.choice([55000, 72000, 96000, None, -1000]) for _ in range(n_rows)]    }    df = pd.DataFrame(data)    df.loc[np.random.choice(n_rows, 5, replace=False), 'id'] = None    duplicates = df.sample(5, random_state=1)    df = pd.concat([df, duplicates], ignore_index=True)    return df

🔍 Writing PyTest Checks

# tests/test_data_quality.pyimport pytestfrom src.etl import load_dirty_data@pytest.fixture(scope='module')def df():    return load_dirty_data()def test_no_missing_ids(df):    assert df['id'].notnull().all(), "Missing 'id' values found"def test_no_missing_emails(df):    assert df['email'].notnull().all(), "Missing 'email' values found"def test_valid_ages(df):    valid = df['age'].between(0, 120).fillna(False)    assert valid.all(), f"Invalid ages at rows: {df[~valid].index.tolist()}"def test_valid_genders(df):    valid = df['gender'].isin({'M', 'F', 'Other'}).fillna(False)    assert valid.all(), f"Unexpected genders at: {df[~valid].index.tolist()}"def test_login_after_signup(df):    valid = df['last_login'] >= df['signup_date']    assert valid.all(), f"Signup/login mismatch rows: {df[~valid].index.tolist()}"def test_known_countries(df):    valid = df['country'].isin({'US', 'UK', 'IN'}).fillna(False)    assert valid.all(), f"Unknown countries: {df[~valid].index.tolist()}"def test_non_negative_income(df):    valid = df['income'].apply(lambda x: x is None or x >= 0)    assert valid.all(), "Negative income values found"def test_no_duplicates(df):    dupes = df[df.duplicated()]    assert len(dupes) == 0, f"Found {len(dupes)} duplicate rows"

✅ Sample Test Output

Running pytest tests/ returns:

FAILED tests/test_data_quality.py::test_no_missing_ids - AssertionError: Missing 'id' values foundFAILED tests/test_data_quality.py::test_valid_ages - AssertionError: Invalid ages at rows: [4, 13, 29]FAILED tests/test_data_quality.py::test_valid_genders - AssertionError: Unexpected genders at: [22, 46, 98]

Each failure shows exactly which rows failed, making debugging much faster.

⚙️ GitHub Actions: Automating Validation

We’ll add a .github/workflows/data-validation.yml file:

name: Data Validationon:  push:    branches: [main]  pull_request:    branches: [main]jobs:  validate:    runs-on: ubuntu-latest    steps:      - name: Checkout code        uses: actions/checkout@v3      - name: Set up Python        uses: actions/setup-python@v4        with:          python-version: '3.10'      - name: Install dependencies        run: |          pip install pandas numpy pytest      - name: Run tests        run: |          pytest tests

✅ Live Feedback

After pushing to GitHub, go to the Actions tab. You’ll see:

✔️ Checkout code✔️ Set up Python 3.10✔️ Install dependencies❌ Run tests — 4 failed, 4 passed

Click “Run tests” to see which tests failed and why — just like a failing unit test.

🧱 Folder Structure

data-validation-pipeline/├── src/│   └── etl.py├── tests/│   └── test_data_quality.py├── .github/│   └── workflows/│       └── data-validation.yml├── requirements.txt

requirements.txt:

pandasnumpypytest

🌍 Real-World Use Cases

✅ Validate CRM exports before ingesting into Snowflake
✅ Block PRs with failing data tests using GitHub branch protection
✅ Detect schema drift in ML training data
✅ Prevent stakeholder dashboards from silently breaking

🔚 Final Thoughts

You don’t need a heavyweight tool to enforce data quality. With just PyTest and GitHub Actions, you can:

Define clear, reusable validation logic
Catch silent data bugs before they spread
Integrate QA into your GitOps or analytics workflows

Start with small, testable rules. Add more as needed. And treat your data like code — with versioning, reviews, and validation.

Resources:

If this helped, give it a clap — or better, show your implementation and tag me. Happy validating!

Feb 7, 2024

Pandas 2.0 vs Polars (PvP)

May 9, 2025

Use for Free

Use for Free

More Templates

More Templates

Building a Lightweight Data Validation Framework with PyTest and GitHub Actions

🚨 Why This Matters

🧪 Why PyTest?

📊 Simulating Dirty Data

🔍 Writing PyTest Checks

✅ Sample Test Output

⚙️ GitHub Actions: Automating Validation

✅ Live Feedback

🧱 Folder Structure

🌍 Real-World Use Cases

🔚 Final Thoughts

More Articles

Pandas 2.0 vs Polars (PvP)

Amazon Product Tracker with AWS: From Web Scraper to Quicksight Dashboard

LET'S
COLLABORATE

LET'S
COLLABORATE

Use for Free

Use for Free

More Templates

More Templates

Building a Lightweight Data Validation Framework with PyTest and GitHub Actions

🚨 Why This Matters

🧪 Why PyTest?

📊 Simulating Dirty Data

🔍 Writing PyTest Checks

✅ Sample Test Output

⚙️ GitHub Actions: Automating Validation

✅ Live Feedback

🧱 Folder Structure

🌍 Real-World Use Cases

🔚 Final Thoughts

More Articles

Pandas 2.0 vs Polars (PvP)

Amazon Product Tracker with AWS: From Web Scraper to Quicksight Dashboard

LET'SCOLLABORATE

LET'SCOLLABORATE

LET'S
COLLABORATE

LET'S
COLLABORATE