Husein Ghadiali

Data Scientist

New York, NY

Husein Ghadiali

Data Scientist

New York, NY

Husein Ghadiali

Data Scientist

New York, NY

Blog Image
Blog Image

Jul 5, 2025

Building a Lightweight Data Validation Framework with PyTest and GitHub Actions

“In every data pipeline, there’s a moment of quiet risk — the part where you assume the input data is fine.” — a data engineer, moments before everything broke.

Press enter or click to view image in full size

Thanks, GPT-4o.

In analytics and machine learning pipelines, bad data is a silent killer. A single null in an ID field might break joins. A value like age = 130 could pollute forecasts. But many teams still rely on hope instead of testing.

This guide walks you through building a lightweight data validation framework using PyTest for writing tests and GitHub Actions for CI/CD automation. You’ll get:

  • A realistic dirty dataset

  • A suite of reusable PyTest checks

  • A GitHub Actions pipeline that enforces QA with every push

  • Real outputs, logs, and test results

No overengineered tools. Just clean, scalable validation that actually works.

🚨 Why This Matters

Let’s say you receive a weekly extract from your CRM vendor. It looks fine. But:

  • 4% of rows have signup_date after last_login

  • 7% are missing emails

  • Some users are None years old

  • One row has income = -1000

Now imagine you join this into a dashboard seen by leadership.

That’s why validation before ingestion matters. If you’re building pipelines without automated checks, every deployment is a gamble.

🧪 Why PyTest?

PyTest is the gold standard for testing in Python — but most data teams overlook its power.

Key benefits:

  • ✅ Declarative: just use assert

  • 🔁 Reusable fixtures (perfect for loading shared data)

  • 📂 Auto-discovery (test_*.py, test_*)

  • 🧾 Clean failure messages

  • 🛠️ Works with notebooks, pipelines, Airflow, and CI/CD

Install it:

pip install pytest

Run it:

pytest tests

PyTest scans the tests/ folder for files starting with test_ and runs every function that starts with test_.

📊 Simulating Dirty Data

Let’s generate a dataset with realistic problems:

# src/etl.pyimport pandas as pdimport numpy as npdef load_dirty_data():    np.random.seed(42)    n_rows = 100    data = {        "id": np.arange(1, n_rows + 1),        "email": [f"@example.com">user{i}@example.com" if i % 10 != 0 else None for i in range(1, n_rows + 1)],        "age": [np.random.choice([25, 30, 45, 60, -5, 130, None]) for _ in range(n_rows)],        "gender": [np.random.choice(["M", "F", "Other", "X", None]) for _ in range(n_rows)],        "signup_date": pd.date_range(start="2022-01-01", periods=n_rows),        "last_login": [pd.Timestamp("2022-01-01") + pd.to_timedelta(np.random.randint(-10, 300), unit='D') for _ in range(n_rows)],        "country": [np.random.choice(["US", "UK", "IN", "Unknown", ""]) for _ in range(n_rows)],        "income": [np.random.choice([55000, 72000, 96000, None, -1000]) for _ in range(n_rows)]    }    df = pd.DataFrame(data)    df.loc[np.random.choice(n_rows, 5, replace=False), 'id'] = None    duplicates = df.sample(5, random_state=1)    df = pd.concat([df, duplicates], ignore_index=True)    return df

🔍 Writing PyTest Checks

# tests/test_data_quality.pyimport pytestfrom src.etl import load_dirty_data@pytest.fixture(scope='module')def df():    return load_dirty_data()def test_no_missing_ids(df):    assert df['id'].notnull().all(), "Missing 'id' values found"def test_no_missing_emails(df):    assert df['email'].notnull().all(), "Missing 'email' values found"def test_valid_ages(df):    valid = df['age'].between(0, 120).fillna(False)    assert valid.all(), f"Invalid ages at rows: {df[~valid].index.tolist()}"def test_valid_genders(df):    valid = df['gender'].isin({'M', 'F', 'Other'}).fillna(False)    assert valid.all(), f"Unexpected genders at: {df[~valid].index.tolist()}"def test_login_after_signup(df):    valid = df['last_login'] >= df['signup_date']    assert valid.all(), f"Signup/login mismatch rows: {df[~valid].index.tolist()}"def test_known_countries(df):    valid = df['country'].isin({'US', 'UK', 'IN'}).fillna(False)    assert valid.all(), f"Unknown countries: {df[~valid].index.tolist()}"def test_non_negative_income(df):    valid = df['income'].apply(lambda x: x is None or x >= 0)    assert valid.all(), "Negative income values found"def test_no_duplicates(df):    dupes = df[df.duplicated()]    assert len(dupes) == 0, f"Found {len(dupes)} duplicate rows"

✅ Sample Test Output

Running pytest tests/ returns:

FAILED tests/test_data_quality.py::test_no_missing_ids - AssertionError: Missing 'id' values foundFAILED tests/test_data_quality.py::test_valid_ages - AssertionError: Invalid ages at rows: [4, 13, 29]FAILED tests/test_data_quality.py::test_valid_genders - AssertionError: Unexpected genders at: [22, 46, 98]

Each failure shows exactly which rows failed, making debugging much faster.

⚙️ GitHub Actions: Automating Validation

We’ll add a .github/workflows/data-validation.yml file:

name: Data Validationon:  push:    branches: [main]  pull_request:    branches: [main]jobs:  validate:    runs-on: ubuntu-latest    steps:      - name: Checkout code        uses: actions/checkout@v3      - name: Set up Python        uses: actions/setup-python@v4        with:          python-version: '3.10'      - name: Install dependencies        run: |          pip install pandas numpy pytest      - name: Run tests        run: |          pytest tests

✅ Live Feedback

After pushing to GitHub, go to the Actions tab. You’ll see:

✔️ Checkout code✔️ Set up Python 3.10✔️ Install dependencies❌ Run tests 4 failed, 4 passed

Click “Run tests” to see which tests failed and why — just like a failing unit test.

🧱 Folder Structure

data-validation-pipeline/├── src/└── etl.py├── tests/└── test_data_quality.py├── .github/└── workflows/└── data-validation.yml├── requirements.txt

requirements.txt:

pandasnumpypytest

🌍 Real-World Use Cases

  • ✅ Validate CRM exports before ingesting into Snowflake

  • ✅ Block PRs with failing data tests using GitHub branch protection

  • ✅ Detect schema drift in ML training data

  • ✅ Prevent stakeholder dashboards from silently breaking

🔚 Final Thoughts

You don’t need a heavyweight tool to enforce data quality. With just PyTest and GitHub Actions, you can:

  • Define clear, reusable validation logic

  • Catch silent data bugs before they spread

  • Integrate QA into your GitOps or analytics workflows

Start with small, testable rules. Add more as needed. And treat your data like code — with versioning, reviews, and validation.

Resources:

If this helped, give it a clap — or better, show your implementation and tag me. Happy validating!

Create a free website with Framer, the website builder loved by startups, designers and agencies.