Skip to content

Latest commit

 

History

History
706 lines (554 loc) · 26.5 KB

File metadata and controls

706 lines (554 loc) · 26.5 KB

🚀 ValidateX

A powerful, extensible data quality validation framework for Python.

Build Status (Tests & CI) Code Coverage Test Passing Rate PyPI Latest Version PyPI Monthly Downloads Supported Python Versions GitHub Stars MIT License Code Style: black

Badges represent (from left to right): CI/CD Build Status, Code Coverage, Test Count, Latest PyPI Release, Supported Python Versions, License, and Code Style.

ValidateX provides a comprehensive suite of tools for validating, profiling, and monitoring data quality across Pandas, Polars, and PySpark DataFrames, as well as Push-Down SQL engines. Inspired by Great Expectations, it offers a simpler, more focused approach with modern, production-ready HTML reports and an intuitive API.

📑 Table of Contents


🖼️ Report Preview

ValidateX Report — Overview

Column Health Summary

Column Health Summary with mini bar charts

Expectations Table

Severity-tagged Expectations with human-readable output


🤔 Why ValidateX?

Feature ValidateX Great Expectations Pandera
Easy Setup pip install → validate in 5 lines ⚠️ Heavy multi-step context setup ✅ Decorator & Schema API
HTML Report Generator ✅ Modern dark-theme report ✅ Data docs site ❌ Basic schema error logs
Data Quality Score ✅ Weighted (0–100) business score
PySpark Support ✅ Distributed DataFrame support ⚠️ Limited PySpark engine
Polars Support ✅ Multi-threaded Rust support ⚠️
Push-Down SQL Native ✅ Postgres, Snowflake, BigQuery, DuckDB
Data Drift (PSI) ✅ Built-in PSI & Schema shifts ❌ Separate plugins
Airflow Operator ValidateXOperator built-in ⚠️ External provider
Webhook Alerts ✅ Slack & Teams built-in ⚠️ Complex webhook setups

ValidateX is not a replacement for Great Expectations — it's a focused alternative for teams that want production-grade data validation without the overhead.


⚡ Performance Benchmarks

Execution performance measured on standard datasets across Pandas, Polars, and PySpark engines (run via python -m benchmarks.benchmark_suite):

Dataset Size Engine Execution Time Peak Memory Setup / API Lines
100,000 rows Pandas 0.04s 2.1 MB 5 lines
100,000 rows Polars 0.03s < 1 MB 5 lines
100,000 rows PySpark 55.65s (local JVM init) < 1 MB 5 lines
1,000,000 rows Pandas 0.13s 33.2 MB 5 lines
1,000,000 rows Polars 0.21s < 1 MB 5 lines
1,000,000 rows PySpark 52.21s < 1 MB 5 lines
10,000,000 rows Pandas 2.22s 268.5 MB 5 lines
10,000,000 rows Polars 4.77s < 1 MB 5 lines
10,000,000 rows PySpark Distributed Cluster Distributed 5 lines

💼 Real-World Industry Showcases

Explore production-ready validation examples tailored for major data domains in examples/showcase/:


🎯 Who Is This For?

  • Startup data teams — Ship data quality checks in minutes, not days
  • ML engineers — Validate feature stores and training data before model runs
  • CI/CD pipelines — Gate deployments on data quality with a single CLI command
  • Analytics teams — Catch data issues before they reach dashboards
  • dbt users — Lightweight validation alongside your transformation layer
  • Data platform teams — Monitor data quality across dozens of tables

✨ Features

Feature Description
50+ Built-in Expectations Column-level, table-level, format, statistical, and sequential cross-validations
Push-Down SQL Native Execute core validation via SQLAlchemy directly on Postgres, Snowflake, or BigQuery
Quad-Engine Support Pandas, Polars, PySpark, and SQL execution engines
🔔 Webhook Alerting Built-in Slack (Block Kit) and Microsoft Teams (MessageCard) notifications
🎯 Data Quality Score Weighted score (0–100) based on severity of checks
🔴🟡🔵 Severity Levels Critical / Warning / Info classification for every expectation
📊 Column Health Summary At-a-glance per-column health with mini bar charts
📈 Data Drift Detection Calculate Population Stability Index (PSI) and schema shifts between datasets
🧩 Airflow Integration Natively gate data pipelines via ValidateXOperator
Data Profiling Auto-analyse datasets and suggest expectations
YAML/JSON Config Define expectations declaratively
CLI Interface Run validations from the command line
Clean Output All values are native Python types — zero NumPy leakage

📦 Installation

# Basic install
pip install validatex

# With PySpark support
pip install "validatex[spark]"

# With database support
pip install "validatex[database]"

# Full install
pip install "validatex[all]"

# Development
pip install "validatex[dev]"

🏁 Quick Start

Python API

import pandas as pd
import validatex as vx

# Create your data
df = pd.DataFrame({
    "user_id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "age": [25, 30, 35, 28, 42],
    "email": ["alice@test.com", "bob@test.com", "charlie@test.com",
              "diana@test.com", "eve@test.com"],
    "status": ["active", "active", "inactive", "active", "pending"],
})

# Build an expectation suite
suite = (
    vx.ExpectationSuite("user_quality")
    .add("expect_column_to_not_be_null", column="user_id")
    .add("expect_column_values_to_be_unique", column="user_id")
    .add("expect_column_values_to_be_between", column="age", min_value=0, max_value=150)
    .add("expect_column_values_to_be_in_set",
         column="status", value_set=["active", "inactive", "pending"])
    .add("expect_column_values_to_match_regex",
         column="email", regex=r"^[\w.]+@[\w]+\.\w+$")
)

# Validate
result = vx.validate(df, suite)

# Print summary (includes Quality Score)
print(result.summary())

# Generate reports
result.to_html("report.html")
result.to_json_file("report.json")

CLI

# Initialize a project
validatex init

# Profile a dataset
validatex profile --data data.csv --suggest --output auto_suite.yaml

# Run validation
validatex validate --data data.csv --suite suite.yaml --report report.html

# Run checkpoint
validatex run --checkpoint checkpoint.yaml

# List available expectations
validatex list-expectations

📚 Documentation Hub

Explore comprehensive guides, FAQs, and migration documentation:

📰 Technical Articles & Guides


🗄️ Push-Down SQL Native Validation

ValidateX can validate terabytes of data directly inside your database without ever loading DataFrames into Python memory. This generates optimized native queries (like SELECT COUNT(*)) under the hood.

import validatex as vx
from sqlalchemy import create_engine

# 1. Connect to any database (PostgreSQL, Snowflake, BigQuery, etc.)
engine = create_engine("postgresql://user:pass@host/db")

# 2. Build your expectation suite
suite = (
    vx.ExpectationSuite("users_table_checks")
    .add("expect_table_row_count_to_be_between", min_value=1_000_000, max_value=5_000_000)
    .add("expect_column_to_not_be_null", column="email")
    .add("expect_column_values_to_be_unique", column="user_id")
    .add("expect_column_values_to_be_between", column="age", min_value=18, max_value=120)
)

# 3. Validate directly against the SQL table (Zero Pandas overhead!)
result = vx.validate(
    data="prod_users",     # Just pass the table name
    suite=suite,
    engine="sql",          # Tells ValidateX to use Push-Down SQL
    sql_engine=engine      # The SQLAlchemy database connection
)

print(f"Data Quality Score: {result.compute_quality_score()}/100")

🤖 Automate with CI/CD

ValidateX is designed to be lightweight and CI-friendly. You can easily integrate it into your GitHub Actions, GitLab CI, or Jenkins pipelines to gate deployments on data quality.

Example: GitHub Actions

name: Data Quality Validation
on: [push, pull_request]

jobs:
  validate-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          
      - name: Install ValidateX
        run: pip install validatex
        
      - name: Run Data Validation
        run: |
          validatex validate \
            --data data/production_data.csv \
            --suite tests/data_quality/suite.yaml \
            --report dq_report.html
            
      - name: Archive production artifacts
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: validatex-report
          path: dq_report.html

🧩 Apache Airflow Integration

ValidateX includes a native Apache Airflow operator to completely gate your ETL pipelines based on Data Quality Scores.

from validatex.integrations.airflow import ValidateXOperator

# This task will FAIL the Airflow DAG if the data quality score is < 95.0
validate_data = ValidateXOperator(
    task_id="ensure_data_quality",
    suite=suite,
    data_path="s3://my-bucket/daily_users.parquet",
    data_format="parquet",
    min_score=95.0, 
    report_path="/tmp/validatex_daily_report.html"
)

🔔 Slack & Teams Webhook Alerts

Send real-time notifications to engineering channels whenever validations run or data quality thresholds fail.

import validatex as vx

result = vx.validate(df, suite)

# Send Slack alert (Block Kit format)
result.send_slack(webhook_url="https://hooks.slack.com/services/...", notify_on="failure")

# Send Teams alert (MessageCard format)
result.send_teams(webhook_url="https://outlook.office.com/webhook/...", notify_on="failure")

Or trigger alerts automatically via the CLI:

validatex validate --data data.csv --suite suite.yaml --slack-webhook $SLACK_URL --notify-on failure

📈 Data Drift Detection (PSI)

Stop guessing if distributions have changed. Calculate Population Stability Index (PSI) and exact schema changes natively without heavy dependencies.

import validatex as vx

# Compare Yesterday's data vs Today's data
detector = vx.DriftDetector(psi_threshold=0.2)
report = detector.compare(yesterday_df, today_df)

print(report.summary())

Output:

============================================================
  ValidateX Data Drift Report
============================================================
[1] Schema Changes:
  No schema changes detected.
[2] Feature Drift (PSI):
  🔴 DRIFTED | income               | PSI: 5.6120 (numerical)
  🟢 STABLE  | age                  | PSI: 0.0034 (numerical)

🎯 Data Quality Score

ValidateX computes a weighted quality score (0–100) based on the severity of each expectation:

Severity Weight Example Expectations
🔴 Critical ×3 Null checks, uniqueness, column existence, row count
🟡 Warning ×2 Range checks, set membership, regex, type checks
🔵 Info ×1 Mean/stdev bounds, string lengths, distinct values

Formula: Score = 100 × (weighted_passed / weighted_total)

A critical failure impacts the score 3× more than an info-level check. This gives decision-makers a single number to assess data health.

result = vx.validate(df, suite)
score = result.compute_quality_score()
print(f"Data Quality Score: {score}/100")

Custom Severity

Override the default severity on any expectation via meta:

expectations:
  - expectation_type: expect_column_mean_to_be_between
    column: revenue
    kwargs:
      min_value: 1000
      max_value: 50000
    meta:
      severity: critical   # Override default "info" → "critical"

📊 Column Health Summary

The HTML report includes a Column Health Summary that aggregates all expectations per column:

Column Checks Passed Failed Health Null % Unique %
user_id 3 3 0 100% ███ 0.0% 100.0% ███
email 4 4 0 100% ███ 0.0% 100.0% ███
status 1 1 0 100% ███

Each metric includes a mini CSS bar chart for instant visual scanning.

for col in result.column_health():
    print(f"{col.column}: {col.health_score}% health, "
          f"{col.passed}/{col.checks} passed")

📋 Available Expectations

Column-Level (36)

Expectation Severity Description
expect_column_to_exist 🔴 Critical Column exists in DataFrame
expect_column_to_not_be_null 🔴 Critical No null values
expect_column_values_to_be_unique 🔴 Critical All values unique
expect_column_values_to_be_between 🟡 Warning Values within range
expect_column_values_to_be_in_set 🟡 Warning Values in allowed set
expect_column_values_to_not_be_in_set 🟡 Warning Values not in forbidden set
expect_column_values_to_match_regex 🟡 Warning Values match regex pattern
expect_column_values_to_be_of_type 🟡 Warning Column dtype matches
expect_column_values_to_be_dateutil_parseable 🟡 Warning Values parseable as dates
expect_column_value_lengths_to_be_between 🔵 Info String lengths within range
expect_column_max_to_be_between 🔵 Info Column max within bounds
expect_column_min_to_be_between 🔵 Info Column min within bounds
expect_column_mean_to_be_between 🔵 Info Column mean within bounds
expect_column_stdev_to_be_between 🔵 Info Column std dev within bounds
expect_column_distinct_values_to_be_in_set 🔵 Info All distinct values in set
expect_column_proportion_of_unique_values_to_be_between 🔵 Info Uniqueness ratio in range
expect_column_values_to_not_match_regex 🟡 Warning Values do not match regex
expect_column_values_to_be_valid_email 🟡 Warning Values parse as valid emails
expect_column_values_to_be_json_parseable 🟡 Warning Values are parseable JSON
expect_column_sum_to_be_between 🔵 Info Column sum within bounds
expect_column_median_to_be_between 🔵 Info Column median within bounds
expect_column_value_lengths_to_equal 🔵 Info String lengths exact match
expect_column_quantile_values_to_be_between 🔵 Info Per-quantile range checks
expect_column_null_percentage_to_be_less_than 🟡 Warning Null rate < threshold
expect_column_values_to_be_positive 🟡 Warning All values > 0
expect_column_values_to_be_negative 🟡 Warning All values < 0
expect_column_values_to_be_in_range_of_std_devs 🔵 Info Outlier / Z-score detection
expect_column_correlation_to_be_between 🔵 Info Pearson correlation in range
expect_column_values_to_have_no_whitespace 🟡 Warning No leading/trailing whitespace
expect_column_values_to_be_valid_url 🟡 Warning Valid HTTP/HTTPS/FTP URLs
expect_column_values_to_be_valid_ip_address 🟡 Warning Valid IPv4 / IPv6 addresses
expect_column_values_to_be_valid_uuid 🟡 Warning Valid UUID (any version)
expect_column_values_to_be_valid_iso_date 🟡 Warning Valid ISO 8601 dates
expect_column_values_to_be_valid_phone_number 🟡 Warning Valid international phone
expect_column_values_to_be_all_uppercase 🔵 Info All values UPPERCASED
expect_column_values_to_be_all_lowercase 🔵 Info All values lowercased

Table-Level (5)

Expectation Severity Description
expect_table_row_count_to_equal 🔴 Critical Exact row count
expect_table_row_count_to_be_between 🔴 Critical Row count in range
expect_table_columns_to_match_ordered_list 🔴 Critical Column order matches
expect_table_columns_to_match_set 🔴 Critical Column names match (unordered)
expect_table_column_count_to_equal 🔴 Critical Exact column count

Aggregate / Cross-Column (4)

Expectation Severity Description
expect_column_pair_values_a_to_be_greater_than_b 🟡 Warning Column A > Column B
expect_column_pair_values_to_be_equal 🟡 Warning Two columns equal
expect_multicolumn_sum_to_equal 🟡 Warning Row-wise sum equals target
expect_compound_columns_to_be_unique 🔴 Critical Compound key uniqueness

Sequential / Time-Series (2)

Expectation Severity Description
expect_column_values_to_be_increasing 🔵 Info Monotonically increasing
expect_column_values_to_be_decreasing 🔵 Info Monotonically decreasing

Conditional / Cross-Row (3)

Expectation Severity Description
expect_column_values_to_be_null_when 🟡 Warning Column must be null given condition
expect_column_values_to_be_not_null_when 🔴 Critical Column must not be null given condition
expect_column_values_to_satisfy 🟡 Warning Pass a Python lambda as custom validation

📊 Data Profiling

import pandas as pd
from validatex import DataProfiler

df = pd.read_csv("data.csv")
profiler = DataProfiler()

# Profile
profile = profiler.profile(df)
print(profile.summary())

# Auto-suggest expectations
suite = profiler.suggest_expectations(df, suite_name="auto_suite")
suite.save("auto_suite.yaml")

🔧 YAML Suite Configuration

suite_name: my_data_quality
meta:
  description: "Quality checks for production data"

expectations:
  - expectation_type: expect_column_to_not_be_null
    column: id
    meta:
      severity: critical

  - expectation_type: expect_column_values_to_be_between
    column: age
    kwargs:
      min_value: 0
      max_value: 150

  - expectation_type: expect_column_values_to_be_in_set
    column: status
    kwargs:
      value_set: ["active", "inactive"]

🏗️ Architecture

validatex/
├── core/
│   ├── expectation.py     # Base class + registry
│   ├── result.py          # ValidationResult, QualityScore, Severity, ColumnHealth
│   ├── suite.py           # ExpectationSuite (fluent API)
│   └── validator.py       # Validation orchestrator
├── expectations/
│   ├── column_expectations.py     # 16 column-level checks
│   ├── table_expectations.py      # 5 table-level checks
│   └── aggregate_expectations.py  # 4 cross-column checks
├── datasources/
│   ├── csv_source.py      # CSV files
│   ├── parquet_source.py  # Parquet files
│   ├── database_source.py # SQL databases (SQLAlchemy)
│   └── dataframe_source.py # Direct DataFrames
├── profiler/
│   └── profiler.py        # Auto-profiling & suggestion engine
├── reporting/
│   ├── html_report.py     # Production HTML reports
│   └── json_report.py     # JSON reports
├── config/
│   └── loader.py          # YAML/JSON config loading
└── cli/
    └── main.py            # CLI (validate, run, profile, init, list-expectations)

🧪 Testing

# Run all tests (66 tests)
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=validatex --cov-report=html

# Unit tests only
pytest tests/unit/ -v

# Integration tests
pytest tests/integration/ -v

🤝 Creating Custom Expectations

from dataclasses import dataclass, field
from validatex.core.expectation import Expectation, register_expectation
from validatex.core.result import ExpectationResult

@register_expectation
@dataclass
class ExpectColumnValuesToBePositive(Expectation):
    """Expect all values in a numeric column to be positive."""

    expectation_type: str = field(
        init=False, default="expect_column_values_to_be_positive"
    )

    def _validate_pandas(self, df) -> ExpectationResult:
        series = df[self.column].dropna()
        total = len(series)
        negative_mask = series <= 0
        unexpected_count = int(negative_mask.sum())
        pct = (unexpected_count / total * 100) if total > 0 else 0.0

        return self._build_result(
            success=(unexpected_count == 0),
            element_count=total,
            unexpected_count=unexpected_count,
            unexpected_percent=pct,
            unexpected_values=series[negative_mask].tolist()[:20],
        )

🧹 Clean Output

ValidateX converts all internal types to native Python before rendering. You'll never see np.int64(20) in reports or JSON — only clean 20.

result = vx.validate(df, suite)
data = result.to_dict()

# Observed values are always clean:
# {'min': 20, 'max': 69}        ← NOT {'min': np.int64(20), ...}
# "Unique: 100/100 (100.0%)"    ← NOT "100 unique out of 100"
# "Distinct values: 3"          ← NOT "{'unique_values': 3}"

🚀 Roadmap

  • 50+ built-in expectations (column, table, aggregate, statistical, sequential)
  • Pandas, PySpark, and SQL Push-down Dual-engine support
  • Severity modeling (Critical / Warning / Info)
  • Weighted data quality score (0–100)
  • Column health summary with mini charts
  • Modern HTML reports with dark theme
  • Data Drift Detection (Population Stability Index / Schema checks)
  • Apache Airflow Integration via ValidateXOperator
  • Sequential & Time-Series Anomaly features
  • Data profiler with auto-suggestion
  • CLI with validate, profile, run, init commands
  • YAML/JSON declarative configuration
  • Native Python type sanitization
  • Slack / Teams notifications on failure
  • GitHub Action template for CI/CD
  • Polars engine support
  • Baseline history tracking & trend charts
  • Great Expectations suite import/migration
  • Web dashboard for multi-dataset monitoring
  • dbt integration plugin

Versioning

ValidateX follows Semantic Versioning.

  • MAJOR version for incompatible API changes
  • MINOR version for backwards-compatible new functionality
  • PATCH version for backwards-compatible bug fixes

📄 License

MIT License


Built with ❤️ by the ValidateX Team
If this project helps you, consider giving it a ⭐