🚀 ValidateX

A powerful, extensible data quality validation framework for Python.

Badges represent (from left to right): CI/CD Build Status, Code Coverage, Test Count, Latest PyPI Release, Supported Python Versions, License, and Code Style.

ValidateX provides a comprehensive suite of tools for validating, profiling, and monitoring data quality across Pandas, Polars, and PySpark DataFrames, as well as Push-Down SQL engines. Inspired by Great Expectations, it offers a simpler, more focused approach with modern, production-ready HTML reports and an intuitive API.

🖼️ Report Preview

Column Health Summary with mini bar charts

Severity-tagged Expectations with human-readable output

🤔 Why ValidateX?

Feature	ValidateX	Great Expectations	Pandera
Easy Setup	✅ `pip install` → validate in 5 lines	⚠️ Heavy multi-step context setup	✅ Decorator & Schema API
HTML Report Generator	✅ Modern dark-theme report	✅ Data docs site	❌ Basic schema error logs
Data Quality Score	✅ Weighted (0–100) business score	❌	❌
PySpark Support	✅ Distributed DataFrame support	✅	⚠️ Limited PySpark engine
Polars Support	✅ Multi-threaded Rust support	✅	⚠️
Push-Down SQL Native	✅ Postgres, Snowflake, BigQuery, DuckDB	✅	❌
Data Drift (PSI)	✅ Built-in PSI & Schema shifts	❌ Separate plugins	❌
Airflow Operator	✅ `ValidateXOperator` built-in	⚠️ External provider	❌
Webhook Alerts	✅ Slack & Teams built-in	⚠️ Complex webhook setups	❌

ValidateX is not a replacement for Great Expectations — it's a focused alternative for teams that want production-grade data validation without the overhead.

⚡ Performance Benchmarks

Execution performance measured on standard datasets across Pandas, Polars, and PySpark engines (run via python -m benchmarks.benchmark_suite):

Dataset Size	Engine	Execution Time	Peak Memory	Setup / API Lines
100,000 rows	Pandas	0.04s	2.1 MB	5 lines
100,000 rows	Polars	0.03s	< 1 MB	5 lines
100,000 rows	PySpark	55.65s (local JVM init)	< 1 MB	5 lines
1,000,000 rows	Pandas	0.13s	33.2 MB	5 lines
1,000,000 rows	Polars	0.21s	< 1 MB	5 lines
1,000,000 rows	PySpark	52.21s	< 1 MB	5 lines
10,000,000 rows	Pandas	2.22s	268.5 MB	5 lines
10,000,000 rows	Polars	4.77s	< 1 MB	5 lines
10,000,000 rows	PySpark	Distributed Cluster	Distributed	5 lines

💼 Real-World Industry Showcases

Explore production-ready validation examples tailored for major data domains in examples/showcase/:

👤 Customer 360 Validation — Demographic bounds, email pattern checks, loyalty tier rules.
🛒 Sales ETL Pipeline Validation — Order revenue consistency, non-negative quantities, ISO timestamps.
🏦 Banking Transaction Audit — UUID ledger checks, strict currency sets, Z-score outlier detection.
🏥 Healthcare Claims Quality Gate — ICD-10 diagnosis formatting, admission/discharge date ordering.
📦 Retail Inventory & Supply Chain — Multi-warehouse inventory, SKU regex checks, reorder point auditing.

🎯 Who Is This For?

Startup data teams — Ship data quality checks in minutes, not days
ML engineers — Validate feature stores and training data before model runs
CI/CD pipelines — Gate deployments on data quality with a single CLI command
Analytics teams — Catch data issues before they reach dashboards
dbt users — Lightweight validation alongside your transformation layer
Data platform teams — Monitor data quality across dozens of tables

✨ Features

Feature	Description
50+ Built-in Expectations	Column-level, table-level, format, statistical, and sequential cross-validations
Push-Down SQL Native	Execute core validation via SQLAlchemy directly on Postgres, Snowflake, or BigQuery
Quad-Engine Support	Pandas, Polars, PySpark, and SQL execution engines
🔔 Webhook Alerting	Built-in Slack (Block Kit) and Microsoft Teams (MessageCard) notifications
🎯 Data Quality Score	Weighted score (0–100) based on severity of checks
🔴🟡🔵 Severity Levels	Critical / Warning / Info classification for every expectation
📊 Column Health Summary	At-a-glance per-column health with mini bar charts
📈 Data Drift Detection	Calculate Population Stability Index (PSI) and schema shifts between datasets
🧩 Airflow Integration	Natively gate data pipelines via `ValidateXOperator`
Data Profiling	Auto-analyse datasets and suggest expectations
YAML/JSON Config	Define expectations declaratively
CLI Interface	Run validations from the command line
Clean Output	All values are native Python types — zero NumPy leakage

📦 Installation

# Basic install
pip install validatex

# With PySpark support
pip install "validatex[spark]"

# With database support
pip install "validatex[database]"

# Full install
pip install "validatex[all]"

# Development
pip install "validatex[dev]"

🏁 Quick Start

Python API

import pandas as pd
import validatex as vx

# Create your data
df = pd.DataFrame({
    "user_id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "age": [25, 30, 35, 28, 42],
    "email": ["alice@test.com", "bob@test.com", "charlie@test.com",
              "diana@test.com", "eve@test.com"],
    "status": ["active", "active", "inactive", "active", "pending"],
})

# Build an expectation suite
suite = (
    vx.ExpectationSuite("user_quality")
    .add("expect_column_to_not_be_null", column="user_id")
    .add("expect_column_values_to_be_unique", column="user_id")
    .add("expect_column_values_to_be_between", column="age", min_value=0, max_value=150)
    .add("expect_column_values_to_be_in_set",
         column="status", value_set=["active", "inactive", "pending"])
    .add("expect_column_values_to_match_regex",
         column="email", regex=r"^[\w.]+@[\w]+\.\w+$")
)

# Validate
result = vx.validate(df, suite)

# Print summary (includes Quality Score)
print(result.summary())

# Generate reports
result.to_html("report.html")
result.to_json_file("report.json")

CLI

# Initialize a project
validatex init

# Profile a dataset
validatex profile --data data.csv --suggest --output auto_suite.yaml

# Run validation
validatex validate --data data.csv --suite suite.yaml --report report.html

# Run checkpoint
validatex run --checkpoint checkpoint.yaml

# List available expectations
validatex list-expectations

📚 Documentation Hub

Explore comprehensive guides, FAQs, and migration documentation:

🔁 Migration Guide (Great Expectations & AWS Deequ) — Step-by-step instructions for converting legacy suites to ValidateX.
🏛️ Architectural Decisions & Safeguards (ADR) — Compute alignment, memory safeguards, and RunStore.
🛠️ Custom Expectations Developer Guide — Extend ValidateX with domain-specific rules.
❓ Frequently Asked Questions (FAQ) — Security, engine performance, custom expectations, and alert setup.
⚙️ GitHub Action Setup — Automate data quality gates in CI/CD.

📰 Technical Articles & Guides

🛡️ 10 Data Quality Checks Every Data Engineer Needs
⚡ Validate Pandas DataFrames in Minutes
⚖️ ValidateX vs Great Expectations Architectural Comparison
📊 How to Generate Beautiful HTML Data Quality Reports

🗄️ Push-Down SQL Native Validation

ValidateX can validate terabytes of data directly inside your database without ever loading DataFrames into Python memory. This generates optimized native queries (like SELECT COUNT(*)) under the hood.

import validatex as vx
from sqlalchemy import create_engine

# 1. Connect to any database (PostgreSQL, Snowflake, BigQuery, etc.)
engine = create_engine("postgresql://user:pass@host/db")

# 2. Build your expectation suite
suite = (
    vx.ExpectationSuite("users_table_checks")
    .add("expect_table_row_count_to_be_between", min_value=1_000_000, max_value=5_000_000)
    .add("expect_column_to_not_be_null", column="email")
    .add("expect_column_values_to_be_unique", column="user_id")
    .add("expect_column_values_to_be_between", column="age", min_value=18, max_value=120)
)

# 3. Validate directly against the SQL table (Zero Pandas overhead!)
result = vx.validate(
    data="prod_users",     # Just pass the table name
    suite=suite,
    engine="sql",          # Tells ValidateX to use Push-Down SQL
    sql_engine=engine      # The SQLAlchemy database connection
)

print(f"Data Quality Score: {result.compute_quality_score()}/100")

🤖 Automate with CI/CD

ValidateX is designed to be lightweight and CI-friendly. You can easily integrate it into your GitHub Actions, GitLab CI, or Jenkins pipelines to gate deployments on data quality.

Example: GitHub Actions

name: Data Quality Validation
on: [push, pull_request]

jobs:
  validate-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          
      - name: Install ValidateX
        run: pip install validatex
        
      - name: Run Data Validation
        run: |
          validatex validate \
            --data data/production_data.csv \
            --suite tests/data_quality/suite.yaml \
            --report dq_report.html
            
      - name: Archive production artifacts
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: validatex-report
          path: dq_report.html

🧩 Apache Airflow Integration

ValidateX includes a native Apache Airflow operator to completely gate your ETL pipelines based on Data Quality Scores.

from validatex.integrations.airflow import ValidateXOperator

# This task will FAIL the Airflow DAG if the data quality score is < 95.0
validate_data = ValidateXOperator(
    task_id="ensure_data_quality",
    suite=suite,
    data_path="s3://my-bucket/daily_users.parquet",
    data_format="parquet",
    min_score=95.0, 
    report_path="/tmp/validatex_daily_report.html"
)

🔔 Slack & Teams Webhook Alerts

Send real-time notifications to engineering channels whenever validations run or data quality thresholds fail.

import validatex as vx

result = vx.validate(df, suite)

# Send Slack alert (Block Kit format)
result.send_slack(webhook_url="https://hooks.slack.com/services/...", notify_on="failure")

# Send Teams alert (MessageCard format)
result.send_teams(webhook_url="https://outlook.office.com/webhook/...", notify_on="failure")

Or trigger alerts automatically via the CLI:

validatex validate --data data.csv --suite suite.yaml --slack-webhook $SLACK_URL --notify-on failure

📈 Data Drift Detection (PSI)

Stop guessing if distributions have changed. Calculate Population Stability Index (PSI) and exact schema changes natively without heavy dependencies.

import validatex as vx

# Compare Yesterday's data vs Today's data
detector = vx.DriftDetector(psi_threshold=0.2)
report = detector.compare(yesterday_df, today_df)

print(report.summary())

Output:

============================================================
  ValidateX Data Drift Report
============================================================
[1] Schema Changes:
  No schema changes detected.
[2] Feature Drift (PSI):
  🔴 DRIFTED | income               | PSI: 5.6120 (numerical)
  🟢 STABLE  | age                  | PSI: 0.0034 (numerical)

🎯 Data Quality Score

ValidateX computes a weighted quality score (0–100) based on the severity of each expectation:

Severity	Weight	Example Expectations
🔴 Critical	×3	Null checks, uniqueness, column existence, row count
🟡 Warning	×2	Range checks, set membership, regex, type checks
🔵 Info	×1	Mean/stdev bounds, string lengths, distinct values

Formula: Score = 100 × (weighted_passed / weighted_total)

A critical failure impacts the score 3× more than an info-level check. This gives decision-makers a single number to assess data health.

result = vx.validate(df, suite)
score = result.compute_quality_score()
print(f"Data Quality Score: {score}/100")

Custom Severity

Override the default severity on any expectation via meta:

expectations:
  - expectation_type: expect_column_mean_to_be_between
    column: revenue
    kwargs:
      min_value: 1000
      max_value: 50000
    meta:
      severity: critical   # Override default "info" → "critical"

📊 Column Health Summary

The HTML report includes a Column Health Summary that aggregates all expectations per column:

Column	Checks	Passed	Health	Null %	Unique %
user_id	3	3	100% ███	0.0%	100.0% ███
email	4	4	100% ███	0.0%	100.0% ███
status	1	1	100% ███	—	—

Each metric includes a mini CSS bar chart for instant visual scanning.

for col in result.column_health():
    print(f"{col.column}: {col.health_score}% health, "
          f"{col.passed}/{col.checks} passed")

📋 Available Expectations

Column-Level (36)

Expectation	Severity	Description
`expect_column_to_exist`	🔴 Critical	Column exists in DataFrame
`expect_column_to_not_be_null`	🔴 Critical	No null values
`expect_column_values_to_be_unique`	🔴 Critical	All values unique
`expect_column_values_to_be_between`	🟡 Warning	Values within range
`expect_column_values_to_be_in_set`	🟡 Warning	Values in allowed set
`expect_column_values_to_not_be_in_set`	🟡 Warning	Values not in forbidden set
`expect_column_values_to_match_regex`	🟡 Warning	Values match regex pattern
`expect_column_values_to_be_of_type`	🟡 Warning	Column dtype matches
`expect_column_values_to_be_dateutil_parseable`	🟡 Warning	Values parseable as dates
`expect_column_value_lengths_to_be_between`	🔵 Info	String lengths within range
`expect_column_max_to_be_between`	🔵 Info	Column max within bounds
`expect_column_min_to_be_between`	🔵 Info	Column min within bounds
`expect_column_mean_to_be_between`	🔵 Info	Column mean within bounds
`expect_column_stdev_to_be_between`	🔵 Info	Column std dev within bounds
`expect_column_distinct_values_to_be_in_set`	🔵 Info	All distinct values in set
`expect_column_proportion_of_unique_values_to_be_between`	🔵 Info	Uniqueness ratio in range
`expect_column_values_to_not_match_regex`	🟡 Warning	Values do not match regex
`expect_column_values_to_be_valid_email`	🟡 Warning	Values parse as valid emails
`expect_column_values_to_be_json_parseable`	🟡 Warning	Values are parseable JSON
`expect_column_sum_to_be_between`	🔵 Info	Column sum within bounds
`expect_column_median_to_be_between`	🔵 Info	Column median within bounds
`expect_column_value_lengths_to_equal`	🔵 Info	String lengths exact match
`expect_column_quantile_values_to_be_between`	🔵 Info	Per-quantile range checks
`expect_column_null_percentage_to_be_less_than`	🟡 Warning	Null rate < threshold
`expect_column_values_to_be_positive`	🟡 Warning	All values > 0
`expect_column_values_to_be_negative`	🟡 Warning	All values < 0
`expect_column_values_to_be_in_range_of_std_devs`	🔵 Info	Outlier / Z-score detection
`expect_column_correlation_to_be_between`	🔵 Info	Pearson correlation in range
`expect_column_values_to_have_no_whitespace`	🟡 Warning	No leading/trailing whitespace
`expect_column_values_to_be_valid_url`	🟡 Warning	Valid HTTP/HTTPS/FTP URLs
`expect_column_values_to_be_valid_ip_address`	🟡 Warning	Valid IPv4 / IPv6 addresses
`expect_column_values_to_be_valid_uuid`	🟡 Warning	Valid UUID (any version)
`expect_column_values_to_be_valid_iso_date`	🟡 Warning	Valid ISO 8601 dates
`expect_column_values_to_be_valid_phone_number`	🟡 Warning	Valid international phone
`expect_column_values_to_be_all_uppercase`	🔵 Info	All values UPPERCASED
`expect_column_values_to_be_all_lowercase`	🔵 Info	All values lowercased

Table-Level (5)

Expectation	Severity	Description
`expect_table_row_count_to_equal`	🔴 Critical	Exact row count
`expect_table_row_count_to_be_between`	🔴 Critical	Row count in range
`expect_table_columns_to_match_ordered_list`	🔴 Critical	Column order matches
`expect_table_columns_to_match_set`	🔴 Critical	Column names match (unordered)
`expect_table_column_count_to_equal`	🔴 Critical	Exact column count

Aggregate / Cross-Column (4)

Expectation	Severity	Description
`expect_column_pair_values_a_to_be_greater_than_b`	🟡 Warning	Column A > Column B
`expect_column_pair_values_to_be_equal`	🟡 Warning	Two columns equal
`expect_multicolumn_sum_to_equal`	🟡 Warning	Row-wise sum equals target
`expect_compound_columns_to_be_unique`	🔴 Critical	Compound key uniqueness

Sequential / Time-Series (2)

Expectation	Severity	Description
`expect_column_values_to_be_increasing`	🔵 Info	Monotonically increasing
`expect_column_values_to_be_decreasing`	🔵 Info	Monotonically decreasing

Conditional / Cross-Row (3)

Expectation	Severity	Description
`expect_column_values_to_be_null_when`	🟡 Warning	Column must be null given condition
`expect_column_values_to_be_not_null_when`	🔴 Critical	Column must not be null given condition
`expect_column_values_to_satisfy`	🟡 Warning	Pass a Python lambda as custom validation

📊 Data Profiling

import pandas as pd
from validatex import DataProfiler

df = pd.read_csv("data.csv")
profiler = DataProfiler()

# Profile
profile = profiler.profile(df)
print(profile.summary())

# Auto-suggest expectations
suite = profiler.suggest_expectations(df, suite_name="auto_suite")
suite.save("auto_suite.yaml")

🔧 YAML Suite Configuration

suite_name: my_data_quality
meta:
  description: "Quality checks for production data"

expectations:
  - expectation_type: expect_column_to_not_be_null
    column: id
    meta:
      severity: critical

  - expectation_type: expect_column_values_to_be_between
    column: age
    kwargs:
      min_value: 0
      max_value: 150

  - expectation_type: expect_column_values_to_be_in_set
    column: status
    kwargs:
      value_set: ["active", "inactive"]

🏗️ Architecture

validatex/
├── core/
│   ├── expectation.py     # Base class + registry
│   ├── result.py          # ValidationResult, QualityScore, Severity, ColumnHealth
│   ├── suite.py           # ExpectationSuite (fluent API)
│   └── validator.py       # Validation orchestrator
├── expectations/
│   ├── column_expectations.py     # 16 column-level checks
│   ├── table_expectations.py      # 5 table-level checks
│   └── aggregate_expectations.py  # 4 cross-column checks
├── datasources/
│   ├── csv_source.py      # CSV files
│   ├── parquet_source.py  # Parquet files
│   ├── database_source.py # SQL databases (SQLAlchemy)
│   └── dataframe_source.py # Direct DataFrames
├── profiler/
│   └── profiler.py        # Auto-profiling & suggestion engine
├── reporting/
│   ├── html_report.py     # Production HTML reports
│   └── json_report.py     # JSON reports
├── config/
│   └── loader.py          # YAML/JSON config loading
└── cli/
    └── main.py            # CLI (validate, run, profile, init, list-expectations)

🧪 Testing

# Run all tests (66 tests)
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=validatex --cov-report=html

# Unit tests only
pytest tests/unit/ -v

# Integration tests
pytest tests/integration/ -v

🤝 Creating Custom Expectations

from dataclasses import dataclass, field
from validatex.core.expectation import Expectation, register_expectation
from validatex.core.result import ExpectationResult

@register_expectation
@dataclass
class ExpectColumnValuesToBePositive(Expectation):
    """Expect all values in a numeric column to be positive."""

    expectation_type: str = field(
        init=False, default="expect_column_values_to_be_positive"
    )

    def _validate_pandas(self, df) -> ExpectationResult:
        series = df[self.column].dropna()
        total = len(series)
        negative_mask = series <= 0
        unexpected_count = int(negative_mask.sum())
        pct = (unexpected_count / total * 100) if total > 0 else 0.0

        return self._build_result(
            success=(unexpected_count == 0),
            element_count=total,
            unexpected_count=unexpected_count,
            unexpected_percent=pct,
            unexpected_values=series[negative_mask].tolist()[:20],
        )

🧹 Clean Output

ValidateX converts all internal types to native Python before rendering. You'll never see np.int64(20) in reports or JSON — only clean 20.

result = vx.validate(df, suite)
data = result.to_dict()

# Observed values are always clean:
# {'min': 20, 'max': 69}        ← NOT {'min': np.int64(20), ...}
# "Unique: 100/100 (100.0%)"    ← NOT "100 unique out of 100"
# "Distinct values: 3"          ← NOT "{'unique_values': 3}"

🚀 Roadmap

Versioning

ValidateX follows Semantic Versioning.

MAJOR version for incompatible API changes
MINOR version for backwards-compatible new functionality
PATCH version for backwards-compatible bug fixes

📄 License

MIT License

Built with ❤️ by the ValidateX Team
_{If this project helps you, consider giving it a ⭐}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 ValidateX

📑 Table of Contents

🖼️ Report Preview

🤔 Why ValidateX?

⚡ Performance Benchmarks

💼 Real-World Industry Showcases

🎯 Who Is This For?

✨ Features

📦 Installation

🏁 Quick Start

Python API

CLI

📚 Documentation Hub

📰 Technical Articles & Guides

🗄️ Push-Down SQL Native Validation

🤖 Automate with CI/CD

🧩 Apache Airflow Integration

🔔 Slack & Teams Webhook Alerts

📈 Data Drift Detection (PSI)

🎯 Data Quality Score

Custom Severity

📊 Column Health Summary

📋 Available Expectations

Column-Level (36)

Table-Level (5)

Aggregate / Cross-Column (4)

Sequential / Time-Series (2)

Conditional / Cross-Row (3)

📊 Data Profiling

🔧 YAML Suite Configuration

🏗️ Architecture

🧪 Testing

🤝 Creating Custom Expectations

🧹 Clean Output

🚀 Roadmap

Versioning

📄 License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🚀 ValidateX

📑 Table of Contents

🖼️ Report Preview

🤔 Why ValidateX?

⚡ Performance Benchmarks

💼 Real-World Industry Showcases

🎯 Who Is This For?

✨ Features

📦 Installation

🏁 Quick Start

Python API

CLI

📚 Documentation Hub

📰 Technical Articles & Guides

🗄️ Push-Down SQL Native Validation

🤖 Automate with CI/CD

🧩 Apache Airflow Integration

🔔 Slack & Teams Webhook Alerts

📈 Data Drift Detection (PSI)

🎯 Data Quality Score

Custom Severity

📊 Column Health Summary

📋 Available Expectations

Column-Level (36)

Table-Level (5)

Aggregate / Cross-Column (4)

Sequential / Time-Series (2)

Conditional / Cross-Row (3)

📊 Data Profiling

🔧 YAML Suite Configuration

🏗️ Architecture

🧪 Testing

🤝 Creating Custom Expectations

🧹 Clean Output

🚀 Roadmap

Versioning

📄 License