🚀 ValidateX

A powerful, extensible data quality validation framework for Python.

Badges represent (from left to right): CI/CD Build Status, Code Coverage, Test Count, Latest PyPI Release, Supported Python Versions, License, and Code Style.

ValidateX provides a comprehensive suite of tools for validating, profiling, and monitoring data quality across Pandas, Polars, and PySpark DataFrames, as well as Push-Down SQL engines. Inspired by Great Expectations, it offers a simpler, more focused approach with modern, production-ready HTML reports and an intuitive API.

🖼️ Report Preview

Column Health Summary with mini bar charts

Severity-tagged Expectations with human-readable output

🤔 Why ValidateX?

Feature	ValidateX	Great Expectations	Pandera
Easy Setup	✅ `pip install` → validate in 5 lines	⚠️ Heavy multi-step context setup	✅ Decorator & Schema API
HTML Report Generator	✅ Modern dark-theme report	✅ Data docs site	❌ Basic schema error logs
Data Quality Score	✅ Weighted (0–100) business score	❌	❌
PySpark Support	✅ Distributed DataFrame support	✅	⚠️ Limited PySpark engine
Polars Support	✅ Multi-threaded Rust support	✅	⚠️
Push-Down SQL Native	✅ Postgres, Snowflake, BigQuery, DuckDB	✅	❌
Data Drift (PSI)	✅ Built-in PSI & Schema shifts	❌ Separate plugins	❌
Airflow Operator	✅ `ValidateXOperator` built-in	⚠️ External provider	❌
Webhook Alerts	✅ Slack & Teams built-in	⚠️ Complex webhook setups	❌

ValidateX is not a replacement for Great Expectations — it's a focused alternative for teams that want production-grade data validation without the overhead.

⚡ Performance Benchmarks

Execution performance measured on standard datasets across Pandas, Polars, and PySpark engines (run via python -m benchmarks.benchmark_suite):

Dataset Size	Engine	Execution Time	Peak Memory	Setup / API Lines
100,000 rows	Pandas	0.04s	2.1 MB	5 lines
100,000 rows	Polars	0.03s	< 1 MB	5 lines
100,000 rows	PySpark	55.65s (local JVM init)	< 1 MB	5 lines
1,000,000 rows	Pandas	0.13s	33.2 MB	5 lines
1,000,000 rows	Polars	0.21s	< 1 MB	5 lines
1,000,000 rows	PySpark	52.21s	< 1 MB	5 lines
10,000,000 rows	Pandas	2.22s	268.5 MB	5 lines
10,000,000 rows	Polars	4.77s	< 1 MB	5 lines
10,000,000 rows	PySpark	Distributed Cluster	Distributed	5 lines

💼 Real-World Industry Showcases

Explore production-ready validation examples tailored for major data domains in examples/showcase/:

👤 Customer 360 Validation — Demographic bounds, email pattern checks, loyalty tier rules.
🛒 Sales ETL Pipeline Validation — Order revenue consistency, non-negative quantities, ISO timestamps.
🏦 Banking Transaction Audit — UUID ledger checks, strict currency sets, Z-score outlier detection.
🏥 Healthcare Claims Quality Gate — ICD-10 diagnosis formatting, admission/discharge date ordering.
📦 Retail Inventory & Supply Chain — Multi-warehouse inventory, SKU regex checks, reorder point auditing.

🎯 Who Is This For?

Startup data teams — Ship data quality checks in minutes, not days
ML engineers — Validate feature stores and training data before model runs
CI/CD pipelines — Gate deployments on data quality with a single CLI command
Analytics teams — Catch data issues before they reach dashboards
dbt users — Lightweight validation alongside your transformation layer
Data platform teams — Monitor data quality across dozens of tables

✨ Features

Feature	Description
50+ Built-in Expectations	Column-level, table-level, format, statistical, and sequential cross-validations
Push-Down SQL Native	Execute core validation via SQLAlchemy directly on Postgres, Snowflake, or BigQuery
Quad-Engine Support	Pandas, Polars, PySpark, and SQL execution engines
🔔 Webhook Alerting	Built-in Slack (Block Kit) and Microsoft Teams (MessageCard) notifications
🎯 Data Quality Score	Weighted score (0–100) based on severity of checks
🔴🟡🔵 Severity Levels	Critical / Warning / Info classification for every expectation
📊 Column Health Summary	At-a-glance per-column health with mini bar charts
📈 Data Drift Detection	Calculate Population Stability Index (PSI) and schema shifts between datasets
🧩 Airflow Integration	Natively gate data pipelines via `ValidateXOperator`
Data Profiling	Auto-analyse datasets and suggest expectations
YAML/JSON Config	Define expectations declaratively
CLI Interface	Run validations from the command line
Clean Output	All values are native Python types — zero NumPy leakage

📦 Installation

# Basic install
pip install validatex

# With PySpark support
pip install "validatex[spark]"

# With database support
pip install "validatex[database]"

# Full install
pip install "validatex[all]"

# Development
pip install "validatex[dev]"

🏁 Quick Start

Python API

import pandas as pd
import validatex as vx

# Create your data
df = pd.DataFrame({
    "user_id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "age": [25, 30, 35, 28, 42],
    "email": ["alice@test.com", "bob@test.com", "charlie@test.com",
              "diana@test.com", "eve@test.com"],
    "status": ["active", "active", "inactive", "active", "pending"],
})

# Build an expectation suite
suite = (
    vx.ExpectationSuite("user_quality")
    .add("expect_column_to_not_be_null", column="user_id")
    .add("expect_column_values_to_be_unique", column="user_id")
    .add("expect_column_values_to_be_between", column="age", min_value=0, max_value=150)
    .add("expect_column_values_to_be_in_set",
         column="status", value_set=["active", "inactive", "pending"])
    .add("expect_column_values_to_match_regex",
         column="email", regex=r"^[\w.]+@[\w]+\.\w+$")
)

# Validate
result = vx.validate(df, suite)

# Print summary (includes Quality Score)
print(result.summary())

# Generate reports
result.to_html("report.html")
result.to_json_file("report.json")

CLI

# Initialize a project
validatex init

# Profile a dataset
validatex profile --data data.csv --suggest --output auto_suite.yaml

# Run validation
validatex validate --data data.csv --suite suite.yaml --report report.html

# Run checkpoint
validatex run --checkpoint checkpoint.yaml

# List available expectations
validatex list-expectations

📚 Documentation Hub

Explore comprehensive guides, FAQs, and migration documentation:

🔁 Migration Guide (Great Expectations & AWS Deequ) — Step-by-step instructions for converting legacy suites to ValidateX.
🏛️ Architectural Decisions & Safeguards (ADR) — Compute alignment, memory safeguards, and RunStore.
🛠️ Custom Expectations Developer Guide — Extend ValidateX with domain-specific rules.
❓ Frequently Asked Questions (FAQ) — Security, engine performance, custom expectations, and alert setup.
⚙️ GitHub Action Setup — Automate data quality gates in CI/CD.

📰 Technical Articles & Guides

🛡️ 10 Data Quality Checks Every Data Engineer Needs
⚡ Validate Pandas DataFrames in Minutes
⚖️ ValidateX vs Great Expectations Architectural Comparison
📊 How to Generate Beautiful HTML Data Quality Reports

🗄️ Push-Down SQL Native Validation

ValidateX can validate terabytes of data directly inside your database without ever loading DataFrames into Python memory. This generates optimized native queries (like SELECT COUNT(*)) under the hood.

import validatex as vx
from sqlalchemy import create_engine

# 1. Connect to any database (PostgreSQL, Snowflake, BigQuery, etc.)
engine = create_engine("postgresql://user:pass@host/db")

# 2. Build your expectation suite
suite = (
    vx.ExpectationSuite("users_table_checks")
    .add("expect_table_row_count_to_be_between", min_value=1_000_000, max_value=5_000_000)
    .add("expect_column_to_not_be_null", column="email")
    .add("expect_column_values_to_be_unique", column="user_id")
    .add("expect_column_values_to_be_between", column="age", min_value=18, max_value=120)
)

# 3. Validate directly against the SQL table (Zero Pandas overhead!)
result = vx.validate(
    data="prod_users",     # Just pass the table name
    suite=suite,
    engine="sql",          # Tells ValidateX to use Push-Down SQL
    sql_engine=engine      # The SQLAlchemy database connection
)

print(f"Data Quality Score: {result.compute_quality_score()}/100")

🤖 Automate with CI/CD

ValidateX is designed to be lightweight and CI-friendly. You can easily integrate it into your GitHub Actions, GitLab CI, or Jenkins pipelines to gate deployments on data quality.

Example: GitHub Actions

name: Data Quality Validation
on: [push, pull_request]

jobs:
  validate-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          
      - name: Install ValidateX
        run: pip install validatex
        
      - name: Run Data Validation
        run: |
          validatex validate \
            --data data/production_data.csv \
            --suite tests/data_quality/suite.yaml \
            --report dq_report.html
            
      - name: Archive production artifacts
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: validatex-report
          path: dq_report.html

🧩 Apache Airflow Integration

ValidateX includes a native Apache Airflow operator to completely gate your ETL pipelines based on Data Quality Scores.

from validatex.integrations.airflow import ValidateXOperator

# This task will FAIL the Airflow DAG if the data quality score is < 95.0
validate_data = ValidateXOperator(
    task_id="ensure_data_quality",
    suite=suite,
    data_path="s3://my-bucket/daily_users.parquet",
    data_format="parquet",
    min_score=95.0, 
    report_path="/tmp/validatex_daily_report.html"
)

🔔 Slack & Teams Webhook Alerts

Send real-time notifications to engineering channels whenever validations run or data quality thresholds fail.

import validatex as vx

result = vx.validate(df, suite)

# Send Slack alert (Block Kit format)
result.send_slack(webhook_url="https://hooks.slack.com/services/...", notify_on="failure")

# Send Teams alert (MessageCard format)
result.send_teams(webhook_url="https://outlook.office.com/webhook/...", notify_on="failure")

Or trigger alerts automatically via the CLI:

validatex validate --data data.csv --suite suite.yaml --slack-webhook $SLACK_URL --notify-on failure

📈 Data Drift Detection (PSI)

Stop guessing if distributions have changed. Calculate Population Stability Index (PSI) and exact schema changes natively without heavy dependencies.

import validatex as vx

# Compare Yesterday's data vs Today's data
detector = vx.DriftDetector(psi_threshold=0.2)
report = detector.compare(yesterday_df, today_df)

print(report.summary())

Output:

============================================================
  ValidateX Data Drift Report
============================================================
[1] Schema Changes:
  No schema changes detected.
[2] Feature Drift (PSI):
  🔴 DRIFTED | income               | PSI: 5.6120 (numerical)
  🟢 STABLE  | age                  | PSI: 0.0034 (numerical)

🎯 Data Quality Score

ValidateX computes a weighted quality score (0–100) based on the severity of each expectation:

Severity	Weight	Example Expectations
🔴 Critical	×3	Null checks, uniqueness, column existence, row count
🟡 Warning	×2	Range checks, set membership, regex, type checks
🔵 Info	×1	Mean/stdev bounds, string lengths, distinct values

Formula: Score = 100 × (weighted_passed / weighted_total)

A critical failure impacts the score 3× more than an info-level check. This gives decision-makers a single number to assess data health.

result = vx.validate(df, suite)
score = result.compute_quality_score()
print(f"Data Quality Score: {score}/100")

Custom Severity

Override the default severity on any expectation via meta:

expectations:
  - expectation_type: expect_column_mean_to_be_between
    column: revenue
    kwargs:
      min_value: 1000
      max_value: 50000
    meta:
      severity: critical   # Override default "info" → "critical"

📊 Column Health Summary

The HTML report includes a Column Health Summary that aggregates all expectations per column:

Column	Checks	Passed	Health	Null %	Unique %
user_id	3	3	100% ███	0.0%	100.0% ███
email	4	4	100% ███	0.0%	100.0% ███
status	1	1	100% ███	—	—

Each metric includes a mini CSS bar chart for instant visual scanning.

for col in result.column_health():
    print(f"{col.column}: {col.health_score}% health, "
          f"{col.passed}/{col.checks} passed")

📋 Available Expectations

Column-Level (36)

Expectation	Severity	Description
`expect_column_to_exist`	🔴 Critical	Column exists in DataFrame
`expect_column_to_not_be_null`	🔴 Critical	No null values
`expect_column_values_to_be_unique`	🔴 Critical	All values unique
`expect_column_values_to_be_between`	🟡 Warning	Values within range
`expect_column_values_to_be_in_set`	🟡 Warning	Values in allowed set
`expect_column_values_to_not_be_in_set`	🟡 Warning	Values not in forbidden set
`expect_column_values_to_match_regex`	🟡 Warning	Values match regex pattern
`expect_column_values_to_be_of_type`	🟡 Warning	Column dtype matches
`expect_column_values_to_be_dateutil_parseable`	🟡 Warning	Values parseable as dates
`expect_column_value_lengths_to_be_between`	🔵 Info	String lengths within range
`expect_column_max_to_be_between`	🔵 Info	Column max within bounds
`expect_column_min_to_be_between`	🔵 Info	Column min within bounds
`expect_column_mean_to_be_between`	🔵 Info	Column mean within bounds
`expect_column_stdev_to_be_between`	🔵 Info	Column std dev within bounds
`expect_column_distinct_values_to_be_in_set`	🔵 Info	All distinct values in set
`expect_column_proportion_of_unique_values_to_be_between`	🔵 Info	Uniqueness ratio in range
`expect_column_values_to_not_match_regex`	🟡 Warning	Values do not match regex
`expect_column_values_to_be_valid_email`	🟡 Warning	Values parse as valid emails
`expect_column_values_to_be_json_parseable`	🟡 Warning	Values are parseable JSON
`expect_column_sum_to_be_between`	🔵 Info	Column sum within bounds
`expect_column_median_to_be_between`	🔵 Info	Column median within bounds
`expect_column_value_lengths_to_equal`	🔵 Info	String lengths exact match
`expect_column_quantile_values_to_be_between`	🔵 Info	Per-quantile range checks
`expect_column_null_percentage_to_be_less_than`	🟡 Warning	Null rate < threshold
`expect_column_values_to_be_positive`	🟡 Warning	All values > 0
`expect_column_values_to_be_negative`	🟡 Warning	All values < 0
`expect_column_values_to_be_in_range_of_std_devs`	🔵 Info	Outlier / Z-score detection
`expect_column_correlation_to_be_between`	🔵 Info	Pearson correlation in range
`expect_column_values_to_have_no_whitespace`	🟡 Warning	No leading/trailing whitespace
`expect_column_values_to_be_valid_url`	🟡 Warning	Valid HTTP/HTTPS/FTP URLs
`expect_column_values_to_be_valid_ip_address`	🟡 Warning	Valid IPv4 / IPv6 addresses
`expect_column_values_to_be_valid_uuid`	🟡 Warning	Valid UUID (any version)
`expect_column_values_to_be_valid_iso_date`	🟡 Warning	Valid ISO 8601 dates
`expect_column_values_to_be_valid_phone_number`	🟡 Warning	Valid international phone
`expect_column_values_to_be_all_uppercase`	🔵 Info	All values UPPERCASED
`expect_column_values_to_be_all_lowercase`	🔵 Info	All values lowercased

Table-Level (5)

Expectation	Severity	Description
`expect_table_row_count_to_equal`	🔴 Critical	Exact row count
`expect_table_row_count_to_be_between`	🔴 Critical	Row count in range
`expect_table_columns_to_match_ordered_list`	🔴 Critical	Column order matches
`expect_table_columns_to_match_set`	🔴 Critical	Column names match (unordered)
`expect_table_column_count_to_equal`	🔴 Critical	Exact column count

Aggregate / Cross-Column (4)

Expectation	Severity	Description
`expect_column_pair_values_a_to_be_greater_than_b`	🟡 Warning	Column A > Column B
`expect_column_pair_values_to_be_equal`	🟡 Warning	Two columns equal
`expect_multicolumn_sum_to_equal`	🟡 Warning	Row-wise sum equals target
`expect_compound_columns_to_be_unique`	🔴 Critical	Compound key uniqueness

Sequential / Time-Series (2)

Expectation	Severity	Description
`expect_column_values_to_be_increasing`	🔵 Info	Monotonically increasing
`expect_column_values_to_be_decreasing`	🔵 Info	Monotonically decreasing

Conditional / Cross-Row (3)

Expectation	Severity	Description
`expect_column_values_to_be_null_when`	🟡 Warning	Column must be null given condition
`expect_column_values_to_be_not_null_when`	🔴 Critical	Column must not be null given condition
`expect_column_values_to_satisfy`	🟡 Warning	Pass a Python lambda as custom validation

📊 Data Profiling

import pandas as pd
from validatex import DataProfiler

df = pd.read_csv("data.csv")
profiler = DataProfiler()

# Profile
profile = profiler.profile(df)
print(profile.summary())

# Auto-suggest expectations
suite = profiler.suggest_expectations(df, suite_name="auto_suite")
suite.save("auto_suite.yaml")

🔧 YAML Suite Configuration

suite_name: my_data_quality
meta:
  description: "Quality checks for production data"

expectations:
  - expectation_type: expect_column_to_not_be_null
    column: id
    meta:
      severity: critical

  - expectation_type: expect_column_values_to_be_between
    column: age
    kwargs:
      min_value: 0
      max_value: 150

  - expectation_type: expect_column_values_to_be_in_set
    column: status
    kwargs:
      value_set: ["active", "inactive"]

🏗️ Architecture

validatex/
├── core/
│   ├── expectation.py     # Base class + registry
│   ├── result.py          # ValidationResult, QualityScore, Severity, ColumnHealth
│   ├── suite.py           # ExpectationSuite (fluent API)
│   └── validator.py       # Validation orchestrator
├── expectations/
│   ├── column_expectations.py     # 16 column-level checks
│   ├── table_expectations.py      # 5 table-level checks
│   └── aggregate_expectations.py  # 4 cross-column checks
├── datasources/
│   ├── csv_source.py      # CSV files
│   ├── parquet_source.py  # Parquet files
│   ├── database_source.py # SQL databases (SQLAlchemy)
│   └── dataframe_source.py # Direct DataFrames
├── profiler/
│   └── profiler.py        # Auto-profiling & suggestion engine
├── reporting/
│   ├── html_report.py     # Production HTML reports
│   └── json_report.py     # JSON reports
├── config/
│   └── loader.py          # YAML/JSON config loading
└── cli/
    └── main.py            # CLI (validate, run, profile, init, list-expectations)

🧪 Testing

# Run all tests (66 tests)
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=validatex --cov-report=html

# Unit tests only
pytest tests/unit/ -v

# Integration tests
pytest tests/integration/ -v

🤝 Creating Custom Expectations

from dataclasses import dataclass, field
from validatex.core.expectation import Expectation, register_expectation
from validatex.core.result import ExpectationResult

@register_expectation
@dataclass
class ExpectColumnValuesToBePositive(Expectation):
    """Expect all values in a numeric column to be positive."""

    expectation_type: str = field(
        init=False, default="expect_column_values_to_be_positive"
    )

    def _validate_pandas(self, df) -> ExpectationResult:
        series = df[self.column].dropna()
        total = len(series)
        negative_mask = series <= 0
        unexpected_count = int(negative_mask.sum())
        pct = (unexpected_count / total * 100) if total > 0 else 0.0

        return self._build_result(
            success=(unexpected_count == 0),
            element_count=total,
            unexpected_count=unexpected_count,
            unexpected_percent=pct,
            unexpected_values=series[negative_mask].tolist()[:20],
        )

🧹 Clean Output

ValidateX converts all internal types to native Python before rendering. You'll never see np.int64(20) in reports or JSON — only clean 20.

result = vx.validate(df, suite)
data = result.to_dict()

# Observed values are always clean:
# {'min': 20, 'max': 69}        ← NOT {'min': np.int64(20), ...}
# "Unique: 100/100 (100.0%)"    ← NOT "100 unique out of 100"
# "Distinct values: 3"          ← NOT "{'unique_values': 3}"

🚀 Roadmap

Versioning

ValidateX follows Semantic Versioning.

MAJOR version for incompatible API changes
MINOR version for backwards-compatible new functionality
PATCH version for backwards-compatible bug fixes

📄 License

MIT License

Built with ❤️ by the ValidateX Team
_{If this project helps you, consider giving it a ⭐}

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github		.github
benchmarks		benchmarks
docs		docs
examples		examples
reports		reports
tests		tests
validatex		validatex
.flake8		.flake8
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
USER_GUIDE.md		USER_GUIDE.md
action.yml		action.yml
auto_suggested_suite.yaml		auto_suggested_suite.yaml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
runs.json		runs.json
runs2.json		runs2.json
setup.py		setup.py
task.md		task.md
todo.md		todo.md
user_data_suite.yaml		user_data_suite.yaml
validation_report.json		validation_report.json

Folders and files

Latest commit

History

Repository files navigation

🚀 ValidateX

📑 Table of Contents

🖼️ Report Preview

🤔 Why ValidateX?

⚡ Performance Benchmarks

💼 Real-World Industry Showcases

🎯 Who Is This For?

✨ Features

📦 Installation

🏁 Quick Start

Python API

CLI

📚 Documentation Hub

📰 Technical Articles & Guides

🗄️ Push-Down SQL Native Validation

🤖 Automate with CI/CD

🧩 Apache Airflow Integration

🔔 Slack & Teams Webhook Alerts

📈 Data Drift Detection (PSI)

🎯 Data Quality Score

Custom Severity

📊 Column Health Summary

📋 Available Expectations

Column-Level (36)

Table-Level (5)

Aggregate / Cross-Column (4)

Sequential / Time-Series (2)

Conditional / Cross-Row (3)

📊 Data Profiling

🔧 YAML Suite Configuration

🏗️ Architecture

🧪 Testing

🤝 Creating Custom Expectations

🧹 Clean Output

🚀 Roadmap

Versioning

📄 License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages