A powerful, extensible data quality validation framework for Python.
Badges represent (from left to right): CI/CD Build Status, Code Coverage, Test Count, Latest PyPI Release, Supported Python Versions, License, and Code Style.
ValidateX provides a comprehensive suite of tools for validating, profiling, and monitoring data quality across Pandas, Polars, and PySpark DataFrames, as well as Push-Down SQL engines. Inspired by Great Expectations, it offers a simpler, more focused approach with modern, production-ready HTML reports and an intuitive API.
- πΌοΈ Report Preview
- π€ Why ValidateX?
- β‘ Performance Benchmarks
- πΌ Real-World Industry Showcases
- π― Who Is This For?
- β¨ Features
- π¦ Installation
- π Quick Start
- π Documentation Hub
- π€ Automate with CI/CD
- π― Data Quality Score
- π Available Expectations
- π Roadmap
Column Health Summary with mini bar charts |
Severity-tagged Expectations with human-readable output |
| Feature | ValidateX | Great Expectations | Pandera |
|---|---|---|---|
| Easy Setup | β
pip install β validate in 5 lines |
β Decorator & Schema API | |
| HTML Report Generator | β Modern dark-theme report | β Data docs site | β Basic schema error logs |
| Data Quality Score | β Weighted (0β100) business score | β | β |
| PySpark Support | β Distributed DataFrame support | β | |
| Polars Support | β Multi-threaded Rust support | β | |
| Push-Down SQL Native | β Postgres, Snowflake, BigQuery, DuckDB | β | β |
| Data Drift (PSI) | β Built-in PSI & Schema shifts | β Separate plugins | β |
| Airflow Operator | β
ValidateXOperator built-in |
β | |
| Webhook Alerts | β Slack & Teams built-in | β |
ValidateX is not a replacement for Great Expectations β it's a focused alternative for teams that want production-grade data validation without the overhead.
Execution performance measured on standard datasets across Pandas, Polars, and PySpark engines (run via python -m benchmarks.benchmark_suite):
| Dataset Size | Engine | Execution Time | Peak Memory | Setup / API Lines |
|---|---|---|---|---|
| 100,000 rows | Pandas | 0.04s | 2.1 MB | 5 lines |
| 100,000 rows | Polars | 0.03s | < 1 MB | 5 lines |
| 100,000 rows | PySpark | 55.65s (local JVM init) | < 1 MB | 5 lines |
| 1,000,000 rows | Pandas | 0.13s | 33.2 MB | 5 lines |
| 1,000,000 rows | Polars | 0.21s | < 1 MB | 5 lines |
| 1,000,000 rows | PySpark | 52.21s | < 1 MB | 5 lines |
| 10,000,000 rows | Pandas | 2.22s | 268.5 MB | 5 lines |
| 10,000,000 rows | Polars | 4.77s | < 1 MB | 5 lines |
| 10,000,000 rows | PySpark | Distributed Cluster | Distributed | 5 lines |
Explore production-ready validation examples tailored for major data domains in examples/showcase/:
- π€ Customer 360 Validation β Demographic bounds, email pattern checks, loyalty tier rules.
- π Sales ETL Pipeline Validation β Order revenue consistency, non-negative quantities, ISO timestamps.
- π¦ Banking Transaction Audit β UUID ledger checks, strict currency sets, Z-score outlier detection.
- π₯ Healthcare Claims Quality Gate β ICD-10 diagnosis formatting, admission/discharge date ordering.
- π¦ Retail Inventory & Supply Chain β Multi-warehouse inventory, SKU regex checks, reorder point auditing.
- Startup data teams β Ship data quality checks in minutes, not days
- ML engineers β Validate feature stores and training data before model runs
- CI/CD pipelines β Gate deployments on data quality with a single CLI command
- Analytics teams β Catch data issues before they reach dashboards
- dbt users β Lightweight validation alongside your transformation layer
- Data platform teams β Monitor data quality across dozens of tables
| Feature | Description |
|---|---|
| 50+ Built-in Expectations | Column-level, table-level, format, statistical, and sequential cross-validations |
| Push-Down SQL Native | Execute core validation via SQLAlchemy directly on Postgres, Snowflake, or BigQuery |
| Quad-Engine Support | Pandas, Polars, PySpark, and SQL execution engines |
| π Webhook Alerting | Built-in Slack (Block Kit) and Microsoft Teams (MessageCard) notifications |
| π― Data Quality Score | Weighted score (0β100) based on severity of checks |
| π΄π‘π΅ Severity Levels | Critical / Warning / Info classification for every expectation |
| π Column Health Summary | At-a-glance per-column health with mini bar charts |
| π Data Drift Detection | Calculate Population Stability Index (PSI) and schema shifts between datasets |
| π§© Airflow Integration | Natively gate data pipelines via ValidateXOperator |
| Data Profiling | Auto-analyse datasets and suggest expectations |
| YAML/JSON Config | Define expectations declaratively |
| CLI Interface | Run validations from the command line |
| Clean Output | All values are native Python types β zero NumPy leakage |
# Basic install
pip install validatex
# With PySpark support
pip install "validatex[spark]"
# With database support
pip install "validatex[database]"
# Full install
pip install "validatex[all]"
# Development
pip install "validatex[dev]"import pandas as pd
import validatex as vx
# Create your data
df = pd.DataFrame({
"user_id": [1, 2, 3, 4, 5],
"name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
"age": [25, 30, 35, 28, 42],
"email": ["alice@test.com", "bob@test.com", "charlie@test.com",
"diana@test.com", "eve@test.com"],
"status": ["active", "active", "inactive", "active", "pending"],
})
# Build an expectation suite
suite = (
vx.ExpectationSuite("user_quality")
.add("expect_column_to_not_be_null", column="user_id")
.add("expect_column_values_to_be_unique", column="user_id")
.add("expect_column_values_to_be_between", column="age", min_value=0, max_value=150)
.add("expect_column_values_to_be_in_set",
column="status", value_set=["active", "inactive", "pending"])
.add("expect_column_values_to_match_regex",
column="email", regex=r"^[\w.]+@[\w]+\.\w+$")
)
# Validate
result = vx.validate(df, suite)
# Print summary (includes Quality Score)
print(result.summary())
# Generate reports
result.to_html("report.html")
result.to_json_file("report.json")# Initialize a project
validatex init
# Profile a dataset
validatex profile --data data.csv --suggest --output auto_suite.yaml
# Run validation
validatex validate --data data.csv --suite suite.yaml --report report.html
# Run checkpoint
validatex run --checkpoint checkpoint.yaml
# List available expectations
validatex list-expectationsExplore comprehensive guides, FAQs, and migration documentation:
- π Migration Guide (Great Expectations & AWS Deequ) β Step-by-step instructions for converting legacy suites to ValidateX.
- ποΈ Architectural Decisions & Safeguards (ADR) β Compute alignment, memory safeguards, and RunStore.
- π οΈ Custom Expectations Developer Guide β Extend ValidateX with domain-specific rules.
- β Frequently Asked Questions (FAQ) β Security, engine performance, custom expectations, and alert setup.
- βοΈ GitHub Action Setup β Automate data quality gates in CI/CD.
- π‘οΈ 10 Data Quality Checks Every Data Engineer Needs
- β‘ Validate Pandas DataFrames in Minutes
- βοΈ ValidateX vs Great Expectations Architectural Comparison
- π How to Generate Beautiful HTML Data Quality Reports
ValidateX can validate terabytes of data directly inside your database without ever loading DataFrames into Python memory. This generates optimized native queries (like SELECT COUNT(*)) under the hood.
import validatex as vx
from sqlalchemy import create_engine
# 1. Connect to any database (PostgreSQL, Snowflake, BigQuery, etc.)
engine = create_engine("postgresql://user:pass@host/db")
# 2. Build your expectation suite
suite = (
vx.ExpectationSuite("users_table_checks")
.add("expect_table_row_count_to_be_between", min_value=1_000_000, max_value=5_000_000)
.add("expect_column_to_not_be_null", column="email")
.add("expect_column_values_to_be_unique", column="user_id")
.add("expect_column_values_to_be_between", column="age", min_value=18, max_value=120)
)
# 3. Validate directly against the SQL table (Zero Pandas overhead!)
result = vx.validate(
data="prod_users", # Just pass the table name
suite=suite,
engine="sql", # Tells ValidateX to use Push-Down SQL
sql_engine=engine # The SQLAlchemy database connection
)
print(f"Data Quality Score: {result.compute_quality_score()}/100")ValidateX is designed to be lightweight and CI-friendly. You can easily integrate it into your GitHub Actions, GitLab CI, or Jenkins pipelines to gate deployments on data quality.
Example: GitHub Actions
name: Data Quality Validation
on: [push, pull_request]
jobs:
validate-data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install ValidateX
run: pip install validatex
- name: Run Data Validation
run: |
validatex validate \
--data data/production_data.csv \
--suite tests/data_quality/suite.yaml \
--report dq_report.html
- name: Archive production artifacts
uses: actions/upload-artifact@v4
if: always()
with:
name: validatex-report
path: dq_report.htmlValidateX includes a native Apache Airflow operator to completely gate your ETL pipelines based on Data Quality Scores.
from validatex.integrations.airflow import ValidateXOperator
# This task will FAIL the Airflow DAG if the data quality score is < 95.0
validate_data = ValidateXOperator(
task_id="ensure_data_quality",
suite=suite,
data_path="s3://my-bucket/daily_users.parquet",
data_format="parquet",
min_score=95.0,
report_path="/tmp/validatex_daily_report.html"
)Send real-time notifications to engineering channels whenever validations run or data quality thresholds fail.
import validatex as vx
result = vx.validate(df, suite)
# Send Slack alert (Block Kit format)
result.send_slack(webhook_url="https://hooks.slack.com/services/...", notify_on="failure")
# Send Teams alert (MessageCard format)
result.send_teams(webhook_url="https://outlook.office.com/webhook/...", notify_on="failure")Or trigger alerts automatically via the CLI:
validatex validate --data data.csv --suite suite.yaml --slack-webhook $SLACK_URL --notify-on failureStop guessing if distributions have changed. Calculate Population Stability Index (PSI) and exact schema changes natively without heavy dependencies.
import validatex as vx
# Compare Yesterday's data vs Today's data
detector = vx.DriftDetector(psi_threshold=0.2)
report = detector.compare(yesterday_df, today_df)
print(report.summary())Output:
============================================================
ValidateX Data Drift Report
============================================================
[1] Schema Changes:
No schema changes detected.
[2] Feature Drift (PSI):
π΄ DRIFTED | income | PSI: 5.6120 (numerical)
π’ STABLE | age | PSI: 0.0034 (numerical)
ValidateX computes a weighted quality score (0β100) based on the severity of each expectation:
| Severity | Weight | Example Expectations |
|---|---|---|
| π΄ Critical | Γ3 | Null checks, uniqueness, column existence, row count |
| π‘ Warning | Γ2 | Range checks, set membership, regex, type checks |
| π΅ Info | Γ1 | Mean/stdev bounds, string lengths, distinct values |
Formula: Score = 100 Γ (weighted_passed / weighted_total)
A critical failure impacts the score 3Γ more than an info-level check. This gives decision-makers a single number to assess data health.
result = vx.validate(df, suite)
score = result.compute_quality_score()
print(f"Data Quality Score: {score}/100")Override the default severity on any expectation via meta:
expectations:
- expectation_type: expect_column_mean_to_be_between
column: revenue
kwargs:
min_value: 1000
max_value: 50000
meta:
severity: critical # Override default "info" β "critical"The HTML report includes a Column Health Summary that aggregates all expectations per column:
| Column | Checks | Passed | Failed | Health | Null % | Unique % |
|---|---|---|---|---|---|---|
| user_id | 3 | 3 | 0 | 100% βββ | 0.0% | 100.0% βββ |
| 4 | 4 | 0 | 100% βββ | 0.0% | 100.0% βββ | |
| status | 1 | 1 | 0 | 100% βββ | β | β |
Each metric includes a mini CSS bar chart for instant visual scanning.
for col in result.column_health():
print(f"{col.column}: {col.health_score}% health, "
f"{col.passed}/{col.checks} passed")| Expectation | Severity | Description |
|---|---|---|
expect_column_to_exist |
π΄ Critical | Column exists in DataFrame |
expect_column_to_not_be_null |
π΄ Critical | No null values |
expect_column_values_to_be_unique |
π΄ Critical | All values unique |
expect_column_values_to_be_between |
π‘ Warning | Values within range |
expect_column_values_to_be_in_set |
π‘ Warning | Values in allowed set |
expect_column_values_to_not_be_in_set |
π‘ Warning | Values not in forbidden set |
expect_column_values_to_match_regex |
π‘ Warning | Values match regex pattern |
expect_column_values_to_be_of_type |
π‘ Warning | Column dtype matches |
expect_column_values_to_be_dateutil_parseable |
π‘ Warning | Values parseable as dates |
expect_column_value_lengths_to_be_between |
π΅ Info | String lengths within range |
expect_column_max_to_be_between |
π΅ Info | Column max within bounds |
expect_column_min_to_be_between |
π΅ Info | Column min within bounds |
expect_column_mean_to_be_between |
π΅ Info | Column mean within bounds |
expect_column_stdev_to_be_between |
π΅ Info | Column std dev within bounds |
expect_column_distinct_values_to_be_in_set |
π΅ Info | All distinct values in set |
expect_column_proportion_of_unique_values_to_be_between |
π΅ Info | Uniqueness ratio in range |
expect_column_values_to_not_match_regex |
π‘ Warning | Values do not match regex |
expect_column_values_to_be_valid_email |
π‘ Warning | Values parse as valid emails |
expect_column_values_to_be_json_parseable |
π‘ Warning | Values are parseable JSON |
expect_column_sum_to_be_between |
π΅ Info | Column sum within bounds |
expect_column_median_to_be_between |
π΅ Info | Column median within bounds |
expect_column_value_lengths_to_equal |
π΅ Info | String lengths exact match |
expect_column_quantile_values_to_be_between |
π΅ Info | Per-quantile range checks |
expect_column_null_percentage_to_be_less_than |
π‘ Warning | Null rate < threshold |
expect_column_values_to_be_positive |
π‘ Warning | All values > 0 |
expect_column_values_to_be_negative |
π‘ Warning | All values < 0 |
expect_column_values_to_be_in_range_of_std_devs |
π΅ Info | Outlier / Z-score detection |
expect_column_correlation_to_be_between |
π΅ Info | Pearson correlation in range |
expect_column_values_to_have_no_whitespace |
π‘ Warning | No leading/trailing whitespace |
expect_column_values_to_be_valid_url |
π‘ Warning | Valid HTTP/HTTPS/FTP URLs |
expect_column_values_to_be_valid_ip_address |
π‘ Warning | Valid IPv4 / IPv6 addresses |
expect_column_values_to_be_valid_uuid |
π‘ Warning | Valid UUID (any version) |
expect_column_values_to_be_valid_iso_date |
π‘ Warning | Valid ISO 8601 dates |
expect_column_values_to_be_valid_phone_number |
π‘ Warning | Valid international phone |
expect_column_values_to_be_all_uppercase |
π΅ Info | All values UPPERCASED |
expect_column_values_to_be_all_lowercase |
π΅ Info | All values lowercased |
| Expectation | Severity | Description |
|---|---|---|
expect_table_row_count_to_equal |
π΄ Critical | Exact row count |
expect_table_row_count_to_be_between |
π΄ Critical | Row count in range |
expect_table_columns_to_match_ordered_list |
π΄ Critical | Column order matches |
expect_table_columns_to_match_set |
π΄ Critical | Column names match (unordered) |
expect_table_column_count_to_equal |
π΄ Critical | Exact column count |
| Expectation | Severity | Description |
|---|---|---|
expect_column_pair_values_a_to_be_greater_than_b |
π‘ Warning | Column A > Column B |
expect_column_pair_values_to_be_equal |
π‘ Warning | Two columns equal |
expect_multicolumn_sum_to_equal |
π‘ Warning | Row-wise sum equals target |
expect_compound_columns_to_be_unique |
π΄ Critical | Compound key uniqueness |
| Expectation | Severity | Description |
|---|---|---|
expect_column_values_to_be_increasing |
π΅ Info | Monotonically increasing |
expect_column_values_to_be_decreasing |
π΅ Info | Monotonically decreasing |
| Expectation | Severity | Description |
|---|---|---|
expect_column_values_to_be_null_when |
π‘ Warning | Column must be null given condition |
expect_column_values_to_be_not_null_when |
π΄ Critical | Column must not be null given condition |
expect_column_values_to_satisfy |
π‘ Warning | Pass a Python lambda as custom validation |
import pandas as pd
from validatex import DataProfiler
df = pd.read_csv("data.csv")
profiler = DataProfiler()
# Profile
profile = profiler.profile(df)
print(profile.summary())
# Auto-suggest expectations
suite = profiler.suggest_expectations(df, suite_name="auto_suite")
suite.save("auto_suite.yaml")suite_name: my_data_quality
meta:
description: "Quality checks for production data"
expectations:
- expectation_type: expect_column_to_not_be_null
column: id
meta:
severity: critical
- expectation_type: expect_column_values_to_be_between
column: age
kwargs:
min_value: 0
max_value: 150
- expectation_type: expect_column_values_to_be_in_set
column: status
kwargs:
value_set: ["active", "inactive"]validatex/
βββ core/
β βββ expectation.py # Base class + registry
β βββ result.py # ValidationResult, QualityScore, Severity, ColumnHealth
β βββ suite.py # ExpectationSuite (fluent API)
β βββ validator.py # Validation orchestrator
βββ expectations/
β βββ column_expectations.py # 16 column-level checks
β βββ table_expectations.py # 5 table-level checks
β βββ aggregate_expectations.py # 4 cross-column checks
βββ datasources/
β βββ csv_source.py # CSV files
β βββ parquet_source.py # Parquet files
β βββ database_source.py # SQL databases (SQLAlchemy)
β βββ dataframe_source.py # Direct DataFrames
βββ profiler/
β βββ profiler.py # Auto-profiling & suggestion engine
βββ reporting/
β βββ html_report.py # Production HTML reports
β βββ json_report.py # JSON reports
βββ config/
β βββ loader.py # YAML/JSON config loading
βββ cli/
βββ main.py # CLI (validate, run, profile, init, list-expectations)
# Run all tests (66 tests)
pytest tests/ -v
# Run with coverage
pytest tests/ -v --cov=validatex --cov-report=html
# Unit tests only
pytest tests/unit/ -v
# Integration tests
pytest tests/integration/ -vfrom dataclasses import dataclass, field
from validatex.core.expectation import Expectation, register_expectation
from validatex.core.result import ExpectationResult
@register_expectation
@dataclass
class ExpectColumnValuesToBePositive(Expectation):
"""Expect all values in a numeric column to be positive."""
expectation_type: str = field(
init=False, default="expect_column_values_to_be_positive"
)
def _validate_pandas(self, df) -> ExpectationResult:
series = df[self.column].dropna()
total = len(series)
negative_mask = series <= 0
unexpected_count = int(negative_mask.sum())
pct = (unexpected_count / total * 100) if total > 0 else 0.0
return self._build_result(
success=(unexpected_count == 0),
element_count=total,
unexpected_count=unexpected_count,
unexpected_percent=pct,
unexpected_values=series[negative_mask].tolist()[:20],
)ValidateX converts all internal types to native Python before rendering. You'll never see np.int64(20) in reports or JSON β only clean 20.
result = vx.validate(df, suite)
data = result.to_dict()
# Observed values are always clean:
# {'min': 20, 'max': 69} β NOT {'min': np.int64(20), ...}
# "Unique: 100/100 (100.0%)" β NOT "100 unique out of 100"
# "Distinct values: 3" β NOT "{'unique_values': 3}"- 50+ built-in expectations (column, table, aggregate, statistical, sequential)
- Pandas, PySpark, and SQL Push-down Dual-engine support
- Severity modeling (Critical / Warning / Info)
- Weighted data quality score (0β100)
- Column health summary with mini charts
- Modern HTML reports with dark theme
- Data Drift Detection (Population Stability Index / Schema checks)
- Apache Airflow Integration via
ValidateXOperator - Sequential & Time-Series Anomaly features
- Data profiler with auto-suggestion
- CLI with validate, profile, run, init commands
- YAML/JSON declarative configuration
- Native Python type sanitization
- Slack / Teams notifications on failure
- GitHub Action template for CI/CD
- Polars engine support
- Baseline history tracking & trend charts
- Great Expectations suite import/migration
- Web dashboard for multi-dataset monitoring
- dbt integration plugin
ValidateX follows Semantic Versioning.
- MAJOR version for incompatible API changes
- MINOR version for backwards-compatible new functionality
- PATCH version for backwards-compatible bug fixes
MIT License
Built with β€οΈ by the ValidateX Team
If this project helps you, consider giving it a β


