Skip to content

kaviarasanmani/ValidateX

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

72 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

πŸš€ ValidateX

A powerful, extensible data quality validation framework for Python.

Build Status (Tests & CI) Code Coverage Test Passing Rate PyPI Latest Version PyPI Monthly Downloads Supported Python Versions GitHub Stars MIT License Code Style: black

Badges represent (from left to right): CI/CD Build Status, Code Coverage, Test Count, Latest PyPI Release, Supported Python Versions, License, and Code Style.

ValidateX provides a comprehensive suite of tools for validating, profiling, and monitoring data quality across Pandas, Polars, and PySpark DataFrames, as well as Push-Down SQL engines. Inspired by Great Expectations, it offers a simpler, more focused approach with modern, production-ready HTML reports and an intuitive API.

πŸ“‘ Table of Contents


πŸ–ΌοΈ Report Preview

ValidateX Report β€” Overview

Column Health Summary

Column Health Summary with mini bar charts

Expectations Table

Severity-tagged Expectations with human-readable output


πŸ€” Why ValidateX?

Feature ValidateX Great Expectations Pandera
Easy Setup βœ… pip install β†’ validate in 5 lines ⚠️ Heavy multi-step context setup βœ… Decorator & Schema API
HTML Report Generator βœ… Modern dark-theme report βœ… Data docs site ❌ Basic schema error logs
Data Quality Score βœ… Weighted (0–100) business score ❌ ❌
PySpark Support βœ… Distributed DataFrame support βœ… ⚠️ Limited PySpark engine
Polars Support βœ… Multi-threaded Rust support βœ… ⚠️
Push-Down SQL Native βœ… Postgres, Snowflake, BigQuery, DuckDB βœ… ❌
Data Drift (PSI) βœ… Built-in PSI & Schema shifts ❌ Separate plugins ❌
Airflow Operator βœ… ValidateXOperator built-in ⚠️ External provider ❌
Webhook Alerts βœ… Slack & Teams built-in ⚠️ Complex webhook setups ❌

ValidateX is not a replacement for Great Expectations β€” it's a focused alternative for teams that want production-grade data validation without the overhead.


⚑ Performance Benchmarks

Execution performance measured on standard datasets across Pandas, Polars, and PySpark engines (run via python -m benchmarks.benchmark_suite):

Dataset Size Engine Execution Time Peak Memory Setup / API Lines
100,000 rows Pandas 0.04s 2.1 MB 5 lines
100,000 rows Polars 0.03s < 1 MB 5 lines
100,000 rows PySpark 55.65s (local JVM init) < 1 MB 5 lines
1,000,000 rows Pandas 0.13s 33.2 MB 5 lines
1,000,000 rows Polars 0.21s < 1 MB 5 lines
1,000,000 rows PySpark 52.21s < 1 MB 5 lines
10,000,000 rows Pandas 2.22s 268.5 MB 5 lines
10,000,000 rows Polars 4.77s < 1 MB 5 lines
10,000,000 rows PySpark Distributed Cluster Distributed 5 lines

πŸ’Ό Real-World Industry Showcases

Explore production-ready validation examples tailored for major data domains in examples/showcase/:


🎯 Who Is This For?

  • Startup data teams β€” Ship data quality checks in minutes, not days
  • ML engineers β€” Validate feature stores and training data before model runs
  • CI/CD pipelines β€” Gate deployments on data quality with a single CLI command
  • Analytics teams β€” Catch data issues before they reach dashboards
  • dbt users β€” Lightweight validation alongside your transformation layer
  • Data platform teams β€” Monitor data quality across dozens of tables

✨ Features

Feature Description
50+ Built-in Expectations Column-level, table-level, format, statistical, and sequential cross-validations
Push-Down SQL Native Execute core validation via SQLAlchemy directly on Postgres, Snowflake, or BigQuery
Quad-Engine Support Pandas, Polars, PySpark, and SQL execution engines
πŸ”” Webhook Alerting Built-in Slack (Block Kit) and Microsoft Teams (MessageCard) notifications
🎯 Data Quality Score Weighted score (0–100) based on severity of checks
πŸ”΄πŸŸ‘πŸ”΅ Severity Levels Critical / Warning / Info classification for every expectation
πŸ“Š Column Health Summary At-a-glance per-column health with mini bar charts
πŸ“ˆ Data Drift Detection Calculate Population Stability Index (PSI) and schema shifts between datasets
🧩 Airflow Integration Natively gate data pipelines via ValidateXOperator
Data Profiling Auto-analyse datasets and suggest expectations
YAML/JSON Config Define expectations declaratively
CLI Interface Run validations from the command line
Clean Output All values are native Python types β€” zero NumPy leakage

πŸ“¦ Installation

# Basic install
pip install validatex

# With PySpark support
pip install "validatex[spark]"

# With database support
pip install "validatex[database]"

# Full install
pip install "validatex[all]"

# Development
pip install "validatex[dev]"

🏁 Quick Start

Python API

import pandas as pd
import validatex as vx

# Create your data
df = pd.DataFrame({
    "user_id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    "age": [25, 30, 35, 28, 42],
    "email": ["alice@test.com", "bob@test.com", "charlie@test.com",
              "diana@test.com", "eve@test.com"],
    "status": ["active", "active", "inactive", "active", "pending"],
})

# Build an expectation suite
suite = (
    vx.ExpectationSuite("user_quality")
    .add("expect_column_to_not_be_null", column="user_id")
    .add("expect_column_values_to_be_unique", column="user_id")
    .add("expect_column_values_to_be_between", column="age", min_value=0, max_value=150)
    .add("expect_column_values_to_be_in_set",
         column="status", value_set=["active", "inactive", "pending"])
    .add("expect_column_values_to_match_regex",
         column="email", regex=r"^[\w.]+@[\w]+\.\w+$")
)

# Validate
result = vx.validate(df, suite)

# Print summary (includes Quality Score)
print(result.summary())

# Generate reports
result.to_html("report.html")
result.to_json_file("report.json")

CLI

# Initialize a project
validatex init

# Profile a dataset
validatex profile --data data.csv --suggest --output auto_suite.yaml

# Run validation
validatex validate --data data.csv --suite suite.yaml --report report.html

# Run checkpoint
validatex run --checkpoint checkpoint.yaml

# List available expectations
validatex list-expectations

πŸ“š Documentation Hub

Explore comprehensive guides, FAQs, and migration documentation:

πŸ“° Technical Articles & Guides


πŸ—„οΈ Push-Down SQL Native Validation

ValidateX can validate terabytes of data directly inside your database without ever loading DataFrames into Python memory. This generates optimized native queries (like SELECT COUNT(*)) under the hood.

import validatex as vx
from sqlalchemy import create_engine

# 1. Connect to any database (PostgreSQL, Snowflake, BigQuery, etc.)
engine = create_engine("postgresql://user:pass@host/db")

# 2. Build your expectation suite
suite = (
    vx.ExpectationSuite("users_table_checks")
    .add("expect_table_row_count_to_be_between", min_value=1_000_000, max_value=5_000_000)
    .add("expect_column_to_not_be_null", column="email")
    .add("expect_column_values_to_be_unique", column="user_id")
    .add("expect_column_values_to_be_between", column="age", min_value=18, max_value=120)
)

# 3. Validate directly against the SQL table (Zero Pandas overhead!)
result = vx.validate(
    data="prod_users",     # Just pass the table name
    suite=suite,
    engine="sql",          # Tells ValidateX to use Push-Down SQL
    sql_engine=engine      # The SQLAlchemy database connection
)

print(f"Data Quality Score: {result.compute_quality_score()}/100")

πŸ€– Automate with CI/CD

ValidateX is designed to be lightweight and CI-friendly. You can easily integrate it into your GitHub Actions, GitLab CI, or Jenkins pipelines to gate deployments on data quality.

Example: GitHub Actions

name: Data Quality Validation
on: [push, pull_request]

jobs:
  validate-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          
      - name: Install ValidateX
        run: pip install validatex
        
      - name: Run Data Validation
        run: |
          validatex validate \
            --data data/production_data.csv \
            --suite tests/data_quality/suite.yaml \
            --report dq_report.html
            
      - name: Archive production artifacts
        uses: actions/upload-artifact@v4
        if: always()
        with:
          name: validatex-report
          path: dq_report.html

🧩 Apache Airflow Integration

ValidateX includes a native Apache Airflow operator to completely gate your ETL pipelines based on Data Quality Scores.

from validatex.integrations.airflow import ValidateXOperator

# This task will FAIL the Airflow DAG if the data quality score is < 95.0
validate_data = ValidateXOperator(
    task_id="ensure_data_quality",
    suite=suite,
    data_path="s3://my-bucket/daily_users.parquet",
    data_format="parquet",
    min_score=95.0, 
    report_path="/tmp/validatex_daily_report.html"
)

πŸ”” Slack & Teams Webhook Alerts

Send real-time notifications to engineering channels whenever validations run or data quality thresholds fail.

import validatex as vx

result = vx.validate(df, suite)

# Send Slack alert (Block Kit format)
result.send_slack(webhook_url="https://hooks.slack.com/services/...", notify_on="failure")

# Send Teams alert (MessageCard format)
result.send_teams(webhook_url="https://outlook.office.com/webhook/...", notify_on="failure")

Or trigger alerts automatically via the CLI:

validatex validate --data data.csv --suite suite.yaml --slack-webhook $SLACK_URL --notify-on failure

πŸ“ˆ Data Drift Detection (PSI)

Stop guessing if distributions have changed. Calculate Population Stability Index (PSI) and exact schema changes natively without heavy dependencies.

import validatex as vx

# Compare Yesterday's data vs Today's data
detector = vx.DriftDetector(psi_threshold=0.2)
report = detector.compare(yesterday_df, today_df)

print(report.summary())

Output:

============================================================
  ValidateX Data Drift Report
============================================================
[1] Schema Changes:
  No schema changes detected.
[2] Feature Drift (PSI):
  πŸ”΄ DRIFTED | income               | PSI: 5.6120 (numerical)
  🟒 STABLE  | age                  | PSI: 0.0034 (numerical)

🎯 Data Quality Score

ValidateX computes a weighted quality score (0–100) based on the severity of each expectation:

Severity Weight Example Expectations
πŸ”΄ Critical Γ—3 Null checks, uniqueness, column existence, row count
🟑 Warning Γ—2 Range checks, set membership, regex, type checks
πŸ”΅ Info Γ—1 Mean/stdev bounds, string lengths, distinct values

Formula: Score = 100 Γ— (weighted_passed / weighted_total)

A critical failure impacts the score 3Γ— more than an info-level check. This gives decision-makers a single number to assess data health.

result = vx.validate(df, suite)
score = result.compute_quality_score()
print(f"Data Quality Score: {score}/100")

Custom Severity

Override the default severity on any expectation via meta:

expectations:
  - expectation_type: expect_column_mean_to_be_between
    column: revenue
    kwargs:
      min_value: 1000
      max_value: 50000
    meta:
      severity: critical   # Override default "info" β†’ "critical"

πŸ“Š Column Health Summary

The HTML report includes a Column Health Summary that aggregates all expectations per column:

Column Checks Passed Failed Health Null % Unique %
user_id 3 3 0 100% β–ˆβ–ˆβ–ˆ 0.0% 100.0% β–ˆβ–ˆβ–ˆ
email 4 4 0 100% β–ˆβ–ˆβ–ˆ 0.0% 100.0% β–ˆβ–ˆβ–ˆ
status 1 1 0 100% β–ˆβ–ˆβ–ˆ β€” β€”

Each metric includes a mini CSS bar chart for instant visual scanning.

for col in result.column_health():
    print(f"{col.column}: {col.health_score}% health, "
          f"{col.passed}/{col.checks} passed")

πŸ“‹ Available Expectations

Column-Level (36)

Expectation Severity Description
expect_column_to_exist πŸ”΄ Critical Column exists in DataFrame
expect_column_to_not_be_null πŸ”΄ Critical No null values
expect_column_values_to_be_unique πŸ”΄ Critical All values unique
expect_column_values_to_be_between 🟑 Warning Values within range
expect_column_values_to_be_in_set 🟑 Warning Values in allowed set
expect_column_values_to_not_be_in_set 🟑 Warning Values not in forbidden set
expect_column_values_to_match_regex 🟑 Warning Values match regex pattern
expect_column_values_to_be_of_type 🟑 Warning Column dtype matches
expect_column_values_to_be_dateutil_parseable 🟑 Warning Values parseable as dates
expect_column_value_lengths_to_be_between πŸ”΅ Info String lengths within range
expect_column_max_to_be_between πŸ”΅ Info Column max within bounds
expect_column_min_to_be_between πŸ”΅ Info Column min within bounds
expect_column_mean_to_be_between πŸ”΅ Info Column mean within bounds
expect_column_stdev_to_be_between πŸ”΅ Info Column std dev within bounds
expect_column_distinct_values_to_be_in_set πŸ”΅ Info All distinct values in set
expect_column_proportion_of_unique_values_to_be_between πŸ”΅ Info Uniqueness ratio in range
expect_column_values_to_not_match_regex 🟑 Warning Values do not match regex
expect_column_values_to_be_valid_email 🟑 Warning Values parse as valid emails
expect_column_values_to_be_json_parseable 🟑 Warning Values are parseable JSON
expect_column_sum_to_be_between πŸ”΅ Info Column sum within bounds
expect_column_median_to_be_between πŸ”΅ Info Column median within bounds
expect_column_value_lengths_to_equal πŸ”΅ Info String lengths exact match
expect_column_quantile_values_to_be_between πŸ”΅ Info Per-quantile range checks
expect_column_null_percentage_to_be_less_than 🟑 Warning Null rate < threshold
expect_column_values_to_be_positive 🟑 Warning All values > 0
expect_column_values_to_be_negative 🟑 Warning All values < 0
expect_column_values_to_be_in_range_of_std_devs πŸ”΅ Info Outlier / Z-score detection
expect_column_correlation_to_be_between πŸ”΅ Info Pearson correlation in range
expect_column_values_to_have_no_whitespace 🟑 Warning No leading/trailing whitespace
expect_column_values_to_be_valid_url 🟑 Warning Valid HTTP/HTTPS/FTP URLs
expect_column_values_to_be_valid_ip_address 🟑 Warning Valid IPv4 / IPv6 addresses
expect_column_values_to_be_valid_uuid 🟑 Warning Valid UUID (any version)
expect_column_values_to_be_valid_iso_date 🟑 Warning Valid ISO 8601 dates
expect_column_values_to_be_valid_phone_number 🟑 Warning Valid international phone
expect_column_values_to_be_all_uppercase πŸ”΅ Info All values UPPERCASED
expect_column_values_to_be_all_lowercase πŸ”΅ Info All values lowercased

Table-Level (5)

Expectation Severity Description
expect_table_row_count_to_equal πŸ”΄ Critical Exact row count
expect_table_row_count_to_be_between πŸ”΄ Critical Row count in range
expect_table_columns_to_match_ordered_list πŸ”΄ Critical Column order matches
expect_table_columns_to_match_set πŸ”΄ Critical Column names match (unordered)
expect_table_column_count_to_equal πŸ”΄ Critical Exact column count

Aggregate / Cross-Column (4)

Expectation Severity Description
expect_column_pair_values_a_to_be_greater_than_b 🟑 Warning Column A > Column B
expect_column_pair_values_to_be_equal 🟑 Warning Two columns equal
expect_multicolumn_sum_to_equal 🟑 Warning Row-wise sum equals target
expect_compound_columns_to_be_unique πŸ”΄ Critical Compound key uniqueness

Sequential / Time-Series (2)

Expectation Severity Description
expect_column_values_to_be_increasing πŸ”΅ Info Monotonically increasing
expect_column_values_to_be_decreasing πŸ”΅ Info Monotonically decreasing

Conditional / Cross-Row (3)

Expectation Severity Description
expect_column_values_to_be_null_when 🟑 Warning Column must be null given condition
expect_column_values_to_be_not_null_when πŸ”΄ Critical Column must not be null given condition
expect_column_values_to_satisfy 🟑 Warning Pass a Python lambda as custom validation

πŸ“Š Data Profiling

import pandas as pd
from validatex import DataProfiler

df = pd.read_csv("data.csv")
profiler = DataProfiler()

# Profile
profile = profiler.profile(df)
print(profile.summary())

# Auto-suggest expectations
suite = profiler.suggest_expectations(df, suite_name="auto_suite")
suite.save("auto_suite.yaml")

πŸ”§ YAML Suite Configuration

suite_name: my_data_quality
meta:
  description: "Quality checks for production data"

expectations:
  - expectation_type: expect_column_to_not_be_null
    column: id
    meta:
      severity: critical

  - expectation_type: expect_column_values_to_be_between
    column: age
    kwargs:
      min_value: 0
      max_value: 150

  - expectation_type: expect_column_values_to_be_in_set
    column: status
    kwargs:
      value_set: ["active", "inactive"]

πŸ—οΈ Architecture

validatex/
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ expectation.py     # Base class + registry
β”‚   β”œβ”€β”€ result.py          # ValidationResult, QualityScore, Severity, ColumnHealth
β”‚   β”œβ”€β”€ suite.py           # ExpectationSuite (fluent API)
β”‚   └── validator.py       # Validation orchestrator
β”œβ”€β”€ expectations/
β”‚   β”œβ”€β”€ column_expectations.py     # 16 column-level checks
β”‚   β”œβ”€β”€ table_expectations.py      # 5 table-level checks
β”‚   └── aggregate_expectations.py  # 4 cross-column checks
β”œβ”€β”€ datasources/
β”‚   β”œβ”€β”€ csv_source.py      # CSV files
β”‚   β”œβ”€β”€ parquet_source.py  # Parquet files
β”‚   β”œβ”€β”€ database_source.py # SQL databases (SQLAlchemy)
β”‚   └── dataframe_source.py # Direct DataFrames
β”œβ”€β”€ profiler/
β”‚   └── profiler.py        # Auto-profiling & suggestion engine
β”œβ”€β”€ reporting/
β”‚   β”œβ”€β”€ html_report.py     # Production HTML reports
β”‚   └── json_report.py     # JSON reports
β”œβ”€β”€ config/
β”‚   └── loader.py          # YAML/JSON config loading
└── cli/
    └── main.py            # CLI (validate, run, profile, init, list-expectations)

πŸ§ͺ Testing

# Run all tests (66 tests)
pytest tests/ -v

# Run with coverage
pytest tests/ -v --cov=validatex --cov-report=html

# Unit tests only
pytest tests/unit/ -v

# Integration tests
pytest tests/integration/ -v

🀝 Creating Custom Expectations

from dataclasses import dataclass, field
from validatex.core.expectation import Expectation, register_expectation
from validatex.core.result import ExpectationResult

@register_expectation
@dataclass
class ExpectColumnValuesToBePositive(Expectation):
    """Expect all values in a numeric column to be positive."""

    expectation_type: str = field(
        init=False, default="expect_column_values_to_be_positive"
    )

    def _validate_pandas(self, df) -> ExpectationResult:
        series = df[self.column].dropna()
        total = len(series)
        negative_mask = series <= 0
        unexpected_count = int(negative_mask.sum())
        pct = (unexpected_count / total * 100) if total > 0 else 0.0

        return self._build_result(
            success=(unexpected_count == 0),
            element_count=total,
            unexpected_count=unexpected_count,
            unexpected_percent=pct,
            unexpected_values=series[negative_mask].tolist()[:20],
        )

🧹 Clean Output

ValidateX converts all internal types to native Python before rendering. You'll never see np.int64(20) in reports or JSON β€” only clean 20.

result = vx.validate(df, suite)
data = result.to_dict()

# Observed values are always clean:
# {'min': 20, 'max': 69}        ← NOT {'min': np.int64(20), ...}
# "Unique: 100/100 (100.0%)"    ← NOT "100 unique out of 100"
# "Distinct values: 3"          ← NOT "{'unique_values': 3}"

πŸš€ Roadmap

  • 50+ built-in expectations (column, table, aggregate, statistical, sequential)
  • Pandas, PySpark, and SQL Push-down Dual-engine support
  • Severity modeling (Critical / Warning / Info)
  • Weighted data quality score (0–100)
  • Column health summary with mini charts
  • Modern HTML reports with dark theme
  • Data Drift Detection (Population Stability Index / Schema checks)
  • Apache Airflow Integration via ValidateXOperator
  • Sequential & Time-Series Anomaly features
  • Data profiler with auto-suggestion
  • CLI with validate, profile, run, init commands
  • YAML/JSON declarative configuration
  • Native Python type sanitization
  • Slack / Teams notifications on failure
  • GitHub Action template for CI/CD
  • Polars engine support
  • Baseline history tracking & trend charts
  • Great Expectations suite import/migration
  • Web dashboard for multi-dataset monitoring
  • dbt integration plugin

Versioning

ValidateX follows Semantic Versioning.

  • MAJOR version for incompatible API changes
  • MINOR version for backwards-compatible new functionality
  • PATCH version for backwards-compatible bug fixes

πŸ“„ License

MIT License


Built with ❀️ by the ValidateX Team
If this project helps you, consider giving it a ⭐

About

ValidateX is a lightweight, extensible data quality validation framework for Python that helps ensure dataset accuracy, consistency, and reliability with automated reporting and quality scoring.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages