Managing scientific datasets, their relationships, and metadata across research workflows can be complex and error-prone. The Scientific Dataset Catalog provides a centralized system for tracking datasets, their lineage, collections, and rich metadata throughout the research lifecycle.
This system helps you organize datasets into collections, track how datasets relate to each other (lineage relationships), store rich metadata, and provides a Python client for programmatic access to all these capabilities.
Want to integrate dataset catalog functionality into your Python workflows? → Quick Start | Full Documentation
Want to understand dataset metadata structure and relationships? → Schema Documentation
Prefer to work in Claude Code? Install the catalog plugin.
→ Installation
Want to contribute to the codebase? → Development Guide
This repo ships a Claude Code plugin, catalog, distributed through the dataset-catalog marketplace defined in .claude-plugin/marketplace.json. Install it from inside a Claude Code session:
/plugin marketplace add chanzuckerberg/dataset-catalog
/plugin install catalog@dataset-catalog
Ready to start using the Python client? The fastest way to get up and running:
→ Installation & Quick Start Guide
This will walk you through installation, getting an API token, and your first few API calls.
- Python Client Usage Guide - Comprehensive guide covering datasets, collections, lineage, async usage, and error handling
- Interactive Examples - Jupyter notebooks with step-by-step walkthroughs
- API Token Setup - How to generate and use API tokens
- Dataset Catalog API - The backend service this client connects to
- Schema Documentation - Detailed data models and relationships
- Development Setup - Local development and testing
- Issues & Feedback - Report bugs or request features
This project adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to opensource@chanzuckerberg.com.