This repository contains a modular, enterprise‑level Python application that:
- Retrieves all approved drugs from the ChEMBL database
- Filters for those first approved in 2019 or later, sorted by approval year and name
- Fetches UniProt accession numbers for each drug’s protein targets
- Retrieves UniProt keywords (functional annotations) for each target
- Outputs a consolidated CSV table linking drugs → targets → keywords
This pipeline demonstrates how to integrate two public REST APIs (ChEMBL and UniProt) into an end‑to‑end data workflow, with caching, progress reporting, and clean, reusable code.
- Python 3.8+
- Git (optional, for version control)
- Internet access to query the ChEMBL and UniProt services
-
Clone this repository (or download the source files):
git clone <your-repo-url> cd <your-repo-folder>
-
Create and activate a Python virtual environment:
python3 -m venv venv # macOS/Linux source venv/bin/activate # Windows (PowerShell) .\venv\Scripts\Activate.ps1
-
Install the required packages:
pip install -r requirements.txt
Run the pipeline with:
python -m src.mainThis will:
- Fetch all approved drugs (
max_phase=4) from ChEMBL - Filter to those approved ≥ 2019 and sort by year & name
- For each drug, retrieve all UniProt target accessions
- For each accession, fetch UniProt keywords
- Write the results to
drugs_targets_keywords.csvin the project root
A progress bar and INFO‑level logs will report progress and elapsed time.
.
├── README.md
├── requirements.txt
├── drugs_targets_keywords.csv # (generated output)
└── src
├── main.py # entry point & pipeline orchestration
├── chembl_client
│ ├── __init__.py
│ └── client.py # ChemblClient: approved drugs & targets
└── uniprot_client
├── __init__.py
└── client.py # UniProtClient: keyword retrieval
ChemblClientget_approved_drugs(max_phase=4, fields=…)
Returns a list of approved drugs with specified fields.get_target_accessions(molecule_chembl_id)
Uses the ChEMBL/target/{tid}endpoint to retrieve all protein components for each mechanism‑of‑action entry and extracts UniProt accessions.
UniProtClientget_keywords(accession)
Calls the EBI Proteins REST API (/proteins/{acc}) to fetch curatedkeywordsfor the given UniProt accession.
- Orchestrates the full workflow, including:
- Logging setup (INFO & WARNING)
- Progress reporting with
tqdm - Data loading & filtering with
pandas - Writing
drugs_targets_keywords.csv
drugs_targets_keywords.csv
A CSV table with columns:Each row links one drug to one UniProt target and its list of keywords.ChEMBL ID,Drug Name,Approval Year,UniProt Accession,UniProt Keywords
- Configuration: externalize parameters (e.g. cutoff year) into
config.yaml - Unit Tests: add
pytesttests for each client and the end‑to‑end script - Logging to File: configure
logging.FileHandlerfor persistent logs - Dockerization: wrap the pipeline in a Docker container for reproducibility
- ChEMBL Web Services: https://chembl.gitbook.io/chembl-interface-documentation/web-services/chembl-data-web-services
- EBI Proteins API: https://www.ebi.ac.uk/proteins/api/doc/#!/proteins/get_proteins__accession_
- chembl_webresource_client Python package: https://pypi.org/project/chembl_webresource_client/