Skip to content

VictorChibueze-stud/chembl-uniprot-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Drug → Target → Keyword Pipeline

This repository contains a modular, enterprise‑level Python application that:

  1. Retrieves all approved drugs from the ChEMBL database
  2. Filters for those first approved in 2019 or later, sorted by approval year and name
  3. Fetches UniProt accession numbers for each drug’s protein targets
  4. Retrieves UniProt keywords (functional annotations) for each target
  5. Outputs a consolidated CSV table linking drugs → targets → keywords

This pipeline demonstrates how to integrate two public REST APIs (ChEMBL and UniProt) into an end‑to‑end data workflow, with caching, progress reporting, and clean, reusable code.


📋 Prerequisites

  • Python 3.8+
  • Git (optional, for version control)
  • Internet access to query the ChEMBL and UniProt services

🛠 Installation

  1. Clone this repository (or download the source files):

    git clone <your-repo-url>
    cd <your-repo-folder>
  2. Create and activate a Python virtual environment:

    python3 -m venv venv
    # macOS/Linux
    source venv/bin/activate
    # Windows (PowerShell)
    .\venv\Scripts\Activate.ps1
  3. Install the required packages:

    pip install -r requirements.txt

⚙️ Usage

Run the pipeline with:

python -m src.main

This will:

  1. Fetch all approved drugs (max_phase=4) from ChEMBL
  2. Filter to those approved ≥ 2019 and sort by year & name
  3. For each drug, retrieve all UniProt target accessions
  4. For each accession, fetch UniProt keywords
  5. Write the results to drugs_targets_keywords.csv in the project root

A progress bar and INFO‑level logs will report progress and elapsed time.


📂 Project Structure

.
├── README.md
├── requirements.txt
├── drugs_targets_keywords.csv   # (generated output)
└── src
    ├── main.py                  # entry point & pipeline orchestration
    ├── chembl_client
    │   ├── __init__.py
    │   └── client.py            # ChemblClient: approved drugs & targets
    └── uniprot_client
        ├── __init__.py
        └── client.py            # UniProtClient: keyword retrieval

🧩 Module Descriptions

chembl_client/client.py

  • ChemblClient
    • get_approved_drugs(max_phase=4, fields=…)
      Returns a list of approved drugs with specified fields.
    • get_target_accessions(molecule_chembl_id)
      Uses the ChEMBL /target/{tid} endpoint to retrieve all protein components for each mechanism‑of‑action entry and extracts UniProt accessions.

uniprot_client/client.py

  • UniProtClient
    • get_keywords(accession)
      Calls the EBI Proteins REST API (/proteins/{acc}) to fetch curated keywords for the given UniProt accession.

main.py

  • Orchestrates the full workflow, including:
    • Logging setup (INFO & WARNING)
    • Progress reporting with tqdm
    • Data loading & filtering with pandas
    • Writing drugs_targets_keywords.csv

💾 Output

  • drugs_targets_keywords.csv
    A CSV table with columns:
    ChEMBL ID,Drug Name,Approval Year,UniProt Accession,UniProt Keywords
    
    Each row links one drug to one UniProt target and its list of keywords.

🔍 Next Steps & Extensions

  • Configuration: externalize parameters (e.g. cutoff year) into config.yaml
  • Unit Tests: add pytest tests for each client and the end‑to‑end script
  • Logging to File: configure logging.FileHandler for persistent logs
  • Dockerization: wrap the pipeline in a Docker container for reproducibility

📖 References


About

This project demonstrates how to integrate two public REST APIs (ChEMBL and UniProt) into an end‑to‑end data workflow, with caching, progress reporting, and clean, reusable code

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages