Automated pipeline to pull physicochemical properties from the PubChem REST API, visualizing the drug-likeness of a compound library.
Evaluating drug-likeness of a compound library is the first step, which involves retrieving some conventional physicochemical parameters of compounds (MW, LogP, TPSA, HBD/HBA count) and screening it against the Rule of 5 and for oral bioavailability. The manual querying of these values in public databases at each library update is slow, laborious, prone to transcription errors, and a major bottleneck during compound prioritization.
This repo aims to automatically fetch and do initial analysis on the properties of a library of compounds. The main script, fetchpubchemproperties.py, performs a two stage search query to the PubChem PUG REST API (resolve compound names to CIDs and then find the properties of each CID), implementing rate limiting and retry-with-backoff to ensure that requests will not fail permanently due to intermittent issues, and that the query complies with PubChem's access limits. The R script, visualize_properties.R, then plots out the library with regards to Lipinski and TPSA thresholds.
| Component | Function |
|---|---|
| Python 3.10+ / requests, pandas | API communication, data parsing, tabular formatting |
| R 4.3+ / ggplot2, dplyr, readr | Visualization of chemical space against drug-likeness thresholds |
| PubChem PUG REST API | Source for physicochemical properties |
pip install -r requirements.txt
python fetch_pubchem_properties.py --input data/mock_data/compound_library_mock.csv --output results/compound_properties.csv
Rscript visualize_properties.Rpubchem-metabolite-descriptor-fetcher/
│
├── data/
│ └── mock_data/
│ └── compound_library_mock.csv # Example library for testing
│
├── results/ # Output directory (generated on execution)
│ ├── compound_properties.csv
│ ├── lipinski_scatter.pdf
│ └── tpsa_barplot.pdf
│
├── fetch_pubchem_properties.py # Main data retrieval script
├── visualize_properties.R # Visualization script
├── requirements.txt
├── .gitignore
└── README.md
Run the pipeline against the mock compound library, then visualize:
python fetch_pubchem_properties.py --input data/mock_data/compound_library_mock.csv --output results/compound_properties.csv
Rscript visualize_properties.R results/compound_properties.csv results/
