Accepted at KDD 2026 (Oral)
32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining · Aug 9–13, 2026 · Jeju Island, Republic of Korea
POLO (Preference-guided multi-turn Optimization for Lead Optimization) enables LLMs to learn from complete optimization trajectories rather than isolated steps. This repository contains the implementation of POLO, which introduces Preference-Guided Policy Optimization (PGPO), a novel reinforcement learning algorithm that extracts learning signals at two complementary levels: trajectory-level optimization reinforces successful strategies, while turn-level preference learning provides dense comparative feedback by ranking intermediate molecules within each trajectory.
This implementation is built on top of the RAGEN (Reinforcement learning Agent training ENvironment) framework.
- Dual-level Learning: Trajectory-level optimization combined with turn-level preference learning
- Sample Efficiency: Achieves 84% success rate on single-property tasks with only 500 oracle evaluations
- Multi-objective Optimization: Supports both single and multi-property optimization tasks
- Preference-Guided Policy Optimization (PGPO): Novel RL algorithm that fully exploits each oracle evaluation
- Distributed Training: Built on Ray for scalable distributed reinforcement learning
POLO/
├── config/ # Configuration files
│ ├── base.yaml # Base training configuration
│ ├── molecule_opt.yaml # Molecular optimization specific config
│ └── ppo_trainer.yaml # PPO algorithm configuration
├── data/ # Sample data
│ └── qed/ # QED optimization test data
│ └── val/ # Validation dataset
├── ragen/ # Core implementation
│ ├── env/ # Environment implementations
│ │ └── molecule_opt/ # Molecular optimization environment
│ ├── llm_agent/ # LLM agent components
│ ├── trainer/ # Training algorithms
│ └── workers/ # Distributed workers
├── train.py # Main training script
└── run_molecule_opt.sh # Example run script
To run molecular optimization training with default settings:
python train.py --config-name molecule_optThe framework uses Hydra for configuration management. Key parameters can be modified in config/molecule_opt.yaml:
# Task specification (examples)
molecule_opt_task: "qed" # Single objective
molecule_opt_task: "qed+logp" # Multi-objective
molecule_opt_task: "drd2+qed+sa" # Complex multi-objective
# Training parameters
train_size: 128 # Number of training molecules
number_of_gpus: 2 # GPUs for training
trainer:
total_training_steps: 100 # Training iterations
save_freq: 20 # Checkpoint frequency- qed: Drug-likeness (Quantitative Estimate of Drug-likeness) [↑]
- logp: Lipophilicity (LogP) [↑]
- sa: Synthetic Accessibility [↓]
- jnk3: JNK3 inhibition activity [↑]
- drd2: DRD2 (dopamine receptor) activity [↑]
Note: [↑] indicates properties to maximize, [↓] indicates properties to minimize.
Combine properties using + in the task specification:
"qed+logp": Optimize both drug-likeness and lipophilicity"qed+sa": Optimize drug-likeness while maintaining synthesizability"drd2+qed+logp": Target-specific activity with drug-like properties
Input data should be in Parquet format with a smiles column containing initial molecular structures. Example structure:
data/
└── {task}/
├── train/
│ └── {task}_train_{size}.parquet
└── val/
└── {task}_val_32.parquet
This implementation builds upon the RAGEN framework. For more details about RAGEN, see:
@misc{ragen,
title={RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning},
author={Zihan Wang and Kangrui Wang and Qineng Wang and Pingyue Zhang and Linjie Li and Zhengyuan Yang and Xing Jin and Kefan Yu and Minh Nhat Nguyen and Licheng Liu and Eli Gottlieb and Yiping Lu and Kyunghyun Cho and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li},
year={2025},
eprint={2504.20073},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.20073},
}This project is released under the MIT License.