POLO: Preference-Guided Multi-Turn Reinforcement Learning for Lead Optimization

Accepted at KDD 2026 (Oral)
32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining · Aug 9–13, 2026 · Jeju Island, Republic of Korea

Overview

POLO (Preference-guided multi-turn Optimization for Lead Optimization) enables LLMs to learn from complete optimization trajectories rather than isolated steps. This repository contains the implementation of POLO, which introduces Preference-Guided Policy Optimization (PGPO), a novel reinforcement learning algorithm that extracts learning signals at two complementary levels: trajectory-level optimization reinforces successful strategies, while turn-level preference learning provides dense comparative feedback by ranking intermediate molecules within each trajectory.

This implementation is built on top of the RAGEN (Reinforcement learning Agent training ENvironment) framework.

Key Features

Dual-level Learning: Trajectory-level optimization combined with turn-level preference learning
Sample Efficiency: Achieves 84% success rate on single-property tasks with only 500 oracle evaluations
Multi-objective Optimization: Supports both single and multi-property optimization tasks
Preference-Guided Policy Optimization (PGPO): Novel RL algorithm that fully exploits each oracle evaluation
Distributed Training: Built on Ray for scalable distributed reinforcement learning

Directory Structure

POLO/
├── config/                 # Configuration files
│   ├── base.yaml          # Base training configuration
│   ├── molecule_opt.yaml  # Molecular optimization specific config
│   └── ppo_trainer.yaml   # PPO algorithm configuration
├── data/                  # Sample data
│   └── qed/              # QED optimization test data
│       └── val/          # Validation dataset
├── ragen/                 # Core implementation
│   ├── env/              # Environment implementations
│   │   └── molecule_opt/ # Molecular optimization environment
│   ├── llm_agent/        # LLM agent components
│   ├── trainer/          # Training algorithms
│   └── workers/          # Distributed workers
├── train.py              # Main training script
└── run_molecule_opt.sh   # Example run script

Usage

Basic Training

To run molecular optimization training with default settings:

python train.py --config-name molecule_opt

Configuration

The framework uses Hydra for configuration management. Key parameters can be modified in config/molecule_opt.yaml:

# Task specification (examples)
molecule_opt_task: "qed"           # Single objective
molecule_opt_task: "qed+logp"      # Multi-objective
molecule_opt_task: "drd2+qed+sa"   # Complex multi-objective

# Training parameters
train_size: 128                    # Number of training molecules
number_of_gpus: 2                   # GPUs for training
trainer:
  total_training_steps: 100        # Training iterations
  save_freq: 20                     # Checkpoint frequency

Supported Properties

qed: Drug-likeness (Quantitative Estimate of Drug-likeness) [↑]
logp: Lipophilicity (LogP) [↑]
sa: Synthetic Accessibility [↓]
jnk3: JNK3 inhibition activity [↑]
drd2: DRD2 (dopamine receptor) activity [↑]

Note: [↑] indicates properties to maximize, [↓] indicates properties to minimize.

Multi-objective Tasks

Combine properties using + in the task specification:

"qed+logp": Optimize both drug-likeness and lipophilicity
"qed+sa": Optimize drug-likeness while maintaining synthesizability
"drd2+qed+logp": Target-specific activity with drug-like properties

Data Format

Input data should be in Parquet format with a smiles column containing initial molecular structures. Example structure:

data/
└── {task}/
    ├── train/
    │   └── {task}_train_{size}.parquet
    └── val/
        └── {task}_val_32.parquet

Acknowledgments

This implementation builds upon the RAGEN framework. For more details about RAGEN, see:

@misc{ragen,
      title={RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning}, 
      author={Zihan Wang and Kangrui Wang and Qineng Wang and Pingyue Zhang and Linjie Li and Zhengyuan Yang and Xing Jin and Kefan Yu and Minh Nhat Nguyen and Licheng Liu and Eli Gottlieb and Yiping Lu and Kyunghyun Cho and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li},
      year={2025},
      eprint={2504.20073},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.20073}, 
}

License

This project is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

POLO: Preference-Guided Multi-Turn Reinforcement Learning for Lead Optimization

Overview

Key Features

Directory Structure

Usage

Basic Training

Configuration

Supported Properties

Multi-objective Tasks

Data Format

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
data		data
ragen		ragen
.DS_Store		.DS_Store
README.md		README.md
run_molecule_opt.sh		run_molecule_opt.sh
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

POLO: Preference-Guided Multi-Turn Reinforcement Learning for Lead Optimization

Overview

Key Features

Directory Structure

Usage

Basic Training

Configuration

Supported Properties

Multi-objective Tasks

Data Format

Acknowledgments

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages