Skip to content

REAL-Lab-NU/POLO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

POLO: Preference-Guided Multi-Turn Reinforcement Learning for Lead Optimization

Accepted at KDD 2026 (Oral)
32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining · Aug 9–13, 2026 · Jeju Island, Republic of Korea

Overview

POLO (Preference-guided multi-turn Optimization for Lead Optimization) enables LLMs to learn from complete optimization trajectories rather than isolated steps. This repository contains the implementation of POLO, which introduces Preference-Guided Policy Optimization (PGPO), a novel reinforcement learning algorithm that extracts learning signals at two complementary levels: trajectory-level optimization reinforces successful strategies, while turn-level preference learning provides dense comparative feedback by ranking intermediate molecules within each trajectory.

This implementation is built on top of the RAGEN (Reinforcement learning Agent training ENvironment) framework.

Key Features

  • Dual-level Learning: Trajectory-level optimization combined with turn-level preference learning
  • Sample Efficiency: Achieves 84% success rate on single-property tasks with only 500 oracle evaluations
  • Multi-objective Optimization: Supports both single and multi-property optimization tasks
  • Preference-Guided Policy Optimization (PGPO): Novel RL algorithm that fully exploits each oracle evaluation
  • Distributed Training: Built on Ray for scalable distributed reinforcement learning

Directory Structure

POLO/
├── config/                 # Configuration files
│   ├── base.yaml          # Base training configuration
│   ├── molecule_opt.yaml  # Molecular optimization specific config
│   └── ppo_trainer.yaml   # PPO algorithm configuration
├── data/                  # Sample data
│   └── qed/              # QED optimization test data
│       └── val/          # Validation dataset
├── ragen/                 # Core implementation
│   ├── env/              # Environment implementations
│   │   └── molecule_opt/ # Molecular optimization environment
│   ├── llm_agent/        # LLM agent components
│   ├── trainer/          # Training algorithms
│   └── workers/          # Distributed workers
├── train.py              # Main training script
└── run_molecule_opt.sh   # Example run script

Usage

Basic Training

To run molecular optimization training with default settings:

python train.py --config-name molecule_opt

Configuration

The framework uses Hydra for configuration management. Key parameters can be modified in config/molecule_opt.yaml:

# Task specification (examples)
molecule_opt_task: "qed"           # Single objective
molecule_opt_task: "qed+logp"      # Multi-objective
molecule_opt_task: "drd2+qed+sa"   # Complex multi-objective

# Training parameters
train_size: 128                    # Number of training molecules
number_of_gpus: 2                   # GPUs for training
trainer:
  total_training_steps: 100        # Training iterations
  save_freq: 20                     # Checkpoint frequency

Supported Properties

  • qed: Drug-likeness (Quantitative Estimate of Drug-likeness) [↑]
  • logp: Lipophilicity (LogP) [↑]
  • sa: Synthetic Accessibility [↓]
  • jnk3: JNK3 inhibition activity [↑]
  • drd2: DRD2 (dopamine receptor) activity [↑]

Note: [↑] indicates properties to maximize, [↓] indicates properties to minimize.

Multi-objective Tasks

Combine properties using + in the task specification:

  • "qed+logp": Optimize both drug-likeness and lipophilicity
  • "qed+sa": Optimize drug-likeness while maintaining synthesizability
  • "drd2+qed+logp": Target-specific activity with drug-like properties

Data Format

Input data should be in Parquet format with a smiles column containing initial molecular structures. Example structure:

data/
└── {task}/
    ├── train/
    │   └── {task}_train_{size}.parquet
    └── val/
        └── {task}_val_32.parquet

Acknowledgments

This implementation builds upon the RAGEN framework. For more details about RAGEN, see:

@misc{ragen,
      title={RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning}, 
      author={Zihan Wang and Kangrui Wang and Qineng Wang and Pingyue Zhang and Linjie Li and Zhengyuan Yang and Xing Jin and Kefan Yu and Minh Nhat Nguyen and Licheng Liu and Eli Gottlieb and Yiping Lu and Kyunghyun Cho and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li},
      year={2025},
      eprint={2504.20073},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.20073}, 
}

License

This project is released under the MIT License.

About

POLO: Preference-Guided Multi-Turn Reinforcement Learning for Sample-Efficient Lead Optimization (KDD 2026 Oral)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors