Skip to content

Latest commit

 

History

History
375 lines (191 loc) · 25.2 KB

File metadata and controls

375 lines (191 loc) · 25.2 KB

Reinforcement Learning in Generative Multimodal AI

Introduction

Generative multimodal artificial intelligence (AI) has achieved remarkable progress in recent years, driven by large-scale pre-training and the emergence of powerful foundation models. While these models have demonstrated strong capabilities in perception, reasoning, and content synthesis, their training is predominantly based on supervised objectives, which are often insufficient to capture task-specific goals and user intent. Reinforcement learning (RL) has therefore emerged as a critical training framework for improving generative multimodal models.

This repository collects research papers on reinforcement learning in generative multimodal AI. We primarily focus on three categories of models:

  • Multimodal understanding models, which focus on perceiving and reasoning over visual inputs and produce corresponding natural language responses.
  • Visual generation models, which synthesize visual content conditioned on textual prompts or inputs from other modalities.
  • Unified models, which adopt a single framework to jointly support visual understanding and visual generation, allowing multimodal inputs and flexibly producing outputs in the visual or textual form.

Papers

Autoregression-based RL

Diffusion-based RL