Training a PPO agent to autonomously land a spacecraft using Deep Reinforcement Learning.
- Env:
LunarLander-v2(OpenAI Gymnasium / Box2D) - Observation space: 8 continuous values (position, velocity, angle, leg contact)
- Action space: 4 discrete actions (do nothing, fire left, fire main, fire right)
- Goal: Land between the flags with minimal fuel. Score ≥ 200 = solved.
PPO is a policy gradient method that clips the objective to prevent destructively large updates — making it stable and sample-efficient for continuous control tasks.
| Hyperparameter | Value |
|---|---|
| Learning rate | 3e-4 |
| Timesteps | 500,000 |
| Batch size | 64 |
| Gamma (discount) | 0.999 |
| GAE Lambda | 0.98 |
Agent achieves mean reward > 200 after ~300k timesteps, consistently landing successfully.
pip install stable-baselines3==2.3.2 "gymnasium[box2d]==0.29.1"
python train.py # Train the agent (~15-20 mins)
python evaluate.py # Evaluate trained model
python plot_results.py # Plot reward curve- Python 3.11
- PyTorch
- Stable Baselines3
- Gymnasium (Box2D)
- Matplotlib

