Intelligent, Fault-Tolerant Routing for 2D Torus Optical Interconnects
Next-generation data centres and High-Performance Computing (HPC) environments rely heavily on scalable, high-speed optical interconnects. The 2D Torus topology is widely adopted due to its symmetric structure and short average hop counts. However, traditional deterministic routing algorithms (like XY routing) struggle to handle dynamic congestion, non-uniform traffic workloads, and transient link failures, leading to severe performance bottlenecks.
GRL-Torus is an open-source research framework that leverages the power of Artificial Intelligence to solve this problem. It introduces a novel, adaptive routing algorithm combining two powerful Machine Learning paradigms:
- GraphSAGE Encoder: A Graph Neural Network (GNN) that learns rich, structural node embeddings to capture the global topology and real-time congestion state of the network.
- Dueling DQN Agent: A Deep Reinforcement Learning (DRL) agent that uses the GNN embeddings to make optimal, hop-by-hop routing decisions, dynamically balancing load and minimising packet latency.
- Cycle-Accurate Discrete-Event Simulator: Built on SimPy, the simulator provides high-fidelity modelling of 2D Torus networks, tracking directional buffers, link propagation delays, and virtual channels.
- Advanced ML Architecture: A two-stage architecture featuring supervised pre-training (with Dijkstra labels) followed by Reinforcement Learning fine-tuning using Prioritised Experience Replay (PER).
- Traffic & Fault Generation: Evaluate models against Uniform, Hotspot, and Adversarial traffic patterns, with configurable transient link/node failure injection.
- Robust Baselines: Compare GRL against standard XY routing, Odd-Even (turn models), and Valiant Load Balancing.
- Stunning React Visualiser: An interactive, glassmorphic web dashboard built with Vite + React to visualise real-time packet routing and congestion across the Torus grid.
- Statistical Rigour: Automated generation of Mann-Whitney U tests, Welch's t-tests, and Cohen's d effect sizes, alongside publication-ready Matplotlib figures.
GRL-Torus employs a hybrid architecture where the GNN acts as the "eyes" of the network, and the DQN acts as the "brain".
graph LR
A["Network State<br/>(Buffers, Links)"] --> B["GraphSAGE Encoder<br/>(Extracts 64-dim Embeddings)"]
B --> C["Dueling DQN Agent<br/>(Evaluates Action Q-Values)"]
C --> D["Routing Decision<br/>(N, S, E, W, or Hold)"]
D --> A
For an in-depth breakdown of the ML models, refer to our Model Specification Documentation.
The GRL-Torus framework has been thoroughly evaluated across multiple grid sizes (4x4, 8x8), traffic patterns, and failure scenarios.
GRL dynamically adapts to congestion, achieving lower average latency than Valiant load balancing and the Odd-Even turn model, especially under non-uniform traffic workloads (like Hotspot).

By intelligently avoiding congested links and minimising dropped packets, GRL sustains significantly higher throughput than deterministic baselines like XY routing.

GRL drastically reduces packet loss under Hotspot and Adversarial traffic patterns, proving its ability to load-balance traffic away from critical choke points.

When subjected to a 10% random link failure rate, the GRL router gracefully routes packets around dead links, maintaining lower latency degradation than standard adaptive algorithms.

- Python 3.10+
- Node.js 18+ (for the visualiser)
# Clone the repository
git clone https://github.com/Ansh/grl-torus.git
cd grl-torus
# Install Python dependencies
pip install -r requirements.txtcd demo
npm installThe repository is built to be fully reproducible. You can run individual scripts or execute the entire pipeline from scratch.
First, we train the GraphSAGE model using Dijkstra-optimal routing labels. This gives the RL agent a "warm start" by teaching it the fundamental graph structure.
python scripts/train_gnn.py --grid-size 4 --epochs 50 --batch-size 32Next, train the RL agent using Prioritised Experience Replay. You can choose to freeze the GNN weights or joint-train both networks.
python scripts/train_dqn.py --grid-size 4 --episodes 500 --gnn-checkpoint results/checkpoints/gnn_best_4x4.ptEvaluate the trained models against baselines across different traffic patterns and failure rates. This generates a consolidated CSV of raw metrics.
python scripts/run_experiments.py --topologies 4 8 --routers xy odd_even valiant gnn grl --traffic uniform hotspot adversarial --seeds 0 1 2 3 4Run statistical tests (Mann-Whitney, Welch's t-test) and generate publication-ready Matplotlib/Seaborn charts.
python scripts/run_statistical_tests.py
python scripts/generate_figures.py --format both
python scripts/run_baseline_comparison.pyTip
One-Click Automation: You can run the entire pipeline (Training β Experiments β Analysis β Figures) automatically by executing python scripts/orchestrate_all.py.
We have built a premium, glassmorphic React dashboard to visualise the packet routing in real-time. You can watch the AI route packets, avoid congestion, and react to link failures live.
cd demo
npm run devOpen http://localhost:5173 in your browser to interact with the dashboard.
grl-torus/
βββ conf/ # Hydra configuration files (YAML)
βββ demo/ # React UI Dashboard (Vite + React + TS)
βββ docs/ # Detailed documentation (Algorithms, Models, Setup)
βββ results/ # Checkpoints, Logs, CSVs, Figures, LaTeX Tables
βββ scripts/ # CLI entry points (train, evaluate, viz, orchestrate)
βββ src/ # Core Python Packages
β βββ experiments/ # Metric collection and Grid Runner
β βββ models/ # GNN (GraphSAGE), DQN, Replay Buffer (PER)
β βββ routers/ # Baselines (XY, Odd-Even) and ML Routers (GRL)
β βββ sim/ # SimPy Engine, Packets, Torus Graph, Traffic Gen
β βββ viz/ # Matplotlib plotting scripts and Table generators
βββ tests/ # Pytest Unit & Integration tests
βββ requirements.txt # Python dependencies
For more detailed technical documentation, please explore the docs/ folder:
Ansh - Vellore Institute of Technology (Chennai)
School of Electronics Engineering
Target Venue: IEEE Networking Letters 2026