Geometry-First Visual Intelligence: using differential geometry instead of raw pixels for hand gesture recognition.
Paper: "Geometry-First Visual Intelligence: Deep Geometric Networks and Quantum Geometric Networks for Gesture Recognition" Author: Amit Rana — Independent Researcher, Santa Clara, CA ORCID: 0009-0008-5998-6560 Preprint: https://doi.org/10.5281/zenodo.19842048
Standard gesture recognition feeds raw video frames into deep neural networks — the system learns pixel statistics with no geometric understanding. DGN takes the opposite approach:
Before any neural network sees the data, we compute explicit differential-geometric properties directly from hand landmarks:
| Feature Group | Dimensions | What It Captures |
|---|---|---|
| Ricci curvature of finger trajectories | 32-D | How sharply each joint bends over time |
| Bézier motion arc coefficients | 32-D | The shape of the path each fingertip traces |
| Joint angular velocities | 32-D | How fast each joint rotates |
| Skeletal topological weights | 32-D | Connectivity structure of the hand graph |
| Total | 128-D | Per-frame geometric snapshot |
These 128 scalar values are not learned — they follow from mathematical definitions. Every dimension is named and interpretable. A logistic regression on these features alone reaches 52.61% across 27 gesture classes (~15× above random chance), with no deep learning at all.
Adding a temporal encoder (BiLSTM or Mamba SSM) over 36-frame sequences reaches 65.77% — competitive with skeleton-based state-of-the-art while operating on a representation 5,000× more compact than raw video.
| Model | Validation Accuracy |
|---|---|
| Logistic Regression | 52.61% |
| Neural Network Baseline | 61.70% |
| DGN — geometric MLP | 61.64% |
| Model | Validation Accuracy |
|---|---|
| DGN + Transformer | 60.28% |
| DGN + bidirectional LSTM | 65.69% |
| DGN + Mamba (best) | 65.77% |
Dataset: Jester — 148,092 videos, 27 gesture classes.
dgn-gesture-recognition/
├── training/
│ ├── colab_temporal_feature_extraction.ipynb ← Feature extraction pipeline (run on Colab)
│ ├── run_static_classifiers.py ← Table I: logistic regression + MLP baselines
│ ├── run_flattened_mlp.py ← Flat MLP baseline
│ ├── augment_flow_features.py ← 192-D flow-augmented features
│ └── eval_temporal_checkpoints.py ← Checkpoint evaluation utility
├── results/
│ ├── static_results_verified.json ← Table I numbers (verified)
│ └── flow_results.json ← Flow feature ablation results
└── paper/
├── generate_paper.py ← Generates IEEE-formatted .docx
├── generate_figures.py ← Generates all 4 paper figures
├── sections/ ← Paper section text files
└── figures/ ← Pre-generated PNG figures
Not included in this repo (by design):
- Raw Jester video data (download from the 20BN website)
- Pre-extracted feature NPZ files (~700 MB — too large for GitHub)
- Trained model checkpoints
- Quantum extension (QGN) — described in the paper, not released here
pip install mediapipe opencv-python numpy scipy scikit-learn torch python-docx matplotlibOpen training/colab_temporal_feature_extraction.ipynb in Google Colab.
- Mount your Google Drive and point it at the Jester dataset frames
- Outputs:
temporal_ricci_bezier_instance_0.npz— shape(148092, 36, 128)
python training/run_static_classifiers.pycd paper && python generate_figures.pycd paper && python generate_paper.pyRaw Video Frames
↓
MediaPipe Hand Landmarks (21 keypoints × 3D)
↓
┌──────────────────────────────────────────┐
│ Differential Geometry Extraction │
│ • Ricci curvature (finger trajectories) │
│ • Bézier arc parameterization │
│ • Angular velocities (joint rotations) │
│ • Topological connectivity weights │
└──────────────────────────────────────────┘
↓
128-D Geometric Feature Vector (per frame)
↓
36-Frame Temporal Sequence
↓
BiLSTM / Mamba SSM Encoder
↓
27-Class Gesture Output
Every feature has a name and a mathematical definition. Unlike convolutional embeddings, these representations can be inspected, composed with symbolic rules, and reasoned about directly — making DGN a natural front-end for Neuro-Symbolic AI systems.
-
Interpretability — "Dimension 7 is the Ricci curvature of the index fingertip trajectory" is a statement that can be verified and composed. A 512-D CNN embedding cannot say the same.
-
Compactness — 128 scalars vs ~1M+ pixels per frame. 5,000× smaller. Runs on CPU.
-
Quantum-readiness — Geometric scalars (angles, curvatures) map directly to quantum rotation gate parameters (θ, φ on the Bloch sphere). This is the native input format for variational quantum circuits — no forced encoding, no information loss. Explored further in the companion paper.
Preprint available on Zenodo. Citation will be updated upon journal publication.
@article{rana2026dgn,
title = {Geometry-First Visual Intelligence: Deep Geometric Networks and
Quantum Geometric Networks for Gesture Recognition},
author = {Rana, Amit},
year = {2026},
doi = {10.5281/zenodo.19842048},
url = {https://doi.org/10.5281/zenodo.19842048},
note = {Preprint, Zenodo}
}
Code released under the MIT License. The paper text and figures are copyright Amit Rana. All rights reserved.