This repository contains a full-stack LLM fine-tuning project where Llama-3.2-1B-Instruct was optimized for high-fidelity instruction following using the Alpaca-Cleaned dataset.
[Link to your Hugging Face Space]
The goal was to fine-tune a lightweight LLM to follow complex instructions while maintaining a minimal memory footprint. Using Unsloth and 4-bit QLoRA, I reduced VRAM usage by 75% and accelerated training by 2x compared to standard LoRA implementations.
- Base Model: Llama-3.2-1B-Instruct
- Fine-Tuning Technique: QLoRA (Quantized Low-Rank Adaptation)
- Quantization: 4-bit NormalFloat (NF4)
- Optimization Library: Unsloth
- Hardware: NVIDIA Tesla T4 GPU (16GB VRAM)
| Parameter | Value |
|---|---|
| LoRA Rank (r) | 16 |
| LoRA Alpha | 16 |
| Learning Rate | 2e-4 |
| Batch Size | 2 |
| Gradient Accumulation | 4 |
| Optimizer | AdamW 8-bit |
The model was trained for 60 steps, showing a steady decline in cross-entropy loss from 2.03 to 1.39, indicating successful alignment without overfitting.
Evaluated on a held-out test set of 15 samples:
- ROUGE-1: 0.464
- ROUGE-2: 0.262
- ROUGE-L: 0.386 (Standard metric for instruction following)
| Prompt | Fine-Tuned Response |
|---|---|
| Explain Recursion to a 5-year old | "Recursion is a super cool way that computers solve big problems by breaking them into smaller, identical pieces..." |
| 3 Healthy Breakfast Ideas | "1. Oatmeal with berries... 2. Greek yogurt parfait... 3. Avocado toast..." |
The model was exported into GGUF (Q4_K_M) format. This allows the model to run on consumer CPUs (Mac, Windows, Linux) via Ollama or LM Studio.
File size reduction:
- Original FP16: ~2.5 GB
- Quantized GGUF: ~700 MB
- Clone the repo.
- Install dependencies:
pip install -r requirements.txt. - Run inference:
python scripts/inference.py.
