|
Automated hyperparameter search for optimal Gabliteration configurations. Paper: Gabliteration: Adaptive Multi-Directional Neural Weight Modification Author: Gökdeniz Gülmez (2025) |
This script automates the process of finding optimal Gabliteration parameters by:
- Automatically loading datasets from HuggingFace (mlabonne/harmful_behaviors and mlabonne/harmless_alpaca)
- Testing multiple random parameter configurations
- Evaluating each configuration's effectiveness (refusal rate reduction)
- Measuring model similarity to original (KL divergence)
- Ranking configurations by combined score
- Allowing you to select and save the best version
pip install gabliterationThis will install all dependencies and make the gabliteration command available system-wide.
The tool automatically downloads these datasets from HuggingFace:
mlabonne/harmful_behaviors- Harmful prompts for trainingmlabonne/harmless_alpaca- Harmless prompts for comparison
No local files needed!
Test your favorite model with:
# Basic usage
gabliterate --model "Nanbeige/Nanbeige4-3B-Thinking-2511"
# With custom parameters
gabliterate --model "meta-llama/Llama-3.2-1B-Instruct" --num-versions 50 --batch-size 4
# Full options
gabliterate --model "Qwen/Qwen3-4B-Instruct-2507" \
--num-versions 100 \
--test-samples 200 \
--max-tokens 150 \
--batch-size 4 \
--kl-samples 15CLI Options:
--model, -m(required): Hugging Face model name or path--num-versions, -n: Number of configurations to test (default: 100)--test-samples, -t: Test samples for refusal evaluation (default: 100)--max-tokens: Max tokens to generate during evaluation (default: 100)--batch-size, -b: Batch size for evaluation (default: 2)--kl-samples: KL divergence samples (default: 10)
Run gabliteration --help to see all options.
The script will:
- Test each configuration
- Print real-time results:
Testing Version 5/10 Config: Samples: 100, Skip: [2, 1], Layer: 0.52, Scale: 0.65, λ: 0.10, k: 2, Adaptive: True, β: 0.45 KL Divergence: 0.0234 Refusal Rate: 12.0% (12/100) Score: 1.2234
After all tests, you'll see:
TOP 10 BEST CONFIGURATIONS
Rank Refusal KL Div Score Config
----------------------------------------------------------------------
1 8.0% 0.0189 0.8189 Samples: 150, Skip: [2, 1], ...
2 12.0% 0.0234 1.2234 Samples: 100, Skip: [1, 2], ...
...
After all tests complete, the script automatically:
- Selects the best configuration (lowest combined score)
- Recreates and saves the gabliterated model
- Saves all configuration details in
gabliteration_config.json - Generates a model-specific README.md
Qwen_Qwen3-4B-Instruct-2507-gabliterated-v1-20250102_143022/
├── config.json # Model config
├── model.safetensors # Model weights
├── tokenizer.json # Tokenizer
├── tokenizer_config.json # Tokenizer config
└── gabliteration_config.json # ⭐ Gabliteration parameters & results
The gabliteration_config.json contains:
{
"model_name": "Qwen/Qwen3-4B-Instruct-2507",
"version_id": 1,
"timestamp": "20250102_143022",
"gabliteration_config": {
"num_prompt_samples": 150,
"skip_begin_layers": 2,
"skip_end_layers": 1,
"layer_fraction": 0.52,
"base_scale_factor": 0.65,
"regularization": 0.1,
"n_directions": 2,
"adaptive_layer_scale": true,
"beta": 0.5
},
"results": {
"kl_divergence": 0.0189,
"refusal_rate": 0.08,
"score": 0.8189
},
"all_results": [...] // Full results from all tested versions
}- What: Percentage of test prompts that trigger refusal responses
- Lower is better: 0% means no refusals, 100% means all prompts refused
- Target: Aim for <10% for effective gabliteration
- What: Measures how different the modified model is from the original
- Lower is better: Smaller values = model behaves more similarly to original
- Target: Keep <0.05 to preserve model quality
- What: Combined metric = 10×RefusalRate + KLDivergence
- Lower is better: Balances refusal reduction with model preservation
- Weights refusal rate 10x more than KL: Primary goal is reducing refusals
The script randomly samples from these ranges:
| Parameter | Range | Paper Default | Description |
|---|---|---|---|
num_prompt_samples |
[50, 75, 100, 150, 200] | 100 | Training samples for direction extraction |
skip_begin_layers |
[1, 2, 3] | 2 | Skip initial layers (preserve embeddings) |
skip_end_layers |
[1, 2, 3] | 1 | Skip final layers (preserve output) |
layer_fraction |
[0.3, 0.7] | 0.5 | Which layer to extract directions from |
base_scale_factor |
[0.2, 0.8] | 0.3 | Modification strength (α_base) |
regularization |
[0.05, 0.1, 0.15, 0.2] | 0.1 | Ridge regularization (λ) |
n_directions |
[1, 2, 3] | 1 | Number of refusal directions (k) |
adaptive_layer_scale |
[True, False] | True | Use adaptive scaling |
beta |
[0.3, 0.7] | 0.5 | Adaptive strength (β) |
Increase the number of versions tested:
gabliterate --model "Qwen/Qwen3-4B-Instruct-2507" --num-versions 200Fine-tune evaluation settings:
gabliterate --model "meta-llama/Llama-3.2-1B-Instruct" \
--test-samples 300 \
--kl-samples 25 \
--max-tokens 200Adjust batch size for faster evaluation:
gabliterate --model "Nanbeige/Nanbeige4-3B-Thinking-2511" \
--batch-size 8 \
--num-versions 100Clone the repository and edit GabliterationConfig.random() method in the source code to customize the hyperparameter search space.
- Each version creates a new model copy
- Memory is cleared between versions
- Use smaller models for faster testing
- Reduce
--test-samplesif memory is tight
- Use GPU/CUDA if available (automatically detected)
- Increase
--batch-sizefor faster evaluation - Reduce
--test-samplesfor faster evaluation - Start with fewer
--num-versionsto test the pipeline
-
Quick Test (5 minutes):
gabliterate --model "your-model" --num-versions 5 --test-samples 50 -
Standard Search (30 minutes):
gabliterate --model "your-model" --num-versions 20 --test-samples 100 -
Thorough Search (2+ hours):
gabliterate --model "your-model" --num-versions 50 --test-samples 200
gabliterate --model "your-model" --num-versions 10 --batch-size 1 --test-samples 50- Reduce
--num-versions - Use smaller model
- Reduce
--batch-size - Reduce
--test-samples
Ensure the package is installed:
pip install gabliteration
pip show gabliteration # Verify installation- The random configurations may need different ranges
- Try multiple runs with different
--num-versions - Check that the model supports the refusal behavior
If you use this implementation, please cite:
@article{gulmez2025gabliteration,
title={Gabliteration: Adaptive Multi-Directional Neural Weight Modification for Selective Behavioral Alteration in Large Language Models},
author={G{\"u}lmez, G{\"o}kdeniz},
journal={arXiv preprint arXiv:2512.18901},
year={2025}
}Same license as the base models being modified (typically Apache 2.0 or similar).
For issues or questions:
- GitHub: Check the original Gabliteration repository
- Paper: https://arxiv.org/abs/2512.18901
- Email: goekdenizguelmez-ml@gmail.com
