gemma-4-26B-A4B-it-uncensored
Uncensored version of google/gemma-4-26B-A4B-it with refusal behavior removed.
Results
| Before | After | |
|---|---|---|
| Refusals (mlabonne, 100 prompts) | 98/100 | 1/100 effective (3 flagged, 2 refusal-then-comply) |
| Refusals (cross-dataset, 686 prompts) | — | 5/686 (0.7%) |
| KL Divergence | 0 (baseline) | 0.09 |
| Quality (harmless response length ratio) | 1.0 | ~1.01 (no degradation) |
Cross-Dataset Validation
Tested against 4 independent prompt datasets to verify generalization:
| Dataset | Prompts | Refusals |
|---|---|---|
| JailbreakBench | 100 | 1/100 |
| tulu-harmbench | 320 | 1/320 |
| NousResearch/RefusalDataset | 166 | 0/166 |
| mlabonne/harmful_behaviors | 100 | 3/100 |
| Total | 686 | 5/686 (0.7%) |
Every flagged refusal was manually audited. Most are "refusal-then-comply" false positives where the model adds an AI identity disclaimer then answers the question anyway.
Method
Norm-preserving biprojected abliteration on the dense pathway (o_proj + shared mlp.down_proj), plus Expert-Granular Abliteration (EGA) on all 128 MoE expert down_proj slices per layer.
EGA (OBLITERATUS) hooks the MoE routers during probing to compute per-expert routing weights for harmful vs harmless prompts, then applies norm-preserving projection (grimjim) to each expert individually. Dense-only abliteration leaves 29/100 refusals; adding EGA drops it to 3/100.
Pipeline
- Load model in bf16 with LoRA adapters on
o_projandmlp.down_proj - Collect residual activations for 400 harmful + 400 harmless prompts (mlabonne datasets)
- Winsorize activations at 99.5th percentile (clamps GeGLU outlier activations in Gemma family)
- Compute per-layer refusal direction:
normalize(mean(harmful) - mean(harmless)) - Orthogonalize each direction against harmless mean (double-pass Gram-Schmidt)
- Apply norm-preserving weight modification to
o_projanddown_projin all layers - Hook MoE routers, collect per-expert routing weights for harmful vs harmless prompts
- Apply same norm-preserving modification to all 128 expert
down_projslices per layer - Merge LoRA adapters into base weights for clean tensor names
Parameters
| Parameter | Value |
|---|---|
| Layers abliterated | 100% |
| Scale | 1.0 |
| Winsorization | 0.995 |
| Experts abliterated | 100% (128/128 per layer) |
| Expert scale | 1.0 |
How this differs from vanilla heretic
- Norm-preserving biprojection instead of standard projection (preserves weight magnitudes)
- Per-layer refusal directions instead of one global direction
- Deterministic single-pass instead of 50-trial Optuna search (faster, same or better results)
- LoRA merge before save for clean GGUF-compatible tensor names
- Expert-Granular Abliteration for MoE expert weights (not supported in heretic)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained("TrevorJS/gemma-4-26B-A4B-it-uncensored", dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("TrevorJS/gemma-4-26B-A4B-it-uncensored")
messages = [{"role": "user", "content": "Your prompt here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
Reproduction
Full code and experiment data: abliteration research repo
python scripts/ega.py --model google/gemma-4-26B-A4B-it \
--top-pct 100 --strip-topic-markers --skip-prefix --batch-size 4 \
--save output_dir
- Downloads last month
- 2,123