SauerkrautLM-Doom-MultiVec-1.3M

A tiny 1.3M parameter model that plays DOOM, outperforming LLMs up to 92,000x its size.

This model is a ModernBERT-Hash encoder with depth-aware token representations and an attention pooling classification head, trained on 31K human gameplay demonstrations to select actions from ASCII game frame representations in real time.

Core Features and Innovations

178 frags in 10 episodes (17.8 per episode) in VizDoom's defend_the_center scenario, more than all tested LLMs combined (13 frags total)
31ms inference on CPU, enabling real-time gameplay at 35 FPS
Depth-aware ASCII encoding: VizDoom depth buffer encoded as learned 16-bin token embeddings fused with character embeddings
ModernBERT-Hash architecture: Hash embeddings + local/global attention + Flash Attention 2 support
Character-level tokenizer: 75 tokens, no BPE, preserving full spatial granularity of ASCII frames

David vs. Goliath: 1.3M Parameters vs. 120 Billion

With 1.3 million parameters -- less than 1/92,000th the size of Nemotron-120B -- SauerkrautLM-Doom-MultiVec achieves:

178 frags vs 0 for GPT-4o-mini (proprietary)
178 frags vs 3 for Nemotron-120B (120B, 92,000x larger)
178 frags vs 2 for Qwen3.5-27B (27B, 20,000x larger)
178 frags vs 8 for Gemini Flash Lite (proprietary)

All LLMs are vision-capable multimodal models evaluated on text (ASCII + depth), their strongest modality. Our tiny text-only model outperforms them on their home turf.

Model Overview

Model: VAGOsolutions/SauerkrautLM-Doom-MultiVec-1.3M
Architecture: ModernBERT-Hash encoder + Attention Pooling + Linear Classifier
Task: Real-time DOOM action classification from ASCII frames
Training Data: 31,645 human gameplay demonstrations with depth annotations
License: Apache 2.0
Model Size: 1.3M parameters (~5MB FP32)

Model Description

Model Type: Multi-vector encoder with attention pooling classification head
Encoder: 5-layer ModernBERT with hash embeddings (H=128, 4 heads)
Tokenizer: Character-level, 75 tokens (no BPE)
Depth Bins: 16 learned depth embeddings added to token representations
Actions: 4 (shoot, move_forward, turn_left, turn_right)
Max Sequence Length: 1,200 tokens
Training Loss: KL-divergence on soft action scores
Inference Latency: 31ms (CPU), 29ms (GPU)

Architecture

DoomMultiVecClassifier(
  encoder: ModernBertHashModel(
    embeddings: HashEmbedding(75 vocab, 16 proj -> 128 dim)
    depth_embedding: DepthEmbedding(16 bins x 128 dim)
    layers: 5x TransformerLayer(H=128, heads=4, FFN=512)
  )
  attention_pool: Linear(128 -> 1)  # learned attention weights
  classifier: Linear(128 -> 4)      # action probabilities
)

ModernBERT-Hash: Advancing Hash Embeddings to Modern Architectures

This model introduces hash embeddings on the ModernBERT architecture, a combination that has not been explored before. Previous work on hash embeddings for tiny language models (NeuML's BERT-Hash, Svenstrup et al. 2017) applied the technique to the original BERT architecture from 2018. We bring hash embeddings to ModernBERT (Warner et al. 2024), which provides several architectural advantages:

Rotary Position Embeddings (RoPE) instead of learned absolute positions, enabling better generalization across sequence lengths
Alternating local + global attention: Layers alternate between sliding-window local attention (w=128) and full global attention, matching the spatial structure of ASCII frames where local patterns (adjacent characters) and global context (arena layout) both matter
Flash Attention 2 support for efficient GPU training with long sequences (~1,100 tokens per frame)
Pre-normalization with RMSNorm for more stable training of tiny models

The hash embedding layer replaces the standard embedding table (V x H) with a two-stage projection: a compact lookup (V x P) followed by a linear projection (P x H), where P=16 is the projection dimension. For our 75-token vocabulary this reduces embedding parameters from 9,600 to 4,480 (53% reduction). While modest for a tiny vocabulary, the same architecture scales to standard vocabularies: at 30K tokens, hash embeddings reduce embedding parameters by 97% (from 3.8M to 120K), which is the key enabler for sub-1M parameter language models.

Combined with depth embeddings (16 learned bins added to token representations), the model receives both spatial (what the character looks like) and distance (how far away it is) information at the token level, a novel input representation for game state encoding.

Input Pipeline

From VizDoom game frame to model input: (a) RGB frame, (b) grayscale, (c) depth buffer, (d) ASCII brightness map (40x25), (e) depth bins with 16 quantization levels (red=near, green=far), (f) combined ASCII + depth representation as fed to the model.

Benchmark: DOOM defend_the_center

All agents receive identical input: ASCII frame (40x25) + depth map. Game settings match training conditions (640x480, HUD on, real-time pacing). Frags are tracked via VizDoom's per-step reward signal.

Agent	Params	Episodes	Avg Survival	Max Survival	Total Frags	Latency
SauerkrautLM-Doom-MultiVec-1.3M	1.3M	10	388	525	178	31ms
GPT-4o-mini	proprietary	10	104	228	0	646ms
Nemotron-120B	120B	5	88	104	3	8.9s
Qwen3.5-27B	27B	3	87	109	2	13.3s
Gemini Flash Lite	proprietary	10	81	97	8	920ms

178 frags vs 13 for all LLMs combined. GPT-4o-mini scores zero frags across 10 episodes (pure evasion). Our model actively engages enemies, turns to face them, and fires -- playing DOOM as intended.

Quick Start

import torch
from transformers import AutoTokenizer

# Load model
model_path = "VAGOsolutions/SauerkrautLM-Doom-MultiVec-1.3M"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

from doom_multivec.model.classifier import DoomMultiVecClassifier
state = torch.load(f"{model_path}/model.pt", map_location="cpu")
model = DoomMultiVecClassifier(model_path, pool_mode="attention", num_actions=4)
model.load_state_dict(state)
model.eval()

# Classify an ASCII frame
ascii_frame = "." * 1024  # 40x25 ASCII frame
encoded = tokenizer(ascii_frame, return_tensors="pt", max_length=1100,
                    padding="max_length", truncation=True)

with torch.no_grad():
    result = model(encoded["input_ids"], encoded["attention_mask"])
    probs = torch.softmax(result["logits"], dim=-1)[0]

actions = ["shoot", "move_forward", "turn_left", "turn_right"]
for action, prob in zip(actions, probs):
    print(f"  {action}: {prob:.3f}")

Watch the Model Play DOOM

pip install vizdoom
python scripts/play_doom_visual.py --model models/doom-multivec-trained --scenario defend_the_center

Run the LLM Benchmark

pip install openai
export OPENAI_API_KEY="your-key"
python scripts/benchmark.py --agent multivec --episodes 10 --realtime
python scripts/benchmark.py --agent gpt4mini --episodes 10 --realtime

Parameter Budget

Component	Parameters	% of Total
Hash Embeddings (75 vocab, 16 proj)	4,480	0.3%
Depth Embeddings (17 bins x 128)	2,176	0.2%
Transformer Layers (x 5)	1,312,000	99.1%
Attention Pool + Classifier	644	0.05%
Total	1,319,300	100%

Paper

Playing DOOM with 1.3M Parameters: Specialized Small Models vs Large Language Models for Real-Time Game Control

David Golchinfar (VAGO Solutions, Germany), Daryoush Vaziri (University of Applied Sciences Bonn-Rhein-Sieg, Germany), Alexander Marquardt (CARE Laboratory, NAIST, Japan)

Available in the project repository under paper/doom_multivec.pdf.

Acknowledgements

This work was developed using VizDoom as the game platform, PyLate for initial multi-vector experiments, and the HuggingFace ecosystem for model development. The ModernBERT-Hash architecture builds on NeuML's BERT-Hash models. Training data includes human gameplay demonstrations and frames from the arnaudstiegler GameNGen reproduction datasets on HuggingFace.

DOOM is a registered trademark of id Software LLC. This project is not affiliated with or endorsed by id Software.

Citation

@misc{SauerkrautLM-Doom-MultiVec,
  title={SauerkrautLM-Doom-MultiVec-1.3M: Playing DOOM with 1.3M Parameters},
  author={David Golchinfar and Daryoush Vaziri and Alexander Marquardt},
  url={https://huggingface.co/VAGOsolutions/SauerkrautLM-Doom-MultiVec-1.3M},
  year={2026}
}

Downloads last month: 101

Safetensors

Model size

1.32M params

Tensor type

F32

VAGOsolutions
/

SauerkrautLM-Doom-MultiVec-1.3M