Nemotron-Cascade-2-30B-A3B-FP8

FP8 W8A8 quantization of Nemotron-Cascade-2-30B-A3B — a 32B hybrid Mamba-MoE reasoning model (3.2B active). Quantized with NVIDIA ModelOpt using the same selective quantization recipe as NVIDIA's official Nano FP8: MoE experts and Mamba GEMMs in FP8, attention and sensitive layers in BF16, KV cache in FP8.

Benchmarks

Calculated using NVIDIA-NeMo/Evaluator with config from Nemotron-3-Super-120B's eval config:

Benchmark Nemotron-Cascade-2-30B-A3B
(reproduced results)
Nemotron-Cascade-2-30B-A3B-FP8
(this model)
AIME 2025 (avg@8) 98.8 96.7
AIME 2026 (avg@8) 94.2 95.0
HMMT Feb 2025 (avg@8) 92.9 93.8

With the low sample count (8 rollouts per problem), a deviation of ±2% accross runs is expected. Safe to say BF16 and FP8 variants are equivalent in reasoning performance.

Quantization Details

  • Method: FP8 Post-Training Quantization (PTQ), without Quantization-Aware Distillation (QAD)
  • Format: FP8 E4M3 — static per-tensor weight scaling, dynamic per-token activation scaling at inference
  • KV cache: FP8
  • Tooling: NVIDIA ModelOpt

Selective Quantization Recipe

Follows the Nano-architecture selective quantization recipe from the Nemotron 3 Nano Technical Report (Section 4). Same recipe as NVIDIA's official FP8 checkpoint. Sensitive components are kept in higher precision:

Component Precision Rationale
MoE expert GEMMs (routed + shared) FP8 All 23 MoE layers, 128 routed + 2 shared experts each
Mamba GEMMs (non-adjacent) FP8 17 of 23 Mamba layers
Attention layers (all 6) BF16 Most sensitive — kept BF16 per NVIDIA sensitivity analysis
Mamba layers adjacent to attention (6) BF16 Layers {4, 11, 18, 25, 32, 41} — found sensitive in ablations
Mamba 1D conv BF16 All layers
Router gates FP32 Routing precision must not degrade
Embeddings & lm_head BF16 Not quantized
KV cache FP8 All 6 attention layers

Calibration

  • Dataset: 4,000 samples from nvidia/Nemotron-Cascade-2-SFT-Data
  • Domain mix: math (1000), swe (900), terminal_agent (500), science (500), chat (400), conversational_agent (300), instruction_following (300), safety (100)
  • Sequence length: Up to 12,288 tokens (no padding, natural length per sample)

Usage

SGLang

python -m sglang.launch_server \
    --model chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
    --trust-remote-code \
    --tool-call-parser qwen3_coder \
    --reasoning-parser nano_v3

vLLM

vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
    --mamba_ssm_cache_dtype float32 \
    --max-model-len 262144 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser nemotron_v3

Acknowledgments

Downloads last month
8,179
Safetensors
Model size
32B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chankhavu/Nemotron-Cascade-2-30B-A3B-FP8

Quantized
(31)
this model

Dataset used to train chankhavu/Nemotron-Cascade-2-30B-A3B-FP8