Nemotron-Cascade-2-30B-A3B-FP8
FP8 W8A8 quantization of Nemotron-Cascade-2-30B-A3B — a 32B hybrid Mamba-MoE reasoning model (3.2B active). Quantized with NVIDIA ModelOpt using the same selective quantization recipe as NVIDIA's official Nano FP8: MoE experts and Mamba GEMMs in FP8, attention and sensitive layers in BF16, KV cache in FP8.
Benchmarks
Calculated using NVIDIA-NeMo/Evaluator with config from Nemotron-3-Super-120B's eval config:
| Benchmark | Nemotron-Cascade-2-30B-A3B (reproduced results) |
Nemotron-Cascade-2-30B-A3B-FP8 (this model) |
|---|---|---|
| AIME 2025 (avg@8) | 98.8 | 96.7 |
| AIME 2026 (avg@8) | 94.2 | 95.0 |
| HMMT Feb 2025 (avg@8) | 92.9 | 93.8 |
With the low sample count (8 rollouts per problem), a deviation of ±2% accross runs is expected. Safe to say BF16 and FP8 variants are equivalent in reasoning performance.
Quantization Details
- Method: FP8 Post-Training Quantization (PTQ), without Quantization-Aware Distillation (QAD)
- Format: FP8 E4M3 — static per-tensor weight scaling, dynamic per-token activation scaling at inference
- KV cache: FP8
- Tooling: NVIDIA ModelOpt
Selective Quantization Recipe
Follows the Nano-architecture selective quantization recipe from the Nemotron 3 Nano Technical Report (Section 4). Same recipe as NVIDIA's official FP8 checkpoint. Sensitive components are kept in higher precision:
| Component | Precision | Rationale |
|---|---|---|
| MoE expert GEMMs (routed + shared) | FP8 | All 23 MoE layers, 128 routed + 2 shared experts each |
| Mamba GEMMs (non-adjacent) | FP8 | 17 of 23 Mamba layers |
| Attention layers (all 6) | BF16 | Most sensitive — kept BF16 per NVIDIA sensitivity analysis |
| Mamba layers adjacent to attention (6) | BF16 | Layers {4, 11, 18, 25, 32, 41} — found sensitive in ablations |
| Mamba 1D conv | BF16 | All layers |
| Router gates | FP32 | Routing precision must not degrade |
| Embeddings & lm_head | BF16 | Not quantized |
| KV cache | FP8 | All 6 attention layers |
Calibration
- Dataset: 4,000 samples from nvidia/Nemotron-Cascade-2-SFT-Data
- Domain mix: math (1000), swe (900), terminal_agent (500), science (500), chat (400), conversational_agent (300), instruction_following (300), safety (100)
- Sequence length: Up to 12,288 tokens (no padding, natural length per sample)
Usage
SGLang
python -m sglang.launch_server \
--model chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
--trust-remote-code \
--tool-call-parser qwen3_coder \
--reasoning-parser nano_v3
vLLM
vllm serve chankhavu/Nemotron-Cascade-2-30B-A3B-FP8 \
--mamba_ssm_cache_dtype float32 \
--max-model-len 262144 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser nemotron_v3
Acknowledgments
- Quantization recipe based on the Nemotron 3 Nano Technical Report
- Quantized with NVIDIA ModelOpt
- Downloads last month
- 8,179
Model tree for chankhavu/Nemotron-Cascade-2-30B-A3B-FP8
Base model
nvidia/Nemotron-Cascade-2-30B-A3B