Qwen3-TTS 0.6B ONNX INT8 Quantized

This repository provides INT8 quantized ONNX models for Qwen3-TTS 0.6B, optimized for efficient inference.

Model Details

Original Model: Qwen/Qwen3-TTS by the Qwen Team at Alibaba
ONNX Conversion: zukky/Qwen3-TTS-ONNX-DLL
Quantization: Dynamic INT8 quantization using ONNX Runtime

Compression Results

Model	Original	Quantized	Compression
talker_prefill	1.69 GB	448 MB	75%
talker_decode	1.69 GB	448 MB	75%
text_project	1.21 GB	317 MB	75%
tokenizer12hz_decode	436 MB	221 MB	52%
code_predictor	420 MB	111 MB	75%
tokenizer12hz_encode	184 MB	76 MB	61%
code_predictor_embed	120 MB	31 MB	75%
speaker_encoder	34 MB	9.3 MB	73%
codec_embed	12 MB	3.1 MB	75%
Total	6.1 GB	1.6 GB	73%

Usage

Requirements

pip install onnxruntime numpy

Loading Models

import onnxruntime as ort

# Load a quantized model
session = ort.InferenceSession(
    "text_project_q.onnx",
    providers=["CPUExecutionProvider"]
)

# Run inference
outputs = session.run(None, {"input_ids": input_ids})

Full Pipeline

For the complete TTS pipeline, you'll need:

The tokenizer files from Qwen3-TTS-12Hz-0.6B-Base
The Rust DLL for audio preprocessing (from the original repo)
Reference audio for voice cloning

See the original repository for the complete pipeline example.

Model Files

quantized_int4/
├── codec_embed_q.onnx           # 3.1 MB
├── speaker_encoder_q.onnx       # 9.3 MB
├── code_predictor_embed_q.onnx  # 31 MB
├── code_predictor_q.onnx        # 111 MB
├── tokenizer12hz_encode_q.onnx  # 76 MB
├── tokenizer12hz_decode_q.onnx  # 221 MB
├── text_project_q.onnx          # 317 MB
├── talker_decode_q.onnx         # 448 MB
└── talker_prefill_q.onnx        # 448 MB

Test Results (Linux, ONNX Runtime 1.23.2)

Model	Status	Notes
text_project_q.onnx	✅ Works	Text → embedding
codec_embed_q.onnx	✅ Works	Code embedding
code_predictor_q.onnx	✅ Works	Sub-code prediction
code_predictor_embed_q.onnx	✅ Works	Code predictor embedding
talker_prefill_q.onnx	✅ Works	Initial generation
talker_decode_q.onnx	✅ Works	Autoregressive decoding
speaker_encoder_q.onnx	⚠️ Fails	Requires ConvInteger support
tokenizer12hz_encode_q.onnx	⚠️ Fails	Requires ConvInteger support
tokenizer12hz_decode_q.onnx	⚠️ Fails	Requires ConvInteger support

Known Limitations

ConvInteger ops: The audio tokenizer and speaker encoder models use ConvInteger(10) ops that require:
- ONNX Runtime with MLAS optimizations
- Or GPU execution provider (CUDA, DirectML)
Voice cloning: Requires reference audio processing from the original DLL
Full pipeline: For complete TTS, you need the non-quantized tokenizer models from the original repo

Credits

This work is based on:

Qwen3-TTS by the Qwen Team at Alibaba Cloud
- Original PyTorch model and training
- Apache 2.0 License
zukky/Qwen3-TTS-ONNX-DLL by @zukky
- ONNX conversion with single-file embedded weights
- Rust DLL for preprocessing and tokenization
- Python pipeline example

License

Apache-2.0 (following the original Qwen3-TTS license)

Citation

@misc{qwen3tts2024,
  title={Qwen3-TTS: A Text-to-Speech Model},
  author={Qwen Team},
  year={2024},
  publisher={Alibaba Cloud}
}

Downloads last month: 34