Qwen3-TTS 0.6B ONNX INT8 Quantized
This repository provides INT8 quantized ONNX models for Qwen3-TTS 0.6B, optimized for efficient inference.
Model Details
- Original Model: Qwen/Qwen3-TTS by the Qwen Team at Alibaba
- ONNX Conversion: zukky/Qwen3-TTS-ONNX-DLL
- Quantization: Dynamic INT8 quantization using ONNX Runtime
Compression Results
| Model | Original | Quantized | Compression |
|---|---|---|---|
| talker_prefill | 1.69 GB | 448 MB | 75% |
| talker_decode | 1.69 GB | 448 MB | 75% |
| text_project | 1.21 GB | 317 MB | 75% |
| tokenizer12hz_decode | 436 MB | 221 MB | 52% |
| code_predictor | 420 MB | 111 MB | 75% |
| tokenizer12hz_encode | 184 MB | 76 MB | 61% |
| code_predictor_embed | 120 MB | 31 MB | 75% |
| speaker_encoder | 34 MB | 9.3 MB | 73% |
| codec_embed | 12 MB | 3.1 MB | 75% |
| Total | 6.1 GB | 1.6 GB | 73% |
Usage
Requirements
pip install onnxruntime numpy
Loading Models
import onnxruntime as ort
# Load a quantized model
session = ort.InferenceSession(
"text_project_q.onnx",
providers=["CPUExecutionProvider"]
)
# Run inference
outputs = session.run(None, {"input_ids": input_ids})
Full Pipeline
For the complete TTS pipeline, you'll need:
- The tokenizer files from Qwen3-TTS-12Hz-0.6B-Base
- The Rust DLL for audio preprocessing (from the original repo)
- Reference audio for voice cloning
See the original repository for the complete pipeline example.
Model Files
quantized_int4/
βββ codec_embed_q.onnx # 3.1 MB
βββ speaker_encoder_q.onnx # 9.3 MB
βββ code_predictor_embed_q.onnx # 31 MB
βββ code_predictor_q.onnx # 111 MB
βββ tokenizer12hz_encode_q.onnx # 76 MB
βββ tokenizer12hz_decode_q.onnx # 221 MB
βββ text_project_q.onnx # 317 MB
βββ talker_decode_q.onnx # 448 MB
βββ talker_prefill_q.onnx # 448 MB
Test Results (Linux, ONNX Runtime 1.23.2)
| Model | Status | Notes |
|---|---|---|
| text_project_q.onnx | β Works | Text β embedding |
| codec_embed_q.onnx | β Works | Code embedding |
| code_predictor_q.onnx | β Works | Sub-code prediction |
| code_predictor_embed_q.onnx | β Works | Code predictor embedding |
| talker_prefill_q.onnx | β Works | Initial generation |
| talker_decode_q.onnx | β Works | Autoregressive decoding |
| speaker_encoder_q.onnx | β οΈ Fails | Requires ConvInteger support |
| tokenizer12hz_encode_q.onnx | β οΈ Fails | Requires ConvInteger support |
| tokenizer12hz_decode_q.onnx | β οΈ Fails | Requires ConvInteger support |
Known Limitations
- ConvInteger ops: The audio tokenizer and speaker encoder models use
ConvInteger(10)ops that require:- ONNX Runtime with MLAS optimizations
- Or GPU execution provider (CUDA, DirectML)
- Voice cloning: Requires reference audio processing from the original DLL
- Full pipeline: For complete TTS, you need the non-quantized tokenizer models from the original repo
Credits
This work is based on:
Qwen3-TTS by the Qwen Team at Alibaba Cloud
- Original PyTorch model and training
- Apache 2.0 License
zukky/Qwen3-TTS-ONNX-DLL by @zukky
- ONNX conversion with single-file embedded weights
- Rust DLL for preprocessing and tokenization
- Python pipeline example
License
Apache-2.0 (following the original Qwen3-TTS license)
Citation
@misc{qwen3tts2024,
title={Qwen3-TTS: A Text-to-Speech Model},
author={Qwen Team},
year={2024},
publisher={Alibaba Cloud}
}
- Downloads last month
- 34