Qwen3-TTS 0.6B ONNX INT8 Quantized

This repository provides INT8 quantized ONNX models for Qwen3-TTS 0.6B, optimized for efficient inference.

Model Details

Compression Results

Model Original Quantized Compression
talker_prefill 1.69 GB 448 MB 75%
talker_decode 1.69 GB 448 MB 75%
text_project 1.21 GB 317 MB 75%
tokenizer12hz_decode 436 MB 221 MB 52%
code_predictor 420 MB 111 MB 75%
tokenizer12hz_encode 184 MB 76 MB 61%
code_predictor_embed 120 MB 31 MB 75%
speaker_encoder 34 MB 9.3 MB 73%
codec_embed 12 MB 3.1 MB 75%
Total 6.1 GB 1.6 GB 73%

Usage

Requirements

pip install onnxruntime numpy

Loading Models

import onnxruntime as ort

# Load a quantized model
session = ort.InferenceSession(
    "text_project_q.onnx",
    providers=["CPUExecutionProvider"]
)

# Run inference
outputs = session.run(None, {"input_ids": input_ids})

Full Pipeline

For the complete TTS pipeline, you'll need:

  1. The tokenizer files from Qwen3-TTS-12Hz-0.6B-Base
  2. The Rust DLL for audio preprocessing (from the original repo)
  3. Reference audio for voice cloning

See the original repository for the complete pipeline example.

Model Files

quantized_int4/
β”œβ”€β”€ codec_embed_q.onnx           # 3.1 MB
β”œβ”€β”€ speaker_encoder_q.onnx       # 9.3 MB
β”œβ”€β”€ code_predictor_embed_q.onnx  # 31 MB
β”œβ”€β”€ code_predictor_q.onnx        # 111 MB
β”œβ”€β”€ tokenizer12hz_encode_q.onnx  # 76 MB
β”œβ”€β”€ tokenizer12hz_decode_q.onnx  # 221 MB
β”œβ”€β”€ text_project_q.onnx          # 317 MB
β”œβ”€β”€ talker_decode_q.onnx         # 448 MB
└── talker_prefill_q.onnx        # 448 MB

Test Results (Linux, ONNX Runtime 1.23.2)

Model Status Notes
text_project_q.onnx βœ… Works Text β†’ embedding
codec_embed_q.onnx βœ… Works Code embedding
code_predictor_q.onnx βœ… Works Sub-code prediction
code_predictor_embed_q.onnx βœ… Works Code predictor embedding
talker_prefill_q.onnx βœ… Works Initial generation
talker_decode_q.onnx βœ… Works Autoregressive decoding
speaker_encoder_q.onnx ⚠️ Fails Requires ConvInteger support
tokenizer12hz_encode_q.onnx ⚠️ Fails Requires ConvInteger support
tokenizer12hz_decode_q.onnx ⚠️ Fails Requires ConvInteger support

Known Limitations

  • ConvInteger ops: The audio tokenizer and speaker encoder models use ConvInteger(10) ops that require:
    • ONNX Runtime with MLAS optimizations
    • Or GPU execution provider (CUDA, DirectML)
  • Voice cloning: Requires reference audio processing from the original DLL
  • Full pipeline: For complete TTS, you need the non-quantized tokenizer models from the original repo

Credits

This work is based on:

  1. Qwen3-TTS by the Qwen Team at Alibaba Cloud

    • Original PyTorch model and training
    • Apache 2.0 License
  2. zukky/Qwen3-TTS-ONNX-DLL by @zukky

    • ONNX conversion with single-file embedded weights
    • Rust DLL for preprocessing and tokenization
    • Python pipeline example

License

Apache-2.0 (following the original Qwen3-TTS license)

Citation

@misc{qwen3tts2024,
  title={Qwen3-TTS: A Text-to-Speech Model},
  author={Qwen Team},
  year={2024},
  publisher={Alibaba Cloud}
}
Downloads last month
34
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support