Text-to-Speech
Transformers
Safetensors
qwen3
text-generation
speech
tts
voice
text-generation-inference

Model Card for Indic-Mio

Indic-Mio is an open-source Text-to-Speech (TTS) model that supports all 22 scheduled Indian languages and English. Produces high-quality natural-sounding speech at 44kHz with less than 0.1 RTF. Zero-shot voice cloning supported via speaker embeddings in the codec. Also works well for code-mixed sentences.

This model is a fine-tuned version of Aratako/MioTTS-0.6B which uses MioCodec for speech tokenization.

Prompting

For emotion and style control, place the tags at the end of the sentence.

For example: मुझे यह फिल्म बहुत पसंद आई! <happy> or I am not sure if I can do this. <confused>

Tags for Indian languages: <happy>, <sad>, <angry>, <disgust>, <fear>, <surprise>
Tags for English: <happy>, <sad>, <enunciated>, <confused>, <angry>, <whisper>

A word can be stressed by using asterisks(*) around it. For example: No! I could *never* do it!

Inference

Approach 1: With MioTTS-Inference (recommended)

Install vllm and set up MioTTS-Inference.

vllm serve SPRINGLab/Indic-Mio --gpu-memory-utilization 0.5
cd MioTTS-Inference
python run_server.py
python run_gradio.py

Approach 2: Directly with Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
from miocodec import MioCodec
import numpy as np
import torch

model_name = "SPRINGLab/Indic-Mio"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="cuda"
)

text = "नमस्ते, आप कैसे हैं?"
messages = [{"role": "user", "content": text}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.9,
    top_p=0.9,
)

generated = output[0][inputs["input_ids"].shape[1]:]
speech_offset = 151669
audio_codes = [t.item() - speech_offset for t in generated 
               if speech_offset <= t.item() < speech_offset + 12800]

# Convert audio_codes by decoding with MioCodec
# audio_codes -> numpy array -> MioCodec decode -> wav

codec = MioCodec.from_pretrained("Aratako/MioCodec-25Hz-24kHz")
codes_tensor = torch.tensor([audio_codes], dtype=torch.long).unsqueeze(0)  # [1, 1, T]
wav = codec.decode(codes_tensor)  # -> [1, 1, num_samples]

import soundfile as sf
sf.write("output.wav", wav.squeeze().cpu().numpy(), 44100)

Training

This model was trained on a single NVIDIA A6000 ADA GPU in less than 6 hours.

For Indian languages, IndicTTS, Rasa and Syspin datasets were used. For American English, LibriTTS and Expresso, while for Indian English, SPICOR dataset was used.

Fine-tuning

This model is robust yet flexible. You can fine-tune it on your own dataset for better performance on specific languages, accents, speakers, styles or emotions. Just a few steps of LoRA fine-tuning can significantly improve the performance for your target task.

Citations

In case you use this model, please cite this huggingface repository as follows:

@misc{indic-mio-tts,
  title={Indic-Mio TTS},
  author={Advait Joglekar},
  year={2026},
  publisher = {Hugging Face},
  howpublished={\url{https://huggingface.co/SPRINGLab/Indic-Mio}},
}
Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Examples
Examples
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SPRINGLab/Indic-Mio

Finetuned
(2)
this model

Datasets used to train SPRINGLab/Indic-Mio