arxiv:2604.25441

Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost

Published on Apr 28

· Submitted by

Venkata Pushpak Teja Menta on Apr 30

Praxel

Upvote

Authors:

Venkata Pushpak Teja Menta

Abstract

Researchers enhance a non-Indic-native text-to-speech system by implementing a Brahmic Unified Phoneme Space, LoRA adaptation, and voice-prompt recovery techniques to achieve commercial-quality output for Indic languages without requiring new acoustic decoders or commercial training data.

AI-generated summary

Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

praxelhq

Paper author Paper submitter about 11 hours ago

Indic TTS from a frozen non-Indic base, at zero commercial-training-data cost.

We start with Chatterbox (English-trained, MIT) and ship commercial-class Telugu/Hindi/Tamil TTS using three contributions:

BUPS: a Brahmic Unified Phoneme Space that romanises Devanagari/Telugu/Tamil through a shared ISO-15919 layer, letting one English-trained base address all three scripts.
Voice-prompt recovery + Config-B sampling: a generation-time recipe that reconstructs Indic phonology from any 8–15s reference clip without retraining the base.
R6 LoRA: a 16-rank adapter on T3 transformer attention, fine-tuned on the publicly-licensed IndicVoices subset only.

Headline finding (PSP benchmark, arXiv:2604.25476): Praxy R6 beats Sarvam Bulbul on Telugu retroflex collapse (26.7% vs 33.3%) and outperforms a commercial trio on Tamil-zha (71% vs 86%), while staying within striking distance of ElevenLabs/Cartesia on Hindi LLM-WER. Apache-2.0 weights, MIT code, live HF Space demo.