Building a Voice-First Hindi Tutor: Technical Lessons from Helping Diaspora Kids Talk to Their Grandparents
TL;DR: My nephew in Singapore was losing his Hindi. Video calls with Dadi were getting awkward. I built an AI tutor so diaspora kids (ages 5-10) can practice real conversations daily. Went through 3 LLM providers, 4 STT experiments, and learned that building for children is 10x harder than adults. Here's the technical journey.
It Started at a Family Dinner
Singapore, 2024. My nephew, surrounded by Chinese at school, was struggling to talk to his grandmother on video calls. The typical pattern: wave awkwardly, say "नमस्ते Dadi," then silence. His mom translating everything.
The gap was obvious: Traditional apps teach vocabulary. Video calls happen weekly at best. But conversation—real, natural conversation—needs daily practice with a patient companion who lets him talk about dinosaurs and cartoons in Hindi without judgment.
So between jobs, I spent two weeks building what he needed. Naive as I was, I thought it would be straightforward.
It wasn't.
This is the technical story of building a voice-first Hindi conversation tutor for diaspora kids trying to hold onto their heritage language.
The Real Problem: Not Just Latency, But Context
There are dozens of storytelling apps in Hindi. Plenty of reading apps like Google's Read Along and Kutuki. But zero conversational practice apps that actually work for kids speaking Hindi in a diaspora context.
Why is diaspora different?
Code-switching is the norm:
"मैं school जा रहा हूं" (I'm going to school) "मुझे dinosaurs बहुत पसंद हैं" (I really like dinosaurs)
Cultural vocabulary matters:
- They need to know "Dadi" vs "Nani" (not just "grandmother")
- Understand why we light diyas on Diwali
- Be able to talk about rotis and parathas, not just "bread"
English is dominant:
- Hindi is the "special occasion" language
- Kids are self-conscious about making mistakes
- They give up easily if it feels like a test
Building for this context meant rethinking everything about the standard STT → LLM → TTS pipeline.
Architecture Evolution: 3 LLM Providers, 4 STT Attempts
The Naive Start (August 2024)
Elevenlabs STT → OpenAI GPT-4 → Elevenlabs TTS
Hypothesis: Use the best-in-class for each component.
Reality:
- Elevenlabs STT was terrible at Hindi (90% accuracy for English, 40% for Hindi)
- GPT-4 was powerful but slow (1.5s per response)
- Total latency: 4-6 seconds
- My nephew lost interest before hearing responses
Lesson 1: "Best-in-class" for English ≠ best-in-class for Indic languages.
The Hindi-Specialized Stack (September 2024)
Sarvam STT → OpenAI GPT-4 → Elevenlabs TTS
Why Sarvam? Indian startup specializing in Hindi ASR. Better at children's speech patterns and understanding context.
Why keep Elevenlabs TTS? Tried Sarvam's TTS—robotic, 1s+ latency. For kids, voice quality is non-negotiable. They need a warm, engaging voice that feels like talking to a real person, not a robot.
Result: Better accuracy, still too slow (3-4s latency).
The Speed Obsession (September 2024)
Sarvam STT → Groq Llama 3.1 8B → Elevenlabs TTS
Breakthrough insight: For "talk to Dadi about your day" conversations, GPT-4 is overkill.
Optimizations:
- Smaller, faster model (Llama 3.1 8B)
- Parallel API calls (evaluation + response generation)
- Cost dropped 10x
# Sequential execution (SLOW)
transcription = await sarvam_stt(audio) # 800ms
evaluation = await evaluate(transcription) # 1200ms
response = await generate_response(...) # 1500ms
audio = await elevenlabs_tts(response) # 800ms
# Total: 4300ms
# Parallel execution (FAST)
transcription = await sarvam_stt(audio) # 800ms
# These run simultaneously
results = await asyncio.gather(
evaluate(transcription), # 1200ms
generate_response(...), # 1500ms
)
audio = await elevenlabs_tts(response) # 800ms
# Total: 3100ms (28% faster)
Result: 44% latency reduction. Kids stayed engaged.
Trade-off: Slightly less nuanced responses, but 5-year-olds didn't notice.
Current Stack (December 2025 - Present)
Sarvam STT → Google Gemini 2.0 Flash Lite → Elevenlabs TTS
Why switch from Groq to Gemini?
- Comparable speed (700ms inference)
- Better at Hindi context and cultural nuance
- Native streaming support for typewriter effect
- More reliable (Groq had occasional timeouts)
- Cost-effective ($)
The winning pipeline:
with ThreadPoolExecutor(max_workers=2) as executor:
eval_future = executor.submit(
evaluate_response, # Grammar + context check
user_text, tutor_question
)
conv_future = executor.submit(
generate_response, # Actual conversation
conversation_history, child_name
)
evaluation = eval_future.result()
conversation = conv_future.result()
This parallel execution saves 1-2 seconds per turn. For kids, that's the difference between "this is fun!" and "I'm bored."
The STT Nightmare: Why Indic Languages Are Still Hard
Experiment 1: Chromium Native ASR 🚫
Hypothesis: Use browser's built-in speech recognition. Zero API costs, instant feedback with live transcription.
Reality with kids learning Hindi:
Live transcription updates as the model corrects itself:
First pass: "मैं school जा रहा"
Second pass: "मैं स्कूल..."
Final: "मैं school जा रहा हूं"
Kids thought the first transcription was correct. They'd stop speaking, confused why it was changing.
Verdict: Optimizing for perceived latency ≠ optimizing for learning clarity.
Experiment 2: Whisper (The Hype vs Reality) 🚫
Everyone said "just use Whisper for multilingual ASR."
For diaspora kids speaking Hindi:
- No context understanding for children's voices
- "बाल" (hair) vs "बॉल" (ball) - consistently wrong
- Proper nouns mangled: "Rohan" → "रोहन" (correct) but "Aarav" → "आरव" (wrong)
- Indian names and places: disaster
- Code-switching: "I like पिज़्ज़ा" → transcribed as gibberish
Example failure: Child says: "मैं आज अपने friend के साथ park गया" Whisper: "main aaj apne friend ke saath park gaya" (all Roman script) Needed: "मैं आज अपने friend के साथ park गया" (mixed script)
Lesson 2: Cutting-edge ≠ good for your specific use case.
Experiment 3: Google Cloud STT (The Dark Horse) ✅
I almost skipped it. Google doesn't market it aggressively. Seemed old-school compared to Whisper.
Surprise: It was perfect for diaspora kids.
Why it worked:
- Context-aware: Distinguishes between similar words based on sentence meaning
- Handles pauses: Kids think while speaking (long pauses are normal)
- Code-switching friendly: Properly handles English words in Hindi sentences
- Indian vocabulary: Knows "Dadi," "Nani," "paratha" aren't random sounds
- Forgiving pronunciation: Doesn't need perfect Hindi to understand intent
Example: Child says (with 2s pause mid-sentence): "मुझे... cricket खेलना बहुत पसंद है" Google STT: Correctly transcribes with pause intact Whisper: Often truncates or misses second half
Trade-off: Doesn't correct pronunciation (but that was never the goal—conversation practice is the goal).
Current Setup: Google Cloud Primary, Sarvam Fallback
def transcribe_audio(audio_bytes, provider='sarvam'):
try:
if provider == 'google':
# Primary: Google (context-aware + decent in Hindi)
result = sarvam_api.transcribe(audio_bytes)
return result
except Exception as e:
logger.warning(f"Google Cloud STT failed: {e}")
# Fallback: Sarvam
return sarvam.transcribe(audio_bytes)
Why dual-provider?
- Google Cloud STT: Better for messy, pause-heavy, code-switched speech
- Sarvam: Better for clear, confident speech
- Fallback ensures 99.9% uptime (critical for kids—they don't retry)
Voice UX: Designing for Heritage Language Learners
Challenge: These Aren't Native Speakers
Most voice apps assume fluent users. Diaspora kids:
- Think in English, translate to Hindi
- Take long pauses mid-sentence
- Need encouragement, not correction
- Get discouraged easily
Decision 1: Manual Recording > Voice Activity Detection
Conventional wisdom: Use VAD for seamless conversation.
Reality with 5-10 year olds:
- 3-5 second pauses while thinking of Hindi words
- Sudden outbursts mid-silence ("oh wait, I know!")
- Environmental noise (siblings yelling, TV, barking dogs)
- Self-conscious about "wasting recording time"
Our solution: Big, obvious record button with visual feedback.
// Simple is better for kids
recordButton.addEventListener('click', () => {
if (isRecording) {
stopRecording();
processingIndicator.show();
} else {
startRecording();
animatedMic.start();
}
});
Visual cues matter:
- Red pulsing ring during recording
- "Listening..." text appears
- Animated mic icon to show it's working
- Clear "Stop Recording" state
Result: Kids understand the turn-taking model. Zero confusion.
Decision 2: The 10-Sentence Conversation Structure
Problem discovered during testing: Kids got tired after 15+ exchanges. Conversations dragged. They'd leave mid-conversation.
Goal: Keep them wanting more, not exhausted.
Solution: Structured phases with automatic wrap-up:
def get_phase_instruction(sentences_count, is_farewell=False):
if is_farewell:
return "Give warm goodbye with homework for parents"
if sentences_count >= 10:
return "Wrap up conversation with encouragement"
if sentences_count == 9:
return "Start transitioning to conclusion"
return "" # Continue naturally
The conversation arc:
Sentences 1-2: Warm greeting, establish topic
- "नमस्ते! आज हम खाने के बारे में बात करेंगे।" (Hi! Today we'll talk about food.)
Sentences 3-8: Natural back-and-forth
- Tutor asks questions, kid responds
- Gentle encouragement and corrections
Sentence 9: Wrap-up signal
- "वाह! तुमने आज बहुत अच्छी हिंदी बोली।" (Wow! You spoke great Hindi today.)
Sentence 10+: Graceful conclusion
- "अब जाकर मम्मी या पापा को बताओ..." (Now go tell Mom or Dad...)
- Gives them "homework" (share what they learned)
- Makes them feel accomplished
Why this works:
- Finite endpoint (kids like knowing when it'll end)
- Natural conclusion (not abrupt cutoff)
- Homework for parents (encourages family conversation)
- Feel accomplished, not exhausted
Farewell handling: If kid says "bye" or "अलविदा" at sentence 5:
if is_farewell:
return immediate_warm_goodbye() # Don't force them to continue
Decision 3: Streaming with Typewriter Effect
Old approach: Wait for full response → display all at once → play audio
Better approach: Stream text as it generates, audio comes after
// Text appears immediately, audio later
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const {done, value} = await reader.read();
if (done) break;
const text = decoder.decode(value);
displayTextTypewriter(text); // Char-by-char animation
}
// Audio completes in background
await playAudio(audioBytes);
Why it works:
- Kids see instant feedback ("it's thinking about what I said!")
- Perceived latency < actual latency
- They start reading while audio generates
- Better sense of conversation flow
Actual latency breakdown:
User stops speaking
↓
500ms: Processing audio
↓
800ms: STT transcription
↓
0ms: Start streaming text (parallel LLM + TTS)
↓
700ms: LLM generates full response
↓
600ms: TTS completes audio
↓
Total: 2.6s but feels like 1.3s due to streaming
The Latency Battle: India + Diaspora Reality
The Mystery: Why Is Production 3x Slower?
Local testing (MacBook, Bangalore): 1.5-2s latency ✅
Production (Heroku, USA): 4-6s latency 🚫
Initial reaction: "Is Heroku throttling me?"
Actual root causes:
1. Geographic API distribution
Sarvam API: Mumbai servers
OpenAI API: US West servers
ElevenLabs API: US East servers
Round-trip time for API chain:
- Bangalore → Mumbai: 30ms
- Mumbai → US West: 200ms
- US West → US East: 70ms
- US East → User: variable
Production total: +400ms just in network hops
2. Heroku cold starts (free tier pain)
First request after 30min inactivity: 8-10s
Subsequent requests: 4-6s
3. Sequential API calls in initial architecture
Solutions Implemented
Optimization 1: Parallel API execution
# Impact: -28% latency
# Before
transcription = await stt(audio) # 800ms
evaluation = await evaluate(text) # 1200ms
response = await generate(context) # 1500ms
audio = await tts(response) # 800ms
# Total: 4300ms
# After
transcription = await stt(audio) # 800ms
# Run simultaneously
eval, response = await asyncio.gather(
evaluate(text), # 1200ms
generate(context) # 1500ms
)
# (takes 1500ms, not 2700ms)
audio = await tts(response) # 800ms
# Total: 3100ms
Optimization 2: Keep Heroku warm
# Ping every 25 minutes
@app.route('/health')
def health_check():
return {'status': 'healthy', 'timestamp': time.time()}
# External service: UptimeRobot pings /health every 5min
Optimization 3: Redis session caching
# Before: DB hit every request (100ms)
# After: Redis hit (5ms)
def get_session_store():
if redis_url:
try:
return RedisSessionStore(redis_url)
except:
return FileSessionStore() # Fallback
return FileSessionStore()
The Provider Speed Race
We tried every major LLM provider:
| Provider | Model | Latency | Quality | Cost/1M | Verdict |
|---|---|---|---|---|---|
| OpenAI | GPT-4 | 1500ms | Excellent | $15 | Slow + expensive |
| OpenAI | GPT-4o-mini | 800ms | Very Good | $0.15 | Good but costly |
| Groq | Llama 3.1 8B | 600ms | Good | $0.05 | Fast! |
| Groq | Llama 3.3 70B | 900ms | Excellent | $0.59 | Quality jump |
| Gemini 2.0 Flash | 700ms | Excellent | $0.075 | Winner |
Why Gemini won:
- Fast inference (700ms avg)
- Excellent at Hindi conversation context
- Streaming support built-in
- Much cheaper than GPT-4o-mini
- More reliable than Groq (fewer timeouts)
Lesson 3: For kid-focused conversations in Hindi, model quality matters less than you think. Speed + cultural context matters more.
UI/UX: Why Building for Kids Is 10x Harder
The "This Is Boring" Problem
First version was built like a normal app:
- Clean white interface
- Minimal animations
- Text instructions
- Standard buttons
Kids' reaction: 😐 Left within 2 minutes.
The Airplane Insight
Flying Singapore → Bangalore, I watched the in-flight entertainment system. When you switch to "Kids Mode":
Adult mode: Clean, text-heavy, functional
Kids mode:
- Bright colors everywhere
- Animations on every interaction
- Dinosaurs, rockets, stars floating around
- Progress bars designed like games
- Sounds for every action
Aha moment: Kids need constant positive reinforcement or they disengage.
What Actually Works for 5-10 Year Olds
1. Visual reward system (The Star Economy)
// Points for every interaction
const POINTS = {
sentence: 10, // Every sentence in Hindi
quality_bonus: 20, // Every 5 good responses
completion: 50, // Finish conversation
milestone: 30 // Special achievements
};
// Update with animation
function updateRewards(points) {
animateNumberChange(starsElement, currentPoints, newPoints);
playSound('chime.mp3');
// Milestone celebration
if (goodResponses % 5 === 0) {
showCelebration(); // Confetti + special animation
playSound('applause.mp3');
}
}
Why it works: Kids are motivated by immediate, visual progress. Stars accumulate visibly. They count them proudly.
2. Positive-only feedback (Never punish)
def get_feedback_type(grammar_score, context_score):
if grammar_score >= 7 and context_score >= 7:
return "green" # "Great Hindi!" ✅
if grammar_score >= 5 or context_score >= 5:
return "amber" # "Try saying..." 🔄
# NEVER red/negative
return "amber" # Always give them a path forward
Green bubble: "बहुत अच्छा! That was great Hindi!"
Amber bubble: "Good! You can also say: [correction]"
Red bubble: ❌ Never. Kids shut down.
3. Minimal header during learning
Before (distracting):
[App Title] [Dashboard] [Nav] [Sentences: 5] [Points: 30] [Profile ▼]
After (focused):
[← Back] [⭐ 47] [🐶]
Result: 60% longer session times. Kids stayed in the conversation.
4. Fun animal avatars (kid-tested)
Instead of initials or photos:
const avatars = ['🐶', '🐱', '🐼', '🦊', '🐨', '🐯'];
// Rotate through on each visit
const avatar = avatars[sessionCount % avatars.length];
Kids love their animal identity. They ask "which animal am I today?"
5. Celebration animations at milestones
// Every 5 good responses
if (goodResponseCount % 5 === 0) {
showFullScreenCelebration({
confetti: true,
message: "Amazing! 5 great sentences!",
sound: 'applause.mp3',
stars: +20
});
}
Before celebration system: 4-minute average session
After celebration system: 12-minute average session
Kids kept talking to hit the next milestone.
The Grammar vs Context Challenge
Why Standard Grammar Checking Fails
Naive approach:
def evaluate(user_response):
prompt = "Rate this Hindi sentence grammar (1-10): " + user_response
return llm(prompt)
Problem: No context means wrong evaluations.
Example:
Tutor: "तुम्हें कौन सा खाना पसंद है?" (What food do you like?)
Kid: "हाँ" (Yes)
Grammar check: 10/10 ✅
Context check: ❌ Completely wrong answer
Better: Contextual Evaluation
def evaluate_response(user_response, tutor_question, conversation_history):
prompt = f"""
You're evaluating a child (age 5-10) learning Hindi.
Tutor asked: "{tutor_question}"
Child said: "{user_response}"
Previous context: {conversation_history[-3:]}
Evaluate:
1. Grammar (1-10): Are sentences grammatically correct?
2. Context (1-10): Does response make sense for the question?
3. Code-switching: Count English words used
4. Encouragement: What to say to motivate them?
5. Correction: If needed, suggest better phrasing
Return JSON:
{{
"grammar_score": int,
"context_score": int,
"english_word_count": int,
"feedback_type": "green" | "amber",
"encouragement": str,
"corrected_response": str | null
}}
"""
return gemini_json_mode(prompt)
Better evaluation:
{
"grammar_score": 8,
"context_score": 4,
"english_word_count": 0,
"feedback_type": "amber",
"encouragement": "Good try! But let's answer the question.",
"corrected_response": "मुझे पिज़्ज़ा पसंद है" (I like pizza)
}
The Code-Switching Dilemma
Diaspora kids naturally code-switch:
"मैं school जा रहा हूं और मेरे friends के साथ lunch खाऊंगा" (I'm going to school and will eat lunch with my friends)
Design decision we debated:
- Strict: Mark as incorrect, force Hindi-only
- Lenient: Accept it completely
- Guiding: Accept but gently suggest alternatives
We chose #3 (guiding):
def handle_code_switching(english_word_count, user_response):
if english_word_count == 0:
return {
"feedback": "बहुत अच्छा! Pure Hindi!",
"type": "green",
"bonus_points": 5
}
elif english_word_count <= 2:
return {
"feedback": "Good! Next time try: 'स्कूल' instead of 'school'",
"type": "green", # Still positive
"suggestion": get_hindi_alternatives(user_response)
}
else: # 3+ English words
return {
"feedback": "Let's try that again in more Hindi",
"type": "amber",
"corrected_response": translate_to_hindi(user_response)
}
Why this works:
- Doesn't discourage them (they're learning!)
- Gently pushes toward more Hindi
- Acknowledges their effort
- Gives concrete alternatives
Real example:
Kid: "मैं park में जाकर मेरे dost के साथ football खेला"
Response: "Great sentence! ⭐ You can also say:
'मैं पार्क में जाकर मेरे दोस्त के साथ फुटबॉल खेला'
Next time try using Hindi words for 'park', 'dost', and 'football'!"
Type: Green (still rewarded)
Points: +10
Gamification: What Works vs What Doesn't
❌ What Didn't Work
1. Story Co-Creation
Initial idea: "Let's create a Panchatantra story together!"
Problems:
- Too open-ended for beginners
- Kids didn't know "what happens next"
- Requires advanced Hindi skills
- Paralysis of choice
Example failure:
Tutor: "एक कौवा प्यासा था। अब क्या हुआ?" (A crow was thirsty. What happened next?)
Kid: "...um... I don't know... पानी?" (water?)
Tutor: "हाँ! कहाँ पानी मिला?" (Yes! Where did he find water?)
Kid: [long pause] "I don't know" [exits]
2. Free-Form Conversation
Initial idea: "Talk about whatever you want!"
Problems:
- Kids don't know what to say
- AI veers into random topics
- No sense of progress
- No clear goal
3. Correction Pop-ups (Too Harsh)
Initial design: Immediately show corrections after mistakes
Problem: Felt like a test, not practice. Kids became self-conscious.
✅ What Actually Works
1. Structured conversation types
Each topic has clear scope and goals:
"My Family" (मेरा परिवार):
- Goal: Learn family member names in Hindi
- Scope: Dadi, Nani, Chacha, Bua relationships
- Duration: 8-10 exchanges
- Outcome: Can introduce family on video call
"Food Talk" (खाने की बातें):
- Goal: Describe favorite foods
- Scope: Meals, snacks, festivals foods
- Cultural element: Why do we eat certain foods?
- Outcome: Can talk about dinner with grandparents
Why structured works:
- Clear goal (kids know what they're working toward)
- Sense of progress (visible topic completion)
- Cultural context (learning why, not just what)
- Real application (will actually use this with Dadi)
2. Milestone-based rewards
REWARD_STRUCTURE = {
'sentence': 10, # Base points
'quality_milestone': 20, # Every 5 good responses
'completion': 50, # Finish conversation
'streak': 30, # Multiple days in a row
}
def calculate_reward(metrics):
points = REWARD_STRUCTURE['sentence']
if metrics['good_response_count'] % 5 == 0:
points += REWARD_STRUCTURE['quality_milestone']
trigger_celebration() # Visual reward
return points
Result: Kids chase milestones ("just 2 more sentences till the celebration!")
3. Parent analytics (hidden motivator)
Dashboard shows parents:
- Conversations this week vs last week
- Good response percentage
- Topics covered
- Progress over time
Why it matters: Parents encourage kids when they see progress. Social proof works even for kids.
Example parent reaction:
"Beta, you had 4 conversations this week! That's amazing! And look, 70% were perfect Hindi sentences!"
Technical Decisions Worth Discussing
HTTP + SSE > WebSockets (For Turn-Taking)
Everyone: "Voice apps need WebSockets for real-time!"
Our reasoning: Kids do turn-taking, not simultaneous conversation.
# Simple HTTP endpoint
@app.route('/api/process_audio_stream', methods=['POST'])
def process_audio():
audio = request.files['audio']
def generate_response():
# Streaming via SSE
for chunk in stream_conversation(audio):
yield f"data: {json.dumps(chunk)}\n\n"
return Response(
generate_response(),
mimetype='text/event-stream'
)
Benefits:
- Simpler architecture
- Easier debugging
- Better error handling
- Works with standard HTTP load balancers
- SSE gives us streaming without WebSocket complexity
Trade-off: Can't do simultaneous input/output (but we don't need it)
Dual-Layer Session Management
def get_session_store():
"""
Production: Redis (24hr TTL)
Fallback: FileSessionStore
Persistence: PostgreSQL (forever)
"""
if redis_url:
try:
store = RedisSessionStore(redis_url)
store.redis.ping()
return store
except:
logger.warning("Redis failed, using FileStore")
return FileSessionStore()
return FileSessionStore()
Why three layers?
| Layer | Purpose | TTL | Use Case |
|---|---|---|---|
| Redis | Fast access | 24 hours | Active conversation state |
| FileStore | Dev + fallback | 24 hours | Local dev, Redis failure |
| PostgreSQL | History | Forever | Parent dashboard, analytics |
Principle: Resilience > performance optimization. If Redis fails at 2am, conversations continue with FileStore.
SQLite → PostgreSQL (Ship Fast, Scale Later)
# Works in both dev and prod
database_url = os.getenv(
'DATABASE_URL',
'sqlite:///hindi_tutor.db' # Dev fallback
)
# Heroku gives postgres:// URL, SQLAlchemy needs postgresql://
if database_url.startswith('postgres://'):
database_url = database_url.replace('postgres://', 'postgresql://', 1)
Development: SQLite (zero setup, file-based)
Production: PostgreSQL (Heroku managed)
Same ORM, different backend. Ship fast locally, scale when needed.
The iOS Audio Hell
This deserves its own post.
The Problem: Safari Audio Just Doesn't Work
Issues discovered:
- First audio playback: Silent (AudioContext requires user gesture)
- Volume controls: Don't work (Web Audio API needed)
- Recording: Random lag (WebKit MediaRecorder quirks)
- OAuth redirects: Break audio state entirely
Standard audio approach:
// This FAILS on iOS
const audio = new Audio(audioUrl);
audio.play(); // Silent on first try
The Solution (After Pulling Hair Out)
Step 1: Unlock AudioContext on ANY user interaction
let audioContext = null;
let audioUnlocked = false;
function unlockAudio() {
if (audioUnlocked) return;
audioContext = new (window.AudioContext || window.webkitAudioContext)();
// Create dummy sound to "unlock" audio
audioContext.resume().then(() => {
const buffer = audioContext.createBuffer(1, 1, 22050);
const source = audioContext.createBufferSource();
source.buffer = buffer;
source.connect(audioContext.destination);
source.start(0);
audioUnlocked = true;
console.log('iOS audio unlocked');
});
}
// Critical: Call on FIRST user interaction
document.addEventListener('click', unlockAudio, { once: true });
document.addEventListener('touchstart', unlockAudio, { once: true });
Step 2: Use Web Audio API for all playback
async function playAudioiOS(audioData) {
if (!audioUnlocked) {
console.error('Audio not unlocked yet');
return;
}
// Decode audio data
const arrayBuffer = await audioData.arrayBuffer();
const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
// Create source
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
// Volume control (finally works!)
const gainNode = audioContext.createGain();
gainNode.gain.value = volumeSlider.value;
// Connect: source → gain → destination
source.connect(gainNode);
gainNode.connect(audioContext.destination);
// Play
source.start(0);
}
Step 3: Handle OAuth redirects
// Before OAuth redirect, save audio state
function preserveAudioState() {
sessionStorage.setItem('audioUnlocked', audioUnlocked);
sessionStorage.setItem('volume', volumeSlider.value);
}
// After OAuth callback, restore
function restoreAudioState() {
audioUnlocked = sessionStorage.getItem('audioUnlocked') === 'true';
if (audioUnlocked) {
unlockAudio();
}
}
Lesson 4: Mobile web audio in 2025 is still a mess. Test on actual iPhones, not just simulators.
Measuring Success: Metrics That Matter
Latency Breakdown (Current)
User stops speaking
↓
Audio processing: 50ms
↓
STT (Google Cloud): 800ms
↓
[Parallel execution starts]
├─ Evaluation (Gemini): 600ms
└─ Response (Gemini): 400ms
↓
TTS (Elevenlabs): 300ms
↓
Network overhead: 200ms
↓
Total: ~1.7 seconds ✅
Target: < 2 seconds (kids stay engaged)
Achieved: 1.7s average, 1.0s best case
Key Lessons for Voice AI Builders
1. Don't Trust English-Optimized Benchmarks
- Whisper is SOTA for English ASR → Failed for Hindi kids
- GPT-4 is best LLM → Too slow for real-time voice
- WebSockets are standard → HTTP+SSE worked better
Takeaway: Benchmark on YOUR use case, not general leaderboards.
2. Optimize for Perceived Latency
Actual latency: 2.5s
Perceived latency: 1.3s (via streaming text)
Techniques:
- Show text immediately while audio generates
- Animated "thinking" indicators
- Progress bars during processing
- Smooth transitions between states
3. Kids Are a Different Species
What works for adults:
- Minimal UI
- Text-heavy
- Functional design
- Assumes motivation
What works for kids:
- Visual rewards
- Game-like progression
- Needs constant encouragement
4. For Indic Languages, Go Specialized
Winners:
- Google Cloud STT (context-aware + decent for Hindi)
- Sarvam STT (Hindi-specialized)
Losers:
- Whisper (English-optimized)
- Generic APIs (no cultural context)
5. Voice UX ≠ Chat UX
Voice-specific needs:
- Turn-taking clarity (record button > VAD)
- Audio quality > speed (for TTS)
- Long pause handling (kids think)
- Visual feedback (they can't see processing)
What's Next for the Product
Short-term (Next 3 Months)
1. Age-specific models
- 5-6 year olds: Simpler vocabulary, shorter sentences
- 7-8 year olds: More complex topics
- 9-10 year olds: Cultural context, stories
2. Phoneme practice module
- Difficult Hindi sounds: ड vs द, ट vs त, ऋ
- Minimal pairs practice
- Pronunciation feedback
3. Parent dashboard v2
- Conversation playback (audio clips)
- Weekly progress reports via email
- Suggested topics based on child's level
Long-term Vision
1. Sibling mode
- Multiple kids in one household
- Shared family subscription
- Individual progress tracking
2. Heritage language expansion
- Tamil for Tamil diaspora
- Gujarati, Telugu, Bengali
- Same architecture, different languages
3. Cultural curriculum
- Festival explanations (why Diwali? why Holi?)
- Story traditions (Panchatantra, Akbar-Birbal)
- Family vocabulary (why "Mami" ≠ "Chachi")
Goal: Become the default tool for diaspora families keeping heritage languages alive.
Open Questions (Still Figuring Out)
1. How do we measure actual learning vs engagement?
- High engagement doesn't always = learning
- Need longitudinal studies with real families
- Considering: pre/post conversation assessments
2. Optimal conversation length by age?
- 10 sentences works for 5-6 year olds
- Do 8-9 year olds need longer conversations?
- How to dynamically adjust?
3. Should we enforce Hindi-only or accept code-switching?
- Current: Accept with gentle guidance
- Alternative: Strict Hindi after level 3
- Need: More user research
4. How to prevent reward gaming?
- Kids figure out: Say anything → get points
- Considering: Quality threshold for rewards
- Balance: Don't discourage, but don't reward gibberish
Overall summary
Time invested: 4 months of focused work
Lines of code: ~15,000
Git commits: 220+
LLM providers tried: 3 (OpenAI, Groq, Gemini)
STT providers tried: 4 (Elevenlabs, Whisper, Sarvam, Google)
UI redesigns: 6 major iterations
Biggest lesson:
Building for kids is humbling. They're brutally honest users—if it's boring, they leave. If it's too slow, they get frustrated. If it doesn't work, they try once and never return.
But when a 6-year-old says "मुझे यह बहुत पसंद है!" (I really like this!) after finishing a conversation?
When a parent messages: "She asked to practice before calling Dadi. First time ever."
When you see a family preserving their heritage language across continents?
Worth every commit. Worth every latency optimization. Worth every iOS audio bug.
Join the Journey
We're currently building with 100 founding families for free early access. If you're a diaspora parent trying to keep Hindi alive, we'd love your feedback.
Built with:
- Backend: Flask + Python
- STT: Google Cloud Speech (Sarvam as fallback)
- LLM: Google Gemini 2.0 Flash Lite
- TTS: ElevenLabs
- Infrastructure: Heroku + Redis + PostgreSQL
Open questions for the community:
- What other heritage languages need this?
- How do we measure actual learning?
- Best practices for voice UX with kids?
Building something in voice AI or heritage language tech? Let's chat. This space needs more builders.
P.S. If you're building for kids, test with real kids early. They will humble you in ways adults never will.