Building a Voice-First Hindi Tutor: Technical Lessons from Helping Diaspora Kids Talk to Their Grandparents

12 Jan, 2026

TL;DR: My nephew in Singapore was losing his Hindi. Video calls with Dadi were getting awkward. I built an AI tutor so diaspora kids (ages 5-10) can practice real conversations daily. Went through 3 LLM providers, 4 STT experiments, and learned that building for children is 10x harder than adults. Here's the technical journey.

It Started at a Family Dinner

Singapore, 2024. My nephew, surrounded by Chinese at school, was struggling to talk to his grandmother on video calls. The typical pattern: wave awkwardly, say "नमस्ते Dadi," then silence. His mom translating everything.

The gap was obvious: Traditional apps teach vocabulary. Video calls happen weekly at best. But conversation—real, natural conversation—needs daily practice with a patient companion who lets him talk about dinosaurs and cartoons in Hindi without judgment.

So between jobs, I spent two weeks building what he needed. Naive as I was, I thought it would be straightforward.

It wasn't.

This is the technical story of building a voice-first Hindi conversation tutor for diaspora kids trying to hold onto their heritage language.

The Real Problem: Not Just Latency, But Context

There are dozens of storytelling apps in Hindi. Plenty of reading apps like Google's Read Along and Kutuki. But zero conversational practice apps that actually work for kids speaking Hindi in a diaspora context.

Why is diaspora different?

Code-switching is the norm:

"मैं school जा रहा हूं" (I'm going to school) "मुझे dinosaurs बहुत पसंद हैं" (I really like dinosaurs)

Cultural vocabulary matters:

They need to know "Dadi" vs "Nani" (not just "grandmother")
Understand why we light diyas on Diwali
Be able to talk about rotis and parathas, not just "bread"

English is dominant:

Hindi is the "special occasion" language
Kids are self-conscious about making mistakes
They give up easily if it feels like a test

Building for this context meant rethinking everything about the standard STT → LLM → TTS pipeline.

Architecture Evolution: 3 LLM Providers, 4 STT Attempts

The Naive Start (August 2024)

Elevenlabs STT → OpenAI GPT-4 → Elevenlabs TTS

Hypothesis: Use the best-in-class for each component.

Reality:

Elevenlabs STT was terrible at Hindi (90% accuracy for English, 40% for Hindi)
GPT-4 was powerful but slow (1.5s per response)
Total latency: 4-6 seconds
My nephew lost interest before hearing responses

Lesson 1: "Best-in-class" for English ≠ best-in-class for Indic languages.

The Hindi-Specialized Stack (September 2024)

Sarvam STT → OpenAI GPT-4 → Elevenlabs TTS

Why Sarvam? Indian startup specializing in Hindi ASR. Better at children's speech patterns and understanding context.

Why keep Elevenlabs TTS? Tried Sarvam's TTS—robotic, 1s+ latency. For kids, voice quality is non-negotiable. They need a warm, engaging voice that feels like talking to a real person, not a robot.

Result: Better accuracy, still too slow (3-4s latency).

The Speed Obsession (September 2024)

Sarvam STT → Groq Llama 3.1 8B → Elevenlabs TTS

Breakthrough insight: For "talk to Dadi about your day" conversations, GPT-4 is overkill.

Optimizations:

Smaller, faster model (Llama 3.1 8B)
Parallel API calls (evaluation + response generation)
Cost dropped 10x

# Sequential execution (SLOW)
transcription = await sarvam_stt(audio)      # 800ms
evaluation = await evaluate(transcription)    # 1200ms  
response = await generate_response(...)       # 1500ms
audio = await elevenlabs_tts(response)       # 800ms
# Total: 4300ms

# Parallel execution (FAST)
transcription = await sarvam_stt(audio)      # 800ms

# These run simultaneously
results = await asyncio.gather(
    evaluate(transcription),                  # 1200ms
    generate_response(...),                   # 1500ms
)
audio = await elevenlabs_tts(response)       # 800ms
# Total: 3100ms (28% faster)

Result: 44% latency reduction. Kids stayed engaged.

Trade-off: Slightly less nuanced responses, but 5-year-olds didn't notice.

Current Stack (December 2025 - Present)

Sarvam STT → Google Gemini 2.0 Flash Lite → Elevenlabs TTS

Why switch from Groq to Gemini?

Comparable speed (700ms inference)
Better at Hindi context and cultural nuance
Native streaming support for typewriter effect
More reliable (Groq had occasional timeouts)
Cost-effective ($)

The winning pipeline:

with ThreadPoolExecutor(max_workers=2) as executor:
    eval_future = executor.submit(
        evaluate_response,  # Grammar + context check
        user_text, tutor_question
    )
    conv_future = executor.submit(
        generate_response,  # Actual conversation
        conversation_history, child_name
    )
    
    evaluation = eval_future.result()
    conversation = conv_future.result()

This parallel execution saves 1-2 seconds per turn. For kids, that's the difference between "this is fun!" and "I'm bored."

The STT Nightmare: Why Indic Languages Are Still Hard

Experiment 1: Chromium Native ASR 🚫

Hypothesis: Use browser's built-in speech recognition. Zero API costs, instant feedback with live transcription.

Reality with kids learning Hindi:

Live transcription updates as the model corrects itself:

First pass: "मैं school जा रहा"
Second pass: "मैं स्कूल..."
Final: "मैं school जा रहा हूं"

Kids thought the first transcription was correct. They'd stop speaking, confused why it was changing.

Verdict: Optimizing for perceived latency ≠ optimizing for learning clarity.

Experiment 2: Whisper (The Hype vs Reality) 🚫

Everyone said "just use Whisper for multilingual ASR."

For diaspora kids speaking Hindi:

No context understanding for children's voices
"बाल" (hair) vs "बॉल" (ball) - consistently wrong
Proper nouns mangled: "Rohan" → "रोहन" (correct) but "Aarav" → "आरव" (wrong)
Indian names and places: disaster
Code-switching: "I like पिज़्ज़ा" → transcribed as gibberish

Example failure: Child says: "मैं आज अपने friend के साथ park गया" Whisper: "main aaj apne friend ke saath park gaya" (all Roman script) Needed: "मैं आज अपने friend के साथ park गया" (mixed script)

Lesson 2: Cutting-edge ≠ good for your specific use case.

Experiment 3: Google Cloud STT (The Dark Horse) ✅

I almost skipped it. Google doesn't market it aggressively. Seemed old-school compared to Whisper.

Surprise: It was perfect for diaspora kids.

Why it worked:

Context-aware: Distinguishes between similar words based on sentence meaning
Handles pauses: Kids think while speaking (long pauses are normal)
Code-switching friendly: Properly handles English words in Hindi sentences
Indian vocabulary: Knows "Dadi," "Nani," "paratha" aren't random sounds
Forgiving pronunciation: Doesn't need perfect Hindi to understand intent

Example: Child says (with 2s pause mid-sentence): "मुझे... cricket खेलना बहुत पसंद है" Google STT: Correctly transcribes with pause intact Whisper: Often truncates or misses second half

Trade-off: Doesn't correct pronunciation (but that was never the goal—conversation practice is the goal).

Current Setup: Google Cloud Primary, Sarvam Fallback

def transcribe_audio(audio_bytes, provider='sarvam'):
    try:
        if provider == 'google':
            # Primary: Google (context-aware + decent in Hindi)
            result = sarvam_api.transcribe(audio_bytes)
            return result
    except Exception as e:
        logger.warning(f"Google Cloud STT failed: {e}")
        # Fallback: Sarvam
        return sarvam.transcribe(audio_bytes)

Why dual-provider?

Google Cloud STT: Better for messy, pause-heavy, code-switched speech
Sarvam: Better for clear, confident speech
Fallback ensures 99.9% uptime (critical for kids—they don't retry)

Voice UX: Designing for Heritage Language Learners

Challenge: These Aren't Native Speakers

Most voice apps assume fluent users. Diaspora kids:

Think in English, translate to Hindi
Take long pauses mid-sentence
Need encouragement, not correction
Get discouraged easily

Decision 1: Manual Recording > Voice Activity Detection

Conventional wisdom: Use VAD for seamless conversation.

Reality with 5-10 year olds:

3-5 second pauses while thinking of Hindi words
Sudden outbursts mid-silence ("oh wait, I know!")
Environmental noise (siblings yelling, TV, barking dogs)
Self-conscious about "wasting recording time"

Our solution: Big, obvious record button with visual feedback.

// Simple is better for kids
recordButton.addEventListener('click', () => {
    if (isRecording) {
        stopRecording();
        processingIndicator.show();
    } else {
        startRecording();
        animatedMic.start();
    }
});

Visual cues matter:

Red pulsing ring during recording
"Listening..." text appears
Animated mic icon to show it's working
Clear "Stop Recording" state

Result: Kids understand the turn-taking model. Zero confusion.

Decision 2: The 10-Sentence Conversation Structure

Problem discovered during testing: Kids got tired after 15+ exchanges. Conversations dragged. They'd leave mid-conversation.

Goal: Keep them wanting more, not exhausted.

Solution: Structured phases with automatic wrap-up:

def get_phase_instruction(sentences_count, is_farewell=False):
    if is_farewell:
        return "Give warm goodbye with homework for parents"
    
    if sentences_count >= 10:
        return "Wrap up conversation with encouragement"
    
    if sentences_count == 9:
        return "Start transitioning to conclusion"
    
    return ""  # Continue naturally

The conversation arc:

Sentences 1-2: Warm greeting, establish topic
- "नमस्ते! आज हम खाने के बारे में बात करेंगे।" (Hi! Today we'll talk about food.)
Sentences 3-8: Natural back-and-forth
- Tutor asks questions, kid responds
- Gentle encouragement and corrections
Sentence 9: Wrap-up signal
- "वाह! तुमने आज बहुत अच्छी हिंदी बोली।" (Wow! You spoke great Hindi today.)
Sentence 10+: Graceful conclusion
- "अब जाकर मम्मी या पापा को बताओ..." (Now go tell Mom or Dad...)
- Gives them "homework" (share what they learned)
- Makes them feel accomplished

Why this works:

Finite endpoint (kids like knowing when it'll end)
Natural conclusion (not abrupt cutoff)
Homework for parents (encourages family conversation)
Feel accomplished, not exhausted

Farewell handling: If kid says "bye" or "अलविदा" at sentence 5:

if is_farewell:
    return immediate_warm_goodbye()  # Don't force them to continue

Decision 3: Streaming with Typewriter Effect

Old approach: Wait for full response → display all at once → play audio

Better approach: Stream text as it generates, audio comes after

// Text appears immediately, audio later
const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
    const {done, value} = await reader.read();
    if (done) break;
    
    const text = decoder.decode(value);
    displayTextTypewriter(text);  // Char-by-char animation
}

// Audio completes in background
await playAudio(audioBytes);

Why it works:

Kids see instant feedback ("it's thinking about what I said!")
Perceived latency < actual latency
They start reading while audio generates
Better sense of conversation flow

Actual latency breakdown:

User stops speaking
↓
500ms: Processing audio
↓
800ms: STT transcription
↓
0ms: Start streaming text (parallel LLM + TTS)
↓
700ms: LLM generates full response
↓
600ms: TTS completes audio
↓
Total: 2.6s but feels like 1.3s due to streaming

The Latency Battle: India + Diaspora Reality

The Mystery: Why Is Production 3x Slower?

Local testing (MacBook, Bangalore): 1.5-2s latency ✅
Production (Heroku, USA): 4-6s latency 🚫

Initial reaction: "Is Heroku throttling me?"

Actual root causes:

1. Geographic API distribution

Sarvam API: Mumbai servers
OpenAI API: US West servers  
ElevenLabs API: US East servers

Round-trip time for API chain:
- Bangalore → Mumbai: 30ms
- Mumbai → US West: 200ms
- US West → US East: 70ms
- US East → User: variable

Production total: +400ms just in network hops

2. Heroku cold starts (free tier pain)

First request after 30min inactivity: 8-10s
Subsequent requests: 4-6s

3. Sequential API calls in initial architecture

Solutions Implemented

Optimization 1: Parallel API execution

# Impact: -28% latency

# Before
transcription = await stt(audio)          # 800ms
evaluation = await evaluate(text)         # 1200ms
response = await generate(context)        # 1500ms
audio = await tts(response)               # 800ms
# Total: 4300ms

# After
transcription = await stt(audio)          # 800ms

# Run simultaneously
eval, response = await asyncio.gather(
    evaluate(text),                       # 1200ms
    generate(context)                     # 1500ms
)
# (takes 1500ms, not 2700ms)

audio = await tts(response)               # 800ms
# Total: 3100ms

Optimization 2: Keep Heroku warm

# Ping every 25 minutes
@app.route('/health')
def health_check():
    return {'status': 'healthy', 'timestamp': time.time()}

# External service: UptimeRobot pings /health every 5min

Optimization 3: Redis session caching

# Before: DB hit every request (100ms)
# After: Redis hit (5ms)

def get_session_store():
    if redis_url:
        try:
            return RedisSessionStore(redis_url)
        except:
            return FileSessionStore()  # Fallback
    return FileSessionStore()

The Provider Speed Race

We tried every major LLM provider:

Provider	Model	Latency	Quality	Cost/1M	Verdict
OpenAI	GPT-4	1500ms	Excellent	$15	Slow + expensive
OpenAI	GPT-4o-mini	800ms	Very Good	$0.15	Good but costly
Groq	Llama 3.1 8B	600ms	Good	$0.05	Fast!
Groq	Llama 3.3 70B	900ms	Excellent	$0.59	Quality jump
Google	Gemini 2.0 Flash	700ms	Excellent	$0.075	Winner

Why Gemini won:

Fast inference (700ms avg)
Excellent at Hindi conversation context
Streaming support built-in
Much cheaper than GPT-4o-mini
More reliable than Groq (fewer timeouts)

Lesson 3: For kid-focused conversations in Hindi, model quality matters less than you think. Speed + cultural context matters more.

UI/UX: Why Building for Kids Is 10x Harder

The "This Is Boring" Problem

First version was built like a normal app:

Clean white interface
Minimal animations
Text instructions
Standard buttons

Kids' reaction: 😐 Left within 2 minutes.

The Airplane Insight

Flying Singapore → Bangalore, I watched the in-flight entertainment system. When you switch to "Kids Mode":

Adult mode: Clean, text-heavy, functional
Kids mode:

Bright colors everywhere
Animations on every interaction
Dinosaurs, rockets, stars floating around
Progress bars designed like games
Sounds for every action

Aha moment: Kids need constant positive reinforcement or they disengage.

What Actually Works for 5-10 Year Olds

1. Visual reward system (The Star Economy)

// Points for every interaction
const POINTS = {
    sentence: 10,           // Every sentence in Hindi
    quality_bonus: 20,      // Every 5 good responses
    completion: 50,         // Finish conversation
    milestone: 30           // Special achievements
};

// Update with animation
function updateRewards(points) {
    animateNumberChange(starsElement, currentPoints, newPoints);
    playSound('chime.mp3');
    
    // Milestone celebration
    if (goodResponses % 5 === 0) {
        showCelebration();  // Confetti + special animation
        playSound('applause.mp3');
    }
}

Why it works: Kids are motivated by immediate, visual progress. Stars accumulate visibly. They count them proudly.

2. Positive-only feedback (Never punish)

def get_feedback_type(grammar_score, context_score):
    if grammar_score >= 7 and context_score >= 7:
        return "green"  # "Great Hindi!" ✅
    
    if grammar_score >= 5 or context_score >= 5:
        return "amber"  # "Try saying..." 🔄
    
    # NEVER red/negative
    return "amber"  # Always give them a path forward

Green bubble: "बहुत अच्छा! That was great Hindi!"
Amber bubble: "Good! You can also say: [correction]"
Red bubble: ❌ Never. Kids shut down.

3. Minimal header during learning

Before (distracting):

[App Title] [Dashboard] [Nav] [Sentences: 5] [Points: 30] [Profile ▼]

After (focused):

[← Back]                              [⭐ 47] [🐶]

Result: 60% longer session times. Kids stayed in the conversation.

4. Fun animal avatars (kid-tested)

Instead of initials or photos:

const avatars = ['🐶', '🐱', '🐼', '🦊', '🐨', '🐯'];

// Rotate through on each visit
const avatar = avatars[sessionCount % avatars.length];

Kids love their animal identity. They ask "which animal am I today?"

5. Celebration animations at milestones

// Every 5 good responses
if (goodResponseCount % 5 === 0) {
    showFullScreenCelebration({
        confetti: true,
        message: "Amazing! 5 great sentences!",
        sound: 'applause.mp3',
        stars: +20
    });
}

Before celebration system: 4-minute average session
After celebration system: 12-minute average session

Kids kept talking to hit the next milestone.

The Grammar vs Context Challenge

Why Standard Grammar Checking Fails

Naive approach:

def evaluate(user_response):
    prompt = "Rate this Hindi sentence grammar (1-10): " + user_response
    return llm(prompt)

Problem: No context means wrong evaluations.

Example:

Tutor: "तुम्हें कौन सा खाना पसंद है?" (What food do you like?)
Kid: "हाँ" (Yes)
Grammar check: 10/10 ✅
Context check: ❌ Completely wrong answer

Better: Contextual Evaluation

def evaluate_response(user_response, tutor_question, conversation_history):
    prompt = f"""
    You're evaluating a child (age 5-10) learning Hindi.
    
    Tutor asked: "{tutor_question}"
    Child said: "{user_response}"
    Previous context: {conversation_history[-3:]}
    
    Evaluate:
    1. Grammar (1-10): Are sentences grammatically correct?
    2. Context (1-10): Does response make sense for the question?
    3. Code-switching: Count English words used
    4. Encouragement: What to say to motivate them?
    5. Correction: If needed, suggest better phrasing
    
    Return JSON:
    {{
        "grammar_score": int,
        "context_score": int,
        "english_word_count": int,
        "feedback_type": "green" | "amber",
        "encouragement": str,
        "corrected_response": str | null
    }}
    """
    
    return gemini_json_mode(prompt)

Better evaluation:

{
    "grammar_score": 8,
    "context_score": 4,
    "english_word_count": 0,
    "feedback_type": "amber",
    "encouragement": "Good try! But let's answer the question.",
    "corrected_response": "मुझे पिज़्ज़ा पसंद है" (I like pizza)
}

The Code-Switching Dilemma

Diaspora kids naturally code-switch:

"मैं school जा रहा हूं और मेरे friends के साथ lunch खाऊंगा" (I'm going to school and will eat lunch with my friends)

Design decision we debated:

Strict: Mark as incorrect, force Hindi-only
Lenient: Accept it completely
Guiding: Accept but gently suggest alternatives

We chose #3 (guiding):

def handle_code_switching(english_word_count, user_response):
    if english_word_count == 0:
        return {
            "feedback": "बहुत अच्छा! Pure Hindi!",
            "type": "green",
            "bonus_points": 5
        }
    
    elif english_word_count <= 2:
        return {
            "feedback": "Good! Next time try: 'स्कूल' instead of 'school'",
            "type": "green",  # Still positive
            "suggestion": get_hindi_alternatives(user_response)
        }
    
    else:  # 3+ English words
        return {
            "feedback": "Let's try that again in more Hindi",
            "type": "amber",
            "corrected_response": translate_to_hindi(user_response)
        }

Why this works:

Doesn't discourage them (they're learning!)
Gently pushes toward more Hindi
Acknowledges their effort
Gives concrete alternatives

Real example:

Kid: "मैं park में जाकर मेरे dost के साथ football खेला"

Response: "Great sentence! ⭐ You can also say:
'मैं पार्क में जाकर मेरे दोस्त के साथ फुटबॉल खेला'
Next time try using Hindi words for 'park', 'dost', and 'football'!"

Type: Green (still rewarded)
Points: +10

Gamification: What Works vs What Doesn't

❌ What Didn't Work

1. Story Co-Creation

Initial idea: "Let's create a Panchatantra story together!"

Problems:

Too open-ended for beginners
Kids didn't know "what happens next"
Requires advanced Hindi skills
Paralysis of choice

Example failure:

Tutor: "एक कौवा प्यासा था। अब क्या हुआ?" (A crow was thirsty. What happened next?)
Kid: "...um... I don't know... पानी?" (water?)
Tutor: "हाँ! कहाँ पानी मिला?" (Yes! Where did he find water?)
Kid: [long pause] "I don't know" [exits]

2. Free-Form Conversation

Initial idea: "Talk about whatever you want!"

Problems:

Kids don't know what to say
AI veers into random topics
No sense of progress
No clear goal

3. Correction Pop-ups (Too Harsh)

Initial design: Immediately show corrections after mistakes

Problem: Felt like a test, not practice. Kids became self-conscious.

✅ What Actually Works

1. Structured conversation types

Each topic has clear scope and goals:

"My Family" (मेरा परिवार):

Goal: Learn family member names in Hindi
Scope: Dadi, Nani, Chacha, Bua relationships
Duration: 8-10 exchanges
Outcome: Can introduce family on video call

"Food Talk" (खाने की बातें):

Goal: Describe favorite foods
Scope: Meals, snacks, festivals foods
Cultural element: Why do we eat certain foods?
Outcome: Can talk about dinner with grandparents

Why structured works:

Clear goal (kids know what they're working toward)
Sense of progress (visible topic completion)
Cultural context (learning why, not just what)
Real application (will actually use this with Dadi)

2. Milestone-based rewards

REWARD_STRUCTURE = {
    'sentence': 10,              # Base points
    'quality_milestone': 20,     # Every 5 good responses
    'completion': 50,            # Finish conversation
    'streak': 30,                # Multiple days in a row
}

def calculate_reward(metrics):
    points = REWARD_STRUCTURE['sentence']
    
    if metrics['good_response_count'] % 5 == 0:
        points += REWARD_STRUCTURE['quality_milestone']
        trigger_celebration()  # Visual reward
    
    return points

Result: Kids chase milestones ("just 2 more sentences till the celebration!")

3. Parent analytics (hidden motivator)

Dashboard shows parents:

Conversations this week vs last week
Good response percentage
Topics covered
Progress over time

Why it matters: Parents encourage kids when they see progress. Social proof works even for kids.

Example parent reaction:

"Beta, you had 4 conversations this week! That's amazing! And look, 70% were perfect Hindi sentences!"

Technical Decisions Worth Discussing

HTTP + SSE > WebSockets (For Turn-Taking)

Everyone: "Voice apps need WebSockets for real-time!"

Our reasoning: Kids do turn-taking, not simultaneous conversation.

# Simple HTTP endpoint
@app.route('/api/process_audio_stream', methods=['POST'])
def process_audio():
    audio = request.files['audio']
    
    def generate_response():
        # Streaming via SSE
        for chunk in stream_conversation(audio):
            yield f"data: {json.dumps(chunk)}\n\n"
    
    return Response(
        generate_response(),
        mimetype='text/event-stream'
    )

Benefits:

Simpler architecture
Easier debugging
Better error handling
Works with standard HTTP load balancers
SSE gives us streaming without WebSocket complexity

Trade-off: Can't do simultaneous input/output (but we don't need it)

Dual-Layer Session Management

def get_session_store():
    """
    Production: Redis (24hr TTL)
    Fallback: FileSessionStore
    Persistence: PostgreSQL (forever)
    """
    if redis_url:
        try:
            store = RedisSessionStore(redis_url)
            store.redis.ping()
            return store
        except:
            logger.warning("Redis failed, using FileStore")
            return FileSessionStore()
    return FileSessionStore()

Why three layers?

Layer	Purpose	TTL	Use Case
Redis	Fast access	24 hours	Active conversation state
FileStore	Dev + fallback	24 hours	Local dev, Redis failure
PostgreSQL	History	Forever	Parent dashboard, analytics

Principle: Resilience > performance optimization. If Redis fails at 2am, conversations continue with FileStore.

SQLite → PostgreSQL (Ship Fast, Scale Later)

# Works in both dev and prod
database_url = os.getenv(
    'DATABASE_URL',
    'sqlite:///hindi_tutor.db'  # Dev fallback
)

# Heroku gives postgres:// URL, SQLAlchemy needs postgresql://
if database_url.startswith('postgres://'):
    database_url = database_url.replace('postgres://', 'postgresql://', 1)

Development: SQLite (zero setup, file-based)
Production: PostgreSQL (Heroku managed)

Same ORM, different backend. Ship fast locally, scale when needed.

The iOS Audio Hell

This deserves its own post.

The Problem: Safari Audio Just Doesn't Work

Issues discovered:

First audio playback: Silent (AudioContext requires user gesture)
Volume controls: Don't work (Web Audio API needed)
Recording: Random lag (WebKit MediaRecorder quirks)
OAuth redirects: Break audio state entirely

Standard audio approach:

// This FAILS on iOS
const audio = new Audio(audioUrl);
audio.play();  // Silent on first try

The Solution (After Pulling Hair Out)

Step 1: Unlock AudioContext on ANY user interaction

let audioContext = null;
let audioUnlocked = false;

function unlockAudio() {
    if (audioUnlocked) return;
    
    audioContext = new (window.AudioContext || window.webkitAudioContext)();
    
    // Create dummy sound to "unlock" audio
    audioContext.resume().then(() => {
        const buffer = audioContext.createBuffer(1, 1, 22050);
        const source = audioContext.createBufferSource();
        source.buffer = buffer;
        source.connect(audioContext.destination);
        source.start(0);
        
        audioUnlocked = true;
        console.log('iOS audio unlocked');
    });
}

// Critical: Call on FIRST user interaction
document.addEventListener('click', unlockAudio, { once: true });
document.addEventListener('touchstart', unlockAudio, { once: true });

Step 2: Use Web Audio API for all playback

async function playAudioiOS(audioData) {
    if (!audioUnlocked) {
        console.error('Audio not unlocked yet');
        return;
    }
    
    // Decode audio data
    const arrayBuffer = await audioData.arrayBuffer();
    const audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
    
    // Create source
    const source = audioContext.createBufferSource();
    source.buffer = audioBuffer;
    
    // Volume control (finally works!)
    const gainNode = audioContext.createGain();
    gainNode.gain.value = volumeSlider.value;
    
    // Connect: source → gain → destination
    source.connect(gainNode);
    gainNode.connect(audioContext.destination);
    
    // Play
    source.start(0);
}

Step 3: Handle OAuth redirects

// Before OAuth redirect, save audio state
function preserveAudioState() {
    sessionStorage.setItem('audioUnlocked', audioUnlocked);
    sessionStorage.setItem('volume', volumeSlider.value);
}

// After OAuth callback, restore
function restoreAudioState() {
    audioUnlocked = sessionStorage.getItem('audioUnlocked') === 'true';
    if (audioUnlocked) {
        unlockAudio();
    }
}

Lesson 4: Mobile web audio in 2025 is still a mess. Test on actual iPhones, not just simulators.

Measuring Success: Metrics That Matter

Latency Breakdown (Current)

User stops speaking
↓
Audio processing: 50ms
↓
STT (Google Cloud): 800ms
↓
[Parallel execution starts]
├─ Evaluation (Gemini): 600ms
└─ Response (Gemini): 400ms
↓
TTS (Elevenlabs): 300ms
↓
Network overhead: 200ms
↓
Total: ~1.7 seconds ✅

Target: < 2 seconds (kids stay engaged)
Achieved: 1.7s average, 1.0s best case

Key Lessons for Voice AI Builders

1. Don't Trust English-Optimized Benchmarks

Whisper is SOTA for English ASR → Failed for Hindi kids
GPT-4 is best LLM → Too slow for real-time voice
WebSockets are standard → HTTP+SSE worked better

Takeaway: Benchmark on YOUR use case, not general leaderboards.

2. Optimize for Perceived Latency

Actual latency: 2.5s
Perceived latency: 1.3s (via streaming text)

Techniques:

Show text immediately while audio generates
Animated "thinking" indicators
Progress bars during processing
Smooth transitions between states

3. Kids Are a Different Species

What works for adults:

Minimal UI
Text-heavy
Functional design
Assumes motivation

What works for kids:

Visual rewards
Game-like progression
Needs constant encouragement

4. For Indic Languages, Go Specialized

Winners:

Google Cloud STT (context-aware + decent for Hindi)
Sarvam STT (Hindi-specialized)

Losers:

Whisper (English-optimized)
Generic APIs (no cultural context)

5. Voice UX ≠ Chat UX

Voice-specific needs:

Turn-taking clarity (record button > VAD)
Audio quality > speed (for TTS)
Long pause handling (kids think)
Visual feedback (they can't see processing)

What's Next for the Product

Short-term (Next 3 Months)

1. Age-specific models

5-6 year olds: Simpler vocabulary, shorter sentences
7-8 year olds: More complex topics
9-10 year olds: Cultural context, stories

2. Phoneme practice module

Difficult Hindi sounds: ड vs द, ट vs त, ऋ
Minimal pairs practice
Pronunciation feedback

3. Parent dashboard v2

Conversation playback (audio clips)
Weekly progress reports via email
Suggested topics based on child's level

Long-term Vision

1. Sibling mode

Multiple kids in one household
Shared family subscription
Individual progress tracking

2. Heritage language expansion

Tamil for Tamil diaspora
Gujarati, Telugu, Bengali
Same architecture, different languages

3. Cultural curriculum

Festival explanations (why Diwali? why Holi?)
Story traditions (Panchatantra, Akbar-Birbal)
Family vocabulary (why "Mami" ≠ "Chachi")

Goal: Become the default tool for diaspora families keeping heritage languages alive.

Open Questions (Still Figuring Out)

1. How do we measure actual learning vs engagement?

High engagement doesn't always = learning
Need longitudinal studies with real families
Considering: pre/post conversation assessments

2. Optimal conversation length by age?

10 sentences works for 5-6 year olds
Do 8-9 year olds need longer conversations?
How to dynamically adjust?

3. Should we enforce Hindi-only or accept code-switching?

Current: Accept with gentle guidance
Alternative: Strict Hindi after level 3
Need: More user research

4. How to prevent reward gaming?

Kids figure out: Say anything → get points
Considering: Quality threshold for rewards
Balance: Don't discourage, but don't reward gibberish

Overall summary

Time invested: 4 months of focused work
Lines of code: ~15,000
Git commits: 220+
LLM providers tried: 3 (OpenAI, Groq, Gemini)
STT providers tried: 4 (Elevenlabs, Whisper, Sarvam, Google)
UI redesigns: 6 major iterations

Biggest lesson:

Building for kids is humbling. They're brutally honest users—if it's boring, they leave. If it's too slow, they get frustrated. If it doesn't work, they try once and never return.

But when a 6-year-old says "मुझे यह बहुत पसंद है!" (I really like this!) after finishing a conversation?

When a parent messages: "She asked to practice before calling Dadi. First time ever."

When you see a family preserving their heritage language across continents?

Worth every commit. Worth every latency optimization. Worth every iOS audio bug.

Join the Journey

We're currently building with 100 founding families for free early access. If you're a diaspora parent trying to keep Hindi alive, we'd love your feedback.

Built with:

Backend: Flask + Python
STT: Google Cloud Speech (Sarvam as fallback)
LLM: Google Gemini 2.0 Flash Lite
TTS: ElevenLabs
Infrastructure: Heroku + Redis + PostgreSQL

Open questions for the community:

What other heritage languages need this?
How do we measure actual learning?
Best practices for voice UX with kids?

Building something in voice AI or heritage language tech? Let's chat. This space needs more builders.

P.S. If you're building for kids, test with real kids early. They will humble you in ways adults never will.