Choo Choo Story Time generates personalized, narrated bedtime stories in under 90 seconds by orchestrating 4 AI services from a single parent input. Solo-built as founder and full-stack engineer.
I'm Viktor Zhai. I designed, built, and shipped Choo Choo Story Time — a personalized children's storytelling app that turns a parent's 30-second behavioral description into a fully narrated, illustrated bedtime story. Every layer: product vision, iOS frontend, Python backend, AI prompt engineering, deployment, and content safety.
My approach: ship fast, measure, iterate. I used Claude Code as my AI pair-programming partner throughout development — not just for code generation, but for architecture decisions, prompt iteration, and automated content pipelines. I treat AI as a force multiplier, not a crutch.
The app is live on TestFlight with a production backend on Railway. I also built an automated YouTube content pipeline that produces 4K story videos end-to-end using AI (script, artwork, narration, music, video assembly).
It's 7:30 PM. A 4-year-old just threw a block and yelled "NO!" during cleanup. The parent is stressed, tired, and has 10 minutes before bedtime. They don't want a parenting article — they want something that works right now and turns this rough moment into a calm bedtime.
RAG-powered parenting guidance card in 3–5 seconds. Actionable strategies based on the specific behavior, not generic advice.
A bedtime story where the child's own toy is the hero, teaching the lesson through action — not a lecture. Ready in 60–90 seconds.
AI-narrated audio with background music and hand-drawn cover art. The parent can just press play and breathe.
MomentInputViewModel validates input, tracks analytics event via MixpanelPOST /generate-structured-response with JWT auth + rate limitingparenting_scripts corpusStorySummaryViewModel — parent edits tone, moral, setting, length, languageStoryAPIClient.triggerStoryGeneration() dispatches structured payloadPOST /trigger-story-generationjobId instantly (<500ms)StoryGenerationManager polls status every 2s with progress smoothing (eased interpolation, caps at 98%)Stories are not generated from a monolithic prompt. I built a dynamic prompt builder that composes the final ~3,500-word prompt from 7 modular files at request time. This means changing a safety rule touches one file, not four templates. Each module is independently testable and version-controlled.
Assembly: Core_Principles + Age_{3-5|6-8}_Specs + Template_{type}_{age} + Variables → Gemini API
Every generated story passes through two independent safety layers before reaching a child. Layer 1: prompt-level constraints baked into Core_Principles (violence, separation anxiety, age-appropriateness). Layer 2: a separate Gemini moderation pass that validates the generated output against child safety criteria. Stories that fail moderation are rejected and regenerated — never delivered.
Why two layers? Prompt constraints are necessary but not sufficient. LLMs can drift. The post-generation moderation pass catches edge cases that slipped through prompt instructions. This is the same defense-in-depth pattern used in production content moderation systems.
The fast path (3–5 second response) uses Retrieval-Augmented Generation over a curated corpus of parenting strategies. When a parent describes a behavioral challenge, the Supabase Edge Function retrieves relevant strategy documents via vector similarity search, then generates a structured, bilingual guidance card grounded in the retrieved context.
This isn't generic chatbot output. The RAG corpus is curated from evidence-based parenting frameworks, and the retrieval ensures the guidance is specific to the described behavior. The bilingual output (English + Mandarin Chinese) serves our primary user base.
The TTS stage isn't a simple text-to-speech call. The system selects voices based on
language, tone (calm/energetic/playful), and narrator gender. ElevenLabs generates the
narration, then pydub handles audio processing: volume normalization, background music mixing
with mood-matched tracks, and final MP3 encoding. Voice selection is configurable per-language with fallback chains.
I also built a separate YouTube content pipeline that produces full 4K story videos using ElevenLabs narration with forced-alignment page synchronization — the audio timing drives the visual page turns automatically.
Beyond the app, I built a fully automated pipeline that produces publish-ready YouTube story videos. Each video goes through 5 stages, all orchestrated by AI:
I used Claude Code as an AI pair-programming partner throughout the entire project — not just for code completion, but as a collaborator across the full development lifecycle:
Architecture design: Planned the split-latency system, Edge Function → Worker pattern, and prompt modularization strategy through iterative conversation. Implementation: Built features across SwiftUI, FastAPI, and infrastructure with Claude handling boilerplate while I focused on business logic and product decisions. Prompt iteration: Refined the Master Prompt System through dozens of generation → evaluate → revise cycles. Custom agents & skills: Built specialized Claude Code agents for story writing, video production, code review, and QA — each with domain-specific instructions.
The key insight: AI tools are most powerful when you bring strong opinions about what to build. Claude accelerated the how, but every product decision, safety constraint, and architecture tradeoff was mine.
Parents in crisis need help now. The synchronous RAG path gives useful output in 3–5 seconds. The async pipeline runs in background. Two AI outputs from one input — optimized for the parent's emotional state.
iOS gets an instant jobId (<500ms) instead of holding HTTP for 90s. Enables background recovery, progress polling, and graceful degradation if any stage fails.
Research-backed observational learning (Paw Patrol, Peppa Pig pattern). The toy teaches through action — therapeutic distance lets kids process difficult emotions without feeling lectured. Embedded in prompt architecture.
2 age groups × 2 story types = 4 combos, but core principles and age specs are reused. Changing a safety rule touches one file. Each module is independently testable and version-controlled.
iOS app with MVVM, async/await, local persistence, Mixpanel & Sentry analytics
Python backend with Alembic migrations, async pipeline orchestration, deployed on Railway
Story generation, content moderation, cover art illustration with model fallback chains
Multilingual TTS with voice selection, BGM mixing, forced alignment for video sync
PostgreSQL + pgvector for RAG, Edge Functions, JWT auth, file storage
Production deployment with auto-deploy from main branch, env management
AI pair-programming partner for architecture, implementation, prompt engineering, and custom agents
Automated 4K YouTube video pipeline with Ken Burns effects and page-sync narration