I built a production AI agent that turns messy parent input into grounded, multi-step outcomes.
For Hey Kiddo, I built a parenting copilot agent that interprets a parent's situation,
retrieves relevant guidance, decides the response path, orchestrates story, voice, and
illustration tools, validates safety, and delivers a finished experience in under 90 seconds.
This case study focuses on the agent loop: interpret ambiguous human input, ground it with retrieval,
choose the next action, coordinate tool calls, verify outputs, and recover from failures without
breaking the user experience.
Agent Responsibilities
Translate messy parent input into a structured task state the runtime can act on.
Retrieve grounded parenting context and return fast guidance before long-running generation begins.
Decide whether to stay in the quick guidance path or continue into full multimodal story creation.
Orchestrate generation, safety checks, narration, music, illustration, and asset publication across multiple tools and services.
Persist job state so the agent can recover from app suspension, retries, and partial failures.
Runtime handoff returns a jobId quickly instead of holding a long mobile request open.
Grounded
Retrieval keeps the first response anchored in curated parenting context rather than freeform generation.
Multimodal
The agent coordinates text, narration, music, and illustration as one task instead of isolated API calls.
Failure Isolation
Model fallback plus isolated TTS, BGM, and cover-art failures prevent a weak stage from killing delivery.
Agent Thesis
The value is task completion under constraints: grounded context, tool use, guardrails, and recovery all inside one runtime.
Agent Runtime
Perceive, retrieve, decide, act, verify, recover.
The product experience depends on an agent runtime with two lanes: a fast grounded response lane and a
longer execution lane for multimodal asset delivery.
Agent Control Loop
User + Client
Parent describes a stressful moment; the client captures structure, validates input, and packages runtime state.
The app starts tracking only when the parent asks the agent to continue beyond quick guidance.
↓
Grounding + Policy
JWT auth, rate limiting, and pgvector retrieval over parenting_scripts establish safe, grounded context.
Gemini returns structured bilingual guidance, giving the parent immediate value before any heavy execution begins.
3–5 sec grounded response. The parent can stop here.
Data Layer
PostgreSQL + pgvector for retrieval
Storage for audio, images, and derived assets
Job table for status, retries, and public URLs
↓
Decision + Handoff
If the parent wants a full story, the runtime writes a pending job row and returns jobId in under 500ms.
Execution moves off-request so the agent can keep working without blocking the client connection.
Agent Runtime (Railway) — 8 Stages
1Interpret Input
Normalize context, infer intent, and map growth focus
2Build Runtime Context
Assemble prompt modules, request variables, and tool config
3Generate Story
Gemini Narrative generation
4Verify Safety
Gemini Policy and moderation pass
5Narrate
ElevenLabs Language-aware voice
6Compose Audio
pydub music blend + normalization
7Illustrate
Imagen 3 Hand-drawn cover art
8Finalize State
Supabase assets → public URLs
Resilience: Fallbacks, retries, and stage isolation keep the agent moving when one model or asset step degrades.
↓
State + Recovery
StoryGenerationManager polls every 2s, smooths progress, and keeps the UI aligned with the agent's actual state.
Active jobs are saved locally and resumed after app suspension or foreground return.
Completed assets are stored in SwiftData and surfaced with notification + playback UI.
60–90 sec total for full story, audio, and cover-art delivery.
Tool-Using Agent
The runtime does more than generate text. It coordinates retrieval, safety, narration, audio, illustration, and storage as one job.
Fast + Slow Lanes
Grounded guidance arrives immediately, while long-running multimodal execution continues asynchronously with durable state.
Guarded Execution
Grounding, policy checks, retries, and failure isolation make the runtime safer and more reliable than a one-shot generation flow.
Agent Behaviors
The value comes from runtime decisions, not just model calls.
The hard part was not API access. It was building a runtime that can interpret intent, use tools,
enforce guardrails, and finish the task even when individual services wobble.
Interpretation + Planning
The runtime turns ambiguous parent input into a structured task state instead of sending raw prose straight into generation.
Normalizes behavior, context, pressure, and growth goals into agent-readable structure.
Builds runtime context from modular prompt files and request variables.
Supports request-time personalization without maintaining separate hard-coded flows.
Keeps the system adaptable as new behaviors, tones, and story modes are added.
Grounding + Guardrails
The fast lane is grounded by retrieval, and the slow lane is checked again before anything reaches the user.
pgvector retrieval over curated parenting guidance corpus.
Structured bilingual response instead of raw chatbot prose.
Prompt-level safety constraints plus second-pass Gemini moderation.
Unsafe generations are rejected and regenerated before delivery.
Tool Use + Multimodal Execution
The runtime coordinates multiple specialized tools to complete a single user task.
ElevenLabs voice selection by language, tone, and narrator style.
pydub normalization and background music mixing.
Imagen 3 cover art generation for each completed story.
Separate video pipeline publishes long-form story content to YouTube.