Text-to-speech at scale means pre-generating thousands of audio segments.
The Three Tiers
Tier 1 — Static Cache: Common phrases served from CDN edge. Latency: 15ms.
Tier 2 — Semantic Cache: Similar sentences share prosody models with variable segment splicing. Latency: 45ms.
Tier 3 — Live Generation: Novel sentences hit inference cluster. Latency: 180ms.
Combined effect: p95 dropped from 800ms to 90ms. Monthly compute costs fell by 73%.