ASCII TTS Example — "Warm Machine Oracle"

A self-contained Python demo (mode #6: TTS narration ASCII video) that:

Synthesizes a short narration locally with espeak-ng (no API keys).
Renders a colored ASCII-glyph grid reactive to the narration's energy.
Muxes everything into an MP4 with ffmpeg.

The output sits at output/tts_ascii_example.mp4 — about 10–12 seconds at 960×540, 24 fps.

Creative concept

Mood: warm machine oracle — a glowing quote being typed into a living terminal.

Visual arc:

Amber data fog. Quiet, low-frequency noise field in amber. The terminal "hums".
Cyan/purple signal rings. Concentric rings ripple outward from the centre, modulated by the narration's RMS and transients.
Punctuation star-map iris. Around the line "small text can hold a whole universe", the background briefly irises into a star map built from .,'*+oO` glyphs.
Title card resolve. A glowing title settles into place: WARM MACHINE ORACLE — ascii / tts / signal.

A typewriter quote overlay types the line in sync with the narration timing, backed by a soft scanline/grain/bloom CRT-ish postprocess.

Requirements

The script expects these locally available binaries and Python packages (installed system-wide in this environment):

espeak-ng (or espeak as fallback) — local TTS, no API keys.
ffmpeg, ffprobe — encode + verify.
Python 3.10+, numpy, Pillow. scipy is not required.

No pip install step is performed — the script will use what is present.

Run

cd /root/src/ascii-tts-example
python3 render_tts_ascii.py

You should see numbered progress lines for each pipeline stage. On success the script prints [OK] ...tts_ascii_example.mp4 — video+audio, ~11s and exits 0.

Output

output/tts_ascii_example.mp4   final video (H.264 + AAC, 960x540, 24fps)
_tmp/narration_raw.wav         raw espeak-ng synthesis
_tmp/narration_pad.wav         padded/trimmed audio used for mux
_tmp/frames/f_00000.png ...    individual rendered frames
_tmp/logs/*.log                full stdout+stderr for every subprocess

The _tmp/ directory is safe to delete after the render.

How it works (high level)

TTS — espeak-ng -v en+m3 -s 148 -p 38 synthesizes a slightly slow, lower-pitched male voice for the "oracle" feel.
Audio features — the WAV is decoded with wave, converted to mono float32, and per-video-frame RMS + transient (positive-rectified energy diff with exponential decay) signals are computed. These drive intensity and a centre-radiating pulse.
Field synthesis — three numpy fields are summed with trapezoidal section weights: amber fog (sum of low-frequency sinusoids), cyan/purple rings (sin(r·k - t·ω) with falloff), and a sparse twinkling star map.
Glyph grid — character sprites are pre-rasterised into an (n_chars, cell_h, cell_w) uint8 alpha atlas. Each frame indexes into the atlas with vectorised numpy lookups (sprites[grid_indices]), then transpose/reshape into a single full-canvas alpha image — no Python-level per-cell drawing loop.
Per-cell colour — each section contributes a weighted colour (amber/teal-purple/hot-amber), normalised by the active section weights and multiplied by intensity.
Adaptive tonemap — 96th-percentile normalisation keeps frames from flattening when the field's overall energy is low.
Postprocess — alternate-row scanlines, additive monochrome grain, cheap "bloom" via Gaussian-blurred bright channel, and a soft vignette.
Overlays — typewriter quote (with blinking cursor and a glowing blurred copy behind the sharp text) and a final title card, both drawn on top of the postprocessed image.
Font / palette safety — preferred glyph palettes use Unicode block characters (▒▓█●★); each is probed against the chosen font via font.getmask(ch) and ASCII-safe fallbacks are used if any are missing.

ffmpeg pipeline notes

The script never pipes raw frames to ffmpeg's stdin. Frames are written as PNGs and ffmpeg reads them via the f_%05d.png glob. All ffmpeg/ffprobe invocations redirect both stdout and stderr to log files in _tmp/logs/, so there are no half-filled pipes or stderr-deadlock conditions. The audio is pre-padded (or trimmed) to exactly match the video duration in Python, so the mux command is a simple -map 0:v -map 1:a -t <dur> with no filter graph.

Tuning

Narration text — edit NARRATION near the top of the script.
Duration window — DUR_MIN/DUR_MAX clamp the final video length.
Voice — change the -s (speed), -p (pitch), -v (voice) arguments to espeak-ng inside generate_tts().
Look — palette colours (C_AMBER, C_TEAL, etc.) and glyph palettes (PALETTE_*) live at the top of the file.

README.md Unescape Escape