Reference

Pipeline

Pipeline

This page documents the algorithm and the defaults songsee uses to produce repeatable, high-quality images. It complements Visualizations (what each mode shows) and Rendering (how the canvas gets composed).

#Stages

input → decode → mono mixdown → optional slice → window → FFT
       ↓
       per-mode features (mel, chroma, mfcc, hpss, …)
       ↓
       percentile normalize → palette map → grid compose → encode

Every stage is deterministic. The same input file with the same flags always produces the same output bytes.

#Decode

  • WAV (PCM 8/16/24/32-bit, 32/64-bit float, WAVEFORMATEXTENSIBLE) and MP3 are decoded in pure Go via the bundled decoders.
  • Anything else falls through to ffmpeg (32-bit float little-endian, mono, --sample-rate Hz; default 44100).
  • Stereo or multichannel input is averaged to mono.
  • --start / --duration slice the decoded sample buffer in seconds before windowing.

See Decoding for input formats, sample rate, ffmpeg lookup, and stdin usage.

#Windowing and FFT

  • Window: Hann, applied per frame.
  • Window size: --window samples (default 2048, must be a power of two).
  • Hop size: --hop samples (default 512).
  • Frame count: 1 + (len(samples) - window + hop - 1) / hop.
  • Bin count: window / 2 + 1.
  • Bin spacing: sampleRate / window Hz per bin.

Magnitude is converted to decibels with 20·log10(mag + 1e-9) for the base spectrogram. Per-feature pipelines (mel, chroma, mfcc) use linear power instead.

#Per-mode features

modesourcenotes
spectrogramSTFT magnitude in dBclamped to 5th–98th percentile
melmel-warped powerlog-magnitude; clamped 5th–98th percentile
chroma12-bin pitch classfolds octaves; clamped 10th–98th percentile
mfccDCT of mel powerstrips pitch, keeps timbre
hpssmedian filters on STFT9-frame harmonic + 9-frame percussive kernels
selfsimcosine sim on chroma framesgamma 1.4; clamped 10th–98th percentile
loudnessper-frame RMSclamped to 95th percentile
tempogramonset autocorrelation30–240 BPM, 256 bins
fluxframe-to-frame STFT deltaclamped to 95th percentile

The percentile sampling reservoir is capped at 20 000 values per panel for speed; this is dense enough that boundaries are stable across runs.

#Rendering

  • Each panel maps (time × bin) cells onto pixels at the panel's width × height.
  • Values are normalized into [0, 1] against the per-panel min/max (after the percentile clamp), then passed through the chosen palette.
  • Heatmap panels (mel, chroma, mfcc, selfsim, hpss halves, tempogram) render with flipVert so low frequencies are at the bottom.
  • Multiple panels compose into a ceil(sqrt(n))-column grid with an 8 px gap (see Rendering).
  • Encoder: PNG (lossless) or JPEG (quality 95).

#CLI defaults

--format       jpg
--width        1920
--height       1080
--window       2048
--hop          512
--sample-rate  44100
--style        classic
--viz          spectrogram

Full reference: CLI.