hardie|soft

Jon Hardie, Software Developer

Posts index

Spectastiq: a web component for exploring audio spectrograms

An outline of some of the architectural decisions made during the development of Spectastiq, a standalone web component that allows the exploration of pre-recorded audio in frequency space inside your web browser.

Visit here to play with Spectastiq and see what it can do.

Background

Spectastiq began life as a command‑line utility that batch‑processes .wav files from DOC AR4 and AudioMoth acoustic recorders into uniformly styled spectrogram images. These images are then displayed in KiwiViz, a visualisation tool for kiwi calls (of individual kiwi!) at the Atarau Sanctuary.

Building on this foundation, I evolved Spectastiq into a reusable web component that renders spectrograms directly in the browser at runtime. This was motivated by the needs of Cacophony Browse’s bird‑classification interface. (It turns out the best way to classify sounds in audio is to make it a computer vision problem).

In this workflow, expert users select and tag birdsong regions, contributing to a dataset that is used to train an AI model to automatically recognise native New Zealand birds.

Having a real-world consumer of the component was essential in helping to inform the customisation hooks that I'd expose via a JavaScript API and HTML attributes.

Design goals

Fast initial load time

A performant FFT implementation would be important, especially for that crucial initial load of the full audio spectrogram.

Touch first interactions

Pinch‑to‑zoom, panning, and other gestures must feel natural on mobile devices.

Smooth interaction

No jank or dropped frames, even on low spec hardware.

Self-contained

Multiple instances should safely co-exist on a page without interference or global state clashes.

Architecture

Access to raw PCM audio samples is needed to be able to generate a spectrogram.

For loading an audio file and turning it into a Float32Array of PCM audio samples, Spectastiq currently leans entirely on the browsers' built-in WebAudio API.

A asynchronous call to audioContext.decodeAudioData(myAudioFileBytes) is made, and then we wait a while. This returns an AudioBuffer. Of course you can't just play back an audio buffer silly, to do that you need to create an AudioBufferSourceNode, assign this decoded AudioBuffer to the bufferNode.buffer property, and then bufferNode.connect(audioContext.destination) (where 'destination' is your speakers). The WebAudio API is almost universally panned, but it seems unlikely we can use anything better anytime soon, so let's just just move on.

For a speedy startup time, the most important factor is that we can load an audio file once and then use the same data to generate a spectrogram image of that audio, as well as play it back through our speakers. There is some copying around of data involved, but the memory footprint ends up peaking around 3x the size of the decompressed PCM data.

To get a Float32Array for spectral analysis, we use the WebAudio API again: audioBuffer.getChannelData(0). At this time, Spectastiq only uses a single channel to display a spectrogram, so in the case where your audio is stereo, only the first channel will be used.

Having received this decoded audio buffer, this is where we can actually make some meaningful architectural choices.

Spectrogram rendering is achieved by performing a Fast Fourier Transform (FFT) over the samples of your audio. This is a math intensive process that is indistinguishable from magic. It will take some time to complete, so it's important to do this work off the main UI thread, otherwise the browser will become unresponsive.

There is an excellent crate called RustFFT available for Rust, and it's possible to compile this to WebAssembly and expose its functionality to JavaScript using wasm_bindgen.

FFT takes a sequence of numeric samples of some length and returns a histogram of 'frequency buckets', which — in terms of audio — represent the intensity found in a sound at a given frequency.

If we break up our audio samples array into small chunks — let's say of 2048 samples each, or about 42ms with 48Khz audio — and perform FFT on each chunk in turn, we'll have an array-of-arrays of frequencies found in each chunk of the sound, and their corresponding intensities (approximately, how loud each frequency sounds).

This information can be interpreted as a 2D array - an image that can be displayed.

nïave FFT stepping approach

Unfortunately this nïave approach doesn't look very good. We're just not capturing enough temporal resolution.

It turns out that to get a smooth looking representation of the frequencies found in a sound, it's necessary to slide a window the length of our chunk size over the audio, and for each increment where the window is moved over the sample data, take an FFT over the window and accumulate the frequency intensity data from each FFT step into a single set of buckets representing that slice - what can be thought of in our image as a vertical column of pixels. That's complicated to understand, so I've tried, (and quite possibly failed) to illustrate it here:

basic sliding FFT window

This still looks pretty rough. There's another step most spectrogram generating software uses to further smooth things out, and that's to apply a 'filter', or 'window function', also known as a 'kernel'. This just means that for every increment when sliding the sampling window across the data, a weighting is applied to all the values inside the window. This is usually some kind of bell-shaped distribution.

basic sliding FFT window

Okay, so there's some output data that corresponds to the frequency information in the audio. We can draw that into a nice image, but this is all happening on a single CPU thread, and is not as fast as it could be.

Because we have that sliding window function, each column of the spectrogram has a dependency on some sample data from a slightly earlier time-step. At the beginning of the audio, where there are no previous samples to look back at when evaluating the first one, there's a visible 'shadow' before things really get going.

Spectastiq splits up the audio into a number of slices (in Rust parlance), and dispatches those slices to an equal number of WebWorker threads, each running the same FFT-processing Web Assembly module.

The tricky bit in successfully parallelising this work was that these chunks need to be overlapping, with each chunk containing the last WINDOW_SIZE samples from the previous chunk as a kind of prelude to prime the FFT calculation. If we do this successfully it prevents the early columns of pixels in each spectrogram chunk from containing that ugly shadow discontinuity. When each worker returns the processed frequency data, this initial prelude is not included.

Chunks of overlapping audio data

In Spectastics' implementation, FFT WINDOW_SIZE is always 2048 samples, but this may become customisable in the future. For now 2048 samples gives a reasonable trade-off between temporal resolution of the audio vs how fine-grained the frequency buckets are.

Rendering

After all of the worker threads have returned, and there's a new spectrogram buffer available, it needs to be rendered to the screen.

Substituting intermediate placeholder images

When the user zooms and pans around the audio file, Spectastiq aims to have those interactions happen at your devices' full native frame rate.

While awaiting delivery of a new higher definition spectrogram image for the current time-range of audio you're viewing, Spectastiq draws the appropriate section using the best resolution previous image that it has, as a placeholder until the new sharp image is ready to be swapped in.

At the limit, this image is a portion of the spectrogram image of the entire audio clip that was rendered at initialisation time.

Because of this design decision, when a user is panning or zooming the spectrogram the image may appear momentarily blurry, but it makes for a more responsive user experience overall.

Drawing to a canvas

One could decide to use a basic HTML <canvas> with a "2d" context and employ the drawImage API to 'blit' a portion of your spectrogram image buffer, stretched out to cover the visible area. An early version of Spectastiq did in fact use this method, but I discovered an (under-specified?) limitation to this API at higher zoom levels.

canvasContext.drawImage() rounding issues

drawImage as implemented by browsers doesn't support cropping to sub-pixel offsets for the rectangular portion you're drawing. When calling something like drawImage(imageBufferSource, 2.5, 0, 10.5, sourceHeight, 0, 0, targetWidth, targetHeight), those fractional values on the x-axis are going to get clamped.

In practice this means that you'll often experience a jarring 'popping' effect when the high resolution image eventually replaced the low resolution placeholder cropped version.

The placeholder would be offset slightly in x from the spectrogram data that actually represented that time range, with the effect growing more noticeable the more you're cropping into your placeholder source.

WebGL2 context

The "webgl2" context provided by HTML canvas doesn't share this limitation.

GPU texture sampling units are quite capable of doing the correct sub-pixel sampling of a texture.

The current version of Spectastiq uploads the Float32Array buffer created by the FFT worker threads to the GPU as-is, thanks to the ability to use floating point textures.

OES_texture_float_linear is a WebGL2 extension that is almost universally available. It means Spectastiq doesn't need to do work to quantize the Float32Array returned by the FFT into 8-bit grayscale values - we can just use a single channel float32 texture and do all that work in a glsl shader, taking advantage of GPU parallelism.

That same shader can also cheaply map our float values to an RGB colour gradient texture. In the canvas 2d version of this, we needed to reference a lookup table for each pixel we output.

A final additional benefit of this approach: when drawing the spectrogram at a lower resolution than the backing texture, the GPUs' texel sampler gives us smooth interpolation between values for free.

Removing jank

Initial versions of Spectastiq had the 'minimap' timeline range selector background drawn to a separate canvas.

The 'grabby handles' and draggable playhead indicators were contained inside DOM elements that were manipulated with CSS.

This strategy worked well on desktop, but introduced unacceptable jank on my low spec reference device due to reflows and repaints when dragging the slider handles.

I was unable to completely eliminate this DOM overhead, even when forcing Chrome to promote these elements to GPU accelerated layers using CSS tricks.

Ultimately I was able to remove all the frame hitches when interacting with the spectrogram by drawing all the UI onto a 2d canvas layer above the spectrogram WebGL canvas, and handling all interaction hit-testing of UI elements myself.

I'm not thrilled about this. It sacrifices any accessibility that might have previously been available. I ultimately decided the performance gains were necessary in this instance.

Memory copy overhead in typical operation

In normal Spectastiq usage, there's a decent amount of copying of data between all of these WebWorker threads. On modern desktop systems, this doesn't really have a noticeable impact on the user experience.

On my reference Android hardware this is no longer the case, especially for longer audio clips.

I think the memory subsystem is just that much slower on this device that memcpy has a non-negligible impact.

It's not a problem in terms of actual UI jank — all this work is still happening safely off the main UI thread — but it does mean that each frame of the newly rendered spectrogram takes a while to resolve.

There is a way that we can avoid copying chunks of the decoded audio buffer between the main UI thread and all these workers. The backing buffers can be changed to SharedArrayBuffers, but there are many caveats that apply when attempting this.

In order to be able to use SharedArrayBuffer, it's necessary to be serving the page you're embedding Spectastiq in from a 'Secure context'. This means having a valid HTTPS certificate, and having the means to serve your page with some additional security headers present.
It's a hassle, but has been deemed necessary to mitigate the security risks of Spectre, Heartbleed and similar attacks.

If you are in a position where you can enable SharedArrayBuffer support, doing so speeds up performance on my low spec reference hardware quite dramatically.

Why roll your own FFT?

The WebAudio API does expose native FFT functionality, but its implementation runs on a single thread.

While it should be possible to parallelise it — following the same slicing strategy I used in Spectastiq — I ultimately preferred the extra control that comes with running the FFT calculations in WebAssembly.

With WASM I have full control over the exact parameters and windowing functions of the FFT.

Prior art

The Cornell e‑Bird media library website offers a comparable way to explore birdsong spectrograms, but it makes a very different set of tradeoffs to Spectastiq.

It pre‑renders everything.
Instead of computing an FFT on the fly, e‑Bird generates a series of JPEG spectrogram strips ahead of time, at multiple zoom levels. The browser javascript component fetches those images and stitches them together, aligning them with the audio timeline as you scroll or play.

This makes a lot of sense for less powerful devices, as images and audio can be downloaded in parallel by the browser.

Major cons for this approach are the need to host and run a server-side preprocessing pipeline, and additional storage requirements.

User experience is also a bit more restrictive: You can't change the colour palette the spectrograms are viewed in, and at the maximum zoom level some visual fidelity is lost due to obvious JPEG artifacting.

Future work

When Spectastiq renders long audio clips, the main thing that slows it down is the bottleneck at decodeAudioData().

All browsers run this function on a single thread, so the whole audio file must be parsed and decoded before playback can start. It also doesn't return a stream of the decoded audio, so we can't start rendering a spectrogram eagerly. It only returns once the entire file has been decoded.

There's no technical reason why you couldn't break the file into pieces and decode those pieces in parallel — it's just fiddly.

Most browser‑supported audio formats (MP3, AAC, Opus, etc.) are streaming formats: they consist of a sequence of audio frames that are independent of one another. If you locate the nearest frame boundary in the compressed stream, you can start decoding from that point.

An approach to doing this work in parallel might look like the following:

  1. Split the file into equal, slightly overlapping chunks (the amount of overlap could be informed by the max length of an audio frame in a given format).
  2. Some format-specific parsing code would seek to the earliest audio frame header in each chunk, and then prepend a valid stream header.
  3. Decode parts in parallel using AudioWorklet, and then join them all back together.
  4. Optionally, you could also start creating the initial spectrogram as soon as you had at least one decoded chunk, though juggling this would also introduce additional complexity.

If the network is faster than your serial decode speed, you could perhaps even download the audio file chunks concurrently using HTTP range requests.

This is all quite conceptually simple, but in practice it is probably a fair bit of work to get right.

Usage

Spectastiq is under an open source license, and is free to use and extend however you fancy.

I had fun designing and building it, so I hope it's useful, especially to those presenting research into bird-song.
If it does end up being useful to the work you're doing, please reach out, I'd love to hear about it!

Posts index