Skip to main content

Applied ML

CNNs for Signal Feature Extraction

1D CNNs on raw waveforms versus 2D CNNs on spectrograms (mel, log-mel, STFT, CWT), self-supervised speech feature extractors (wav2vec, HuBERT), VGGish embeddings, and the choice of representation as the dominant accuracy lever.

AdvancedTier 3Current~15 min
0

Why This Matters

The accuracy of a deep audio or vibration classifier is determined by the input representation more than by the architecture. A ResNet-50 on log-mel spectrograms typically beats a much larger 1D CNN on raw waveform for sound-event detection, while a smaller 1D model wins on synthetic-aperture radar where the relevant features are in the raw IQ phase. The right question is not "which architecture", it is "which time-frequency representation, at which window length, with which normalization".

The shift from handcrafted MFCCs to learned representations did not eliminate the choice; it pushed it down a layer. wav2vec 2.0 (Baevski et al., NeurIPS 2020, arXiv:2006.11477) and HuBERT (Hsu et al., IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2021, arXiv:2106.07447) learn frame-level features from unlabeled speech that transfer across downstream tasks, but the convolutional front end still operates on raw waveform with a fixed receptive field of 25 ms and a 20 ms stride, which is exactly the time-frequency trade an STFT would have made.

Core Ideas

A 1D CNN over raw waveform learns its own analysis filters in the first convolutional layer. With kernel size 250 to 400 samples and stride 80 at 16 kHz audio, the first layer behaves like a learned filterbank with 25 ms windows. Visualizing the filters reveals bandpass shapes that resemble a mel filterbank with task-dependent center frequencies. SincNet (Ravanelli and Bengio, 2018) constrains the first layer to parameterized sinc bandpass filters, which reduces parameter count without losing accuracy.

A 2D CNN on a spectrogram treats the time-frequency surface as an image. The standard pipeline computes an STFT with a 25 ms Hann window and 10 ms hop, a mel filterbank with 64 to 128 bands, and a logarithm. Log compression is critical: linear-scale input degrades classification by 5 to 15 percentage points on AudioSet. VGGish (Hershey et al., ICASSP 2017, arXiv:1609.09430), trained on 70 million YouTube clips, still produces useful 128-dimensional general-purpose audio embeddings in 2026.

Continuous wavelet transforms give a multi-resolution view: short windows at high frequency, long windows at low frequency. They help when the signal has both transients and slow oscillations in the same window, where a single STFT window length compromises one or the other. CWT is rarely used in production speech but appears in fault-detection and biomedical pipelines.

Self-supervised pretraining changed the regime above 100 hours of unlabeled audio. wav2vec 2.0 quantizes the convolutional output into a discrete codebook and trains a transformer with a contrastive objective over masked timesteps. HuBERT replaces the contrastive loss with cross-entropy over k-means cluster targets, which is more stable. A small head fine-tunes either to LibriSpeech-competitive WER with one to ten hours of labeled data, where a from-scratch CNN would need hundreds.

Common Confusions

Watch Out

Mel scale is a perceptual approximation, not an information-theoretic optimum

The mel scale was fit to human pitch perception experiments. It compresses high frequencies and expands low ones. For non-speech audio (machine sounds, sonar, animal vocalizations) the mel scale can discard discriminative high-frequency content. A constant-Q transform or a learned filterbank often beats mel for these domains.

Watch Out

A 1D CNN on waveform is not necessarily 'end-to-end' in the useful sense

The first-layer filterbank still defines the temporal resolution. End-to-end means the filterbank is learned jointly, not that the model is free of inductive bias. Window length, stride, and number of channels in layer 1 are still the dominant hyperparameters.

References

Related Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics