CNNs for Signal Feature Extraction

Sneiderman, Robby

Applied ML

CNNs for Signal Feature Extraction

1D CNNs on raw waveforms versus 2D CNNs on spectrograms (mel, log-mel, STFT, CWT), self-supervised speech feature extractors (wav2vec, HuBERT), VGGish embeddings, and the choice of representation as the dominant accuracy lever.

AdvancedTier 3CurrentReference~15 min

Prerequisites

Convolutional Neural Networks Signals and Systems for ML

Prereq Map

Learning position

Read this page in the graph.

applied-ml | layer 4 | tier 3. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Self-Supervised Vision

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The accuracy of a deep audio or vibration classifier is determined by the input representation more than by the architecture. A ResNet-50 on log-mel spectrograms typically beats a much larger 1D CNN on raw waveform for sound-event detection, while a smaller 1D model wins on synthetic-aperture radar where the relevant features are in the raw IQ phase. The right question is not "which architecture", it is "which time-frequency representation, at which window length, with which normalization".

The shift from handcrafted MFCCs to learned representations did not eliminate the choice; it pushed it down a layer. wav2vec 2.0 (Baevski et al., NeurIPS 2020, arXiv:2006.11477) and HuBERT (Hsu et al., IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2021, arXiv:2106.07447) learn frame-level features from unlabeled speech that transfer across downstream tasks, but the convolutional front end still operates on raw waveform with a fixed receptive field of 25 ms and a 20 ms stride, which is exactly the time-frequency trade an STFT would have made.

Core Ideas

A 1D CNN over raw waveform learns its own analysis filters in the first convolutional layer. With kernel size 250 to 400 samples and stride 80 at 16 kHz audio, the first layer behaves like a learned filterbank with 25 ms windows. Visualizing the filters reveals bandpass shapes that resemble a mel filterbank with task-dependent center frequencies. SincNet (Ravanelli and Bengio, 2018) constrains the first layer to parameterized sinc bandpass filters, which reduces parameter count without losing accuracy.

A 2D CNN on a spectrogram treats the time-frequency surface as an image. The standard pipeline computes an STFT with a 25 ms Hann window and 10 ms hop, a mel filterbank with 64 to 128 bands, and a logarithm. Log compression is critical: linear-scale input degrades classification by 5 to 15 percentage points on AudioSet. VGGish (Hershey et al., ICASSP 2017, arXiv:1609.09430), trained on 70 million YouTube clips, still produces useful 128-dimensional general-purpose audio embeddings in 2026.

Continuous wavelet transforms give a multi-resolution view: short windows at high frequency, long windows at low frequency. They help when the signal has both transients and slow oscillations in the same window, where a single STFT window length compromises one or the other. CWT is rarely used in production speech but appears in fault-detection and biomedical pipelines.

Self-supervised pretraining changed the regime above 100 hours of unlabeled audio. wav2vec 2.0 quantizes the convolutional output into a discrete codebook and trains a transformer with a contrastive objective over masked timesteps. HuBERT replaces the contrastive loss with cross-entropy over k-means cluster targets, which is more stable. A small head fine-tunes either to LibriSpeech-competitive WER with one to ten hours of labeled data, where a from-scratch CNN would need hundreds.

Common Confusions

Watch Out

Mel scale is a perceptual approximation, not an information-theoretic optimum

The mel scale was fit to human pitch perception experiments. It compresses high frequencies and expands low ones. For non-speech audio (machine sounds, sonar, animal vocalizations) the mel scale can discard discriminative high-frequency content. A constant-Q transform or a learned filterbank often beats mel for these domains.

Watch Out

A 1D CNN on waveform is not necessarily 'end-to-end' in the useful sense

The first-layer filterbank still defines the temporal resolution. End-to-end means the filterbank is learned jointly, not that the model is free of inductive bias. Window length, stride, and number of channels in layer 1 are still the dominant hyperparameters.

References

Baevski et al., "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," NeurIPS 2020, arXiv:2006.11477
Hsu et al., "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units," IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2021, arXiv:2106.07447
Hershey et al., "CNN Architectures for Large-Scale Audio Classification," ICASSP 2017, arXiv:1609.09430
Ravanelli and Bengio, "Speaker Recognition from Raw Waveform with SincNet," IEEE SLT 2018, arXiv:1808.00158
Gemmeke et al., "Audio Set: An ontology and human-labeled dataset for audio events," ICASSP 2017
Park et al., "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition," Interspeech 2019, arXiv:1904.08779

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics