MNIST Audio Classification with an FFT Convolutional Neural Net using NetDesigner

Опубликовано: 16 Май 2026
на канале: GradientN
32
3

Hi everyone, thanks for tuning in. This is Nate, a Murf synthetic voice, talking to you from GradientN. In this video we set up and train a voice classification neural net using the Audio MNIST dataset, downloaded from Kaggle. The link and reference are in the description box below. NNetDesigner is available for download at gradientn.com!

The fully connected net has a sound input layer, two hidden layers, and output nodes connected to the sound files for digits one, digit two, digit three, and so on.

We’re going to add an FFT, a Fast Fourier Transform, and a convolution to the fully connected net. We’ll delete the connections between the input layer and net. Then we’ll place a process node, and a convolution. And connect it all together.

We’ll set up the FFT by selecting the process node we just inserted and then bringing out the Process Editor. We’ll set the FFT size to 128. The FFT data stride to 32. And the down convert rate to 16.

For our convolution, we’ll leave the input convolution size at 4 FFTs and the input convolution stride at 4. We’ll set the input frame size to 8. A frame size of 8 will mean that we will be feeding our fully connected output net two convolutions (frame size 8 divided by a convolution size 4) per training step. Finally, we’ll leave the output convolution size at 2, giving us a two to one reduction through the convolution (convolution output size 4 divided by input size 2). The convolutional will have an input frame size of 128 (the FFT size) by 8 (input frame size) and an output frame size of 128 (again, the FFT size) by 4 (output size 2 times 2 convolutions).

We can set our FFT size. And we can also set our FFT stride - the increment we use to step the FFT across the waveform data. We can also set our convolutional size. In this case, our convolution is set to take 4 FFTs at a time. We can also set the convolution stride. Like the FFT stride, the convolution stride sets how far we move the convolution over the data for each training step. And, while not shown, we can set the number of convolutions, or frame size, to feed our fully connected net.

The Audio MNIST waveforms have a sample rate of 48000Hz. Using a down sample rate of 16 gives us an FFT sample rate of 3000Hz. See reference at NI Audio for more details.

We’ll bring out the training graph and then start training. Training will take a while, and so we can fast forward through training a little, and then jump to the end.

We’ve stopped training when it looks like the net has stabilized.

From the classification matrices we see that training accuracy is 77.7% and validation accuracy is 66.3%. This is considerably better than for the FFT fully connected net from our previous video, which was 36.2% for training accuracy and 34.3% for validation accuracy.

We are showing graphically how the net performed. The top graph shows performance for our current net design, with an FFT, a convolution, and a fully connected net. The middle graph shows performance for the FFT plus fully connected from our previous video. The stacked graphs show performance for each digit with a different color. The bottom graph shows the audio waveform for the corresponding digit.

For file Subject01_One_0.wav, we can clearly see the performance difference between the two nets. The previous net performs reasonably well identifying digit one, in blue, in localized regions, but the current net with the convolution performs well over the whole waveform.

For digit two, in orange, we can see that the current convolutional net again outperforms the previous FFT plus fully connected net.

For digit Three, in gray, we can see that the current net outperforms the previous net in the voiced region of the waveform. However, both nets still struggle in the unvoiced region, especially during the ‘th’ sound, where the nets strongly indicate Six, in green.

Four, in yellow, is identified clearly in the voiced region for both nets.

Again, five, in light blue, is identified better with the net with the convolution.

For Six, in green, the current net identifies the voiced portion and the ‘x’ sound of the waveform better, but actually performs slightly worse at the start of the ‘s’ sound at the beginning.

The current net performs better for digit seven, in dark blue, across the full waveform.

Again, the current net performs better for eight, in burnt orange, across the waveform.

The current net with the convolution also performs better for nine, in gray.

Zero, in brown, is clearly identified by the current net.

Link to Audio MNIST dataset:
https://www.kaggle.com/datasets/alanchn31/...

Reference
Becker, Soren and Ackermann, Marcel and Lapuschkin, Sebastian and M\"uller, Klaus-Robert and Samek, Wojciech, 2018, “Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals”, CoRR, abs/1807.03418