Skip to main content

Deep Audio

Victoria Harbour, Hongkong

Github Repository

Signal Processing with Tensorflow

The Challenge is to build a Machine Learning model and code to count the number of Capuchinbird calls within a given clip. This can be done in a variety of ways and we would recommend that you do some research into various methods of audio recognition.

Project Setup

The Dataset is provided on kaggle.com and contains of three sets of bird call recordings:

  • raw (RAW Data)
  • positives (Cut out positive bird calls)
  • negatives (Cut out negatives)

Copy them into your data folder. Open a Jupyter Notebook and create a notebook called signal-processing:

jupyter notebook

Now we can install the Python dependencies from inside the notebook:

!pip install tensorflow tensorflow-gpu tensorflow-io matplotlib

Once installed import them into your notebook:

import os
from matplotlib import pyplot as plt
import tensorflow as tf
import tensorflow_io as tfio

Data Loading

To work with the recorded audio files we can select the corresponding data paths - in the example below we pick both an example file that contains the signal we are looking for and one that only comes with background noise:

CAPUCHIN_FILE = os.path.join('data', 'positives', 'XC3776-3.wav')
NOT_CAPUCHIN_FILE = os.path.join('data', 'negatives', 'afternoon-birds-song-in-forest-0.wav')

And read a single audio stream (mono) resampled to 16kHz from those files by feeding the filepath into the following function:

  • decode_wav : Read single channel from stereo file
  • squeeze : Since all the data is single channel (mono), drop the channels axis from array
  • resample : Reduce audio data to 16kHz
def load_wav_16k_mono(filename):
# Load encoded wav file
file_contents = tf.io.read_file(filename)
# Decode wav (tensors by channels)
wav, sample_rate = tf.audio.decode_wav(file_contents, desired_channels=1)
# Removes trailing axis
wav = tf.squeeze(wav, axis=-1)
sample_rate = tf.cast(sample_rate, dtype=tf.int64)
# Goes from 44100Hz to 16000hz - amplitude of the audio signal
wav = tfio.audio.resample(wav, rate_in=sample_rate, rate_out=16000)
return wav

We can run the function over both the positive and negative file:

wave = load_wav_16k_mono(CAPUCHIN_FILE)
nwave = load_wav_16k_mono(NOT_CAPUCHIN_FILE)

And overlay both waves in a plot:

Tensorflow Audio Signal Classifier

The blue plot represents the positive signal - the correct bird call. While the orange plot is a negative representing our baseline background noise. This now gives us a image representation of our audio that we can use to train our neural network with.

Creating the Dataset

First we need to define the path to all our positive and negative audio files:

POS = os.path.join('data', 'positives')
NEG = os.path.join('data', 'negatives')

Now we can store all the data paths inside Tensorflow Datasets:

pos = tf.data.Dataset.list_files(POS+'/*.wav')
neg = tf.data.Dataset.list_files(NEG+'/*.wav')

We can now label our data by adding ones and zeros to each filepath inside the dataset depending on whether it is a positive or negative sample:

positives = tf.data.Dataset.zip((pos, tf.data.Dataset.from_tensor_slices(tf.ones(len(pos)))))

negatives = tf.data.Dataset.zip((neg, tf.data.Dataset.from_tensor_slices(tf.zeros(len(neg)))))

Tensorflow Audio Signal Classifier

After successfully labelling our data we can now merge everything into a single dataset:

data = positives.concatenate(negatives)

Calculating Average Length of a Birdcall

To be able to count bird calls in our RAW data we first have to know the average length of a single call. We can do this by taking all our parsed positive recordings and load them into their waveform:

lengths = []
for file in os.listdir(os.path.join('data', 'positives')):
tensor_wave = load_wav_16k_mono(os.path.join('data', 'positives', file))
lengths.append(len(tensor_wave))

By appending each file length to the lengths array we can now do some basic maths to calculate the Min, Max and AVG length of a positive birdcall:

Tensorflow Audio Signal Classifier

Tensorflow Audio Signal Classifier

This means that the average birdcall ist 54156 / 16000 Hz = 3.38s. And the calls are in between Min 2s and Max 5s in length.

Converting Data into a Spectrogram

The following function takes an audio file and converts it to 16 kHz Mono. Since we know that the average call is about 3s in length we can limit our data to the first 48000 units of each parsed audio file.

Since our minimum file length was 32000 we need to make sure that every file is padded up with zeros to a length of 48000 using the tf.zeros function:

def preprocess(file_path, label):
# Load files into 16kHz mono
wav = load_wav_16k_mono(file_path)
# Only read the first 3 secs
wav = wav[:48000]
# If file < 3s add zeros
zero_padding = tf.zeros([48000] - tf.shape(wav), dtype=tf.float32)
wav = tf.concat([zero_padding, wav],0)
# Use Short-time Fourier Transform
spectrogram = tf.signal.stft(wav, frame_length=320, frame_step=32)
# Convert to absolut values (no negatives)
spectrogram = tf.abs(spectrogram)
# Add channel dimension (needed by CNN later)
spectrogram = tf.expand_dims(spectrogram, axis=2)
return spectrogram, label

To create the spectrogram we can use the Short-time Fourier Transformation provided by Tensorflow.

_Note: the dimensions of each spectrogram are 1491,257,1 - which represents the height and width of the image representation + an additional channel that was added with the tf.expand_dims function. This channel does not hold any information here but is expected to exists by the Tensorflow model we are going to use later.

Test Function

Pick a random audio file from the positives dataset:

filepath, label = positives.shuffle(buffer_size=10000).as_numpy_iterator().next()

And run it through the pre-processing function:

spectrogram, label = preprocess(filepath, label)

Let's see what the spectrum actually looks like:

plt.figure(figsize=(30,20))
plt.imshow(tf.transpose(spectrogram)[0])
plt.show()

Tensorflow Audio Signal Classifier

For comparison, this is the spectrogram of a negative sample:

Tensorflow Audio Signal Classifier

Now all we have to do, is to train a Tensorflow model that is able to distinguish between those two image representation of our audio data.

Preparing a Testing & Training Dataset

We already wrapped all of our data - positives and negatives - into the data variable. So now we can map through this data and feed each file to the preprocessing function.

data = data.map(preprocess)
data = data.cache()
data = data.shuffle(buffer_size=1000)
data = data.batch(16)
data = data.prefetch(8)

To optimize the training we will first shuffle the data and limit the amount of simultaneous files being processed to 16 and the pre-fetch to 8 files. This can be increased if your CPU/GPU can handle it.

To split our data into training and testing data we can run:

train = data.take(36)
test = data.skip(36).take(15)

With len(data) = 51 the split ration is close to 70:30. We can verify the content of each array by:

Training Set

samples, labels = train.as_numpy_iterator().next
labels

array([0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
dtype=float32)

Testing Set

samples, labels = test.as_numpy_iterator().next()
labels

array([0., 0., 1., 0., 0., 1., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0.],
dtype=float32)

Build the Deep Learning Model

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Dense, Flatten

Sequential Model

model = Sequential()
model.add(Conv2D(8, (3,3), activation='relu', input_shape=(1491,257,1)))
model.add(Conv2D(8, (3,3), activation='relu'))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile('Adam', loss='BinaryCrossentropy', metrics=[tf.keras.metrics.Recall(),tf.keras.metrics.Precision()])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 1489, 255, 8) 80

conv2d_1 (Conv2D) (None, 1487, 253, 8) 584

flatten (Flatten) (None, 3009688) 0

dense (Dense) (None, 32) 96310048

dense_1 (Dense) (None, 1) 33

=================================================================
Total params: 96,310,745
Trainable params: 96,310,745
Non-trainable params: 0
_________________________________________________________________

Training the Model

hist = model.fit(train, epochs=4, validation_data=test)

Already after 4 epochs we are starting to get getting a precision and recall value around 100%:

Epoch 4/4
36/36 [==============================] - 6s 177ms/step - loss: 0.0091 - recall: 0.9870 - precision: 1.0000 - val_loss: 0.0157 - val_recall: 0.9844 - val_precision: 1.0000

Tensorflow Audio Signal Classifier

Tensorflow Audio Signal Classifier

Making Predictions

We can now use our testing data - the 16 files we excluded from the training dataset - and use them to run a prediction against:

X_test, y_test = test.as_numpy_iterator().next()

Tensorflow Audio Signal Classifier

The yhat parameter gives us the probabilities of a file being the recording of a birdcall:

yhat = model.predict(X_test)

Every value that is close to 1.00000000e+00 is a recording that, with a very high probability, contains the birdcall we are looking for:

array([[9.99920249e-01],
[3.07901189e-08],
[1.00000000e+00],
[2.84484369e-25],
[9.62874225e-09],
[9.99707282e-01],
[1.53503152e-05],
[2.34340667e-03],
[8.77155067e-28],
[1.00000000e+00],
[1.14530545e-08],
[5.61734714e-06],
[3.27430301e-20],
[1.99461836e-11],
[0.00000000e+00],
[0.00000000e+00]], dtype=float32)

To make this more readable we can make this - within a selected range of confidence - binary:

yhat = [1 if prediction > 0.5 else 0 for prediction in yhat]

Which looks like this:

[1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]

We can verify that this result is correct by calculating the sum of yhat and comparing it with the sum of y_test - which are our labels that we set to 1 for positives and 0 for negatives:

Tensorflow Audio Signal Classifier

And it seems that we were able to identify all 4 of them! But we should also rule out false-positives/negatives by comparing both arrays:

Tensorflow Audio Signal Classifier

And we have a match - all the zeros and ones are where they belong.

Putting the Model to Work

We now have a well performing model that is able to recognize bird calls. So we can now let it loose on the RAW forest recordings in the data/raw directory. First we need a function that loads the recordings - that are this time in mp3 containers. So instead of the tf.io.read_file function we now need to use tfio.audio.AudioIOTensor. We will handle the stereo recording by adding both channels into a single channel and divide every value by 2:

def load_mp3_16k_mono(filename):
# Load a WAV file, convert it to a float tensor, resample to 16 kHz single-channel audio.
res = tfio.audio.AudioIOTensor(filename)
# Convert to tensor and combine channels
tensor = res.to_tensor()
tensor = tf.math.reduce_sum(tensor, axis=1) / 2
# Extract sample rate and cast
sample_rate = res.rate
sample_rate = tf.cast(sample_rate, dtype=tf.int64)
# Resample to 16 kHz
wav = tfio.audio.resample(tensor, rate_in=sample_rate, rate_out=16000)
return wav

Test Function

We can test the loading function by loading a single file:

mp3 = os.path.join('data', 'Forest Recordings', 'recording_00.mp3')
wav = load_mp3_16k_mono(mp3)

Since the RAW recordings are much longer than the neatly parsed training recordings that were already cut down to the length of a typical bird call we can now slice up the file into short sequences of the same length 48000:

audio_slices = tf.keras.utils.timeseries_dataset_from_array(wav, wav, sequence_length=48000, sequence_stride=48000, batch_size=1)

We can check the amount of slices that were generated with len(audio_slices) - and in case of the file recording_00.mp3 we end up with 60 sequences.

Data Preprocessing

The pre-processing step is identical to before. We can pick the first sequence out of the 60 slices that were generated, make sure that it is the correct length (which of course in case of the first slice - it will be). And then continue generating the spectrogram:

def preprocess_mp3(sample, index):
sample = sample[0]
zero_padding = tf.zeros([48000] - tf.shape(sample), dtype=tf.float32)
wav = tf.concat([zero_padding, sample],0)
spectrogram = tf.signal.stft(wav, frame_length=320, frame_step=32)
spectrogram = tf.abs(spectrogram)
spectrogram = tf.expand_dims(spectrogram, axis=2)
return spectrogram

Now we can create the audio slices and run them through the preprocessing function:

audio_slices = tf.keras.utils.timeseries_dataset_from_array(wav, wav, sequence_length=48000, sequence_stride=48000, batch_size=1)
audio_slices = audio_slices.map(preprocess_mp3)
audio_slices = audio_slices.batch(64)

Running a Prediction

With the 60 slices in place we can now run our prediction model against them - this time I will bump up the confidence to 90% to make sure we don't catch any background noise:

yhat = model.predict(audio_slices)
yhat = [1 if prediction > 0.9 else 0 for prediction in yhat]

The yhat variable now returns 60 zeros and ones depending on wether the tested sequence contained a bird call or not:

len(yhat)
60

yhat
[0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0]

But we can see that there are consecutive detections that are most likely caused by us slicing a single bird call into two. Currently we would count't 8 bird calls:

tf.math.reduce_sum(yhat)
<tf.Tensor: shape=(), dtype=int32, numpy=8>

Grouping Consecutive Hits

from itertools import groupby

yhat = [key for key, group in groupby(yhat)]
calls = tf.math.reduce_sum(yhat).numpy()

This reduces the number of detections to 5 - which we can confirm by listening to the recording_00.mp3 file and counting using finger technology:

calls
5

Processing the RAW Data

Now with everything in place we can loop through every file in the raw directory and count our detections:

results = {}
for file in os.listdir(os.path.join('data', 'raw')):
FILEPATH = os.path.join('data','raw', file)

wav = load_mp3_16k_mono(FILEPATH)
audio_slices = tf.keras.utils.timeseries_dataset_from_array(wav, wav, sequence_length=48000, sequence_stride=48000, batch_size=1)
audio_slices = audio_slices.map(preprocess_mp3)
audio_slices = audio_slices.batch(64)

yhat = model.predict(audio_slices)

results[file] = yhat

This returns predictions for every slice in every recording:

Tensorflow Audio Signal Classifier

Again we can make this more readable by turning everything with a 99% confidence into ones and the rest into zeros:

class_preds = {}
for file, logits in results.items():
class_preds[file] = [1 if prediction > 0.99 else 0 for prediction in logits]
class_preds

And group all consecutive detections to make sure we don't double-count them:

postprocessed = {}
for file, scores in class_preds.items():
postprocessed[file] = tf.math.reduce_sum([key for key, group in groupby(scores)]).numpy()
postprocessed

Now we end up with the sum of all calls inside each file:

{'recording_00.mp3': 5,
'recording_01.mp3': 0,
'recording_02.mp3': 0,
'recording_03.mp3': 0,
'recording_04.mp3': 4,
'recording_05.mp3': 0,
'recording_06.mp3': 5,
'recording_07.mp3': 2,
'recording_08.mp3': 25,
'recording_09.mp3': 0,
'recording_10.mp3': 5,
'recording_11.mp3': 3,
'recording_12.mp3': 0,
'recording_13.mp3': 0,
'recording_14.mp3': 0,
'recording_15.mp3': 2,
'recording_17.mp3': 3,
'recording_18.mp3': 5,
'recording_19.mp3': 0,
'recording_20.mp3': 0,
'recording_21.mp3': 1,
'recording_22.mp3': 2,
'recording_23.mp3': 5,
'recording_24.mp3': 0,
'recording_25.mp3': 16,
'recording_26.mp3': 2,
'recording_27.mp3': 0,
'recording_28.mp3': 16,
'recording_29.mp3': 0,
'recording_30.mp3': 2,
'recording_31.mp3': 1,
'recording_32.mp3': 2,
'recording_34.mp3': 4,
'recording_35.mp3': 0,
'recording_36.mp3': 0,
'recording_37.mp3': 3,
'recording_38.mp3': 1,
'recording_39.mp3': 9,
'recording_40.mp3': 1,
'recording_41.mp3': 0,
'recording_42.mp3': 0,
'recording_43.mp3': 5,
'recording_44.mp3': 1,
'recording_45.mp3': 3,
'recording_46.mp3': 17,
'recording_47.mp3': 16,
'recording_48.mp3': 4,
'recording_49.mp3': 0,
'recording_51.mp3': 3,
'recording_52.mp3': 0,
'recording_53.mp3': 0,
'recording_54.mp3': 3,
'recording_55.mp3': 0,
'recording_56.mp3': 16,
'recording_57.mp3': 3,
'recording_58.mp3': 0,
'recording_59.mp3': 15,
'recording_60.mp3': 4,
'recording_61.mp3': 11,
'recording_62.mp3': 0,
'recording_63.mp3': 17,
'recording_64.mp3': 2,
'recording_65.mp3': 5,
'recording_66.mp3': 0,
'recording_16.mp3': 5,
'recording_33.mp3': 0,
'recording_50.mp3': 0,
'recording_67.mp3': 0,
'recording_68.mp3': 1,
'recording_69.mp3': 1,
'recording_70.mp3': 4,
'recording_71.mp3': 5,
'recording_72.mp3': 4,
'recording_73.mp3': 0,
'recording_74.mp3': 0,
'recording_75.mp3': 1,
'recording_76.mp3': 0,
'recording_77.mp3': 3,
'recording_78.mp3': 10,
'recording_79.mp3': 0,
'recording_80.mp3': 1,
'recording_81.mp3': 5,
'recording_82.mp3': 0,
'recording_83.mp3': 0,
'recording_84.mp3': 16,
'recording_85.mp3': 0,
'recording_86.mp3': 17,
'recording_87.mp3': 24,
'recording_88.mp3': 0,
'recording_89.mp3': 5,
'recording_90.mp3': 0,
'recording_91.mp3': 0,
'recording_92.mp3': 0,
'recording_93.mp3': 5,
'recording_94.mp3': 3,
'recording_95.mp3': 4,
'recording_96.mp3': 1,
'recording_97.mp3': 4,
'recording_98.mp3': 21,
'recording_99.mp3': 5}

Export the Results

with open('results.csv', 'w', newline='') as f:
writer = csv.writer(f, delimiter=',')
writer.writerow(['recording', 'capuchin_calls'])
for key, value in postprocessed.items():
writer.writerow([key, value])