Building Speaker Recognition Systems and Diarization Using d

您所在的位置：网站首页 › 美国电影黑洞讲的是什么 › Building Speaker Recognition Systems and Diarization Using d

Building Speaker Recognition Systems and Diarization Using d

#Building Speaker Recognition Systems and Diarization Using d| 来源: 网络整理| 查看: 265

Building Speaker Recognition Systems and Diarization Using d-vectors

Karan Purohit

Published in

Saarthi.ai

·7 min read·Feb 11, 2020

Photo by Hrayr Movsisyan on Unsplash

In this article, you will learn about how you can build the speaker recognition system from scratch, and a Speaker Diarization System on top of it.

What is Speaker Recognition?

Speaker Recognition helps us recognize “Who is speaking?”. The process involved in any Speaker Recognition System can be classified into two parts; Speaker Identification and Speaker Verification.

First, let us understand what these terms mean.

Speaker Identification

The goal of the Speaker Identification process is to match a voice sample from an unknown speaker with one of several labelled speakers. Here, we take the speech sample of an unknown speaker and determine which enrolled speaker best matches the speech. The Speaker Identification System helps find the best matching speaker among the enrolled speakers. Here the user is not claiming his identity.

Speaker Verification

In the Speaker Verification process, the system takes speech of an unknown speaker with his/her claimed identity, and it determines whether the claimed identity matches the speech. That’s why, in this case, the voice samples are compared only with the speaker model of the claimed identity.

DNN based d-vector

As our approach is to use D-vectors to build our Speaker Identification and Diarization system, let’s also take a minute to understand where d-vectors come from and what they are.

As the name suggests it’s a vector evaluated from audio. If you calculate d-vector for one speaker, then you can use it as a voice fingerprint for that speaker.

D-vector is extracted using DNN. To extract a d-vector, a DNN model takes stacked filterbank features. D-vector is the averaged activation from the last hidden layer of this DNN. You will get to know how you can get the d-vector later on in this article.

Data

As always for any deep learning project, here you need to invest most of your time for data collection and preparation. Here all you need is voice samples of different speakers. Voice samples can contain any utterance spoken by that speaker. You will need at least 12–15 seconds of training material for each speaker and test sentences lasting 2–6 seconds would be sufficient.

Collecting high-quality data set for speaker recognition is in itself a challenge. To start with the baseline, we collected data from youtube videos. We targeted only those videos which have one speaker in it.

As we are building a Speaker Recognition System for the Hindi language, it took us a fair amount of time to collect data that would suffice the need.

Now, let’s look at how we can pre-process this data.

Data PreprocessingPhoto by Chris Liverani on Unsplash

After the collection of audio samples, it is time to sample them. We sampled the audios randomly ranging from 5 seconds to 10 seconds. Random sampling helps to generalize the model.

If you have been working on speech recognition systems for some time, you will see that most of the research work involves data that is simulated. Most of the time researchers create simulated data, and add it to real-world data to improve their results.

To increase the number of samples in your dataset you can always go for data augmentation. In speech data, there are many ways to augment your data. In our project adding white noise, changing pitch and adding background noises are most useful. There are already some resources available to guide you.

Data augmentation

After going through several python libraries for speech, I found Librosa (python) best for our use-case. It has many methods which come handy in preprocessing the data.

Apart from Librosa, for data augmentation, nlpaug is also quite good. To make things easy for you, I have provided some code snippets for data augmentation.

Note that all audio data should be a mono channel with an 8000 sampling rate. While augmenting, also make sure ambient sounds are mono channel with 8000 sample rate.

Adding white noisesample audioimport librosaimport nlpaug.augmenter.audio as naafile='sample.wav'out='noiced_sample.wav'audio, sampling_rate = librosa.load(file,sr=8000)aug_noise = naa.NoiseAug(noise_factor=0.008)augmented_noise = aug_noise.substitute(audio)librosa.output.write_wav(out,augmented_noise,sr=8000)after adding white noise to sample audioChanging pitchfile='sample.wav'out='pitched.wav'audio, sampling_rate = librosa.load(file,sr=8000)aug_pitch = naa.PitchAug(sampling_rate=sampling_rate, pitch_range=(1,2))augmented_pitch = aug_pitch.substitute(audio)librosa.output.write_wav(out,augmented_pitch,sr=8000)after changing the pitch of sample audio

You can change the parameters noise_factor and pitch_range based on your requirements.

Adding background sounds

You can add background sounds like the sound of a chair, clock, door, barking dog, moving cars, etc to your audio. There are thousands of background noise samples out there and which can make your dataset more real world-oriented.

To add background sound you need to superimpose your background noise to your audio sample. To do this both audio files should be of the same length which is not the case most of the time. That’s, why you need to pad the shorter audio sample with zeros. Simple isn’t it?

file='sample.wav'out='ambient_sound.wav'

y1, sample_rate1 = librosa.load(n_file, mono=True,sr=8000)y2, sample_rate2 = librosa.load(file, mono=True,sr=8000)

if y1.shape[0]y2.shape[0]: r = np.zeros(y1.shape) #padding with zeros r[:y2.shape[0]]=y2 amp=(r+y1)/2

librosa.output.write_wav(out,amp, sr=int((sample_rate1+sample_rate2)/2))

Tip 💡 : While augmenting your data try to add noise more randomly. Don’t inject white noise or ambient sounds to eeverysample of your data. Instead, select your audio files from data in a random manner and pick the type of noise also randomly!

Time to blow your GPU 🔥

If you have come this far, congrats!

Its time to take some rest and let your GPU do the job.

To get d-vector we are going to SincNet. SincNet is a neural network architecture for processing raw audio samples. To understand the architecture of network and math behind it you can visit our blog from the link below. We chose SincNet because of its:

Fast ConvergenceLess Number of parameters for trainingInterpretability

Quick read: An overview of Speaker Recognition with Sinctnet

The Github repo of SincNet covers all the information required to train the model. There is a cw_len(context window length) parameter that can be tuned based on the dataset to get good results. After the first 30–40 epochs, you will see your model is converging.

Once you are ready with your model you can do speaker identification by just passing your sample audio file through the trained model. You will get the speaker id as output which corresponds to that speaker.

You will get the d-vector from the output of the last hidden layer. Once you get the d-vector you can do speaker verification. First, prepare the d-vector for speakers which are already present in your database.

You can use your test data to get d-vectors. So once sample audio is passed into the network, you calculate its d-vector and find the cosine similarity between d-vector of sample audio and d-vector of claimed speaker. If the similarity is more than the threshold, you can say that sample audio is verified as the claimed speaker.

Speaker DiarizationPhoto by Joshua Ness on Unsplash

In short, Speaker Diarization means to find “who spoke when” in any given audio, essentially doing speaker segmentation from the audio. If there are only two speakers in audio then this task becomes relatively easy as compared to audio where there are more than two speakers.

Our approach is not end to end Speaker Diarization. Limitation of our approach:

There should be only two speakers in the audioYou already have d-vector for each speaker

So in given audio, if you already have d-vectors of both two speakers then you can find points of speaker change using them. From given audio we take the 2-second window with a stride of 0.2 seconds and slide it through the audio. The optimum window length and stride were found after testing with various values of them. After this, we calculate d-vector of each and every window which we got after sliding.

After getting d-vector for each window we find the cosine similarity between d-vector of each window and d-vector of each speaker. Based on cosine similarity we can find the switching point of speakers.

【本文地址】

Building Speaker Recognition Systems and Diarization Using d

Building Speaker Recognition Systems and Diarization Using d

今日新闻

推荐新闻