Active and Semi-Supervised Learning for Speech Recognition

Kreyssig, Florian

Active and Semi-Supervised Learning for Speech Recognition

Repository URI

https://www.repository.cam.ac.uk/handle/1810/366484

Repository DOI

https://doi.org/10.17863/CAM.107386

Files

Primary Thesis (1.92 MB)

Type

Thesis

Authors

Kreyssig, Florian

Abstract

Recent years have seen significant advances in speech recognition technology, which can largely be attributed to the combination of the rise in deep learning in speech recognition and an increase in computing power. The increase in computing power enabled the training of models on ever-expanding data sets, and deep learning allowed for the better exploitation of these large data sets.

For commercial products, training on multiple thousands of hours of transcribed audio is common practice. However, the manual transcription of audio comes with a significant cost, and the development of high-performance systems is typically limited to commercially viable tasks and languages. To promote the use of speech recognition technology across different languages and make it more accessible, it is crucial to minimise the amount of transcribed audio required for training. This thesis addresses this issue by exploring various approaches to reduce the reliance on transcribed data in training automatic speech recognition systems through novel methods for active learning and for semi-supervised learning.

For active learning, this thesis proposes a method based on a Bayesian framework termed NBest-BALD. NBest-BALD is based on Bayesian Active Learning by Disagreement (BALD). NBest-BALD selects utterances based on the mutual information between the prediction for the utterance and the model parameters, i.e. I[θ, w|D_l, X_i]. Monte-Carlo Dropout is used to approximate sampling from the posterior of the model parameters and an N-Best list is used to approximate the entropy over the hypothesis space. Experiments on English conversational telephony speech showed that NBest-BALD outperforms random sampling and prior active learning methods that use confidence scores or the NBest-Entropy as the informativeness measure. NBest-BALD increases the absolute Word Error Rate (WER) reduction obtained from selecting more data by up to 14% as compared to random selection.

Furthermore, a novel method for encouraging representativeness in active data selection for speech recognition was developed. The method first builds a histogram over the lengths of the utterances. In order to select an utterance, a word length is sampled from the histogram, and the utterance with the highest informativeness within the corresponding histogram bin is chosen. This ensures that the selected data set has a similar distribution of utterance lengths to the overall data set. For mini-batch acquisition in active learning on English conversational telephony speech, the method significantly improves the performance of active learning for the first batch. The histogram-based sampling increases the absolute WER reduction obtained from selecting more data by up to 57% as compared to random selection and by up to 50% as compared to an approach using informativeness alone.

A further contribution to active learning in speech recognition was the definition of a cost function, which takes into account the sequential nature of conversations and meetings. The level of granularity at which data should be selected given the new cost function was examined. Selecting data on the utterance-level, as fixed-length chunks of consecutive utterances, as variable-length chunks of consecutive utterances and on the side-level were examined. The cost function combines a Real-Time Factor (RTF) for the utterance length (in seconds) with an overhead for each utterance (t₁) and an overhead for a chunk of consecutive utterances (t₂). The overhead t₂ affects the utterance-level selection method (which previous methods in the literature rely on) the most, and this level of granularity yielded the worst speech recognition performance. This result showed that it is crucial to focus on methods for selection that can take a better cost function into account.

For semi-supervised learning, the novel algorithm cosine-distance virtual adversarial training (CD-VAT) was developed. Whilst not directed at speech recognition, this technique was inspired by initial work towards using consistency-regularisation for speech recognition. CD-VAT allows for semi-supervised training of speaker-discriminative acoustic embeddings without the requirement that the set of speakers is the same for the labelled and the unlabelled data. CD-VAT is a form of consistency-regularisation where the supervised training loss is interpolated with an unsupervised loss. This loss is the CD-VAT-loss, which smoothes the model’s embeddings with respect to the input as measured by the cosine-distance between an embedding with and without adversarial noise. For a large-scale speaker verification task, it was shown that CD-VAT recovers 32.5% of the Equal Error Rate (EER) improvement that would be obtained when all speaker labels are available for the unlabelled data.

For semi-supervised learning for speech recognition, this thesis proposes two methods to improve the input tokenisation that is used to derive the training targets that are used in masked-prediction pre-training; a form of self-supervised learning. The first method is biased self-supervised learning. Instead of clustering the embeddings of a model trained using unsupervised training, it clusters the embeddings of a model that was finetuned for a small number of updates. The finetuning is performed on the small amount of supervised data that is available in any semi-supervised learning scenario. This finetuning ensures that the self-supervised learning task is specialised towards the task for which the model is supposed to be used. Experiments on English read speech showed that biased self-supervised learning can reduce the WER by up to 24% over the unbiased baseline. The second method replaces the K-Means clustering algorithm that was previously used to tokenise the input with a Hidden Markov Model (HMM). After training, the tokenisation of the input is performed using the Viterbi algorithm. The result is a tokenisation algorithm that takes the sequential nature of the data into account and can temporally smooth the tokenisation. On the same English read speech task, the HMM-based tokenisation reduces the WER by up to 6% as compared to the tokenisation that uses K-Means.

Date

2023-03-31

Advisors

Woodland, Philip

Keywords

Active Learning, Machine Learning, Semi-supervised learning, Speech Processing, Speech recognition

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Sponsorship

EPSRC (2107214)

EPSRC Studentship

Collections

Theses - Engineering