Skip navigation

Detection of the uniqueness of a human voice: towards machine learning for improved data efficiency

Detection of the uniqueness of a human voice: towards machine learning for improved data efficiency

Kinkiri, Saritha (2021) Detection of the uniqueness of a human voice: towards machine learning for improved data efficiency. PhD thesis, University of Greenwich.

[img]
Preview
PDF
Saritha Kinkiri 2021.pdf - Published Version
Available under License Creative Commons Attribution.

Download (2MB) | Preview

Abstract

The aim of this thesis is to characterise voice characteristics that can establish the identity of the person who is speaking, independent of the language used. The fundamental goal of the work is to understand how humans recognise a speaker. The voice parameters such as: speech rate, natural pauses & intended or unintended speaker pauses, fundamental frequencies, phoneme generation, volume etc. since the combination of all the voice parameters cannot be easily imitated by another person. It is an assumption that different speakers speak differently, however, it is important to understand and remember that the same speaker’s voice will change over time. For example, the speaker cannot speak/talk/say the same thing in exactly the same way time after time. However, these differences/variations in speech can be audible and measured by using combinations of voice parameters.

The aim is to eliminate a speaker whom we are not looking for. Individuals use words to communicate with others and the same method to communicate with machines too. Humans successfully use speech software (which is speech to text) to talk to telephones instead of tapping words on the keyboard. But machines are proven to be good at converting speech to text, although not at identifying who is speaking.

Problems remain in recognising an individual from their speech whilst proving reliable, repeatable & robust otherwise the speaker could, for example, find themselves locked out of their online voice accessed. For example, the risks are asymmetric - if one in 100 people is locked out of an account that is not too serious, as customer services will ask for answers to security questions. However, if one in 100 people get into bank account fraudulently this is a bigger problem.

A speaker’s voice varies in frequency, tone, and volume sufficiently enough to uniquely identify an individual. However, other factors can contribute to this uniqueness: the size and shape of the mouth, throat, nose, and vocal cords. Sound is produced by air passing from the lungs through the throat, vocal cords and then mouth. A voice makes different sounds based on the position of mouth and throat. It is the variation of these attributes that allows for identification.

Speaker recognition systems are already available, but their overall accuracy is limited because of several issues such as extracted features based on very short time window of speech and models fail to capture useful information of a speaker since current speech recognition systems and extracted features are language-dependent. By using the voice parameters,the work here was able to eliminate 80 percent of population to be able to identify a person. Recognising 1 out 100 is difficult, but identifying 1 out 5 is comparatively easy.

Item Type: Thesis (PhD)
Uncontrolled Keywords: Speech recognition, voice recognition,
Subjects: Q Science > Q Science (General)
Faculty / School / Research Centre / Research Group: Faculty of Engineering & Science
Faculty of Engineering & Science > School of Engineering (ENG)
Last Modified: 10 Sep 2023 16:44
URI: http://gala.gre.ac.uk/id/eprint/44082

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics