top of page

Speech synthesizer for patients with laryngeal defects using articulatory system for European language

Abstract

Speech disorder is a condition in which a person has problems in creating speech sounds needed to communicate with others. An estimated 5-11% of people in the world suffer from speech disorders. Unfortunately, the speech impaired do not have an intuitive way to communicate with most of the human population.Sign language can have different types based on region and is not understood by a majority of the human population who isn’t hearing impaired, it is not very intuitive. The aim is to create a seamless device to help the disabled and help with an improvement in independence and quality of life for the speech impaired and in order to do this a model is designed to operate around a tongue articulatory circuit, giving a device that can help those with speech disabilities caused by laryngeal defects. Our target is to generate the language from the most basic unit which is a phoneme, hence our system will be language independent and can generate any language depending on user input.

Problem statement

The goal of this project was to create a device which can aid in speech production which is to be used for the aid of the speech impaired with laryngeal defects.The synthesizer needs to be ubiquitous and should be language independent, in order to achieve this speech must be generated from the most basic unit of a language which is the phoneme, this should allow the system to generate any language depending on user input.User input must be obtained from the user’s tongue position and which needs to be correlated and mapped to the respective phoneme in order to predict what user is trying to convey and synthesize the desired speech signal .

Project workflow

Analysis of formants

Formants are frequencies amplified by the vocal tract which results in the production of phonetic sounds.
Typically 2 prominent formants F1 and F2. The other formants(F3-F5) result in a person having a unique
pitch of voice. The difference(Δ) in the main formant frequencies which form the vowel/consonant varies slightly with the speaker. We wanted to verify this and conducted some experiments to understand these formants better.

​An observation was made that there was a difference between F1 and F2 that remained largely consistent
for a certain base frequency(which typically varies with age/ gender), and these frequencies appear in the
fourier transform having a much larger amplitude than the other frequencies

formant (F1) is inversely proportional to the height of the tongue body
F1α 1/h

The second formant (F 2) frequency is related both to the height and frontal gap.

F2α 1/l

Approach and methodology

At first a specific phoneme which is needed is given as input.This phoneme is directly mapped to mouth parameters like tongue height,tongue advancement and lip roundness.All these parameters are found using sensors.Estimation of formants is done based on these parameters by sending them to a physical model which will result in formants. This will result in a synthesised output which can be compared to original phoneme that was given as input.

Making a synthesiser

A rudimentary synthesizer was created using voice samples which were collected of the speaker uttering
individual phonemes.
The English language has 44 phonemes, all these different sounds were recorded and named with the
appropriate IPA nomenclature. Creating a synthesiser was done with the steps listed below

Collection of voice data: The english language has 44 phonemes, this includes only the main phonemes without emphasis etc which form allophones. Care was taken to produce only these sounds and not the full vowels which the sounds are represented as. The audio files were renamed to match the corresponding IPA notation phoneme

Cleaning of voice data: The recorded samples had some gaps which makes the synthesiser working an undesirable manner and makes the sounds seem incoherent, this, in order to clean the data the audio samples were scanned through and all sounds below a certain loudness threshold were elementated.

Synthesis: In order to synthesize audio, text input is taken in the form of IPA phonetic input and the output in the form of sound h is obtained by concatenating the different audio from the files with the corresponding names.

Simulations

VTL lab simulation

The significant parameters used in testing are LD, LP, TCX and TCY, where a linear combination of Lip Distance and Lip Protrusion constitutes the phonetic/acoustic parameter Lip Rounding(LR). TCX and TCY correspond to Tongue Advancement(TA) and Tongue Body Height(TH).

  • The shortened initials of each parameters correspond to those in the table in the Appendix

  • The window  shown below depicts the generation of vocal tract shape using the various articulatory parameters for the vowel a

Experimentation and prototyping

The main challenge overcome by this new system is the ability to obtain a more steady value for height
and advancement, two of the most important parameters:

  1. Tongue Height measurement: Measurement of tongue height was done using an IR sensor, the advantage of this approach is that the analog IR sensor maintains a constant voltage over a large range(10-80cm in this case). The movement of the tongue was designed such that it would be in he ranges of between 10 and 14cm away from the sensor, thus operating in a linear range. The analog values can be directly mapped linearly and used .

  2. Tongue advancement:  This was detected using an IR sensor used in combination with a Hall sensor is described. In order to account for sensitivity to noise and scattering effects, the values taken for IR sensor readings are averaged over 100 iterations. The hall effect sensor too at the same time returns an analog value which corresponds to a specific height measurement, if this value is found to be offset from the IR measured values, the this delta is divided by the coefficient which was found through experimentation, this will allow us to obtain the tongue advancement values.

  3. Lip Rounding: Lip rounding is handled using a simple potentiometer, and the 10 Bit Analog to digital converter(ADC) values are mapped between 0 and 100 as a percentage for roundness

About

This project was undertaken as part of my final semester undergraduate Capstone project alongside Anirudh Krishnan, Mukund B, M Saiesh Reddy under the guidance of Prof. Kaustav Bhowmick.

​​

bottom of page