Graduate School of Information Science and Technology, The University of Tokyo

2010/11/01

Practical speech applications based on a unique paradigm

Associate Professor Nobuaki Minematsu
(Department of Information and Communication Engineering)

Associate Professor Minematsu, who majors speech science and engineering, sets a goal to developing “a system to recognize anyone's utterances using the voice of only one speaker.” This goal was initiated by asking a simple question to himself, “what happens if the tallest man and the shortest man in the world meet and say 'good morning' to each other.” Their voices sound so different acoustically and they have never heard such a big difference before. Meanwhile they easily perceive something common in their utterances and a conversation starts. However, it is extremely difficult to realize this performance on computers using today's technology because there are limitations in speech recognition based on statistical models with collection of many speakers' voices. He thought that true human-like speech recognition would be enabled if we can find an sound pattern underlying two linguistically the same but acoustically different utterances. Then, he focused on “timbre movement patterns” in utterances to pursue the speaker-invariant pattern.

Infants acquire a number of words through imitating their parents' utterances (vocal imitation) but cannot perceive individual units of sounds (phonemes). It is also true that infants do not impersonate their parents. What acoustic aspects do infants imitate? Developmental psychology states that they imitate a holistic sound pattern underlying a given utterance, called “speech Gestalt”. This pattern must be independent of the size of speakers! A distance matrix between distributions is calculated by once converting a sound (timbre) stream into a finite number of distributions (events). If you only focus on distance between events, it means following only the movement while ignoring individual sounds. The key factor is how to measure the distance, which is desired to be invariant against speakers. Gestalt is a psychological term and its physical definition is not described in psychology textbooks and has nothing to do with technology as is. Associate Professor Minematsu identified Gestalt mathematically and proposed a method of extracting a Gestalt pattern from an utterance.

This research challenges to unravel universal and invariant structures behind what you can directly see or hear. Associate Professor Minematsu, who is a speech master, makes a science of humans and languages with an engineer's perspective and aims to bring out the system referred to above.

URL: http://www.gavo.t.u-tokyo.ac.jp/~mine/japanese/index.html

Practical speech applications based on a unique paradigm

Associate Professor Nobuaki Minematsu(Department of Information and Communication Engineering)

Associate Professor Nobuaki Minematsu
(Department of Information and Communication Engineering)