Студопедия

Главная страница Случайная страница

КАТЕГОРИИ:

АвтомобилиАстрономияБиологияГеографияДом и садДругие языкиДругоеИнформатикаИсторияКультураЛитератураЛогикаМатематикаМедицинаМеталлургияМеханикаОбразованиеОхрана трудаПедагогикаПолитикаПравоПсихологияРелигияРиторикаСоциологияСпортСтроительствоТехнологияТуризмФизикаФилософияФинансыХимияЧерчениеЭкологияЭкономикаЭлектроника






Automatic speech recognition






 

Utterances of the same word or sentence may differ quite a lot in terms of acoustic parameters. It is easy to understand why that should be so: to begin with, speakers differ considerably between themselves: the length and shape of the vocal tract of a young child is very different from that of an adult male, with all consequences in terms of resonance frequencies. The same goes for the vocal folds, where the differences result in average higher or lower pitch. In addition to the anatomical characteristics there are behavioural differences. Even in a small country like the Netherlands there are several clearly distinguishable regional differences in the way some sounds are articulated. Last but not least, words can be spoken slowly or fast, which gives rise to substantial differences in the duration of the speech signals.

Humans who have learned a language have surprisingly little difficulty in coping with all this variation. We are trained to hear the invariant linguistic message despite all the variation. Probably, the fact that utterances have a direct relevance to our normal daily activities and relations has been instrumental during the learning process. However, for machine speech recognition the variation, little of which is easy to predict, is the single most important problem that must be solved.

Automatic speech recognition is best treated as a problem in Information Theory: during human speech production some message is encoded and transmitted through a channel that is at best partially known and that often is noisy. At the receiving end the recogniser’s task is to decode the message. One obvious way to approach the problem is by building probabilistic models of all relevant messages, and to compute the likelihood that a given signal corresponds to (the model of) each of the possible messages. This is precisely the way in which all presently existing automatic speech recognition devices attempt to solve the problem. Messages are defined in terms of words. The way in which words are modelled depends very much on the number of different words in the vocabulary, and on the way in which words can be combined to form complex messages. If the number of words in the lexicon is small (e.g. only the ten digits 0,..., 9) it is best to build models of full words. If the number of words is much larger, and words can be strung together to continuous speech, the number of possible utterances becomes quickly too large to model individually. In this case it is necessary to build models of the 45 or so different speech sounds that are used to form all the words in a language. Words are then modelled as sequences of sounds, or more precisely, sequences of sound models.

In decoding messages the prior probability of the words also comes into play [3]. This gives rise to a Bayesian approach. What we are interested in is p(w|X), i.e. the probability for the sequence of words w, given the sequence of acoustic observations X. Since it is impossible to estimate p(w|X) directly from training data, the Bayesian inversion formula is used:

p(w|X) = [p(X|w). p(w) ] / p(X)

p(X|w) can be estimated if we have a sufficient number of tokens of the words w spoken by a relevant set of speakers. If a single speaker will use the recogniser, that specific speaker best produces the training speech. However, in practice it is easier to record limited amounts of training speech from a very large number of speakers, and train speaker independent models. If necessary, these models can be adapted to a specific speaker using a small amount of adaptation speech.

The a priori probability of hearing a given utterance, i.e., p(w), is often estimated from written texts. This is especially true for dictation applications, where one can use enormous amounts of computer readable texts. p(X) is the a priori probability of observing a specific acoustic signal.

The most popular type of acoustic models p(X|w) are the so called Hidden Markov Models (HMMs). An HMM consists of a small number of (hidden) states. During training estimates of the state sequence and the acoustic observations are obtained simultaneously for each basic unit (e.g. a sound or a word). Alternatively, Artificial Neural Networks are used to build speech recognisers. Hybrid recognisers combine neural nets to compute the probabilities of all speech sounds for each spectral slice and HMMs to combine frame-based probabilities to do word and utterance level decoding.

In the Information Theoretic framework speech recognition boils down to searching the (sequence of) words that maximise the likelihood p(X|w).p(w). Several efficient decoding algorithms have been developed for implementing this search, the most popular of which are the Viterbi Beam Search and A* or Stack Decoding. Operational speech recognisers exploit numerous heuristics to speed up the search. Grammars and other linguistic knowledge that helps to prune the search space are widely and successfully used, despite the fact that they may result in search errors. The denominator term p(X) is conventionally neglected in the search. This does not affect the result in terms of the word (sequence) that maximises the likelihood. However, since the likelihood is not scaled (and since the heuristic search may yield a sub-optimal result), it is not possible make statements about the absolute goodness of the fit of the best matching model to the acoustic observations. In several applications this creates a need for independent confidence measures.

We do not really know how humans recognise speech, but there are many reasons to assume that they too do some kind of probabilistic decoding. But humans probably use more efficient techniques to extract the information that is encoded in dynamic changes in the signal. They also can bring enormous amounts of prior knowledge about the topic, the speaker, the language, etc., to bear as prior probabilities that we do not yet know how to integrate in automatic speech recognition.

 


Поделиться с друзьями:

mylektsii.su - Мои Лекции - 2015-2024 год. (0.006 сек.)Все материалы представленные на сайте исключительно с целью ознакомления читателями и не преследуют коммерческих целей или нарушение авторских прав Пожаловаться на материал