Automatic speech synthesis

⇐ ПредыдущаяСтр 5 из 7Следующая ⇒

In a way, speech synthesis is the mirror process of speech recognition: in speech synthesis messages that exist in some discrete symbolic form are converted into audio signals which can be played to listeners. Ideally, one would like to accomplish this by simulating the processes in humans who speak (or read aloud, if the message exists in printed form). In actual practice, simulating the extremely complex speech production processes is far beyond our present capabilities, so that simpler approaches must be used.

If the number of utterances that must be produced in an application is small, and if the words and sentences do not change over time, the most obvious solution is to play pre-recorded utterances. If the number of utterances is too large to record, or if the utterances change relatively quickly over time, but the number of different words is small and the grammar of the phrases is constant, speech output can be generated by concatenating pre-recorded words and phrases. However, concatenation is more difficult that it might seem. Smooth, continuous speech is not just a sequence of invariant symbols (words). In a fluent utterance the words blend into one another without perceptible boundaries. Moreover, in continuous speech the so called prosodic features (pitch, loudness, temporal structure) must vary continuously. Especially for pitch and loudness it is not guaranteed that straightforward concatenation of words and phrases recorded in isolation will result in continuous parameter tracks.

If the number of words becomes too large, or if each new message may contain new words, synthesis by concatenation of words and phrases is no longer possible. Then recourse must be taken to concatenation of much smaller sound segments. Conventionally, so called diphones have been used as basic building blocks. Diphones are speech segments that run from the middle of the first sound to the middle of the second. By using diphones, cut out of recordings of speech expressly produced for the purpose, the basic building blocks implicitly encode the very complex ways in which the acoustic parameters vary as the result of the continuous movements of the articulators [4].

Unfortunately, diphone concatenation has the same inherent problem as any concatenation technique: all parameters must be continuous across the diphone boundaries. In actual practice neither the spectral nor the prosodic parameters may be close when diphones are joined. Moreover, in fluent utterances the prosodic parameters must be adjusted to match natural intonation. Attempts have been made to use speech coding techniques (like Linear Predictive Coding, or LPC) to adjust the spectral parameters at diphone boundaries and to change the overall prosodic parameters to obtain appropriate prosody. However, due to non-linear interactions between the excitation signals and the vocal tract filter in human speech production, it has not yet been possible to invent parametric representations in which all parameters are truly orthogonal, so that individual parameters can be changed considerably while retaining a natural sounding quality. Therefore, recent research activities are aimed at techniques to find the optimal sequence of diphones in a very large data base of potential diphones. Optimality is determined by an estimate of the perceptual degradation due to the parameter adaptations needed to form the utterance to be synthesised by concatenating and adapting these specific building blocks.

In reading existing messages through speech synthesis a number of additional problems must be solved, all related in some way to the vast amount of knowledge that humans acquire when they learn to read. For one thing, many languages, including English and Dutch, do not have a one-to-one relation between spelling and pronunciation. Consequently, complex sets of letter-to-sound rules are needed, some of which may even require that syntactic or semantic analysis is included. The word ‘read’ is an example; its pronunciation depends on its syntactic function.

Syntactic and semantic analysis is also necessary to determine which words in a sentence must be stressed. SGML or HTML mark-up information can (and must) be used to help the speech synthesis system to select the appropriate intonation patterns. For instance, in reading a newspaper article (e.g., in a reading machine for the blind, or in a service where press releases of companies are made accessible via the telephone) the ease of understanding is increased substantially if the headlines are read with a different intonation than the body text.

⇐ Предыдущая 1 2 3 456 7 Следующая ⇒

Поделиться с друзьями:

mylektsii.su - Мои Лекции - 2015-2025 год. (0.007 сек.)Все материалы представленные на сайте исключительно с целью ознакомления читателями и не преследуют коммерческих целей или нарушение авторских прав Пожаловаться на материал