Introduction. Natural interfaces with speech technology

⇐ ПредыдущаяСтр 2 из 7Следующая ⇒

Natural interfaces with speech technology

Speech is supposed to be the essential modality for easy, user-friendly interfaces and for communication with ‘intelligent agents’. We present an overview of the three main speech technologies: speech recognition, speaker recognition, and speech synthesis. Next we mention some boundary conditions for successful applications with speech technology.

Introduction

Expectations for the use of speech technology in (multi-modal) interfaces for a wide range of products and services are high. Many magazines have run feature articles about applications of the technology, suggesting that it is ready for wide scale deployment [1]. Yet, the number of actual applications that are crucially dependent on speech technology have remained relatively small, all over the world.

Why is it that speech seems to be a very good candidate for simple and user-friendly interfaces? There are many reasons. Speech can be seen as a simple and natural way to control all kinds of functions: just say what you want. The voice can be used for authentication: speech can be used as a behavioral biometric feature, a ‘voice signature’ that makes PIN codes superfluous. Speech as output, listening to spoken messages may also be very convenient for users in all kinds of environments.

Speech is probably the most natural way for communication between human beings, and it can be used in all types of communicative situations. Speech can be used locally (while sitting in front of your PC), but also over long distances, via the telephone. Speech does not need a special input device, like a keyboard or a touch screen. A microphone (e.g. in a (mobile) phone) is enough to enter commands and data. In addition, people are using more and more very small

terminals with very small key paths (e.g. web phones, cell phones or electronic organisers). For these devices, speech is a much more pleasant input mode than typing. Furthermore, the number of mobile phones and communication devices is growing rapidly. While driving a car (a typical eyes and hands busy situation) it is much safer to use speech to control non-critical functions, like the radio, the navigation system and the telephone (voice dialing). People are more and more on the road and when travelling they want to be able to access the same information as in the office, like their e-mails. Market analyses indicate that people increasingly

want to receive personalised information, like the prizes of the stocks they own or consider buying, or information about traffic jams on the roads towards their destination. If this information is given over the phone, the only way to do this is by using synthetic speech, since it is much too expensive for operators to read e-mail messages etc.

However, the most important reason for the high expectations seems to be that speech technology- which is necessary for all applications mentioned above- has matured considerably over the last few years. This triggers thinking about new services and products which could not be built without speech technology in the interface. This holds for all three types of speech technologies: automatic speech recognition, automatic speaker recognition, and automatic text-to-speech synthesis. In the next section we will shortly describe the general characteristics of speech (2.1), automatic speech recognition (2.2), speaker recognition (2.3), and speech synthesis (2.4). In the succeeding sections we will present types of applications and critical success factors for these applications. In doing so, we intend to explain why speech technology has not yet fulfilled all promises, or in other words, why speech is perhaps less easy to use in human-machine communication than in communication between intelligent human beings.

Speech

Speech is perhaps the most complex human behaviour. Speech can be described on several levels, e.g. as an acoustic signal and as a tightly structured system of symbols and meanings.

As an acoustic signal, speech is generated by the articulatory apparatus, and perceived by the auditory system. Like all mechanical systems, the articulators move with finite speed, thereby generating an acoustic signal with continuously changing parameters. It appears that a large part of the information in a speech signal is encoded in the way the signal parameters change, rather than in terms of instantaneous, static parameter values. This matches with the perceptual capabilities of humans (as well as all other vertebrates). When trying to spot potential dangers (or potential dinners), changes in the physics of the environment are eminently more important than stationary states.

The linguistic information in speech signals is always described in terms of discrete and static symbols, like speech sounds, syllables and words. Most likely, this matches with the way in which humans handle concepts.

Nevertheless, speech is one of the major examples where processing of continuous signals and discrete symbols must be integrated.

We are used to describe meaning in terms of words, and words in terms of

syllables and sounds (if not letters, when we deal with written representations of language). Speech sounds are generated in the vocal tract, i.e., the complex non-uniform tube formed by the throat (pharynx), the oral cavity (the shape of which can be changed by means of tongue and jaw movements) and the nasal cavity, the shape of which is constant (but very different between speakers). This non-uniform, dynamically changing tube is acoustically excited at the very far end (by air pulses released through the vibrating vocal folds) or closer to the near end (the lips), by the turbulence caused by air that is forced across sharp ridges. Different speech sounds correspond with different shapes of the vocal tract, and with different excitation sources.

Speech signals are conventionally described in terms of spectro-temporal characteristics (cf. Fig. 1). The most convenient way to represent the dynamic spectro-temporal information is by stacking short-time spectra, i.e., spectra of short segments of the signal, which can be considered as stationary; 50 to 100 short-time spectra are sufficient for a precise description of the information in a speech signal.

This figure shows the oscillogram and the spectrogram of a short utterance

The vocal tract, in its turn, is usually described in terms of its resonance frequencies. This description is especially useful for vowel sounds, which have the excitation at the far end of the tube. The resonances are often termed ‘formants’. Two such formants suffice to uniquely identify a vowel (cf. Fig. 2)

This figure shows the vowels in an F1-F2 plane [2].

For unvoiced sounds, which have their excitation somewhere in the middle of the tube, a description in terms of only resonances is not really adequate; the same holds for nasal sounds, because there the nasal and oral cavities are connected in parallel. Unvoiced sounds and nasals are characterised by a combination of resonances and anti-resonances. If the excitation signal were exactly white, resonances and anti-resonances would correspond with maxima and minima in the spectral envelope. However, in most real-world conditions the relation between spectral minima and maxima and (anti-)resonances is far from trivial. This is the reason why formant-representations are seldom used as parameters for automatic speech and speaker recognition.

For technological research speech spectra are presented using a Mel scale instead of the linear frequency scale (the Mel scale is approximately linear up to 1 kHz and logarithmic in the higher frequencies), because the capability of the vertebrate auditory system to discriminate frequencies diminishes as the frequencies get higher. For many applications in speech and speaker recognition the Mel scale spectra are subjected to a Discrete Fourier Transform to obtain the so called cepstral coefficients. For most purposes 10 to 15 cepstral coefficients are sufficient to describe the spectral envelope in the frequency band of interest (for telephone speech DC to 4 kHz).

⇐ Предыдущая 123 4 5 6 7 Следующая ⇒

Поделиться с друзьями:

mylektsii.su - Мои Лекции - 2015-2025 год. (0.009 сек.)Все материалы представленные на сайте исключительно с целью ознакомления читателями и не преследуют коммерческих целей или нарушение авторских прав Пожаловаться на материал