FIPA | 96/06/18 23:22 |
FOUNDATION FOR INTELLIGENT PHYSICAL AGENTS | nyws029 |
Source: Chris Ellis (Aurora Project) |
The Aurora Project
The Aurora Project is a joint initiative set up to establish a global standard for distributed speech recognition. A number of companies, including Alcatel, Ericsson, IBM, Matra, Nokia, Nortel and Siemens, are working on the details of a proposed European standard, under the guidance of the European Commission and ETSI. They announced the project at Telecom 95 last October, and are now inviting other suppliers and operators to join the project, in preparation for the second stage, the drafting of a global standard.
Today, we speak to our colleagues, but we have to type to our computers. The primary objective of speech recognition is to enable all of us to have easy access to the full range of computer services and communication systems, without the need for all of us to be able to type, or to be near a keyboard. By using a client/server approach in combination with the latest recognition systems, distributed speech recognition (DSR) will deliver the price/performance levels and access flexibility that will begin to make this practicable and affordable. As just one example of a spectrum of possible new applications, you will be able to dictate your meeting notes directly into your enhanced cellular handset immediately after a meeting, and the draft text will already be in your personal computer, ready for editing, by the time you return to your office (or hotel room, or home).
Over the last few years, much progress has been achieved in the development of large vocabulary, continuous speech recognition systems. The latest systems now work with considerable accuracy, provided they are supplied with high-quality speech input. Users require no special techniques (such as needing to leave gaps between words), and the systems can support large (20,000+ word), generalised, vocabularies. However, it is also clear that they will continue to demand high-performance processors and large RAM memories to run effectively. This puts full-function speech recognition in a cost class of its own, compared with any other office application, and also beyond the capacity of any handheld device.
To improve price/performance, one possibility is to run the speech recognition system on a shared, multi-user server. Systems supporting extended vocabulary, speaker independent, discrete utterance recognition, in applications such as database inquiry and name dialling, operate well over conventional 3.1kHz bandwidth telephone channels, and will be developed further.
However, today's large vocabulary continuous speech recognition systems are usually very sensitive to input quality; such systems only perform well with a high-quality microphone, with a speech bandwidth of 4.5kHz or more. These systems could perform with high accuracy in a network environment if we could overcome the network channel limitations, particularly for cellular.
One answer is to move the speech recognition front-end (i.e., the computationally simple feature extraction module) to the handset end of the line. Imagine a modified digital cellular or cordless phone with a high-quality microphone and a low-power-consumption DSP to carry out the feature extraction. This would be able to pass the feature codes at digital cellular data rates back to powerful, multi-user recognition servers. At relatively low incremental hardware cost, high-quality speech is captured at the handset, transformed into coded data, and transmitted with full error correction to the appropriate recognition server on the other side of the digital cordless or cellular network. The server might be on customer premises, accessed locally or remotely, or in an operator's network, or front-ending an information provider's service. As far as the recognition system in the remote server is concerned, the 'speech data' is just as high in quality as that produced by the co-located front-end of a monolithic system. With a noisy line, the quality of recognition will not degrade; only the response time will suffer slightly as transmission times lengthen due to error correction and retransmission. The output from recognition can be passed back to the user for further text-editing, or directly on to other applications expecting text input, such as the user's word processor back in the office.
There is general agreement that the feature extraction front-ends of most speech recognition systems are similar, and that a common design can be achieved covering a substantial range of applications. It is this feature extraction function that drops the effective bit-rate from, say, 160 kilobits per second to around 8 to 16 kilobits per second for further processing. Feature extraction typically needs very little memory, and can run on most of the low-power consumption DSPs that are already used incellular handsets. This leads to the view that it is both desirable and practicable to standardise the feature extraction function, to give us what might loosely be called a 'speech recognition codec'. This standard would not be confined to running over GSM or DECT, but would be relevant wherever a distributed speech recognition system is appropriate (in LAN systems, for example), and it may even be useful in monolithic products. Standardising the front-end of speech recognition could actively encourage innovation in the rest of the system, which is usually software based. This will also give users continuity of access over time to the very latest speech recognition capability, without the constant need to upgrade or replace their sets or workstations or local software.
The project coordinator, Mr Chris Ellis, may be contacted on +44 1926 311513.
Mr. Chris Ellis
Jove Communications Ltd.
56 Strathearn Road
Leamington Spa, CV32 5NW
UNITED KINGDOM
Tel.: +44 1926 311513
Fax +44 1926 311 532
Email: chrisellis@bcs.org.uk