Voice recognition and speech synthesis technologies may not have developed to the degree some science fiction writers hoped, but have nevertheless seen some startling successes. ... Voice synthesis has been around for a long time. Bell Labs demonstrated a computer-based speech synthesis system running on an IBM704 in 1961, a demonstration seen by the author Arthur C. Clarke, giving him the inspiration for the talking computer HAL9000 in his book and film '2001: A Space Odyssey'. Forty-five years later, voice synthesis technology can be found in products as diverse as talking dolls, car information systems and various text-to-speech conversion services such as the one recently launched by BT. Many of these modern systems can convert text into a computer synthesised voice of quite respectable quality. ... Voice recognition has turned out to be a much harder task than researchers realised when work began on the problem over forty years ago. However, limited voice recognition applications are starting to creep into everyday use, voice input telephone menu systems are now commonplace, speech-to-text dictaphones are increasingly used for note-taking by doctors and lawyers, and voice input has started to appear in computer games systems. The success of some of these limited-application voice recognition systems has recently prompted the big software heavyweights, Microsoft and IBM, to make further investments. ... However, there are still a lot of technological hurdles to overcome; to understand what these are, we need to delve further into the technology. ... Speech recognition - Speech recognition, on the other hand, is a much harder task, and commercial off-the-shelf systems have only been available since the 1990s. Because every person's voice is different, and words can be spoken in a range of different nuances, tones and emotions, the computational task of successfully recognising spoken words is considerable, and has been the subject of many years of continuing research work around the world. A variety of different approaches are used, dynamic algorithms, neural networks, and knowledge bases, with the most widely used underlying technology being the Hidden Markov Model. These techniques all attempt to search for the most likely word sequence given the fact that the acoustic signal will also contain a lot of background noise."
Are you talking to me? Speech recognition: Technology that understands human speech could be about to enter the mainstream. The Economist Technology Quarterly (June 7, 2007). "Speech recognition has taken a long time to move from the laboratory to the marketplace. Researchers at Bell Labs first developed a system that recognised numbers spoken over a telephone in 1952, but in the ensuing decades the technology has generally offered more promise than product, more science fiction than function. ... Optimistic forecasts from market-research firms also suggest that the technology is on the rise. ... An area of great interest at the moment is in that of voice-driven 'mobile search' technology, in which search terms are spoken into a mobile device rather than typed in using a tiny keyboard. ... The resulting lower cost and greater reliability mean that speech-based systems can even save companies money. Last August, for example, Lloyds TSB, a British bank, switched all of its 70m annual incoming calls over to a speech-recognition system based on technology from Nuance and Nortel, a Canadian telecoms-equipment firm. ... Another promising area is in-car use. ... There are military uses, too. ..."
Computer Vision and Speech. " If you are interested in computers with human capabilities, vision and speech open an entirely new world of computers that can see and talk like we do. Computer vision is the moody input cousin of computer graphics-in graphics, you have all the time you can afford to program the rendering, but visual input is an unpredictable and messy reality. Computer speech is both input and output, like in systems capable of spoken dialogue. Viewed as enabling technologies, computer speech arguably holds the lead over computer vision. Even though a speech signal is enormously rich in information and we are still far from mastering important aspects of it like online recognition and generation of speech prosody, it is still much easier to shut up the people in a room in order to get a clear speech signal than it is to control the room's lighting conditions and to identify and track all of its 3-D contents independently of the viewing angle. Given the state of the art, it makes good sense that the papers in this issue of Crossroads are about speech or vision. Two articles address different stages of the process of making computers understand what is commonly called the speaker's communicative intention, i.e., what the speaker really wishes to say by uttering a sequence of words. Deepti Singh and Frank Boland [Voice Activity Detection] discuss approaches to the important pre-(speech)-recognition problem of detecting if and when the acoustic signal includes speech in the first place. ... Nitin Madnani's introduction [Getting Started on Natural Language Processing with Python] to natural language processing, or NLP, is likely to tempt computer scientists to try out NLP for themselves
"Speech recognition software matches strings of phonemes -- the sounds that make up words -- to words in a vocabulary database. The software finds close matches and presents the best one. The software does not understand word meaning, however. This makes it difficult to distinguish among words that sound the same or similar. The Open Mind Common Sense Project database contains more than 700,000 facts that MIT Media Lab researchers have been collecting from the public since the fall of 2000. These are based on common sense like the knowledge that a dog is a type of pet rather than the knowledge that a dog is a type of mammal. The researchers used the phrase database to reorder the close matches returned by speech recognition software. ... 'One surprising thing about testing interfaces like this is that sometimes, even if they don't get the absolutely correct answer, users like them a lot better,' said [Henry] Lieberman. 'This is because they make plausible mistakes, for example 'tennis clay court' for 'tennis player', rather than completely arbitrary mistakes that a statistical recognizer might make, for example 'tennis slayer','