Speech recognition refers to computers being able to understand the words that you say. It is translating sounds that come from your voice into predefined words. Please don’t confuse it with voice recognition, which is recognising speakers based on their speaking styles. We all have very distinct methods of speaking; that is how your son on the phone sounds different from your wife. The voice that comes from our voice system is like fingerprints. Your voice is specific to you and voice recognition technology lets computers recognise the unique characteristics of our voice and match up speakers. It is used mainly for things such as biometric authentication and the like. Voice recognition helps computers understand the person who uttered the word. On the other hand, speech recognition lets computer understand what you said, not who said it.
Speech recognition has always been considered as a better input method- compared with other input methods like mouse, keyboard etc. Voice is a natural interface and voice assistance is going to be pervasive in our life. Instead of communicating with typing or point and click, it is always natural and easier if we can interact with the system via spoken words. Voice-enabled Internet could empower a whole new set of people- like vision impaired, illiterate population, children, elderly and so on.
Being used as an input medium is just one of the simple use cases of speech recognition. One can find several other exciting applications of this technology. For instance, captioning video content, which is done by human experts, is a valuable application.
The goal of a speech recognition tool is that if you just say ‘hello world’ or give it a raw audio wave (of these words) recorded on a recording device, it should print out the phrase ‘Hello world’. Though you still need a human to get this much accurate transcription, the recent advances in speech recognition technology offer better accuracy in comprehending speech. The main factor that makes this possible is the latest developments in the machine learning technology.
Machine Learning (ML) is generally considered as a field of study that imparts computers the ability to learn without being explicitly programmed. Here you don’t need to write rules to create a solution (like if the photograph contains such and such features, it is a cat or if the message contains these words or originates from this source, mark it as spam). Instead, you give some data and the ML system will learn from this data and generate a model that can automatically recognise the similar content. For instance, if you feed the system with a bunch of human-labelled images of different types of fish, it will give us a model that can differentiate fish images from other images. If you provide a database of labelled images of handwritten words you can generate a system that can recognise hand-written words. The ML technology is invading all aspects of our life and an area in which it is having a huge impact is speech recognition.
If you wish to create a speech recognition tool that can accurately transcribe audio into text, you need to input a huge database of human labelled audio. Larger the corpus, the better the models get. Given the complexities involved in the way people speak, creating audio content is not an easy task. It is common knowledge that every language is spoken differently with a large variety of speech patterns, speed variations and accents. Even within a language we generally see different accents or different styles of speech.
A good speech recognition engine should be able to recognise/comprehend voice inputs from different types of people. This means the audio database should contain recorded speech from people across different countries and different age/gender groups. This is where Mozilla’s open source project CommonVoice assumes significance.
Common Voice is Mozilla’s initiative to crowdsource a large public dataset of human voices in all languages, all accents and all genders.The Mozilla Foundation has launched the audio library project to make it easier for users to create voice control applications. The project’s goal is to generate tens of hundreds of hours of audio in a variety of accents and then make it available to anyone who needs it for free. It uses crowdsourcing techniques to gather data from the community. The project is all set to become the world’s largest repository of human voice data to be used for machine learning. As anyone can contribute to the project, it is probably getting a wide variety of different types of voices reflecting the real world- different accents, old, young, male, female etc. The project has already created 365 hours (validated) of data so far.
How to contribute?
When you access the site you will find a page with two sections. One section lets you donate your voice if you have a microphone. For this, click on the ‘Speak up’ button, then read out the sentence displayed aloud and create audio snippet by recording your audio input.
To submit an audio snippet you need to create a profile. With each submission, you will have to record three sentences. After recording the content, the application will give you a chance to review the recordings and if you are satisfied with the recording output, submit it to CommonVoice.
If something got recorded it needs also to be verified by independent people. If you don’t want to do any recordings you can contribute to this open source project by validating sentences. Read the displayed sentence, play/listen to the recording of someone and then you accept or deny it.