What exactly is Speech Recognition?
Speech recognition
What you need to know about speech recognition
When we are talking about speech recognition, usually we mean a software that has the ability to recognize the spoken word and to write it down in a program so in the end you have everything that has been spoken in a written format. It is also often referred to as “speech-to-text”. In the beginning that software had very limited possibilities, so that you could convert only a limited number of phrases. With time, the technology behind speech recognition software has developed a lot and it is now much more sophisticated, so that it can recognize different languages and even different accents. But of course, there is still work that needs to be done in this field.
It is also important to notice that speech recognition isn’t the same as voice recognition, even though sometimes people use the two terms for the same thing. Voice recognition is used for identification of the person who is speaking and not to note what was being said.
A short history of speech recognition and related technology
In this article, we will briefly explain the history and technology behind the rise of speech recognition.
Ever since the dawn of the digital age, people had an urge to somehow be able to communicate with machines. After the first kind of digital computer was invented, numerous scientist and engineers have tried in various ways to somehow implement speech recognition into this process. A crucial year of this process was 1962, when IBM revealed Shoebox, a basic speech recognition machine that was able to do simple math calculations. If the user of this proto-computer spoke into a microphone, this machine was able to recognize up to six control words like “plus” or “minus”. Over time, the technology behind this developed and today it is very common feature to interact with computers by voice. There are many famous speech recognition engines like Siri or Alexa. It is important to note these voice-driven devices are dependent on artificial intelligence (AI) and machine learning.
When artificial intelligence (AI) is mentioned, it might sound like something from a science fiction movie, but the truth is that in today’s day and age AI plays a great role in our world. In fact, AI is already very present in our everyday life, since many programs and apps already use it. But it was science fiction at the beginning of the 20th century, when the term emerged. In the late 1950 the concepts of AI became more prominent and was the focus of interest of many scientists and philosophers. In that time, a very ambitious British mathematician called Alan Turing came up with a proposition that machines can solve problems and make decisions by themselves, based on input of available information. The problem was that computers did not yet have the possibility of memorizing that data, which is a crucial step for development of artificial intelligence. All that they could do back then was to execute simple commands.
Another important name in the development of AI is John McCarthy, who first coined the very term “artificial intelligence”. McCarthy stated that AI is: “the science and engineering of making intelligent machines”. This definition came to light at a seminal conference at Dartmouth College in 1956. From then on AI started to develop at a frantic pace.
Today, artificial intelligence in its various form is present everywhere. It has grown to mass adoption, mainly due to increase in the overall volume of data that is being exchanged worldwide every day. It is used in advanced algorithms, and it gave rise to improvements in storage and computing power. AI is used for many purposes, for example translation, transcription, speech, face and object recognition, analysis of medical images, processing of natural languages, various social network filters and so on. Remember that chess match between grandmaster Gari Kasparov and Deep Blue chess AI?
Machine learning is another very important application of artificial intelligence. In short, it refers to any systems that have the ability to learn and improve from the database of their own experience. This works through recognition of patterns. For the system to do that it needs to be able to be trained. The algorithm of the system receives an input of large amounts of data, and at one point it becomes able to identify patterns from that data. The end goal of this process is to enable these computer systems to learn independently, without the need for any human intervention or assistance.
Another thing that is very important to mention alongside machine learning is deep learning. One of the most important tools in the process of deep learning are the so-called artificial neural networks. They are advanced algorithms, similar to the structure and function of the human brain. However, they are static and symbolic, unlike biological brain which is plastic and more analogue based. In short, this deep learning is a very specialized manner of machine learning, primarily based on artificial neural networks. The goal of deep learning is to closely replicate human learning processes. Deep learning technology is very useful, and it plays an important role in various devices that are controlled by the voice – tablets, TVs, smartphones, fridges etc. Artificial neural networks are also used as a kind of filtering system that aims to predict the items that the user would buy in the future. Deep learning technology is also very widely used in the medical field. It is very important to cancer researchers, because it helps to automatically detect cancer cells.
Now we will come back to speech recognition. This technology, as we mentioned already, aims to identify various words and phrases of the spoken language. Afterwards it converts them into a format that the machine is able to read. Basic programs only identify a small number of key phrases, but some more advanced speech recognition software is able to decipher all kinds of natural speech. Speech recognition technology is convenient in most cases, but it sometimes encounters problems when the quality of the recording is not good enough or when there are background noises which make it difficult to understand the speaker properly. It might also still encounter some problems when the speaker has a really strong accent or a dialect. Speech recognition is constantly developing, but it is still not quite perfect. Not everything is about words, machines are still not capable of many things that humans can do, for example they are not able to decipher body language or the tone of someone’s voice. However, as more data is deciphered by these advanced algorithms, some of these challenges seem to decrease in difficulty. Who knows what will the future bring? It is hard to predict where the speech recognition will end up. For example, Google is already having a lot of success in implementing speech recognition software in Google Translate engines, and the machine are constantly learning and developing. Maybe one day they will replace human translators completely. Or maybe not, everyday speech situations are too complex for any kind of machine that is not able to read the depth of human soul.
When to use speech recognition?
Nowadays almost everyone has a smartphone or a tablet. Speech recognition is a common feature in those devices. They are used to convert a person speech into action. If you want to call your grandmother, it is enough that you command “call Grandma” and your smartphone is already dialing the number without you having to type through your contact lists. This is speech recognition. Another good example of it, is Alexa or Siri. They also have this feature hard-wired in their system. Google gives you also the option to search for anything by voice, without typing in anything.
Maybe you are now curious about how all of this works. Well, for it to work, sensors like microphones have to be built into the software so that the sound waves of the spoken words are recognized, analyzed and converted to a digital format. The digital information then has to be compared with other information that is stored in some sort of words and expressions repository. When there is a match the software can recognize the command and act accordingly.
One more thing that needs to be mentioned at this point is the so called WER (word error rate). This is a formula in which you divide the error number with the total of words. So, to put it in simple terms, it has a lot to do with accuracy. The goal is of course to have a low WER, because this means that the transcription of the spoken word is more accurate.
Speech recognition is now in demand as much as ever. If you also need to convert the spoken word from let’s say a recorded audio file to text, you can turn to Gglot. We are a transcription service provider which offers accurate transcriptions for a fair price. So, don’t hesitate to get in touch via our user-friendly website.