Why has it taken until the last few years for speech recognition to be adopted in day-to-day use? The technology has many hidden industrial applications, but as a real-time user interface for day-to-day use, i.e. talking to your computer, adoption has been unbelievably slow. When I was studying in the 90s, I read about a sort of reverse Turing test, which demonstrated one reason why. Volunteers believed they were talking to a computer, but responses were actually provided by a human being typing “behind the curtain”. The observations and subsequent interviews showed that, back then, people simply didn’t like it.
So, what’s the problem?
We have a Google Home in the house, and we basically only use it to set kitchen timers and find out the outside temperature (so we know how many layers to put on – we live on the arctic circle, and -25-30°C is normal). That’s it. I don’t see much of a use for anything else, as our computers and smartphones are both easier to use and faster than any voice assistant or voice input.
The key to modern voice assistants is that they are basically glorified command line interfaces – they need a command and parameters. What makes them so hard to use is that these commands and parameters are pretty much entirely undiscoverable and ever-changing, unlike actual command line interfaces where they are easily discoverable and static. If voice input and voice assistants really want to take off, we’ll need to make some serious advances in not just recording our voices and mapping them to commands and parameters, but in actually understanding what we as humans are saying.
We’re a long way off from that.
I admit I’m getting kinda tired of being asked “Do you want to read the Xfce Terminal manual online?” every time I fat finger my laptop keyboard. Or, when using MS Office, the text under my mouse suddenly turns upside down because some UX designer thought that was a cool demo 15 years ago. But at least the old school interfaces don’t tell me “I’m sorry Dave, I cannot allow you to jeopardize the mission.”
May I remind everyone the Katalavox (Katala = Catherine, in Elsass) projet that exists since 1985 :
https://www.lemonde.fr/archives/article/1985/10/30/le-katalavox-sera-fabrique-aux-etats-unis_2736506_1819218.html
https://cpcrulez.fr/games-div-martine_kempf.htm
https://en.wikipedia.org/wiki/Martine_Kempf
So, I’d better say 35 years instead of 25…
I think the main point that the PC needs different input depending on the use case.
One of the big ones being that speech is loud. I can use my laptop in front of my family without anyone really being interrupted. If I had to talk for everything, I can’t reasonably do that. The other as mentioned in the article is that multi-tasking is easier with keyboard or mouse input.
But I have found some speech recognition to be really useful. For one Android Auto in the car has been great.
I also have IPTV and love the fact that I can just say “Play seinfeld’, and it takes me to something appropriate instead of trying to navigate the horrible TV menu system. And by in large it’s pretty good, at least for my expectations.
As mentioned though, my 2 biggest annoyances
1. Discoverability. Most don’t make it easy to ‘learn’ all the possible voice commands.
2. Prompts when there is ambiguity are hard to navigate as well. For example on my Android Auto if I want to call/msg a person a and I have more than 1 number for that person, it’s painful to walk through the process where it says each number entry and then prompts me to say which number (I still have trouble navigating that). And it doesn’t even have context as in if I just sent a whatsapp message to Bob mobile 1, it doesn’t know if I send the next message to Bob, it should probably send it to mobile 1 as well.
But I have started to use Voice in limited ways. I’m a bit paranoid, so I only use the ones where you have a microphone button to press before it starts listening.
Computers should go back to being just a shell and an OS, and sending voice commands to typing on a terminal with user defined aliases: Anything else is a cop out, either voice recognition is a peripheral no different from the keyboard or mouse, or you sell an entire machine that ships with and is made to be used entirely through voice recognition, like a 750k self driving car or something.
“The key to modern voice assistants is that they are basically glorified command line interfaces – they need a command and parameters. What makes them so hard to use is that these commands and parameters are pretty much entirely undiscoverable and ever-changing, unlike actual command line interfaces where they are easily discoverable and static. If voice input and voice assistants really want to take off, we’ll need to make some serious advances in not just recording our voices and mapping them to commands and parameters, but in actually understanding what we as humans are saying.”
Wouldn’t that be solved with smarter machine learning which “understands” what the user wants ?
Judging by things like GPT-3 we are making some progress.
I remember using Siri on a desktop computer for the first time.
I couldn’t use the keyboard but I had a mouse, so I tried asking Siri to help troubleshoot the keyboard:
Me: “Siri, my keyboard isn’t working.”
Siri: “I didn’t understand that.”
Me: “I can’t use my keyboard.”
Siri: “Ok, you can’t.”
Certainly the tech is there — has been there for some time.
My only guess as to why it we don’t talk to computers in a Star Trek-like way yet is due to business interests and other politics that may have squelched opportunities along the way.
My vision of talking to a computer is something along the lines of
“Computer, compile this program for me, with libraries lmath and lpthread”
“Computer, list all processes active, and filter out those starting with a”
“Computer, remove all files ending in dot pdf from directory Downloads”
Which I’ll admit gets old quick if you’re giving commands to your computer every 5 seconds like typing messages on Facebook, switching between tabs of documentation because your attention span is so low you can’t remember something you’ve read 30 minutes ago, ordering something from Amazon because going to a store to buy it is too much of a hassle, etc etc. Clearly anyone used to interracting with computers via GUI will be forced to admit a graphical interface is more efficient then spending five minutes thinking about the proper command the computer should be given, at the very least it is easier to use.
But at the same time, a chip that recognizes a person’s voice and transcribes it into words perfectly shouldn’t be impossible, filtering out that person’s specific voice tone so that other people don’t mess with the PC also shouldn’t be impossible, another chip that turns those words into shell commands shouldn’t be a problem, all that’s left is for the programmers to have the skills to use a system where they basically give out one order every ten minutes. I’d say the main problem is to be able to talk to your computer, you need the computer to be sane and not run a couple thousand processes at a time. It’s hard enough to talk to someone who’s pretending to listen to you and thinking about something else, how hard then, must it be to talk to a PC who’s doing 10k other things while pretending to talk to you.
The whole concept of speech technology is based on a wrong premise that all of us speak and pronounce english perfectly. We don’t. Personal assistants will never be useful unless they learn how we speak and pronounce words individually, which means we should spend a lot of time to teach our computers how to understand us. Apart from some simple commands and questions, Siri and Cortana don’t understand me. At all. They should be able to learn my personal language, and only then we’ll have some progress.