Seven years and eight months have passed since the release of the first really popular commercial virtual assistant (VA). Yet, seven years later, virtual assistants can do only marginally better.
Sure, they understand better, they speak better, they have learned some new trick; but in the end, they are still a funny but useless experience. After the first fun moments of experimentation when you start talking to them – that is, where you keep asking them silly jokes or dumb questions – they quickly came back to be pretty dumb object. I am pretty sure that the vast majority of user use a VA just for timers, weather and – occasionally – asking for the event on our calendar.
If we look back at seven years ago, we can see that the world of VAs has experienced seven years of winter. Nothing really changed, and we still use Siri/Alexa/GoogleAssistant for the same things as ever.
Companies are still pushing a lot in the VA direction (in Italy Google Home and Alexa are quite recent stuff, with frequent TV ads) and I get why: winter may end at any moment. In AI there is always the sensation that we are close to a total groundbreaking discovery.
Virtual Assistants are still plagued by four AI unsolved problems. While we wait for the winter to pass, let’s look at them.
Natural Language is Not Natural
The selling point of Virtual Assistants is that you can interact with them in natural language. However, if you have a VA at home, you know that natural interaction is not natural at all.
Talking with a Virtual Assistant is like talking to a dog. You do not talk with them as you talk with another human being, after some time you start adopting a very specific language pattern.
Me: Alexa, turn on the light at 100%. Alexa: Ok. Me: Alexa, turn down “night table” by 50%. Alexa: Ok.
When I interact with my lighting system, I need to start talking like a robot. The command does not work if I combine both parts or if I do not use this exact grammatical structure.1 The result is not natural: it is a weirdly specific despotic command, it is totally different from the way I talk to a human for the same task.
The problem is magnified by the fact that VAs need to constantly add confirmation speeches after each command because of a not 100% reliable understanding. This constant pingback of acknowledgment chit chat (Alexa, do that, Ok, I have done that, and so on, forever) is really unnatural.2
And that God forbid us, don’t you dare give more commands at once! A natural command is a long sequence of requests: “Alexa, set up a meeting with Joe at 3 pm, call Sue and tell her I’ll be late to dinner, and then make sure my kids know that I’ll be working late and dinner is in the fridge they just need to put it in the microwave for 10 minutes on medium.”3 Instead, in current Virtual Assistant you need to explicit each command one at the time, probably by repeating “Alexa” a lot. It is frustrating.
Long-Term Context
The above problem is rooted on the very limited cognitive ability of Virtual Assistants. They have limited conceptual reasoning capabilities. The biggest one, in my opinion, is that they are bad at understanding and using context, especially medium to long-term context.
Me: Siri, when is the next Manchester United game? Siri: It’s on <some date> Me: Ok, put it on my calendar. Siri: ok. What do you want me to put on your calendar?
Google Assistant is better at this, but not much better. Virtual Assistants should be able to catch most information from context. Would not be cool if we could ask our Virtual Assistant things like ”Hey, delete the event I set this morning.”, or “” or ”About that, can you set an appointment when it opens?”, where that is whatever you were talking to in the recent previous interaction.
Context is also a big privacy issue, especially for cloud-based VAs: context works fine if the VA can store and retrieve any detail of our past interactions. It would be cool if we could own such data.
Conversation Recover
Context is hard and ambiguous. That’s why Virtual Assistants should have the ability to have a more natural, and context-based, conversation recovery. The most frustrating thing to do with a VA is to correct something they do not understood. The faster way is usually to start over.
That’s not fine. Instead of making me repeat the whole command from the start, a Virtual Assistant should ask for the specific thing they did not understand. If I ask “Add a meeting tomorrow at 4pm with title Developers Meeting” and I stumble on the time, I want the VA to ask me “Sorry, I don’t get it. At what time?”; if I stumble on the title, ask me again the title. Do not guess, do not completely fail.
No Common Sense
I will explain with Boris Katz4 words:
Say your robot is helping you pack, and you tell it: “This book would not fit in the red box because it is too *small.*Clearly, you want your robot to understand that the redbox is too small, so that you can continue to have a meaningful conversation. However, if you tell the robot: “This book would not fit in the red box because it is too *big,”*you want your robot to understand that the book is too big.
Common Sense is the hardest thing to teach to a machine. You can have some common sense understanding using statistical learning (for instance, by mapping several words to the concept of smallness). Statistical Machine Learning is actually pretty good at encoding knowledge that we cannot explain very well (e.g., what is the difference between a dog and a cat).
However, when you put in the mix syntax and language (that is intrinsically intertwined with knowledge representation), things start quickly to become intractable.
Conclusion
In the end, I do not know how much the winter will last. Most of this problem ranges from major improvement to revolutionary AI breakthrough. Google seems pretty close to do the next major step, but until then I will keep asking my VAs to set timers and tell me the weather for tomorrow.
Most of my example comes from Italian commands. I know that English Natural Language Understanding is better, but I am still aware of similar issues: I just do not remember them and I have no way to try them. ↩︎
I am aware of the usability requirements for that; for now. I just think it is not how human beings talk. ↩︎
Katz is a MIT researcher who worked extensively in the field of conversational AIs. ↩︎