3 lessons from a chatbot blunder

3 min readOct 6, 2021

Started from a Reddit discussion and then cracked the Internet up, one virtual assistant for citizens accidentally dispensed safe sex advice for a COVID-19 query.

How did that happen? Many people try to explain:

whether there’s NLP or not
the order of the words in an utterance
the variations such as “son” and “daughter”
insufficient training phrases
whether the model is overfitted or not
misinterpreting COVID-19 as a sexually transmitted disease

While some of us chuckled and some of us try to pinpoint the exact cause, I believe there are important lessons about how agile teams can take care of their chatbots from this episode.

Chatbots behave probabilistically. Verify with BDD or smoke test.

In Xtreme Programming, developers are most familiar with Test-Driven Development (TDD) to ensure that functionalities work as expected before releasing them to production. But chatbots don’t behave like apps and web apps. Each time new intents or training phrases get introduced, the virtual agent retrains itself to produce a new NLU model, which means that the wirings between what the user says and what the bot answers may have switched in ways you don’t expect. So how do you do TDD on chatbots? You don’t. Consider Behaviour-driven Development (BDD), a communication strategy in the given-when-then format between squad members (between techies and business). But instead of between squad members, BDD can be applied between a bot and a user; you check for dialogues, not functionalities so that you can assess the quality of your chatbot.

It’s never done after launch. Keep calm and train on.

In the case of COVID-19, user expectations are at an all-time high, much higher than the capabilities of chatbot we imagine. Sometimes chatbots fail us due to poor design, and sometimes our expectations fail them because we don’t understand virtual assistants. In either case, launching a chatbot sounds like a cold start; it’s hard to get it right and perfect from the start. It’s a high bar that teams and management need to meet, but it’s not worth getting upset quickly. Teams need to be given time to identify new dialogues, train and develop more complex agents. Designers can minimise unhappy flows with error correction or deflection strategies. Developers can conduct a round of regression tests to ensure that dialogues work as expected after the NLU model is updated. I rarely like to describe that chatbots ostensibly get better with higher customer usage. There are humans involved to label or re-label the responses and to help the virtual agent undergo supervised learning.

Look at codified innocent answers from a different angle.

It is every conversation designer’s best intention to write a concise and accurate answer for every potential question. Even if the response seems neutral or innocent, it may appear out of context. Good chatbots need context, not just tree-based flows. I wouldn’t comment that outlining all cases is an easy job. Still, it’s worth considering critical, especially when the “safe sex” advice or similar anecdote is a probability under the machine learning model. Once in a while, play the devil’s advocate. Find out what the bot says. Users love to try and poke fun at bots, so user messages containing vulgarities, sexual references, “complain letters”, personally identifiable information (PII) and unethical questions are possible. What will your chatbot say?

Opinions expressed are solely my own and do not express the views or opinions of my employer. If you enjoyed this, subscribe to my updates or connect with me over LinkedIn.

3 lessons from a chatbot blunder

Chatbots behave probabilistically. Verify with BDD or smoke test.

It’s never done after launch. Keep calm and train on.

Look at codified innocent answers from a different angle.

Written by Rensen Ho