“Evaluating Quality of Chatbots and Intelligent Conversational Agents” by Nicole Radziwill and Morgan Benton.
The paper outlines six categories and the corresponding quality attributes, shown here below.
|Performance||● Graceful degradation ● Robustness to manipulation ● Robustness to unexpected input ● Avoid inappropriate utterances and be able to perform damage control ● Effective function allocation, provides appropriate escalation channels to humans||● Cohen & Lane (2016) ● Thieltges (2016) ● Kluwer (2011) ● Morrissey and Kirakowski (2013) ● Staven (2017)|
|Functionality||● Accurate speech synthesis ● Interprets commands accurately ● Use appropriate degrees of formality, linguistic register ● Linguistic accuracy of outputs ● Execute requested tasks ● Facilitate transactions and follows up with status reports ● General ease of use ● Engage in on-the-fly problem solving ● Contains breadth of knowledge, is flexible in interpreting it||● Kuligowska (2015) ● Eeuwen (2017) ● Morrissey & Kirakowski (2013) ● Wallace (2003) ● Ramos (2017) ● Eeuwen (2017) ● Solomon (2017) ● Cohen & Lane (2016)|
|Humanity||● Passes the Turing test ● Does not have to pass the Turing Test ● Transparent to inspection, discloses its chatbot identity ● Include errors to increase realism ● Convincing, satisfying, & natural interaction ● Able to respond to specific questions ● Able to maintain themed discussion||● Weizenbaum (1966); Wallace (2003) ● Ramos (2017) ● Bostrom & Yudkowski (2014) ● Coniam (2014) ● Morrissey & Kirakowski (2013)|
|Affect||● Provide greetings, convey personality ● Give conversational cues ● Provide emotional information through tone, inflection, and expressivity ● Exude warmth and authenticity ● Make tasks more fun and interesting ● Entertain and/or enable participant to enjoy the interaction ● Read and respond to moods of human participant||● Morrissey & Kirakowski (2013) ● Pauletto et al. (2013) ● Solomon (2017) ● Eeuwen (2017) ● Ramos (2017) ● Meira & Canuto (2015)|
|Ethics & Behavior||● Respect, inclusion, and preservation of dignity (linked to choice of training set) ● Ethics and cultural knowledge of users ● Protect and respect privacy ● Nondeception ● Sensitivity to safety and social concerns ● Trustworthiness (linked to perceived quality) ● Awareness of trends and social context||● Neff & Nagy (2016) ● Applin & Fischer (2015) ● Eeuwen (2017) ● Isaac & Bridewell (2014) ● Miner et al. (2016) ● Herzum et al. (2002) ● Vetter (2002)|
|Accessibility||● Responds to social cues or lack thereof ● Can detect meaning or intent ● Meets neurodiverse needs such as extra response time and text interface||● Morrissey and Kirakowski (2013) ● Wilson et al. (2017) ● Radziwill & Benton (2017)|
Almost all of these qualities seem evident and self-explanatory for a chatbot. There is no dispute about any of them, except for the question whether chatbots should be able to pass the Turing test.
Simply put, the Turing Test is a test to see if a computer can think like a human being. If you chat with a person and a bot, but you can’t tell which one is the bot and which one is the real person, the Turing test has been passed.
It’s interesting to see the conflict between the people who say chatbots should be able to pass the Turing Test, and others who say they shouldn’t.
Being able to pass the test has been a topic of research for decades, and ever since the first chatbot ELIZA in 1966 (as depicted above) people have been trying to make chatbots as human-like as possible.
But in recent years there’s been a shift, with people now saying it should be very clear if you’re talking to a bot or a real person. The research paper quoted a number of interesting sources I’m eager to check out for myself.
In the meantime, all of this is certainly food for thought, and I’ll take this into account when developing my Historic Voicebot.
Image: A sample dialog with ELIZA (from http://www.masswerk.at/elizabot/eliza.html).