Anh Nguyen — Auburn University
ArtiÂfiÂcial IntelÂliÂgence (AI) empowÂers every secÂtor of the econÂoÂmy and our everyÂday lives. While often accuÂrate, AI sysÂtems someÂtimes make unusuÂal misÂtakes that humans don’t typÂiÂcalÂly make. We study how humans and AI can team up and their human-AI team perÂforÂmance on image clasÂsiÂfiÂcaÂtion. After a series of human-subÂject experÂiÂments on GorilÂla, we find sevÂerÂal interÂestÂing findÂings includÂing that some type of visuÂal heatmap-based explaÂnaÂtion that AIs make for their deciÂsions can conÂfuse humans and conÂsisÂtentÂly cause users to perÂform worse. FurÂtherÂmore, we find some type of exemÂplar-based explaÂnaÂtion often causÂes users to overtrust AIs’ deciÂsions even when they are wrong. We will disÂcuss some pros and cons of difÂferÂent AI explaÂnaÂtions testÂed on human users and interÂestÂing experÂiÂmenÂtal insights from our GorilÂla studies.
Full TranÂscript:
Anh Nguyen 0:00
Okay, yeah. Okay, so thank you again for havÂing me here. I’m very hapÂpy to be part of the disÂcusÂsion so far. It’s very excitÂing. And now, um, I hope to share with you some stoÂry, how we evalÂuÂate the effecÂtiveÂness of human and AI colÂlabÂoÂraÂtions. So I’m assisÂtant proÂfesÂsor at Auburn UniÂverÂsiÂty. So AI has been ubiqÂuiÂtous ingreÂdiÂents behind many, many appliÂcaÂtions, many indusÂtry secÂtors that you have seen. And there’s the need for ExplainÂable AI, peoÂple are conÂstantÂly askÂing about do we need to know what’s going on interÂnalÂly inside these huge giant neurÂal netÂworks, right? We need to know why your self driÂving cars decide to stop when there’s green lights. So all these behavÂiours, we want some explaÂnaÂtions. The quesÂtion is why? Right. So for examÂple, you can see this is two years ago, an image genÂerÂaÂtor at CVPR topÂtiÂer. ConÂferÂence in comÂputÂer vision, it shows this huge racial bias, it’s a netÂwork train to do up samÂpling. So genÂerÂatÂing from low res images here of Barack ObaÂma, and it’s supÂposed to synÂtheÂsise for you someÂwhat of the same idenÂtiÂty. But with a betÂter fideliÂty qualÂiÂty image. HowÂevÂer, it’s tends to bias towards the white men. This is a text genÂerÂaÂtor by open AI. It’s a machine where you talk to it, you give it some text, we will talk back to you it outÂputs some text. But it has also this huge racial and genÂder bias. For examÂple, if you say the man worked, as it will tell you a car salesÂman at a world local WalÂmart as a comÂpletÂing your senÂtence. Now, if you say this, the woman worked as and it will tell you a lot of othÂer things that are realÂly disÂturbÂing for first. So it has a huge genÂder bias, and the whole AI is a black box. RecentÂly, just in the last year, there are mulÂtiÂple false arrests by police. Because the AI when it serves through the extracÂtion from from camÂeras, it will find wrong susÂpect for shopliftÂing or any othÂer crimes. And a lot of peoÂple were put in jail for the wrong reaÂsons, just because AI said so. As this is borÂrow from DARPA, the curÂrent genÂerÂaÂtion of AI sysÂtem offers tremenÂdous benÂeÂfits, but the effecÂtiveÂness will be limÂitÂed by the machine abilÂiÂty to explain its deciÂsion and action to users. So this is traÂdiÂtionÂal AI where you give it some data, and you let it learn how to solve the task by itself. And then the outÂput is this pre train blackÂbox modÂel, and it’s either a review deciÂsion or recÂomÂmenÂdaÂtion and then you decide what to do with it. So for high stakes deciÂsions, humans are the ultiÂmate deciÂsion makÂer. So my research involves how do we buildÂing this explaÂnaÂtion interÂface that allows AIS to talk to humans and back and forth so that they both can achieve betÂter accuÂraÂcy or perÂforÂmance then each of them alone. And the botÂtom line is the human AI teamÂwork is needÂed because neiÂther humans nor AIS can solve the task by themÂselves comÂpleteÂly. Right and effiÂcientÂly. So the task one task I will talk about in this talk today is fine grain bird idenÂtiÂfiÂcaÂtion. So we have here in this cup data set to othÂer classÂes, your job is to givÂen an image, let’s say this phoÂto, pick one of the classÂes among the 200. So pine wobÂble, for examÂple. So givÂen this image, pick one of the class and the corÂrect one here is AmerÂiÂcan goldfinch. It’s actuÂalÂly a not an easy task. For lay users. Unless you are both experts. You spend a lot of time with birds.
4:15
Of course, also askÂing allow users to choose one of the two othÂer classÂes require users to know about all the classÂes. So this is a pretÂty dauntÂing and overÂwhelmÂing task. That’s hard to scale beyond expert users. So here we reforÂmuÂlate the task into a two stage process. So givÂen an input image, we give it to the AI and AI proÂvides some preÂdicÂtion some deciÂsion. Let’s say it thinks this is 60% AmerÂiÂcan goldfinch. And now in stage two users led to see this deciÂsion. In this case, it’s corÂrect. And it can comÂpare with some exemÂplars of AmerÂiÂcan Goldfinch and users get to agree or disÂagree. So it’s a binaÂry as a yes or no deciÂsion. Some case AI will be wrong. And like here, it says, I’m 30% conÂfiÂdent that this is evening GrosÂbeak. It’s simÂiÂlar, but it’s not exactÂly the same. And humans should disÂagree in this case. All right. So the reaÂson we reforÂmuÂlatÂed the task in this way, because AI, this is a very, very accuÂrate in this task of 200 way, you know, catÂeÂgoÂrizaÂtion, it’s 85% accuÂrate. HowÂevÂer, it’s not 100% accuÂrate. So, how accuÂrate are humans? If you have any guest, it’s time to make it now. We actuÂalÂly run mulÂtiÂple guerÂrilÂla studÂies to get this numÂber. And actuÂalÂly, it’s 65%. It’s turns out to be not an easy task. Although it’s binaÂry, it’s just two options, and 50% is a ranÂdom chance. And humans caught on averÂage 65% this task? It looks like the folÂlowÂing before each triÂal, each quesÂtion we actuÂalÂly proÂvide examÂples of the class and then the AI preÂdictÂed each, which could be corÂrect or wrong. We don’t know. Let’s say horn puffÂing here. So we proÂvide six examÂples. And users givÂen a phoÂto that AI preÂdictÂed to be horned PufÂfin. Humans have to select yes or no. Right? So this is a task and humans score 65% accuÂrate. Now, one quesÂtion that we study in our research is, if this is the case, right? It’s pretÂty limÂitÂed the interÂacÂtion between human and AI, it’s limÂitÂed. It’s one way and we do not how can we get more inforÂmaÂtion out of the AIS, for examÂple, some explaÂnaÂtion why this is horn PufÂfin so that humans can improve their accuÂraÂcy. It leads to our secÂond research quesÂtions that do AI explaÂnaÂtions help improve user accuÂraÂcy. So here I will we conÂductÂed preÂviÂous research where we inventÂed first the methÂods that ais that proÂvide a nice capaÂbilÂiÂty of explaÂnaÂtions. And then we run the human study to evalÂuÂate explaÂnaÂtions as a post processed so we run the query image AI norÂmalÂly give you some preÂdicÂtion like June quo with some conÂfiÂdence 97% and user decides yes or no if this is a judgeÂment call. Our ExplainÂable AI is proÂvide addiÂtionÂal inforÂmaÂtion. Here it says that if I think this is a June curl, because it’s simÂiÂlar to othÂer June curl examÂples in the beak, in the chest or in the in the tail. So we proÂvide here visuÂal corÂreÂsponÂdence between the query image and examÂples that you encode that the modÂel thing is it is okay, so this is the main explaÂnaÂtion that we inventÂed in study.
8:29
It turns out that we doubt any explaÂnaÂtion. So we are the gorilÂla study, we have six methÂods. And our main method is these two and the othÂer three are updatÂed verÂsion of the first one just to underÂstand effects. But in this talk, I just comÂpare the the main baseÂline and our main treatÂment. So human withÂout any explaÂnaÂtion. So no furÂther inforÂmaÂtion is a 65% accuÂrate. But if you proÂvide explaÂnaÂtions, they improve conÂsisÂtentÂly, all those slightÂly, but conÂsisÂtentÂly improve their perÂforÂmance to 67 and 69. And there’s a staÂtisÂtiÂcal sigÂnifÂiÂcance between these two groups, and the users per each methÂods around 60 users. In total, we have 355 users for the whole study. The way we set up this guerÂrilÂla setÂting is it’s we have five trainÂing examÂples for this bird, a task, and then folÂlowed by five so we teach users how to do the task. And we proÂvide five qualÂiÂty conÂtrol examÂples. It’s called valÂiÂdaÂtion, and if the user passed it, we then proÂceed them onto the test of 30 quesÂtions. If they fail, we just reject them and invite them out of the study. It’s the end and the accepÂtance rate into the into this study is only around 23 Since around 1100, users parÂticÂiÂpate in the trainÂing and screenÂing, but only 355 were made it past the qualÂiÂty conÂtrol. We have proÂlifÂic, we hire native EngÂlish speakÂers, and they came from a lot of places in the world, we paid a total of 13 point $5 per hour. And that whole study is estiÂmatÂed to be 20 minÂutes, although it varies from around 10 to 245. And gorilÂla is very, very effiÂcient tools first, for this type of study. This is our first study third paper on this topÂic. Now, although with only one user study, we actuÂalÂly can perÂform a test on to human interÂacÂtion modÂel. So in the first modÂel, here, we proÂvide AI with an input image and it proÂvides some deciÂsion and the human gets to say yes or no, and this is AmerÂiÂcan goldfinch. In a secÂond modÂel, the AI will proÂvide you the conÂfiÂdence and based on the conÂfiÂdence, you will let AI make a deciÂsion by itself. Or when the AI is not so conÂfiÂdent, we leave it to human. So this is the secÂond modÂel. And we want to test whether the explaÂnaÂtion can help improve the human accuÂraÂcy and the whole sysÂtem accuÂraÂcy on modÂel two. So it turns out that with a modÂel two, we test, difÂferÂent range of conÂfiÂdence scores. And it turns out that we can autoÂmate, basiÂcalÂly leave for the AI to decide 75% of the data. And humans only need to work on 25% of the data. So that’s pretÂty interÂestÂing. And furÂtherÂmore, we find that if you team the humans with AI, so let both of them make a deciÂsion on comÂpleÂment sets of each othÂer, then the team perÂforÂmance actuÂalÂly it’s it’s can be betÂter than the AI alone. And it’s much betÂter than humans alone. So we are very excitÂed about this work because it is the first work in this area that shows that human and AI man teamÂing up can actuÂalÂly achieved some improveÂment in visuÂal recogÂniÂtion. One quesÂtion you might ask is what made the humans more accuÂrate and when they see AI estiÂmaÂtions, right? So we perÂform mulÂtiÂple slices deepÂer into the data we get from gorilÂla. So when AI is corÂrect, you can see that this is the blue bar is a no explaÂnaÂtion. So the perÂforÂmance simÂiÂlarÂly to the red and the brown, which are the treatÂments with explaÂnaÂtions. HowÂevÂer, when AI is wrong, this is where the explaÂnaÂtion actuÂalÂly shine. It is benÂeÂfits the users, there’s almost more than 10 point gap between no explaÂnaÂtion. And when there’s explaÂnaÂtion proÂvidÂed. BasiÂcalÂly, for examÂple, when AI is incorÂrect, it thinkÂing that this is olive flyÂcatchÂer, which actuÂalÂly is not true. It’s actuÂalÂly a sciÂenÂtist. But
13:29
for othÂer treatÂments, we baseÂline treatÂments when we proÂvide this explaÂnaÂtion. This is olive flyÂcatchÂer because it looks like these birds. All users for over for that see this explaÂnaÂtion actuÂalÂly wrongÂly acceptÂed a bird interÂestÂed in me for our explaÂnaÂtion when we proÂvide this corÂreÂsponÂdence and these explaÂnaÂtion, all three out of three users. For this examÂple corÂrectÂly rejectÂed and things that AI Hey, this is wrong, I do not accept this. So in conÂcluÂsion, we find the first time that humans can improve their accuÂraÂcy when using AI explaÂnaÂtion for bird idenÂtiÂfiÂcaÂtion. When in a modÂel to human AI interÂacÂtions, we can offload around 75% of data to AI and letÂting users just label the rest and that human AI splitÂting improve the whole sysÂtem. Total sysÂtem accuÂraÂcy comÂpared to AI alone and humans alone. This one was done with sanÂguine and Mohammed tesserÂae. My PhD stuÂdents they are more interÂestÂed in quesÂtions you can think of right? For examÂple, because we perÂform these online behavÂiourÂal studÂies we do not know exactÂly what makes users more accuÂrate in the AI wrong. We paces we only have some hypothÂeÂsis but we do not actuÂalÂly observe what’s going on. I’m in the secÂond is the improveÂment in the human accuÂraÂcy. It’s from two to four perÂson. So it’s still Modus at this point. So how can we improve them furÂther? And whether they explaÂnaÂtion hapÂpy human expert is a sepÂaÂrate quesÂtions. We share paper and code and also GabrielÂla screen setÂtings on our webÂsite, if anyÂone is interÂestÂed in repliÂcatÂing. Thank you very much.
15:30
And that was amazÂing. I had no idea what you’re going to talk about. That was absoluteÂly amazÂing. I love this girl for so long, we will go but always good to do it all and you’ve gone? No, no, no, no, that’s the wrong quesÂtion. It’s, it’s AI is in humans workÂing togethÂer. And I’m quite cerÂtain there are a lot of peoÂple in the room who would love an AI that helps them Mark psyÂcholÂoÂgy essays come markÂing time, and can at least like get a rough catÂeÂgoÂrizaÂtion of, of, of the essays. So it just needs to be checked by by humans. So I just I think that’s such a difÂferÂent approach to thinkÂing about how AI and humans work togethÂer. So thank you for sharÂing that with us today. I do have a quesÂtion for you. How did you? How did you maxÂimise the qualÂiÂty of the answers that you got from from your human participants?
16:20
Right, this is very, very interÂestÂing quesÂtion. And we spend a lot of time thinkÂing about this. And we play and we use also code to try to improve it. So I have some backÂup slides here, where, where is it here, where we list are not exhausÂtive list, but some of the things that we we did and we nevÂer thought of at the beginÂning. So it takes mulÂtiÂple triÂals to get it. So for examÂple, we use qualÂiÂty conÂtrol examÂples for five to 10. For this birth, we use five or for othÂer tests, we use 10. And we reject users if they do not pass and we also may be cruÂelÂly we also do not pay them. So we’d say this in advance in introÂducÂtion screen that if you do not pass we will nevÂer pay you. Okay, so this is the first thing to retain this is imporÂtant in scarÂing away some peoÂple who are not too seriÂous. SecÂond is we asked users on the first screen to stay on the screen and do not switch tasks because many times they come back with two hours latÂer with a task where for for which you teachÂers take 20 minÂutes. So this is very imporÂtant to limÂit it and tell them that they are not allowed to switch tasks. We also do not allow studÂies on phones or tablets, they need to use comÂputÂer and speÂcifÂic natures Chrome to avoid any issues. SomeÂthing we have to hack into the proÂgramme but which is niceÂly doable using JavaScript on gorilÂla is we can choose a time 5000 milÂlisecÂonds before you disÂplay the conÂtinÂue butÂton, like before the yes or no butÂton. So they need to look at the examÂple first. And then before they can just hit next, next, next next, that’s not posÂsiÂble with our setÂup. Also, this is from one of the priÂor study is we feel the humans who perÂform too fast. For examÂple, for a task that is 30 minÂutes, we estiÂmate on averÂage 20 minÂutes, we will filÂter out peoÂple with eight minÂutes or less. So that’s all yes.
18:32
That’s that’s realÂly great. I’ve takÂen a screenÂshot of your top tips. So I think these are these are rules that a lot of peoÂple could use to improve the data qualÂiÂty. And thank you so much for your time today.

