Evaluating the effectiveness of human-AI collaboration


Anh Nguyen — Auburn University

Arti­fi­cial Intel­li­gence (AI) empow­ers every sec­tor of the econ­o­my and our every­day lives. While often accu­rate, AI sys­tems some­times make unusu­al mis­takes that humans don’t typ­i­cal­ly make. We study how humans and AI can team up and their human-AI team per­for­mance on image clas­si­fi­ca­tion. After a series of human-sub­ject exper­i­ments on Goril­la, we find sev­er­al inter­est­ing find­ings includ­ing that some type of visu­al heatmap-based expla­na­tion that AIs make for their deci­sions can con­fuse humans and con­sis­tent­ly cause users to per­form worse. Fur­ther­more, we find some type of exem­plar-based expla­na­tion often caus­es users to overtrust AIs’ deci­sions even when they are wrong. We will dis­cuss some pros and cons of dif­fer­ent AI expla­na­tions test­ed on human users and inter­est­ing exper­i­men­tal insights from our Goril­la studies.

Full Tran­script:

Anh Nguyen 0:00
Okay, yeah. Okay, so thank you again for hav­ing me here. I’m very hap­py to be part of the dis­cus­sion so far. It’s very excit­ing. And now, um, I hope to share with you some sto­ry, how we eval­u­ate the effec­tive­ness of human and AI col­lab­o­ra­tions. So I’m assis­tant pro­fes­sor at Auburn Uni­ver­si­ty. So AI has been ubiq­ui­tous ingre­di­ents behind many, many appli­ca­tions, many indus­try sec­tors that you have seen. And there’s the need for Explain­able AI, peo­ple are con­stant­ly ask­ing about do we need to know what’s going on inter­nal­ly inside these huge giant neur­al net­works, right? We need to know why your self dri­ving cars decide to stop when there’s green lights. So all these behav­iours, we want some expla­na­tions. The ques­tion is why? Right. So for exam­ple, you can see this is two years ago, an image gen­er­a­tor at CVPR top­ti­er. Con­fer­ence in com­put­er vision, it shows this huge racial bias, it’s a net­work train to do up sam­pling. So gen­er­at­ing from low res images here of Barack Oba­ma, and it’s sup­posed to syn­the­sise for you some­what of the same iden­ti­ty. But with a bet­ter fideli­ty qual­i­ty image. How­ev­er, it’s tends to bias towards the white men. This is a text gen­er­a­tor by open AI. It’s a machine where you talk to it, you give it some text, we will talk back to you it out­puts some text. But it has also this huge racial and gen­der bias. For exam­ple, if you say the man worked, as it will tell you a car sales­man at a world local Wal­mart as a com­plet­ing your sen­tence. Now, if you say this, the woman worked as and it will tell you a lot of oth­er things that are real­ly dis­turb­ing for first. So it has a huge gen­der bias, and the whole AI is a black box. Recent­ly, just in the last year, there are mul­ti­ple false arrests by police. Because the AI when it serves through the extrac­tion from from cam­eras, it will find wrong sus­pect for shoplift­ing or any oth­er crimes. And a lot of peo­ple were put in jail for the wrong rea­sons, just because AI said so. As this is bor­row from DARPA, the cur­rent gen­er­a­tion of AI sys­tem offers tremen­dous ben­e­fits, but the effec­tive­ness will be lim­it­ed by the machine abil­i­ty to explain its deci­sion and action to users. So this is tra­di­tion­al AI where you give it some data, and you let it learn how to solve the task by itself. And then the out­put is this pre train black­box mod­el, and it’s either a review deci­sion or rec­om­men­da­tion and then you decide what to do with it. So for high stakes deci­sions, humans are the ulti­mate deci­sion mak­er. So my research involves how do we build­ing this expla­na­tion inter­face that allows AIS to talk to humans and back and forth so that they both can achieve bet­ter accu­ra­cy or per­for­mance then each of them alone. And the bot­tom line is the human AI team­work is need­ed because nei­ther humans nor AIS can solve the task by them­selves com­plete­ly. Right and effi­cient­ly. So the task one task I will talk about in this talk today is fine grain bird iden­ti­fi­ca­tion. So we have here in this cup data set to oth­er class­es, your job is to giv­en an image, let’s say this pho­to, pick one of the class­es among the 200. So pine wob­ble, for exam­ple. So giv­en this image, pick one of the class and the cor­rect one here is Amer­i­can goldfinch. It’s actu­al­ly a not an easy task. For lay users. Unless you are both experts. You spend a lot of time with birds.

Of course, also ask­ing allow users to choose one of the two oth­er class­es require users to know about all the class­es. So this is a pret­ty daunt­ing and over­whelm­ing task. That’s hard to scale beyond expert users. So here we refor­mu­late the task into a two stage process. So giv­en an input image, we give it to the AI and AI pro­vides some pre­dic­tion some deci­sion. Let’s say it thinks this is 60% Amer­i­can goldfinch. And now in stage two users led to see this deci­sion. In this case, it’s cor­rect. And it can com­pare with some exem­plars of Amer­i­can Goldfinch and users get to agree or dis­agree. So it’s a bina­ry as a yes or no deci­sion. Some case AI will be wrong. And like here, it says, I’m 30% con­fi­dent that this is evening Gros­beak. It’s sim­i­lar, but it’s not exact­ly the same. And humans should dis­agree in this case. All right. So the rea­son we refor­mu­lat­ed the task in this way, because AI, this is a very, very accu­rate in this task of 200 way, you know, cat­e­go­riza­tion, it’s 85% accu­rate. How­ev­er, it’s not 100% accu­rate. So, how accu­rate are humans? If you have any guest, it’s time to make it now. We actu­al­ly run mul­ti­ple guer­ril­la stud­ies to get this num­ber. And actu­al­ly, it’s 65%. It’s turns out to be not an easy task. Although it’s bina­ry, it’s just two options, and 50% is a ran­dom chance. And humans caught on aver­age 65% this task? It looks like the fol­low­ing before each tri­al, each ques­tion we actu­al­ly pro­vide exam­ples of the class and then the AI pre­dict­ed each, which could be cor­rect or wrong. We don’t know. Let’s say horn puff­ing here. So we pro­vide six exam­ples. And users giv­en a pho­to that AI pre­dict­ed to be horned Puf­fin. Humans have to select yes or no. Right? So this is a task and humans score 65% accu­rate. Now, one ques­tion that we study in our research is, if this is the case, right? It’s pret­ty lim­it­ed the inter­ac­tion between human and AI, it’s lim­it­ed. It’s one way and we do not how can we get more infor­ma­tion out of the AIS, for exam­ple, some expla­na­tion why this is horn Puf­fin so that humans can improve their accu­ra­cy. It leads to our sec­ond research ques­tions that do AI expla­na­tions help improve user accu­ra­cy. So here I will we con­duct­ed pre­vi­ous research where we invent­ed first the meth­ods that ais that pro­vide a nice capa­bil­i­ty of expla­na­tions. And then we run the human study to eval­u­ate expla­na­tions as a post processed so we run the query image AI nor­mal­ly give you some pre­dic­tion like June quo with some con­fi­dence 97% and user decides yes or no if this is a judge­ment call. Our Explain­able AI is pro­vide addi­tion­al infor­ma­tion. Here it says that if I think this is a June curl, because it’s sim­i­lar to oth­er June curl exam­ples in the beak, in the chest or in the in the tail. So we pro­vide here visu­al cor­re­spon­dence between the query image and exam­ples that you encode that the mod­el thing is it is okay, so this is the main expla­na­tion that we invent­ed in study.

It turns out that we doubt any expla­na­tion. So we are the goril­la study, we have six meth­ods. And our main method is these two and the oth­er three are updat­ed ver­sion of the first one just to under­stand effects. But in this talk, I just com­pare the the main base­line and our main treat­ment. So human with­out any expla­na­tion. So no fur­ther infor­ma­tion is a 65% accu­rate. But if you pro­vide expla­na­tions, they improve con­sis­tent­ly, all those slight­ly, but con­sis­tent­ly improve their per­for­mance to 67 and 69. And there’s a sta­tis­ti­cal sig­nif­i­cance between these two groups, and the users per each meth­ods around 60 users. In total, we have 355 users for the whole study. The way we set up this guer­ril­la set­ting is it’s we have five train­ing exam­ples for this bird, a task, and then fol­lowed by five so we teach users how to do the task. And we pro­vide five qual­i­ty con­trol exam­ples. It’s called val­i­da­tion, and if the user passed it, we then pro­ceed them onto the test of 30 ques­tions. If they fail, we just reject them and invite them out of the study. It’s the end and the accep­tance rate into the into this study is only around 23 Since around 1100, users par­tic­i­pate in the train­ing and screen­ing, but only 355 were made it past the qual­i­ty con­trol. We have pro­lif­ic, we hire native Eng­lish speak­ers, and they came from a lot of places in the world, we paid a total of 13 point $5 per hour. And that whole study is esti­mat­ed to be 20 min­utes, although it varies from around 10 to 245. And goril­la is very, very effi­cient tools first, for this type of study. This is our first study third paper on this top­ic. Now, although with only one user study, we actu­al­ly can per­form a test on to human inter­ac­tion mod­el. So in the first mod­el, here, we pro­vide AI with an input image and it pro­vides some deci­sion and the human gets to say yes or no, and this is Amer­i­can goldfinch. In a sec­ond mod­el, the AI will pro­vide you the con­fi­dence and based on the con­fi­dence, you will let AI make a deci­sion by itself. Or when the AI is not so con­fi­dent, we leave it to human. So this is the sec­ond mod­el. And we want to test whether the expla­na­tion can help improve the human accu­ra­cy and the whole sys­tem accu­ra­cy on mod­el two. So it turns out that with a mod­el two, we test, dif­fer­ent range of con­fi­dence scores. And it turns out that we can auto­mate, basi­cal­ly leave for the AI to decide 75% of the data. And humans only need to work on 25% of the data. So that’s pret­ty inter­est­ing. And fur­ther­more, we find that if you team the humans with AI, so let both of them make a deci­sion on com­ple­ment sets of each oth­er, then the team per­for­mance actu­al­ly it’s it’s can be bet­ter than the AI alone. And it’s much bet­ter than humans alone. So we are very excit­ed about this work because it is the first work in this area that shows that human and AI man team­ing up can actu­al­ly achieved some improve­ment in visu­al recog­ni­tion. One ques­tion you might ask is what made the humans more accu­rate and when they see AI esti­ma­tions, right? So we per­form mul­ti­ple slices deep­er into the data we get from goril­la. So when AI is cor­rect, you can see that this is the blue bar is a no expla­na­tion. So the per­for­mance sim­i­lar­ly to the red and the brown, which are the treat­ments with expla­na­tions. How­ev­er, when AI is wrong, this is where the expla­na­tion actu­al­ly shine. It is ben­e­fits the users, there’s almost more than 10 point gap between no expla­na­tion. And when there’s expla­na­tion pro­vid­ed. Basi­cal­ly, for exam­ple, when AI is incor­rect, it think­ing that this is olive fly­catch­er, which actu­al­ly is not true. It’s actu­al­ly a sci­en­tist. But

for oth­er treat­ments, we base­line treat­ments when we pro­vide this expla­na­tion. This is olive fly­catch­er because it looks like these birds. All users for over for that see this expla­na­tion actu­al­ly wrong­ly accept­ed a bird inter­est­ed in me for our expla­na­tion when we pro­vide this cor­re­spon­dence and these expla­na­tion, all three out of three users. For this exam­ple cor­rect­ly reject­ed and things that AI Hey, this is wrong, I do not accept this. So in con­clu­sion, we find the first time that humans can improve their accu­ra­cy when using AI expla­na­tion for bird iden­ti­fi­ca­tion. When in a mod­el to human AI inter­ac­tions, we can offload around 75% of data to AI and let­ting users just label the rest and that human AI split­ting improve the whole sys­tem. Total sys­tem accu­ra­cy com­pared to AI alone and humans alone. This one was done with san­guine and Mohammed tesser­ae. My PhD stu­dents they are more inter­est­ed in ques­tions you can think of right? For exam­ple, because we per­form these online behav­iour­al stud­ies we do not know exact­ly what makes users more accu­rate in the AI wrong. We paces we only have some hypoth­e­sis but we do not actu­al­ly observe what’s going on. I’m in the sec­ond is the improve­ment in the human accu­ra­cy. It’s from two to four per­son. So it’s still Modus at this point. So how can we improve them fur­ther? And whether they expla­na­tion hap­py human expert is a sep­a­rate ques­tions. We share paper and code and also Gabriel­la screen set­tings on our web­site, if any­one is inter­est­ed in repli­cat­ing. Thank you very much.

And that was amaz­ing. I had no idea what you’re going to talk about. That was absolute­ly amaz­ing. I love this girl for so long, we will go but always good to do it all and you’ve gone? No, no, no, no, that’s the wrong ques­tion. It’s, it’s AI is in humans work­ing togeth­er. And I’m quite cer­tain there are a lot of peo­ple in the room who would love an AI that helps them Mark psy­chol­o­gy essays come mark­ing time, and can at least like get a rough cat­e­go­riza­tion of, of, of the essays. So it just needs to be checked by by humans. So I just I think that’s such a dif­fer­ent approach to think­ing about how AI and humans work togeth­er. So thank you for shar­ing that with us today. I do have a ques­tion for you. How did you? How did you max­imise the qual­i­ty of the answers that you got from from your human participants?

Right, this is very, very inter­est­ing ques­tion. And we spend a lot of time think­ing about this. And we play and we use also code to try to improve it. So I have some back­up slides here, where, where is it here, where we list are not exhaus­tive list, but some of the things that we we did and we nev­er thought of at the begin­ning. So it takes mul­ti­ple tri­als to get it. So for exam­ple, we use qual­i­ty con­trol exam­ples for five to 10. For this birth, we use five or for oth­er tests, we use 10. And we reject users if they do not pass and we also may be cru­el­ly we also do not pay them. So we’d say this in advance in intro­duc­tion screen that if you do not pass we will nev­er pay you. Okay, so this is the first thing to retain this is impor­tant in scar­ing away some peo­ple who are not too seri­ous. Sec­ond is we asked users on the first screen to stay on the screen and do not switch tasks because many times they come back with two hours lat­er with a task where for for which you teach­ers take 20 min­utes. So this is very impor­tant to lim­it it and tell them that they are not allowed to switch tasks. We also do not allow stud­ies on phones or tablets, they need to use com­put­er and spe­cif­ic natures Chrome to avoid any issues. Some­thing we have to hack into the pro­gramme but which is nice­ly doable using JavaScript on goril­la is we can choose a time 5000 mil­lisec­onds before you dis­play the con­tin­ue but­ton, like before the yes or no but­ton. So they need to look at the exam­ple first. And then before they can just hit next, next, next next, that’s not pos­si­ble with our set­up. Also, this is from one of the pri­or study is we feel the humans who per­form too fast. For exam­ple, for a task that is 30 min­utes, we esti­mate on aver­age 20 min­utes, we will fil­ter out peo­ple with eight min­utes or less. So that’s all yes.

That’s that’s real­ly great. I’ve tak­en a screen­shot of your top tips. So I think these are these are rules that a lot of peo­ple could use to improve the data qual­i­ty. And thank you so much for your time today.

