Eval­u­at­ing the effec­tive­ness of human-AI collaboration

YouTube

By load­ing the video, you agree to YouTube’s pri­va­cy pol­i­cy.
Learn more

Load video

Anh Nguyen — Auburn University

Arti­fi­cial Intel­li­gence (AI) empow­ers every sec­tor of the econ­o­my and our every­day lives. While often accu­rate, AI sys­tems some­times make unusu­al mis­takes that humans don’t typ­i­cal­ly make. We study how humans and AI can team up and their human-AI team per­for­mance on image clas­si­fi­ca­tion. After a series of human-sub­ject exper­i­ments on Goril­la, we find sev­er­al inter­est­ing find­ings includ­ing that some type of visu­al heatmap-based expla­na­tion that AIs make for their deci­sions can con­fuse humans and con­sis­tent­ly cause users to per­form worse. Fur­ther­more, we find some type of exem­plar-based expla­na­tion often caus­es users to overtrust AIs’ deci­sions even when they are wrong. We will dis­cuss some pros and cons of dif­fer­ent AI expla­na­tions test­ed on human users and inter­est­ing exper­i­men­tal insights from our Goril­la studies.

Full Tran­script:

Anh Nguyen 0:00
Okay, yeah. Okay, so thank you again for hav­ing me here. I’m very hap­py to be part of the dis­cus­sion so far. It’s very excit­ing. And now, um, I hope to share with you some sto­ry, how we eval­u­ate the effec­tive­ness of human and AI col­lab­o­ra­tions. So I’m assis­tant pro­fes­sor at Auburn Uni­ver­si­ty. So AI has been ubiq­ui­tous ingre­di­ents behind many, many appli­ca­tions, many indus­try sec­tors that you have seen. And there’s the need for Explain­able AI, peo­ple are con­stant­ly ask­ing about do we need to know what’s going on inter­nal­ly inside these huge giant neur­al net­works, right? We need to know why your self dri­ving cars decide to stop when there’s green lights. So all these behav­iours, we want some expla­na­tions. The ques­tion is why? Right. So for exam­ple, you can see this is two years ago, an image gen­er­a­tor at CVPR top­ti­er. Con­fer­ence in com­put­er vision, it shows this huge racial bias, it’s a net­work train to do up sam­pling. So gen­er­at­ing from low res images here of Barack Oba­ma, and it’s sup­posed to syn­the­sise for you some­what of the same iden­ti­ty. But with a bet­ter fideli­ty qual­i­ty image. How­ev­er, it’s tends to bias towards the white men. This is a text gen­er­a­tor by open AI. It’s a machine where you talk to it, you give it some text, we will talk back to you it out­puts some text. But it has also this huge racial and gen­der bias. For exam­ple, if you say the man worked, as it will tell you a car sales­man at a world local Wal­mart as a com­plet­ing your sen­tence. Now, if you say this, the woman worked as and it will tell you a lot of oth­er things that are real­ly dis­turb­ing for first. So it has a huge gen­der bias, and the whole AI is a black box. Recent­ly, just in the last year, there are mul­ti­ple false arrests by police. Because the AI when it serves through the extrac­tion from from cam­eras, it will find wrong sus­pect for shoplift­ing or any oth­er crimes. And a lot of peo­ple were put in jail for the wrong rea­sons, just because AI said so. As this is bor­row from DARPA, the cur­rent gen­er­a­tion of AI sys­tem offers tremen­dous ben­e­fits, but the effec­tive­ness will be lim­it­ed by the machine abil­i­ty to explain its deci­sion and action to users. So this is tra­di­tion­al AI where you give it some data, and you let it learn how to solve the task by itself. And then the out­put is this pre train black­box mod­el, and it’s either a review deci­sion or rec­om­men­da­tion and then you decide what to do with it. So for high stakes deci­sions, humans are the ulti­mate deci­sion mak­er. So my research involves how do we build­ing this expla­na­tion inter­face that allows AIS to talk to humans and back and forth so that they both can achieve bet­ter accu­ra­cy or per­for­mance then each of them alone. And the bot­tom line is the human AI team­work is need­ed because nei­ther humans nor AIS can solve the task by them­selves com­plete­ly. Right and effi­cient­ly. So the task one task I will talk about in this talk today is fine grain bird iden­ti­fi­ca­tion. So we have here in this cup data set to oth­er class­es, your job is to giv­en an image, let’s say this pho­to, pick one of the class­es among the 200. So pine wob­ble, for exam­ple. So giv­en this image, pick one of the class and the cor­rect one here is Amer­i­can goldfinch. It’s actu­al­ly a not an easy task. For lay users. Unless you are both experts. You spend a lot of time with birds.

4:15
Of course, also ask­ing allow users to choose one of the two oth­er class­es require users to know about all the class­es. So this is a pret­ty daunt­ing and over­whelm­ing task. That’s hard to scale beyond expert users. So here we refor­mu­late the task into a two stage process. So giv­en an input image, we give it to the AI and AI pro­vides some pre­dic­tion some deci­sion. Let’s say it thinks this is 60% Amer­i­can goldfinch. And now in stage two users led to see this deci­sion. In this case, it’s cor­rect. And it can com­pare with some exem­plars of Amer­i­can Goldfinch and users get to agree or dis­agree. So it’s a bina­ry as a yes or no deci­sion. Some case AI will be wrong. And like here, it says, I’m 30% con­fi­dent that this is evening Gros­beak. It’s sim­i­lar, but it’s not exact­ly the same. And humans should dis­agree in this case. All right. So the rea­son we refor­mu­lat­ed the task in this way, because AI, this is a very, very accu­rate in this task of 200 way, you know, cat­e­go­riza­tion, it’s 85% accu­rate. How­ev­er, it’s not 100% accu­rate. So, how accu­rate are humans? If you have any guest, it’s time to make it now. We actu­al­ly run mul­ti­ple guer­ril­la stud­ies to get this num­ber. And actu­al­ly, it’s 65%. It’s turns out to be not an easy task. Although it’s bina­ry, it’s just two options, and 50% is a ran­dom chance. And humans caught on aver­age 65% this task? It looks like the fol­low­ing before each tri­al, each ques­tion we actu­al­ly pro­vide exam­ples of the class and then the AI pre­dict­ed each, which could be cor­rect or wrong. We don’t know. Let’s say horn puff­ing here. So we pro­vide six exam­ples. And users giv­en a pho­to that AI pre­dict­ed to be horned Puf­fin. Humans have to select yes or no. Right? So this is a task and humans score 65% accu­rate. Now, one ques­tion that we study in our research is, if this is the case, right? It’s pret­ty lim­it­ed the inter­ac­tion between human and AI, it’s lim­it­ed. It’s one way and we do not how can we get more infor­ma­tion out of the AIS, for exam­ple, some expla­na­tion why this is horn Puf­fin so that humans can improve their accu­ra­cy. It leads to our sec­ond research ques­tions that do AI expla­na­tions help improve user accu­ra­cy. So here I will we con­duct­ed pre­vi­ous research where we invent­ed first the meth­ods that ais that pro­vide a nice capa­bil­i­ty of expla­na­tions. And then we run the human study to eval­u­ate expla­na­tions as a post processed so we run the query image AI nor­mal­ly give you some pre­dic­tion like June quo with some con­fi­dence 97% and user decides yes or no if this is a judge­ment call. Our Explain­able AI is pro­vide addi­tion­al infor­ma­tion. Here it says that if I think this is a June curl, because it’s sim­i­lar to oth­er June curl exam­ples in the beak, in the chest or in the in the tail. So we pro­vide here visu­al cor­re­spon­dence between the query image and exam­ples that you encode that the mod­el thing is it is okay, so this is the main expla­na­tion that we invent­ed in study.

8:29
It turns out that we doubt any expla­na­tion. So we are the goril­la study, we have six meth­ods. And our main method is these two and the oth­er three are updat­ed ver­sion of the first one just to under­stand effects. But in this talk, I just com­pare the the main base­line and our main treat­ment. So human with­out any expla­na­tion. So no fur­ther infor­ma­tion is a 65% accu­rate. But if you pro­vide expla­na­tions, they improve con­sis­tent­ly, all those slight­ly, but con­sis­tent­ly improve their per­for­mance to 67 and 69. And there’s a sta­tis­ti­cal sig­nif­i­cance between these two groups, and the users per each meth­ods around 60 users. In total, we have 355 users for the whole study. The way we set up this guer­ril­la set­ting is it’s we have five train­ing exam­ples for this bird, a task, and then fol­lowed by five so we teach users how to do the task. And we pro­vide five qual­i­ty con­trol exam­ples. It’s called val­i­da­tion, and if the user passed it, we then pro­ceed them onto the test of 30 ques­tions. If they fail, we just reject them and invite them out of the study. It’s the end and the accep­tance rate into the into this study is only around 23 Since around 1100, users par­tic­i­pate in the train­ing and screen­ing, but only 355 were made it past the qual­i­ty con­trol. We have pro­lif­ic, we hire native Eng­lish speak­ers, and they came from a lot of places in the world, we paid a total of 13 point $5 per hour. And that whole study is esti­mat­ed to be 20 min­utes, although it varies from around 10 to 245. And goril­la is very, very effi­cient tools first, for this type of study. This is our first study third paper on this top­ic. Now, although with only one user study, we actu­al­ly can per­form a test on to human inter­ac­tion mod­el. So in the first mod­el, here, we pro­vide AI with an input image and it pro­vides some deci­sion and the human gets to say yes or no, and this is Amer­i­can goldfinch. In a sec­ond mod­el, the AI will pro­vide you the con­fi­dence and based on the con­fi­dence, you will let AI make a deci­sion by itself. Or when the AI is not so con­fi­dent, we leave it to human. So this is the sec­ond mod­el. And we want to test whether the expla­na­tion can help improve the human accu­ra­cy and the whole sys­tem accu­ra­cy on mod­el two. So it turns out that with a mod­el two, we test, dif­fer­ent range of con­fi­dence scores. And it turns out that we can auto­mate, basi­cal­ly leave for the AI to decide 75% of the data. And humans only need to work on 25% of the data. So that’s pret­ty inter­est­ing. And fur­ther­more, we find that if you team the humans with AI, so let both of them make a deci­sion on com­ple­ment sets of each oth­er, then the team per­for­mance actu­al­ly it’s it’s can be bet­ter than the AI alone. And it’s much bet­ter than humans alone. So we are very excit­ed about this work because it is the first work in this area that shows that human and AI man team­ing up can actu­al­ly achieved some improve­ment in visu­al recog­ni­tion. One ques­tion you might ask is what made the humans more accu­rate and when they see AI esti­ma­tions, right? So we per­form mul­ti­ple slices deep­er into the data we get from goril­la. So when AI is cor­rect, you can see that this is the blue bar is a no expla­na­tion. So the per­for­mance sim­i­lar­ly to the red and the brown, which are the treat­ments with expla­na­tions. How­ev­er, when AI is wrong, this is where the expla­na­tion actu­al­ly shine. It is ben­e­fits the users, there’s almost more than 10 point gap between no expla­na­tion. And when there’s expla­na­tion pro­vid­ed. Basi­cal­ly, for exam­ple, when AI is incor­rect, it think­ing that this is olive fly­catch­er, which actu­al­ly is not true. It’s actu­al­ly a sci­en­tist. But

13:29
for oth­er treat­ments, we base­line treat­ments when we pro­vide this expla­na­tion. This is olive fly­catch­er because it looks like these birds. All users for over for that see this expla­na­tion actu­al­ly wrong­ly accept­ed a bird inter­est­ed in me for our expla­na­tion when we pro­vide this cor­re­spon­dence and these expla­na­tion, all three out of three users. For this exam­ple cor­rect­ly reject­ed and things that AI Hey, this is wrong, I do not accept this. So in con­clu­sion, we find the first time that humans can improve their accu­ra­cy when using AI expla­na­tion for bird iden­ti­fi­ca­tion. When in a mod­el to human AI inter­ac­tions, we can offload around 75% of data to AI and let­ting users just label the rest and that human AI split­ting improve the whole sys­tem. Total sys­tem accu­ra­cy com­pared to AI alone and humans alone. This one was done with san­guine and Mohammed tesser­ae. My PhD stu­dents they are more inter­est­ed in ques­tions you can think of right? For exam­ple, because we per­form these online behav­iour­al stud­ies we do not know exact­ly what makes users more accu­rate in the AI wrong. We paces we only have some hypoth­e­sis but we do not actu­al­ly observe what’s going on. I’m in the sec­ond is the improve­ment in the human accu­ra­cy. It’s from two to four per­son. So it’s still Modus at this point. So how can we improve them fur­ther? And whether they expla­na­tion hap­py human expert is a sep­a­rate ques­tions. We share paper and code and also Gabriel­la screen set­tings on our web­site, if any­one is inter­est­ed in repli­cat­ing. Thank you very much.

15:30
And that was amaz­ing. I had no idea what you’re going to talk about. That was absolute­ly amaz­ing. I love this girl for so long, we will go but always good to do it all and you’ve gone? No, no, no, no, that’s the wrong ques­tion. It’s, it’s AI is in humans work­ing togeth­er. And I’m quite cer­tain there are a lot of peo­ple in the room who would love an AI that helps them Mark psy­chol­o­gy essays come mark­ing time, and can at least like get a rough cat­e­go­riza­tion of, of, of the essays. So it just needs to be checked by by humans. So I just I think that’s such a dif­fer­ent approach to think­ing about how AI and humans work togeth­er. So thank you for shar­ing that with us today. I do have a ques­tion for you. How did you? How did you max­imise the qual­i­ty of the answers that you got from from your human participants?

16:20
Right, this is very, very inter­est­ing ques­tion. And we spend a lot of time think­ing about this. And we play and we use also code to try to improve it. So I have some back­up slides here, where, where is it here, where we list are not exhaus­tive list, but some of the things that we we did and we nev­er thought of at the begin­ning. So it takes mul­ti­ple tri­als to get it. So for exam­ple, we use qual­i­ty con­trol exam­ples for five to 10. For this birth, we use five or for oth­er tests, we use 10. And we reject users if they do not pass and we also may be cru­el­ly we also do not pay them. So we’d say this in advance in intro­duc­tion screen that if you do not pass we will nev­er pay you. Okay, so this is the first thing to retain this is impor­tant in scar­ing away some peo­ple who are not too seri­ous. Sec­ond is we asked users on the first screen to stay on the screen and do not switch tasks because many times they come back with two hours lat­er with a task where for for which you teach­ers take 20 min­utes. So this is very impor­tant to lim­it it and tell them that they are not allowed to switch tasks. We also do not allow stud­ies on phones or tablets, they need to use com­put­er and spe­cif­ic natures Chrome to avoid any issues. Some­thing we have to hack into the pro­gramme but which is nice­ly doable using JavaScript on goril­la is we can choose a time 5000 mil­lisec­onds before you dis­play the con­tin­ue but­ton, like before the yes or no but­ton. So they need to look at the exam­ple first. And then before they can just hit next, next, next next, that’s not pos­si­ble with our set­up. Also, this is from one of the pri­or study is we feel the humans who per­form too fast. For exam­ple, for a task that is 30 min­utes, we esti­mate on aver­age 20 min­utes, we will fil­ter out peo­ple with eight min­utes or less. So that’s all yes.

18:32
That’s that’s real­ly great. I’ve tak­en a screen­shot of your top tips. So I think these are these are rules that a lot of peo­ple could use to improve the data qual­i­ty. And thank you so much for your time today.

Get on the Registration List

BeOnline is the conference to learn all about online behavioral research. It's the ideal place to discover the challenges and benefits of online research and to learn from pioneers. If that sounds interesting to you, then click the button below to register for the 2022 conference on Tuesday July 5th. You will be the first to know when we release new content and timings for BeOnline 2022.

With thanks to our sponsors!