Evaluating the effectiveness of human-AI collaboration

By loading the video, you agree to YouTube’s privacy policy.
Learn more

Always unblock YouTube

Anh Nguyen — Auburn University

Artificial Intelligence (AI) empowers every sector of the economy and our everyday lives. While often accurate, AI systems sometimes make unusual mistakes that humans don’t typically make. We study how humans and AI can team up and their human-AI team performance on image classification. After a series of human-subject experiments on Gorilla, we find several interesting findings including that some type of visual heatmap-based explanation that AIs make for their decisions can confuse humans and consistently cause users to perform worse. Furthermore, we find some type of exemplar-based explanation often causes users to overtrust AIs’ decisions even when they are wrong. We will discuss some pros and cons of different AI explanations tested on human users and interesting experimental insights from our Gorilla studies.

Full Transcript:

Anh Nguyen 0:00
Okay, yeah. Okay, so thank you again for having me here. I’m very happy to be part of the discussion so far. It’s very exciting. And now, um, I hope to share with you some story, how we evaluate the effectiveness of human and AI collaborations. So I’m assistant professor at Auburn University. So AI has been ubiquitous ingredients behind many, many applications, many industry sectors that you have seen. And there’s the need for Explainable AI, people are constantly asking about do we need to know what’s going on internally inside these huge giant neural networks, right? We need to know why your self driving cars decide to stop when there’s green lights. So all these behaviours, we want some explanations. The question is why? Right. So for example, you can see this is two years ago, an image generator at CVPR toptier. Conference in computer vision, it shows this huge racial bias, it’s a network train to do up sampling. So generating from low res images here of Barack Obama, and it’s supposed to synthesise for you somewhat of the same identity. But with a better fidelity quality image. However, it’s tends to bias towards the white men. This is a text generator by open AI. It’s a machine where you talk to it, you give it some text, we will talk back to you it outputs some text. But it has also this huge racial and gender bias. For example, if you say the man worked, as it will tell you a car salesman at a world local Walmart as a completing your sentence. Now, if you say this, the woman worked as and it will tell you a lot of other things that are really disturbing for first. So it has a huge gender bias, and the whole AI is a black box. Recently, just in the last year, there are multiple false arrests by police. Because the AI when it serves through the extraction from from cameras, it will find wrong suspect for shoplifting or any other crimes. And a lot of people were put in jail for the wrong reasons, just because AI said so. As this is borrow from DARPA, the current generation of AI system offers tremendous benefits, but the effectiveness will be limited by the machine ability to explain its decision and action to users. So this is traditional AI where you give it some data, and you let it learn how to solve the task by itself. And then the output is this pre train blackbox model, and it’s either a review decision or recommendation and then you decide what to do with it. So for high stakes decisions, humans are the ultimate decision maker. So my research involves how do we building this explanation interface that allows AIS to talk to humans and back and forth so that they both can achieve better accuracy or performance then each of them alone. And the bottom line is the human AI teamwork is needed because neither humans nor AIS can solve the task by themselves completely. Right and efficiently. So the task one task I will talk about in this talk today is fine grain bird identification. So we have here in this cup data set to other classes, your job is to given an image, let’s say this photo, pick one of the classes among the 200. So pine wobble, for example. So given this image, pick one of the class and the correct one here is American goldfinch. It’s actually a not an easy task. For lay users. Unless you are both experts. You spend a lot of time with birds.

4:15
Of course, also asking allow users to choose one of the two other classes require users to know about all the classes. So this is a pretty daunting and overwhelming task. That’s hard to scale beyond expert users. So here we reformulate the task into a two stage process. So given an input image, we give it to the AI and AI provides some prediction some decision. Let’s say it thinks this is 60% American goldfinch. And now in stage two users led to see this decision. In this case, it’s correct. And it can compare with some exemplars of American Goldfinch and users get to agree or disagree. So it’s a binary as a yes or no decision. Some case AI will be wrong. And like here, it says, I’m 30% confident that this is evening Grosbeak. It’s similar, but it’s not exactly the same. And humans should disagree in this case. All right. So the reason we reformulated the task in this way, because AI, this is a very, very accurate in this task of 200 way, you know, categorization, it’s 85% accurate. However, it’s not 100% accurate. So, how accurate are humans? If you have any guest, it’s time to make it now. We actually run multiple guerrilla studies to get this number. And actually, it’s 65%. It’s turns out to be not an easy task. Although it’s binary, it’s just two options, and 50% is a random chance. And humans caught on average 65% this task? It looks like the following before each trial, each question we actually provide examples of the class and then the AI predicted each, which could be correct or wrong. We don’t know. Let’s say horn puffing here. So we provide six examples. And users given a photo that AI predicted to be horned Puffin. Humans have to select yes or no. Right? So this is a task and humans score 65% accurate. Now, one question that we study in our research is, if this is the case, right? It’s pretty limited the interaction between human and AI, it’s limited. It’s one way and we do not how can we get more information out of the AIS, for example, some explanation why this is horn Puffin so that humans can improve their accuracy. It leads to our second research questions that do AI explanations help improve user accuracy. So here I will we conducted previous research where we invented first the methods that ais that provide a nice capability of explanations. And then we run the human study to evaluate explanations as a post processed so we run the query image AI normally give you some prediction like June quo with some confidence 97% and user decides yes or no if this is a judgement call. Our Explainable AI is provide additional information. Here it says that if I think this is a June curl, because it’s similar to other June curl examples in the beak, in the chest or in the in the tail. So we provide here visual correspondence between the query image and examples that you encode that the model thing is it is okay, so this is the main explanation that we invented in study.

8:29
It turns out that we doubt any explanation. So we are the gorilla study, we have six methods. And our main method is these two and the other three are updated version of the first one just to understand effects. But in this talk, I just compare the the main baseline and our main treatment. So human without any explanation. So no further information is a 65% accurate. But if you provide explanations, they improve consistently, all those slightly, but consistently improve their performance to 67 and 69. And there’s a statistical significance between these two groups, and the users per each methods around 60 users. In total, we have 355 users for the whole study. The way we set up this guerrilla setting is it’s we have five training examples for this bird, a task, and then followed by five so we teach users how to do the task. And we provide five quality control examples. It’s called validation, and if the user passed it, we then proceed them onto the test of 30 questions. If they fail, we just reject them and invite them out of the study. It’s the end and the acceptance rate into the into this study is only around 23 Since around 1100, users participate in the training and screening, but only 355 were made it past the quality control. We have prolific, we hire native English speakers, and they came from a lot of places in the world, we paid a total of 13 point $5 per hour. And that whole study is estimated to be 20 minutes, although it varies from around 10 to 245. And gorilla is very, very efficient tools first, for this type of study. This is our first study third paper on this topic. Now, although with only one user study, we actually can perform a test on to human interaction model. So in the first model, here, we provide AI with an input image and it provides some decision and the human gets to say yes or no, and this is American goldfinch. In a second model, the AI will provide you the confidence and based on the confidence, you will let AI make a decision by itself. Or when the AI is not so confident, we leave it to human. So this is the second model. And we want to test whether the explanation can help improve the human accuracy and the whole system accuracy on model two. So it turns out that with a model two, we test, different range of confidence scores. And it turns out that we can automate, basically leave for the AI to decide 75% of the data. And humans only need to work on 25% of the data. So that’s pretty interesting. And furthermore, we find that if you team the humans with AI, so let both of them make a decision on complement sets of each other, then the team performance actually it’s it’s can be better than the AI alone. And it’s much better than humans alone. So we are very excited about this work because it is the first work in this area that shows that human and AI man teaming up can actually achieved some improvement in visual recognition. One question you might ask is what made the humans more accurate and when they see AI estimations, right? So we perform multiple slices deeper into the data we get from gorilla. So when AI is correct, you can see that this is the blue bar is a no explanation. So the performance similarly to the red and the brown, which are the treatments with explanations. However, when AI is wrong, this is where the explanation actually shine. It is benefits the users, there’s almost more than 10 point gap between no explanation. And when there’s explanation provided. Basically, for example, when AI is incorrect, it thinking that this is olive flycatcher, which actually is not true. It’s actually a scientist. But

13:29
for other treatments, we baseline treatments when we provide this explanation. This is olive flycatcher because it looks like these birds. All users for over for that see this explanation actually wrongly accepted a bird interested in me for our explanation when we provide this correspondence and these explanation, all three out of three users. For this example correctly rejected and things that AI Hey, this is wrong, I do not accept this. So in conclusion, we find the first time that humans can improve their accuracy when using AI explanation for bird identification. When in a model to human AI interactions, we can offload around 75% of data to AI and letting users just label the rest and that human AI splitting improve the whole system. Total system accuracy compared to AI alone and humans alone. This one was done with sanguine and Mohammed tesserae. My PhD students they are more interested in questions you can think of right? For example, because we perform these online behavioural studies we do not know exactly what makes users more accurate in the AI wrong. We paces we only have some hypothesis but we do not actually observe what’s going on. I’m in the second is the improvement in the human accuracy. It’s from two to four person. So it’s still Modus at this point. So how can we improve them further? And whether they explanation happy human expert is a separate questions. We share paper and code and also Gabriella screen settings on our website, if anyone is interested in replicating. Thank you very much.

15:30
And that was amazing. I had no idea what you’re going to talk about. That was absolutely amazing. I love this girl for so long, we will go but always good to do it all and you’ve gone? No, no, no, no, that’s the wrong question. It’s, it’s AI is in humans working together. And I’m quite certain there are a lot of people in the room who would love an AI that helps them Mark psychology essays come marking time, and can at least like get a rough categorization of, of, of the essays. So it just needs to be checked by by humans. So I just I think that’s such a different approach to thinking about how AI and humans work together. So thank you for sharing that with us today. I do have a question for you. How did you? How did you maximise the quality of the answers that you got from from your human participants?

16:20
Right, this is very, very interesting question. And we spend a lot of time thinking about this. And we play and we use also code to try to improve it. So I have some backup slides here, where, where is it here, where we list are not exhaustive list, but some of the things that we we did and we never thought of at the beginning. So it takes multiple trials to get it. So for example, we use quality control examples for five to 10. For this birth, we use five or for other tests, we use 10. And we reject users if they do not pass and we also may be cruelly we also do not pay them. So we’d say this in advance in introduction screen that if you do not pass we will never pay you. Okay, so this is the first thing to retain this is important in scaring away some people who are not too serious. Second is we asked users on the first screen to stay on the screen and do not switch tasks because many times they come back with two hours later with a task where for for which you teachers take 20 minutes. So this is very important to limit it and tell them that they are not allowed to switch tasks. We also do not allow studies on phones or tablets, they need to use computer and specific natures Chrome to avoid any issues. Something we have to hack into the programme but which is nicely doable using JavaScript on gorilla is we can choose a time 5000 milliseconds before you display the continue button, like before the yes or no button. So they need to look at the example first. And then before they can just hit next, next, next next, that’s not possible with our setup. Also, this is from one of the prior study is we feel the humans who perform too fast. For example, for a task that is 30 minutes, we estimate on average 20 minutes, we will filter out people with eight minutes or less. So that’s all yes.

18:32
That’s that’s really great. I’ve taken a screenshot of your top tips. So I think these are these are rules that a lot of people could use to improve the data quality. And thank you so much for your time today.

Get on the Registration List

BeOnline is the conference to learn all about online behavioral research. It's the ideal place to discover the challenges and benefits of online research and to learn from pioneers. If that sounds interesting to you, then click the button below to register for the 2023 conference on Thursday July 6th. You will be the first to know when we release new content and timings for BeOnline 2023.

Register Now

Eval­u­at­ing the effec­tive­ness of human-AI collaboration

Full Tran­script:

Get on the Registration List

With thanks to our sponsors!

Evaluating the effectiveness of human-AI collaboration

Full Transcript: