GazeScor­er — Auto­mat­ed gaze direc­tion scor­ing from videos col­lect­ed online through con­ven­tion­al webcam.

Alex Fras­er, Uni­ver­si­ty of Oxford


As online research has become more preva­lent, researchers are inves­ti­gat­ing the pos­si­bil­i­ty of repli­cat­ing tech­niques that go beyond sim­ple behav­iour­al mea­sure­ments. One method that has cap­tured the imag­i­na­tion of researchers is lever­ag­ing the web­cam to col­lect eye-track­ing data. Sev­er­al pack­ages have been devel­oped for col­lect­ing such data but have sig­nif­i­cant lim­i­ta­tions due to exten­sive and poten­tial­ly frus­trat­ing cal­i­bra­tion pro­ce­dures. Unfor­tu­nate­ly, this can lim­it the acces­si­bil­i­ty of these pack­ages when col­lect­ing data with spe­cif­ic pop­u­la­tions, such as chil­dren and par­tic­i­pants with neu­ro-devel­op­men­tal dif­fi­cul­ties. To over­come this, we have looked at how gaze detec­tion stud­ies are con­duct­ed with infants, where researchers will man­u­al­ly score gaze direc­tion from videos to min­imise data loss.

Using these meth­ods, we have devel­oped GazeScor­er, an auto­mat­ed gaze scor­ing pack­age that can dis­tin­guish a Left, Right, and Cen­tral gaze loca­tion using basic image pro­cess­ing. Using videos col­lect­ed through a Goril­la-host­ed exper­i­ment, we have demon­strat­ed a good lev­el of inter-rater reli­a­bil­i­ty between GazeScor­er and a man­u­al scor­er. This opens the pos­si­bil­i­ty of a hybrid-scor­ing sys­tem with min­i­mal man­u­al inter­ven­tion in the short term. Future devel­op­ment will focus on util­is­ing live web­cam footage for data col­lec­tion through the brows­er. This soft­ware would pro­vide a poten­tial resource for researchers who would ben­e­fit from gaze-based respons­es, but do not require high spa­tial resolution.

Full Tran­script:

Alex Fras­er:
Great. So thanks for hav­ing me. And I’m real­ly excit­ed to tell you about the project that we’ve been work­ing on for the last year and a bit. But I thought I’d start by dis­cussing my jour­ney into online research and where I started.

Alex Fras­er:
So in 2017 my depart­ment shut down with­out any real notice, and we lost access to lab space right in the mid­dle of my PhD. I moved a lot of my research online and we were real­ly impressed by the amount of data we were able to col­lect in such a short peri­od of time. But we also want­ed to look at how we could take this data into par­tic­i­pants’ homes, of those who are not able to get to the lab quite so eas­i­ly, so small chil­dren and diverse populations.

Alex Fras­er:
But we were also inter­est­ed in how we could move beyond the reac­tion times and the accu­ra­cy scores that we were get­ting very reli­ably in the brows­er. And so we con­tact­ed Goril­la and they had been work­ing on this at the same time, and they built us an exper­i­ment for us to work with. And we start­ed doing some pilot­ing, but we found that the cal­i­bra­tion was quite long and involved and it made it quite dif­fi­cult for us to col­lect … Get the kind of eye track­ing that we’d hoped to be able to do with these populations.

Alex Fras­er:
So we took a bit of a shot in the dark and we start­ed a new project with the Oxford Research Soft­ware Engi­neer­ing team to see what we could do about mak­ing our own pipeline for ana­lyz­ing web­cam data. And when we sat down with them and we estab­lished what we want­ed. The main thing that we need­ed was some­thing that had a very lim­it­ed cal­i­bra­tion. So we actu­al­ly real­ly want­ed to min­i­mize the cal­i­bra­tion to as lit­tle as pos­si­ble to make it as sim­ple for us to col­lect data.

Alex Fras­er:
And as we dis­cussed it more, we start­ed think­ing more about what we want­ed. And so we were focused so much on try­ing to repli­cate an eye track­er, and a lab-based eye track­er, but we thought we may take a step back and actu­al­ly con­sid­er doing some­thing maybe a bit more sim­plis­tic, but maybe more reli­able. And we looked to the man­u­al scor­ing that is often done in infant research, and we won­dered how we could do that a bit more effi­cient­ly and a bit less labor inten­sive. And so we decid­ed to actu­al­ly focus more on gaze ori­en­ta­tion and so cod­i­fy­ing a left and right look, com­pared to actu­al­ly try­ing for a pre­cise gaze location.

Alex Fras­er:
So to do this means to col­lect a lot of video footage of par­tic­i­pants fol­low­ing the tar­get stim­uli. And Sylvia dis­cussed how she did that with chil­dren in the pre­vi­ous talk, but with the adults it was a lot more sim­ple. We could just send them the same pro­ce­dure and they gen­er­al­ly were able to com­ply them­selves and we did­n’t have to super­vise them as they were doing it. And we did this in Goril­la and col­lect­ed a lot of footage online and we end­ed up with a series of videos like this. And what we were able to do is we could trim these videos down to and syn­chro­nize them with the tar­get stim­uli. So we know approx­i­mate­ly where they’re look­ing as they are watch­ing the stimuli.

Alex Fras­er:
Then we need­ed to break down the images into indi­vid­ual frames. And once we had those indi­vid­ual frames, we were able to do more image pro­cess­ing than [inaudi­ble 00:03:12] we were using the web­cam footage inde­pen­dent­ly. But also, we need­ed to set down a base­line, ground truth, that we could com­pare to our auto­mat­ic scor­er. So to do this we went back to the man­u­al scor­ing that we were try­ing to repli­cate. And so we put all of these images online into anoth­er Goril­la exper­i­ment, and we had an inde­pen­dent naive researcher who came in and man­u­al­ly scored all of the videos for their gaze ori­en­ta­tion. Which took a fair amount of time and a lot of effort, but we got that done.

Alex Fras­er:
And this meant that we were able to do a good com­par­i­son to our auto­mat­ed scor­er. And look­ing at how we actu­al­ly did our auto­mat­ed scor­ing, the first thing we need to do is iden­ti­fy the face in the image. So once we detect the face, we were able to cut it down and we could plot land­marks on to each image of every face. And specif­i­cal­ly what we need­ed was the eye loca­tion, and we could actu­al­ly iso­late the eye and work with that independently.

Alex Fras­er:
And as you can see, the eye itself is actu­al­ly very small, only about 30 to 40 pix­els, so there’s not a lot of space for us to work with. But what we were able to do is we were able to iden­ti­fy the iris by look­ing for essen­tial­ly the dark­est space that was with­in the tar­get. And when we processed this we end­ed up with a shape like this. And when we have this shape we can then iden­ti­fy the mid­dle of the shape, and we clas­si­fy this as being the mid­dle of the Iris.

Alex Fras­er:
Now I said we [inaudi­ble 00:04:39] want­ed to try and min­i­mize any cal­i­bra­tion, and this is where we replace what a tra­di­tion­al eye track­ing cal­i­bra­tion would be. So instead of doing a tra­di­tion­al cal­i­bra­tion that you would expect, we rather, we just spec­i­fy where a cen­tral gaze is. So we know where the par­tic­i­pant is look­ing when they’re look­ing straight ahead. We do this with­in the first frame of any video, and so then we can assign a buffer around the cen­ter of that image. And this is what we do instead of our cal­i­bra­tion, any move­ment out­side of the buffer region would be con­sid­ered a cod­i­fied look towards the left and the right.

Alex Fras­er:
And so now we have auto­mat­ed scor­ing and we have a man­u­al scor­ing, we can actu­al­ly com­pare the two. So in these visu­al plots, we can see how the top row, which is the man­u­al scor­er, is cod­i­fy­ing the gaze, and the bot­tom score is also cod­i­fy­ing at the same time. There’s a lit­tle bit of a lag, but gen­er­al­ly they are fol­low­ing the same, they’re con­verg­ing in their gaze ori­en­ta­tion. So to quan­ti­fy this a lit­tle bit more we did a Cohen’s kap­pa com­par­i­son between the two. And we set a min­i­mum val­ue that we want­ed to accept as 0.6, which is a gen­er­al con­sen­sus for Cohen’s kap­pa scores.

Alex Fras­er:
And what we found when we look at the data among the adults is that the vast major­i­ty of par­tic­i­pants got above a Cohen’s kap­pa val­ue of 0.6. And if any­thing, a lot of the par­tic­i­pants are scor­ing well above 0.8, and almost approach­ing near-per­fect agree­ment. This is only focus­ing on the sta­t­ic frames where the tar­get is at its most extreme posi­tion, but this is still show­ing very good agree­ment. There are a cou­ple of par­tic­i­pants where we see one eye under-per­formed com­pared to the oth­er, but in gen­er­al we are doing very well. I’m look­ing at the sam­ple that Sylvia col­lect­ed before me. We see in the chil­dren, we see very sim­i­lar pat­terns of results, where the major­i­ty are get­ting very good agree­ment in both eyes, but there is a cou­ple where the agree­ment is low­er in one over the oth­er. But this being that we’re still see­ing very high agree­ment in these opti­mal con­di­tions we’re look­ing to work with.

Alex Fras­er:
So to give you a bit of a sum­ma­ry in what we hope to do with this mov­ing for­ward. Basi­cal­ly, how did we per­form? I think we did very well. There was gen­er­al­ly quite good agree­ment between the auto­mat­ic and the man­u­al scor­er in these opti­mal con­di­tions that we put down. When we look at more dif­fi­cult bits where the eye is actu­al­ly in move­ment because it’s fol­low­ing a tar­get, the per­for­mance isn’t quite as good. And we’re look­ing at how we can improve that and what ele­ments may impact the move­ment. And hope­ful­ly we can improve the algo­rithm. But it’s just basi­cal­ly just a first pass at the prob­lem in the first instance. And hope­ful­ly we’ll be able to improve this before we can give access to peo­ple in the near future.

Alex Fras­er:
But the oth­er thing I want­ed to dis­cuss is where do we fit in with­in the cur­rent online research? And we heard a lot of great work done yes­ter­day using WebGaz­er and the team work­ing with mouse view­er. And in fact, we saw Kat Ellis who showed, she man­aged to get some good results with kids with Frag­ile X, who were able to go through the cal­i­bra­tion that we were hav­ing trou­ble with. And so this is all very impres­sive, and we’re hop­ing that we can just be anoth­er resource that will fit into this new envi­ron­ment of online research. And hope­ful­ly peo­ple will be able to do good things with this in the future.

Alex Fras­er:
Yeah, thank you to every­one on the team who’s worked with us. And our pre-print is avail­able with a bit more detail about the data that I pre­sent­ed here. Feel free to email me if you have any questions.

Speak­er 2:
Thanks very much, Alex. We have got a cou­ple of ques­tions in the chats, so I’m going to ask you one of them.

Alex Fras­er:

Speak­er 2:
But then you go down to them in the Q&A, that’d be real­ly kind. Thank you. So the first one is from Cather­ine Ellis and she says, “How still do par­tic­i­pants have to be?” So for exam­ple, if you were work­ing with chil­dren, how care­ful would you need to be about that?

Alex Fras­er:
We still to main­tain rel­a­tive­ly lit­tle move­ment, so we do need to min­i­mize move­ment as much as pos­si­ble. But because we’re cap­tur­ing all the fea­tures of the face as we are, one of the future things that we hope to be able to do is com­pen­sate for move­ment more.

Alex Fras­er:
So yeah, as I said, this is still the very ear­ly, very pre­lim­i­nary stuff. Once we can work more with the face land­marks and account­ing for move­ment, we’ll be able to estab­lish ways of com­pen­sat­ing for move­ment a lit­tle bit bet­ter. So that’s the goal in the future.

Speak­er 2:
Bril­liant. Thank you very much.


