Overview of our AI models

Data annotation

The process of developing our AI models starts with a data annotation process. To be able to annotate some data, we begin by collecting video recordings of people’s in-the-wild reactions to a wide variety of video content. We are collecting videos of such reactions across different continents for all age groups and sexes to achieve global representativity for our subject sample.

We then annotate these videos with a large pool of qualified annotators. We ask our annotators to label our videos for attention and emotion signals. Employing a “wisdom of the crowds” approach, we label video frames with an attention or emotion signal if the employed 3-7 annotators reach a majority agreement on that specific frame.

Note

When annotating attention, we ask our annotators to decide whether subjects watching video content are paying attention to it or not. We define a subject to be attentive when they pay close attention to the screen as inferred from the head pose, gaze, posture.

Note

When annotating emotions, we ask our annotators to decide whether subjects watching video content show facial expressions that resemble some listed emotions. The list is constrained to “universal emotions”, that is, emotions with cross-culturally homogeneous expressions and meaning. This is a requirement so that our emotion labels are generalizable to all of the human populations on Earth.

As a next step, we pass the attention and emotion labels to our standardized quality assurance pipeline to filter out bad annotators from the process, and to select videos for reannotation that have not reached a statistically reliable conclusion on attention and emotion labels. Videos selected for reannotation are assigned additional human annotators who help clarify ambivalent attention and emotion labels.

Our annotated data set is constantly growing, and currently contains over 1.5 billion attention and emotion labels from over 7.1 million in-the-wild video recordings from all around the world.

Building Machine Learning Models

The next step after the collection of annotated attention and emotion data is training our machine learning models. This is the resort of our Computer Vision Team. They use the immense quantity of annotated videos of human behavior to build AI models that reproduce (and in some cases outperform) the function of human annotators. These models predict attention and emotion signals based on video recordings of behavior without human interference. To avoid model bias, they are trained on carefully selected subsets of the whole annotated dataset with balanced samples across dimensions like skin color, head pose, partial face occlusion and device type such as desktop and mobile.

Model offering

Our model offering can be understood in terms of a visual engagement funnel. The lowest level of engagement with a given content means not being present to see the content, and the highest level means exhibiting facial reactions. The engagement funnel can be described in 4 separate steps:

1. Presence

When subject is present in front of device with any body part

2. Face detection

When subject is present with his / her face in front of device

3. Visual attention on screen

When subject pays attention to device screen

Note

We currently have two AI models to identify visual attention on screen from camera images. Each of those are more suitable for different devices:

Attention, better suited for desktop devices. The person is in front of the camera, with their face fully visible. AI infers whether the person is paying close attention to the screen from their head pose, gaze, posture, etc. The model considers users as "not attentive" when they are engaging in other activities, like eating or talking.
Eyes on screen, better suited for mobile devices. The person is in front of the camera and, at least, the top half of their face is visible. Visual attention to the device screen is inferred from their head pose and gaze. No other restrictions are applied. In both cases, attention may not be properly evaluated for poor quality images, like in very dark environments.

4. Facial reactions that imply emotional experience

When subject exhibits facial reactions representative of universal emotions

Model performance

You can see an overview of model performance in Table 1 below.

Note

Balanced accuracy is calculated as, and means the average of two things:

What percentage of the actual positive cases (e.g. subject paying attention) are labeled positive by our model
What percentage of the actual negative cases (e.g. subject not paying attention) are labeled as negative by our model

The balanced accuracy metric is a commonly used model performance indicator that is unbiased with respect to class imbalance (i.e. how many attentive periods and non-attentive periods occur relative to each other). It is intended to provide a holistic view of model performance and good comparability across different classifiers and domains.

Model Name	Description	Balanced Accuracy	Available in Android SDK	Available in JavaScript SDK
Presence	Subject is present in front of device camera with at least one bodypart	0.70	Yes	No
Face detection	Subject's face is detectable within the camera feed	N/A	Yes	No
Attention	Subject is looking at the device screen. This is a model optimized for Desktop.	0.64	Yes	Yes
Eyes-on-screen	Subject is looking at the device screen. This is a model optimized for mobile.	0.84	Yes	No
Happiness	Subject is smiling.	0.84	Yes	Yes
Surprise	Subject appears surprised.	0.72	Yes	Yes
Confusion	Subject appears confused.	0.75	Yes	Yes
Contempt	Subject appears contemptful.	0.78	Yes	Yes
Disgust	Subject appears disgusted.	0.75	Yes	Yes
Empathy	Subject appears empathetic / sad.	0.73	Yes	Yes

Table 1: Our current model offering

Our SDKs

The way we serve our models to clients is by integrating them into Software Development Kits (also known as SDKs). We currently have SDKs for Android for mobile-based attention and emotion inference, and JavaScript for web-based attention and emotion inference.

To read more about getting started with our SDKs, visit our Getting Started guide: