Overview of our AI models
Data annotation
The process of developing our AI models starts with a data annotation process. To be able to annotate some data, we begin by collecting video recordings of people’s in-the-wild reactions to a wide variety of video content. We are collecting videos of such reactions across different continents for all age groups and sexes to achieve global representativity for our subject sample.
We then annotate these videos with a large pool of qualified annotators. We ask our annotators to label our videos for attention and emotion signals. Employing a “wisdom of the crowds” approach, we label video frames with an attention or emotion signal if the employed 3-7 annotators reach a majority agreement on that specific frame.
Note
When annotating attention, we ask our annotators to decide whether subjects watching video content are paying attention to it or not. We define a subject to be attentive when they pay close attention to the screen as inferred from the head pose, gaze, posture.
Note
When annotating emotions, we ask our annotators to decide whether subjects watching video content show facial expressions that resemble some listed emotions. The list is constrained to “universal emotions”, that is, emotions with cross-culturally homogeneous expressions and meaning. This is a requirement so that our emotion labels are generalizable to all of the human populations on Earth.
As a next step, we pass the attention and emotion labels to our standardized quality assurance pipeline to filter out bad annotators from the process, and to select videos for reannotation that have not reached a statistically reliable conclusion on attention and emotion labels. Videos selected for reannotation are assigned additional human annotators who help clarify ambivalent attention and emotion labels.
Our annotated data set is constantly growing, and currently contains over 1.5 billion attention and emotion labels from over 7.1 million in-the-wild video recordings from all around the world.
Building Machine Learning Models
The next step after the collection of annotated attention and emotion data is training our machine learning models. This is the resort of our Computer Vision Team. They use the immense quantity of annotated videos of human behavior to build AI models that reproduce (and in some cases outperform) the function of human annotators. These models predict attention and emotion signals based on video recordings of behavior without human interference. To avoid model bias, they are trained on carefully selected subsets of the whole annotated dataset with balanced samples across dimensions like skin color, head pose, partial face occlusion and device type such as desktop and mobile.
Model offering
Our model offering can be understood in terms of a visual engagement funnel. The lowest level of engagement with a given content means not being present to see the content, and the highest level means exhibiting facial reactions. The engagement funnel can be described in 4 separate steps:
1. Presence
When subject is present in front of device with any body part
2. Face detection
When subject is present with his / her face in front of device
3. Visual attention on screen
When subject pays attention to device screen
Note
We currently have two AI models to identify visual attention on screen from camera images. Each of those are more suitable for different devices:
- Attention, better suited for desktop devices. The person is in front of the camera, with their face fully visible. AI infers whether the person is paying close attention to the screen from their head pose, gaze, posture, etc. The model considers users as "not attentive" when they are engaging in other activities, like eating or talking.
- Eyes on screen, better suited for mobile devices. The person is in front of the camera and, at least, the top half of their face is visible. Visual attention to the device screen is inferred from their head pose and gaze. No other restrictions are applied. In both cases, attention may not be properly evaluated for poor quality images, like in very dark environments.
4. Facial reactions that imply emotional experience
When subject exhibits facial reactions representative of universal emotions
Model performance
You can see an overview of model performance in Table 1 below.
Note
Balanced accuracy is calculated as, and means the average of two things:
- What percentage of the actual positive cases (e.g. subject paying attention) are labeled positive by our model
- What percentage of the actual negative cases (e.g. subject not paying attention) are labeled as negative by our model
The balanced accuracy metric is a commonly used model performance indicator that is unbiased with respect to class imbalance (i.e. how many attentive periods and non-attentive periods occur relative to each other). It is intended to provide a holistic view of model performance and good comparability across different classifiers and domains.
Model Name | Description | Balanced Accuracy | Available in Android SDK | Available in JavaScript SDK |
---|---|---|---|---|
Presence | Subject is present in front of device camera with at least one bodypart | 0.70 | Yes | No |
Face detection | Subject's face is detectable within the camera feed | N/A | Yes | No |
Attention | Subject is looking at the device screen. This is a model optimized for Desktop. | 0.64 | Yes | Yes |
Eyes-on-screen | Subject is looking at the device screen. This is a model optimized for mobile. | 0.84 | Yes | No |
Happiness | Subject is smiling. | 0.84 | Yes | Yes |
Surprise | Subject appears surprised. | 0.72 | Yes | Yes |
Confusion | Subject appears confused. | 0.75 | Yes | Yes |
Contempt | Subject appears contemptful. | 0.78 | Yes | Yes |
Disgust | Subject appears disgusted. | 0.75 | Yes | Yes |
Empathy | Subject appears empathetic / sad. | 0.73 | Yes | Yes |
Table 1: Our current model offering
Our SDKs
The way we serve our models to clients is by integrating them into Software Development Kits (also known as SDKs). We currently have SDKs for Android for mobile-based attention and emotion inference, and JavaScript for web-based attention and emotion inference.
To read more about getting started with our SDKs, visit our Getting Started guide: