Search Results for

    Show / Hide Table of Contents

    Overview of our AI models

    Data annotation

    The process of developing our AI models starts with a data annotation process. To be able to annotate some data, we begin by collecting video recordings of people’s in-the-wild reactions to a wide variety of video content. We are collecting videos of such reactions across different continents for all age groups and sexes to achieve global representativity for our subject sample.


    We then annotate these videos with a large pool of qualified annotators. We ask our annotators to label our videos for attention and emotion signals. Employing a “wisdom of the crowds” approach, we label video frames with an attention or emotion signal if the employed 3-7 annotators reach a majority agreement on that specific frame.

    Note

    When annotating attention, we ask our annotators to decide whether subjects watching video content are paying attention to it or not. We define a subject to be attentive when they pay close attention to the screen as inferred from the head pose, gaze, posture.

    Note

    When annotating emotions, we ask our annotators to decide whether subjects watching video content show facial expressions that resemble some listed emotions. The list is constrained to “universal emotions”, that is, emotions with cross-culturally homogeneous expressions and meaning. This is a requirement so that our emotion labels are generalizable to all of the human populations on Earth.

    As a next step, we pass the attention and emotion labels to our standardized quality assurance pipeline to filter out bad annotators from the process, and to select videos for reannotation that have not reached a statistically reliable conclusion on attention and emotion labels. Videos selected for reannotation are assigned additional human annotators who help clarify ambivalent attention and emotion labels.

    Our annotated data set is constantly growing, and currently contains over 1.5 billion attention and emotion labels from over 7.1 million in-the-wild video recordings from all around the world.

    Building Machine Learning Models


    The next step after the collection of annotated attention and emotion data is training our machine learning models. This is the resort of our Computer Vision Team. They use the immense quantity of annotated videos of human behavior to build AI models that reproduce (and in some cases outperform) the function of human annotators. These models predict attention and emotion signals based on video recordings of behavior without human interference. To avoid model bias, they are trained on carefully selected subsets of the whole annotated dataset with balanced samples across dimensions like skin color, head pose, partial face occlusion and device type such as desktop and mobile.

    Model offering

    Our model offering can be understood in terms of a visual engagement funnel. The lowest level of engagement with a given content means not being present to see the content, and the highest level means exhibiting facial reactions. The engagement funnel can be described in 4 separate steps:

    1. Presence

    When subject is present in front of device with any body part

    2. Face detection

    When subject is present with his / her face in front of device

    3. Visual attention on screen

    When subject pays attention to device screen


    Note

    We currently have two AI models to identify visual attention on screen from camera images. Each of those are more suitable for different devices:

    • Attention, better suited for desktop devices. The person is in front of the camera, with their face fully visible. AI infers whether the person is paying close attention to the screen from their head pose, gaze, posture, etc. The model considers users as "not attentive" when they are engaging in other activities, like eating or talking.
    • Eyes on screen, better suited for mobile devices. The person is in front of the camera and, at least, the top half of their face is visible. Visual attention to the device screen is inferred from their head pose and gaze. No other restrictions are applied. In both cases, attention may not be properly evaluated for poor quality images, like in very dark environments.

    4. Facial reactions that imply emotional experience

    When subject exhibits facial reactions representative of universal emotions

    Model performance

    You can see an overview of model performance in Table 1 below.

    Note

    Balanced accuracy is calculated as, and means the average of two things:

    • What percentage of the actual positive cases (e.g. subject paying attention) are labeled positive by our model
    • What percentage of the actual negative cases (e.g. subject not paying attention) are labeled as negative by our model

    The balanced accuracy metric is a commonly used model performance indicator that is unbiased with respect to class imbalance (i.e. how many attentive periods and non-attentive periods occur relative to each other). It is intended to provide a holistic view of model performance and good comparability across different classifiers and domains.

    Model Name Description Balanced Accuracy Available in Android SDK Available in JavaScript SDK
    Presence Subject is present in front of device camera with at least one bodypart 0.70 Yes No
    Face detection Subject's face is detectable within the camera feed N/A Yes No
    Attention Subject is looking at the device screen. This is a model optimized for Desktop. 0.64 Yes Yes
    Eyes-on-screen Subject is looking at the device screen. This is a model optimized for mobile. 0.84 Yes No
    Happiness Subject is smiling. 0.84 Yes Yes
    Surprise Subject appears surprised. 0.72 Yes Yes
    Confusion Subject appears confused. 0.75 Yes Yes
    Contempt Subject appears contemptful. 0.78 Yes Yes
    Disgust Subject appears disgusted. 0.75 Yes Yes
    Empathy Subject appears empathetic / sad. 0.73 Yes Yes

    Table 1: Our current model offering

    Our SDKs

    The way we serve our models to clients is by integrating them into Software Development Kits (also known as SDKs). We currently have SDKs for Android for mobile-based attention and emotion inference, and JavaScript for web-based attention and emotion inference.

    To read more about getting started with our SDKs, visit our Getting Started guide:

    In This Article
    Back to top
    Realeyes is SOC2 Type 2 compliant
    © 2024 - Realeyes' Experience Platform Documentation - Support:   support@realeyesit.com Generated by DocFX