Emotion Analysis Guide

This guide covers how to extract and interpret emotion and sentiment data from Valossa AI analysis results, including face-based emotions, speech sentiment, and voice emotion.

Important Caveat About AI-Detected Emotions

When Valossa AI reports "emotion", "mood", or "sentiment", these terms refer to apparent, external signs that can be described with emotion-related vocabulary. They must not be interpreted as indicating the internal emotional states of a person. AI-detected emotions reflect observable visual and auditory patterns, not psychological assessments.

Prerequisites

Available on Transcribe Pro Vision MAX

Face valence, named emotions, speech sentiment, and voice emotion are all included in the Transcribe Pro Vision MAX free trial — no sales call needed. Start free trial →

Your Valossa subscription must include emotion analytics. Transcribe Pro Vision MAX includes all four emotion types. For higher-volume or custom configurations, contact Valossa sales.

Quick Exploration with Metadata Reader

The Metadata Reader CLI tool is especially powerful for emotion data — it can generate sentiment visualizations and show per-second valence without writing any code:

# Per-second emotion data for all faces
python -m metareader list-detections-by-second --type "human.face" core_metadata.json

# Generate facial sentiment timeline chart (requires matplotlib)
python -m metareader plot --sentiment core_metadata.json

# Bar chart of detection frequencies
python -m metareader plot --barh core_metadata.json

Four Types of Emotion Data

Type	Source	Data Location	Description
Face valence	Facial expressions	`by_second` for `human.face`	Positivity/negativity of facial expression (-1.0 to 1.0)
Named emotions	Facial expressions	`by_second` for `human.face`	Specific emotion labels (joy, sadness, anger, etc.)
Speech valence	Meaning of spoken words	`audio.speech` attributes	Positivity/negativity of speech content (-1.0 to 1.0)
Voice emotion	Voice prosodics (tone/pitch)	`by_second` for `audio.voice_emotion`	Valence and arousal from how the voice sounds

Extracting Face Emotions

Face Valence Over Time

import json

with open("core_metadata.json", "r") as f:
    metadata = json.load(f)

face_ids = metadata["detection_groupings"]["by_detection_type"].get("human.face", [])
if not face_ids:
    print("No faces detected")
    exit()

# Track the most prominent face
main_face_id = face_ids[0]

valence_timeline = []
for second_idx, second_data in enumerate(metadata["detection_groupings"]["by_second"]):
    for item in second_data:
        if item["d"] == main_face_id:
            sen = item.get("a", {}).get("sen", {})
            if "val" in sen:
                valence_timeline.append({
                    "second": second_idx,
                    "valence": sen["val"]
                })

print(f"Face valence over time (Face ID: {main_face_id}):")
for entry in valence_timeline:
    bar = "+" * int(max(0, entry["valence"]) * 20) or "-" * int(abs(min(0, entry["valence"])) * 20)
    print(f"  Second {entry['second']:4d}: {entry['valence']:+.2f} {bar}")

Named Emotions

emotion_counts = {}

for second_idx, second_data in enumerate(metadata["detection_groupings"]["by_second"]):
    for item in second_data:
        if item["d"] == main_face_id:
            emotions = item.get("a", {}).get("sen", {}).get("emo", [])
            for emo in emotions:
                name = emo["value"]
                emotion_counts[name] = emotion_counts.get(name, 0) + 1

print("\nEmotion frequency for main face:")
for emotion, count in sorted(emotion_counts.items(), key=lambda x: -x[1]):
    print(f"  {emotion}: {count} seconds")

Available Emotions

V2 (current, 13 emotions): joy, mild joy, sadness, serious expression, fear, tension/anxiousness, disgust, displeasure, anger, concentration/displeasure, surprise, startlement, neutral

V1 (legacy, 6 emotions): happiness, sadness, anger, disgust, surprise, neutral

Extracting Speech Sentiment

Speech valence reflects the emotional tone of the content of spoken words (currently English only):

speech_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech", [])

print("\nSpeech sentiment:")
for det_id in speech_ids:
    detection = metadata["detections"][det_id]
    text = detection["label"]
    valence = detection.get("a", {}).get("sen", {}).get("val")
    start = detection["occs"][0]["ss"] if detection.get("occs") else 0

    if valence is not None:
        sentiment = "positive" if valence > 0.1 else "negative" if valence < -0.1 else "neutral"
        print(f"  [{start:.1f}s] ({sentiment}, {valence:+.2f}) \"{text[:60]}...\"" if len(text) > 60 else f"  [{start:.1f}s] ({sentiment}, {valence:+.2f}) \"{text}\"")

Extracting Voice Emotion

Voice emotion detects emotional states from how the voice sounds (tone, pitch, rhythm), independent of what is being said:

voice_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.voice_emotion", [])

if voice_ids:
    voice_det_id = voice_ids[0]

    print("\nVoice emotion (valence and arousal):")
    for second_idx, second_data in enumerate(metadata["detection_groupings"]["by_second"]):
        for item in second_data:
            if item["d"] == voice_det_id and "a" in item:
                valence = item["a"].get("val", 0)
                arousal = item["a"].get("aro", 0)
                print(f"  Second {second_idx}: valence={valence:+.3f}, arousal={arousal:.3f}")

Voice emotion provides:

Valence (-1.0 to 1.0): How positive or negative the voice sounds
Arousal (0.0 to 1.0): How energetic or excited the voice sounds

Combined Emotion Dashboard

Build a second-by-second emotion overview combining all sources:

def build_emotion_timeline(metadata, face_id, voice_det_id=None):
    """Build a combined emotion timeline."""
    duration = int(metadata["media_info"]["technical"]["duration_s"])
    timeline = []

    for second in range(duration):
        entry = {"second": second, "face_valence": None, "face_emotion": None,
                 "voice_valence": None, "voice_arousal": None}

        if second < len(metadata["detection_groupings"]["by_second"]):
            for item in metadata["detection_groupings"]["by_second"][second]:
                if item["d"] == face_id:
                    sen = item.get("a", {}).get("sen", {})
                    entry["face_valence"] = sen.get("val")
                    emos = sen.get("emo", [])
                    if emos:
                        entry["face_emotion"] = emos[0]["value"]

                if voice_det_id and item["d"] == voice_det_id:
                    entry["voice_valence"] = item.get("a", {}).get("val")
                    entry["voice_arousal"] = item.get("a", {}).get("aro")

        timeline.append(entry)

    return timeline

Sentiment & Emotion Reference -- Metadata format details
Faces & Identity -- Face detection basics
Speech & Transcription -- Speech detection details
Metadata Reader -- CLI tool for sentiment visualization and per-second emotion extraction

Prerequisites​

Four Types of Emotion Data​

Extracting Face Emotions​

Face Valence Over Time​

Named Emotions​

Available Emotions​

Extracting Speech Sentiment​

Extracting Voice Emotion​

Combined Emotion Dashboard​

Related Resources​