Sentiment and Emotion
Valossa AI detects sentiment and emotion from both visual (face) and audio (speech, voice) modalities. There are four distinct types of sentiment and emotion data.
When Valossa AI reports "emotion", "mood", or "sentiment", these terms refer to apparent, external signs that can be described with emotion-related vocabulary. They must not be interpreted as indicating the internal emotional states of a person. AI-detected emotions reflect observable patterns, not psychological assessments.
Overview of Sentiment Types
| Type | Source | Location in Metadata | Scope |
|---|---|---|---|
| Face valence | Facial expression analysis | by_second for human.face detections | Per face, per second |
| Named facial expressions | Facial expression analysis | by_second for human.face detections | Per face, per second |
| Speech valence | Meaning of spoken words | audio.speech detection attributes | Per speech segment |
| Voice emotion | Voice prosodics (tone, pitch) | by_second for audio.voice_emotion detection | Per second |
These features require face and speech emotion analytics to be activated for your subscription.
Face Valence
Valence describes the emotional positivity or negativity of a person at a specific moment, ranging from -1.0 (most negative) to 1.0 (most positive), with 0.0 being neutral.
Face valence data is in the by_second structure for human.face detections:
{
"d": "9",
"o": ["51"],
"a": {
"sen": {
"val": -0.82
}
}
}
| Field | Description |
|---|---|
a.sen.val | Valence value (-1.0 to 1.0) |
Named Facial expressions
Multiple facial expressions can be recognized on a face at the same second, each with a confidence score.
V2 Expressions (Current)
Most subscriptions use V2 face expressions (apparent emotions) with this complete 13-label vocabulary:
- joy
- mild joy
- sadness
- serious expression
- fear
- tension/anxiousness
- disgust
- displeasure
- anger
- concentration/displeasure
- surprise
- startlement
- neutral
V1 Emotions (Legacy)
Some long-standing subscriptions (pre-December 2020) may use V1 with 6 named expressions:
- happiness
- sadness
- anger
- disgust
- surprise
- neutral
Data Format
Named facial expressions appear alongside valence in the sen structure:
{
"d": "1",
"o": ["1"],
"a": {
"sen": {
"emo": [
{ "c": 0.772, "e": "disgust" }
],
"val": -0.796
}
}
}
| Field | Description |
|---|---|
a.sen.emo | Array of detected apparent emotions |
a.sen.emo[].e | Emotion identifier string |
a.sen.emo[].c | Confidence (0.0 to 1.0) |
The emo array may contain multiple expressions if more than one is detected simultaneously. Entries are ranked by confidence.
Speech Valence
Speech valence is derived from the meaning of the spoken words (not the sound of the voice). It is available for English only and uses the audio.speech.a.sen.val field.
Speech valence appears in the a.sen.val field of audio.speech detections:
{
"t": "audio.speech",
"label": "we profoundly believe that justice will win despite the looming challenges",
"a": {
"sen": {
"val": 0.307
}
}
}
This indicates the text content has a mildly positive sentiment. The value range is -1.0 to 1.0.
When enabled for a job, this field is typically present on each audio.speech segment. It is a text-level sentiment score and should not be confused with audio.voice_emotion, which is based on vocal delivery rather than transcript meaning.
Voice Emotion
Voice emotion detects emotional states from voice prosodics (tone, pitch, rhythm) rather than from the content of the words. This is fundamentally different from speech valence.
Voice emotion data is in a single audio.voice_emotion detection, with per-second values in by_second:
{
"d": "1034",
"o": ["1688"],
"a": {
"val": -0.022,
"aro": 0.655
}
}
| Field | Description |
|---|---|
a.val | Voice emotional valence (-1.0 to 1.0) |
a.aro | Voice emotional arousal (0.0 to 1.0, where 1.0 is maximum arousal) |
The audio.voice_emotion detection has occurrences (occs) indicating the time segments where voice emotion was detected. The per-second val and aro values are computed for the mixed audio track, not separately per diarized speaker.
Comparison of Sentiment Types
| Type | What It Measures | Example |
|---|---|---|
| Face valence | How positive/negative a face looks | A frowning face: -0.7 |
| Named facial expressions | Specific categories from facial expressions, which could be described as apparent expressions of an emotion | "joy" with confidence 0.85 |
| Speech valence | Positive/negative meaning of spoken words | "I love this" = positive valence |
| Voice emotion | Emotional tone of the voice itself | High arousal, negative valence = distressed tone |
Code Example
Python: Extract Emotion Timeline for a Face
import json
with open("core_metadata.json", "r") as f:
metadata = json.load(f)
# Find all face detection IDs
face_ids = metadata["detection_groupings"]["by_detection_type"].get("human.face", [])
if not face_ids:
print("No faces detected")
else:
target_face_id = face_ids[0] # First (most prominent) face
for second_idx, second_data in enumerate(metadata["detection_groupings"]["by_second"]):
for item in second_data:
if item["d"] == target_face_id and "a" in item and "sen" in item.get("a", {}):
sen = item["a"]["sen"]
valence = sen.get("val", "N/A")
emotions = sen.get("emo", [])
emotion_str = ", ".join(f"{e['e']}({e['c']:.2f})" for e in emotions)
print(f"Second {second_idx}: valence={valence}, emotions=[{emotion_str}]")
Related Resources
- Emotion Analysis Guide -- Complete workflow for emotion analysis
- Faces & Identity -- Face detection details
- Speech & Transcription -- Speech-to-text details