Skip to main content

Sentiment and Emotion

Valossa AI detects sentiment and emotion from both visual (face) and audio (speech, voice) modalities. There are four distinct types of sentiment and emotion data.

Important Caveat

When Valossa AI reports "emotion", "mood", or "sentiment", these terms refer to apparent, external signs that can be described with emotion-related vocabulary. They must not be interpreted as indicating the internal emotional states of a person. AI-detected emotions reflect observable patterns, not psychological assessments.

Overview of Sentiment Types

TypeSourceLocation in MetadataScope
Face valenceFacial expression analysisby_second for human.face detectionsPer face, per second
Named facial expressionsFacial expression analysisby_second for human.face detectionsPer face, per second
Speech valenceMeaning of spoken wordsaudio.speech detection attributesPer speech segment
Voice emotionVoice prosodics (tone, pitch)by_second for audio.voice_emotion detectionPer second
note

These features require face and speech emotion analytics to be activated for your subscription.

Face Valence

Valence describes the emotional positivity or negativity of a person at a specific moment, ranging from -1.0 (most negative) to 1.0 (most positive), with 0.0 being neutral.

Face valence data is in the by_second structure for human.face detections:

{
"d": "9",
"o": ["51"],
"a": {
"sen": {
"val": -0.82
}
}
}
FieldDescription
a.sen.valValence value (-1.0 to 1.0)

Named Facial expressions

Multiple facial expressions can be recognized on a face at the same second, each with a confidence score.

V2 Expressions (Current)

Most subscriptions use V2 face expressions (apparent emotions) with this complete 13-label vocabulary:

  • joy
  • mild joy
  • sadness
  • serious expression
  • fear
  • tension/anxiousness
  • disgust
  • displeasure
  • anger
  • concentration/displeasure
  • surprise
  • startlement
  • neutral

V1 Emotions (Legacy)

Some long-standing subscriptions (pre-December 2020) may use V1 with 6 named expressions:

  • happiness
  • sadness
  • anger
  • disgust
  • surprise
  • neutral

Data Format

Named facial expressions appear alongside valence in the sen structure:

{
"d": "1",
"o": ["1"],
"a": {
"sen": {
"emo": [
{ "c": 0.772, "e": "disgust" }
],
"val": -0.796
}
}
}
FieldDescription
a.sen.emoArray of detected apparent emotions
a.sen.emo[].eEmotion identifier string
a.sen.emo[].cConfidence (0.0 to 1.0)

The emo array may contain multiple expressions if more than one is detected simultaneously. Entries are ranked by confidence.

Speech Valence

Speech valence is derived from the meaning of the spoken words (not the sound of the voice). It is available for English only and uses the audio.speech.a.sen.val field.

Speech valence appears in the a.sen.val field of audio.speech detections:

{
"t": "audio.speech",
"label": "we profoundly believe that justice will win despite the looming challenges",
"a": {
"sen": {
"val": 0.307
}
}
}

This indicates the text content has a mildly positive sentiment. The value range is -1.0 to 1.0.

When enabled for a job, this field is typically present on each audio.speech segment. It is a text-level sentiment score and should not be confused with audio.voice_emotion, which is based on vocal delivery rather than transcript meaning.

Voice Emotion

Voice emotion detects emotional states from voice prosodics (tone, pitch, rhythm) rather than from the content of the words. This is fundamentally different from speech valence.

Voice emotion data is in a single audio.voice_emotion detection, with per-second values in by_second:

{
"d": "1034",
"o": ["1688"],
"a": {
"val": -0.022,
"aro": 0.655
}
}
FieldDescription
a.valVoice emotional valence (-1.0 to 1.0)
a.aroVoice emotional arousal (0.0 to 1.0, where 1.0 is maximum arousal)

The audio.voice_emotion detection has occurrences (occs) indicating the time segments where voice emotion was detected. The per-second val and aro values are computed for the mixed audio track, not separately per diarized speaker.

Comparison of Sentiment Types

TypeWhat It MeasuresExample
Face valenceHow positive/negative a face looksA frowning face: -0.7
Named facial expressionsSpecific categories from facial expressions, which could be described as apparent expressions of an emotion"joy" with confidence 0.85
Speech valencePositive/negative meaning of spoken words"I love this" = positive valence
Voice emotionEmotional tone of the voice itselfHigh arousal, negative valence = distressed tone

Code Example

Python: Extract Emotion Timeline for a Face

import json

with open("core_metadata.json", "r") as f:
metadata = json.load(f)

# Find all face detection IDs
face_ids = metadata["detection_groupings"]["by_detection_type"].get("human.face", [])
if not face_ids:
print("No faces detected")
else:
target_face_id = face_ids[0] # First (most prominent) face

for second_idx, second_data in enumerate(metadata["detection_groupings"]["by_second"]):
for item in second_data:
if item["d"] == target_face_id and "a" in item and "sen" in item.get("a", {}):
sen = item["a"]["sen"]
valence = sen.get("val", "N/A")
emotions = sen.get("emo", [])
emotion_str = ", ".join(f"{e['e']}({e['c']:.2f})" for e in emotions)
print(f"Second {second_idx}: valence={valence}, emotions=[{emotion_str}]")