Glossary

A

Ad Score A numeric score (0.0–1.0) indicating how well a specific IAB advertisement category is suited for ad placement near a given section of video content. Available in topic.iab.section detections. Higher scores indicate better ad suitability.

API Key A confidential string that authenticates your requests to the Valossa Core API. Found in Valossa Portal under My Account → Subscriptions and API Keys. Each subscription has its own key. Keep it secret and use HTTPS at all times.

Arousal In voice emotion analysis, arousal measures the intensity or energy level of an emotional state (0.0 to 1.0, where 1.0 is maximum arousal). Paired with valence to characterize voice emotion.

audio.context A detection type for non-speech sounds detected in the audio track (e.g., music, explosions, laughter, applause). Has confidence scores and occurrences.

audio.speech A detection type containing speech-to-text transcript segments, roughly sentence-length. Ordered chronologically in by_detection_type.

audio.speech_detailed A detection type containing individual words with precise timestamps, confidence scores, and speaker IDs for diarization. More granular than audio.speech.

audio.voice_emotion A detection type capturing emotional tone from voice prosodics (pitch, rhythm, tempo) — not from word meaning. Provides valence and arousal values per second.

B

by_detection_type A subfield of detection_groupings that groups detection IDs by their detection type (e.g., all human.face IDs, all visual.context IDs). Detections are sorted by relevance (most prominent first), except speech detections which are chronological.

by_second A subfield of detection_groupings containing an array where each element corresponds to one second of video. Used to answer "What is detected at time X?". Also contains per-second emotion and color data not available in occurrences.

C

Callback (Webhook) An HTTP request sent by Valossa to a URL you specify when a job completes. An alternative to polling with job_status. Not enabled by default — contact Valossa to configure.

categ An optional field on a detection containing category tags (e.g., violence, food_drink, sport). Used to filter detections by broad topic or content compliance status.

cid Valossa Concept ID — a short alphanumeric string uniquely identifying a concept in the Valossa Concept Ontology. Present on visual.context and audio.context detections. Can be used to reliably search for a specific concept regardless of label language.

Confidence A float value (0.0–1.0) indicating how certain the AI is about a detection or attribute. The metadata only includes detections above 0.5 confidence. Expressed as c, c_max, or field-specific confidence values depending on context.

Content Compliance A category of detections flagged as potentially sensitive or inappropriate — including explicit content, violence, weapons, substance use, and offensive speech. Identified by the content_compliance tag in the categ field.

Core Metadata The primary output of Valossa video analysis, in JSON format. Contains all detections, detection groupings, media info, and segmentations. Version 1.8.1 as of February 2026. See Metadata Overview.

D

Detection A single identified concept in the video — e.g., a specific face, an object like "car", a sound like "guitar", or a speech segment. Stored in the detections object keyed by detection ID. Has at minimum a t (type) and label field.

Detection Groupings The detection_groupings section of the Core metadata JSON, containing four organized views of detections: by_detection_type, by_second, by_detection_property, and by_frequency.

Detection ID A string key (often numeric-looking, e.g., "42") that uniquely identifies a detection within the detections object. Used as a reference in detection_groupings. Must be treated as a string, not an integer.

Detection Type A string identifier categorizing what kind of thing was detected (e.g., visual.context, human.face, audio.speech). The prefix indicates the modality: visual.*, audio.*, human.*, transcript.*, topic.*, external.*.

Diarization Speaker identification in speech analysis — separating transcript words by which speaker said them. Available via speaker IDs in audio.speech_detailed detections.

E

ext_refs An optional field on a detection containing references to the detected concept in external ontologies (Wikidata, Google Knowledge Graph, IAB). Enables linking Valossa detections to external knowledge bases.

F

Face Gallery A collection of trained face identities used to recognize specific people in video analysis. Valossa provides a built-in celebrities gallery. You can also create custom galleries via the Face Training API.

frames_faces A specialized metadata type (separate JSON file) containing per-frame bounding box coordinates for detected faces. Download with type=frames_faces in the job_results call.

frames_objects A specialized metadata type containing per-frame bounding box coordinates for localized visual objects (e.g., logos). Download with type=frames_objects.

G

GKG Google Knowledge Graph — an external ontology used by Valossa to reference detected concepts. GKG IDs appear in ext_refs.gkg.id on visual.context detections.

H

human.face A detection type for a single detected face, including gender, similarity matches to face gallery identities, and emotion data. One human.face detection represents one face identity across the entire video.

human.face_group A detection type grouping faces that frequently appear together (suggesting meaningful co-presence, e.g., an interview pair). Does not represent individual face identities.

I

IAB Category A content topic classification from the Interactive Advertising Bureau (IAB) Content Taxonomy. Valossa supports IAB v2.1 (via topic.iab) and IAB v2.2 (via topic.iab.section). Over 510 categories covering the full IAB taxonomy.

J

Job A video analysis task in the Valossa system. Created with new_job, tracked with job_status, and results retrieved with job_results. Identified by a UUID job_id.

job_id A UUID string that uniquely identifies a video analysis job. Returned by new_job and used in all subsequent calls (job_status, job_results, cancel_job, delete_job).

M

Metadata Type The format variant of a Valossa result file: core (default), frames_faces, seconds_objects, frames_objects, or speech subtitle (speech_to_text_srt). Specified via the type parameter in job_results.

moccs (Merged Occurrences) An optional field on human.face detections in frames_faces metadata, containing time ranges where a face is visible while merged with other co-appearing faces — useful for detecting interaction scenes.

N

Novelty Word A keyword detected as noteworthy or distinguishing from speech or text content. An established NLP term for topic-relevant keywords, as opposed to proper names. Found in audio.keyword.novelty_word and transcript.keyword.novelty_word detections.

O

Occurrences (occs) An array field on a detection listing all time ranges where that thing appears in the video. Each occurrence has ss (start second), se (end second), shs/she (shot indices), and optionally c_max. Used to answer "When does X appear?"

S

seconds_objects A specialized metadata type containing per-second bounding box coordinates for localized visual objects. Download with type=seconds_objects.

Segmentation / Shots Shot boundary detection — identifying where one camera shot ends and another begins. Available in the segmentations field of Core metadata.

similar_to An array attribute on human.face detections listing face gallery matches. Each item has name (person name) and c (confidence). Multiple matches may appear if a face resembles several gallery entries.

similar_to_face_id A UUID identifying a specific trained face in a face gallery. Used in by_detection_property to group all human.face detections that matched the same gallery identity.

SRT SubRip Text format — a standard subtitle file format with timestamps and text. Valossa speech-to-text results are downloadable as SRT (and VTT). Uses Unix line endings (LF only).

Subscription A Valossa service plan that activates a specific combination of AI features (detection types) for your API key. Examples: Contextual Video Metadata, Video Moderation Metadata, Face Analysis with Emotions, Automatic Captions.

T

topic.iab A detection type representing the IAB content category of the entire video (not time-specific). No occurrences. Sorted by relevance.

topic.iab.section A detection type representing IAB content categories for time-based sections within the video. Has occurrences with start/end times. Includes Ad Score values.

V

Valossa Core API The primary REST API for submitting video analysis jobs and retrieving results. Base URL: https://api-eu.valossa.com/core/1.0/. Six endpoints: new_job, job_status, job_results, list_jobs, cancel_job, delete_job.

Valossa Core Metadata See Core Metadata.

Valossa Face Training API A REST API for managing custom face galleries and uploading training images. Part of the Valossa API product. Available with face-enabled subscriptions.

valossaupload:// A special URL scheme generated after a successful file upload via the Valossa upload API. Reference this URL in the new_job request instead of an HTTP URL when uploading files directly.

Valence A measure of emotional positivity/negativity ranging from -1.0 (most negative) to 1.0 (most positive). Available for face sentiment (human.face in by_second), speech sentiment (audio.speech detections), and voice emotion (audio.voice_emotion).

visual.context The primary visual detection type, covering objects, scenes, activities, animals, and other visual concepts. Includes confidence scores, occurrences, Valossa concept IDs, and external ontology references.

visual.object.localized A detection type for visual objects with spatial bounding box information (currently logo detection). Spatial coordinates are in the separate seconds_objects or frames_objects metadata files.

visual.text_region.* Detection types for OCR (Optical Character Recognition) — detecting text and numbers visible in video frames. Variants: full_frame_analysis, lower_third, middle_third, upper_third, keyword.compliance.

VTT WebVTT format — a web-standard subtitle/caption format. Valossa speech-to-text results are available in VTT format from Valossa Portal.

W

Wikidata An open knowledge base used by Valossa to reference detected concepts. Wikidata IDs appear in ext_refs.wikidata.id on visual.context detections (e.g., Q146 for "cat").

A​

B​

C​

D​

E​

F​

G​

H​

I​

J​

M​

N​

O​

S​

T​

V​

W​

A

B

C

D

E

F

G

H

I

J

M

N

O

S

T

V

W