Core Concepts

This page explains the key concepts in Valossa AI's data model. Understanding these concepts is essential for working with the API and interpreting analysis results.

Jobs

A Job represents a single video analysis request. Every video you submit for analysis creates a job with a unique job_id (a UUID string).

Job Lifecycle

Submitted -- You send a new_job request with a video URL and your API key.
Downloading -- The system downloads your video file from the provided URL. (Uploading files to analysis by clicking in Valossa Portal is also possible.)
Processing -- The AI models analyze the video's visual, audio, and speech content.
Finished -- Analysis is finished. Results are available for download.
Error -- Something went wrong (invalid URL, unsupported format, etc.).

You can monitor a job's progress with the job_status endpoint and retrieve results with job_results once the status is finished.

Jobs can also be cancelled (if still in the waiting queue) with cancel_job or deleted entirely with delete_job.

Detections

A Detection is a single recognized concept found in the video. Detections represent things like:

A visual object (e.g., "car", "sunglasses")
A face (with optional identity matching)
An audio event (e.g., "guitar", "applause")
A speech transcript segment
A topic or category (e.g., an IAB advertising category)
A keyword extracted from speech

Each detection has a unique detection ID (a string) within its metadata file and contains at minimum:

t -- The detection type identifier (e.g., visual.context, human.face)
label -- A human-readable label for the detected concept

Optional fields include occs (occurrences), a (attributes), cid (concept identifier), ext_refs (external references), and categ (category tags).

Detection Types

Detection types are organized in a hierarchical naming scheme. The prefix indicates the source modality:

Prefix	Source	Examples
`visual.*`	Video frames (visual content)	`visual.context`, `visual.color`, `visual.object.localized`
`audio.*`	Audio track	`audio.context`, `audio.speech`, `audio.voice_emotion`
`human.*`	Human-centered detections	`human.face`, `human.face_group`
`transcript.*`	User-provided transcript	`transcript.keyword.novelty_word`, `transcript.keyword.name.person`
`topic.*`	Video-level or section-level topics	`topic.iab`, `topic.iab.section`, `topic.general`
`external.*`	User-provided title/description	`external.keyword.novelty_word`

For the full list of all 36+ detection types, see Detection Types.

Occurrences

An Occurrence represents a continuous time segment during which a detection is visible or audible in the video. A single detection can have multiple occurrences.

For example, if a "car" is detected from 0:05 to 0:12 and again from 1:30 to 1:45, the visual.context detection for "car" will have two occurrence objects in its occs array.

Each occurrence contains:

Field	Type	Description
`id`	string	Unique occurrence ID within this metadata
`ss`	float	Start second (seconds from video start)
`se`	float	End second
`c_max`	float	Maximum detection confidence during this occurrence (for `visual.context` and `audio.context`)
`c`	float	Confidence for the entire occurrence (for `topic.iab.section`)
`shs`	integer	Shot index where the occurrence starts
`she`	integer	Shot index where the occurrence ends

Not all detection types have occurrences. Video-level detections like topic.iab, topic.general, and external.keyword.* do not have time-bound occurrences.

Detection Groupings

The detection_groupings section organizes detections for efficient querying. It contains four sub-structures:

by_detection_type

Groups all detection IDs by their type. Within each type, detections are sorted by relevance (most prominent first).

"by_detection_type": {
    "visual.context": ["42", "86", "15", ...],
    "human.face": ["64", "65", ...],
    "audio.context": ["12", "8", ...]
}

Use this to answer: "What visual objects were detected?" or "Who appeared in the video?"

by_second

An array where each element represents one second of the video. Each second contains an array of detection references with per-second confidence values.

"by_second": [
    [{"d": "42", "c": 0.87, "o": ["267"]}, {"d": "64", "o": ["114"]}],
    [{"d": "42", "c": 0.91, "o": ["267"]}, {"d": "12", "o": ["97"] "c": 0.65}],
    ...
]

Use this to answer: "What is happening at second 45 of the video?"

The d field references a detection ID, c provides the confidence at that specific second, and o lists the relevant occurrence IDs.

by_detection_property

Groups detections by shared properties. Currently used for human.face detections grouped by similar_to_face_id, which merges occurrences of the same recognized person across multiple face detections.

by_frequency

Summarizes simultaneously appearing visual concepts into thematic groups.

Metadata Types

Valossa produces several types of metadata, each downloaded separately:

Type	Content	Download Parameter
`core`	All detections, groupings, segmentations	Default (no parameter needed)
`frames_faces`	Per-frame face bounding boxes	`type=frames_faces`
`seconds_objects`	Per-second object bounding boxes	`type=seconds_objects`
`frames_objects`	Per-frame object bounding boxes	`type=frames_objects`
`speech_to_text_srt`	Speech transcript in SRT format	`type=speech_to_text_srt`

The Core metadata is always available. Other types require that the corresponding AI features were enabled for your subscription.

Putting It Together

Here is how these concepts relate in a typical workflow:

You create a Job by submitting a video.
The AI produces Detections of various Detection Types (visual objects, faces, speech, etc.).
Each detection may have Occurrences indicating when it appears in the video.
Detection Groupings organize detections for efficient access -- by type, by time, or by property.
You download the Metadata (Core, frames_faces, etc.) and parse the JSON in your application.

For the detailed JSON structure, see JSON Structure.

Jobs​

Job Lifecycle​

Detections​

Detection Types​

Occurrences​

Detection Groupings​

by_detection_type​

by_second​

by_detection_property​

by_frequency​

Metadata Types​

Putting It Together​