Core Concepts
This page explains the key concepts in Valossa AI's data model. Understanding these concepts is essential for working with the API and interpreting analysis results.
Jobs
A Job represents a single video analysis request. Every video you submit for analysis creates a job with a unique job_id (a UUID string).
Job Lifecycle
- Submitted -- You send a
new_jobrequest with a video URL and your API key. - Downloading -- The system downloads your video file from the provided URL. (Uploading files to analysis by clicking in Valossa Portal is also possible.)
- Processing -- The AI models analyze the video's visual, audio, and speech content.
- Finished -- Analysis is finished. Results are available for download.
- Error -- Something went wrong (invalid URL, unsupported format, etc.).
You can monitor a job's progress with the job_status endpoint and retrieve results with job_results once the status is finished.
Jobs can also be cancelled (if still in the waiting queue) with cancel_job or deleted entirely with delete_job.
Detections
A Detection is a single recognized concept found in the video. Detections represent things like:
- A visual object (e.g., "car", "sunglasses")
- A face (with optional identity matching)
- An audio event (e.g., "guitar", "applause")
- A speech transcript segment
- A topic or category (e.g., an IAB advertising category)
- A keyword extracted from speech
Each detection has a unique detection ID (a string) within its metadata file and contains at minimum:
t-- The detection type identifier (e.g.,visual.context,human.face)label-- A human-readable label for the detected concept
Optional fields include occs (occurrences), a (attributes), cid (concept identifier), ext_refs (external references), and categ (category tags).
Detection Types
Detection types are organized in a hierarchical naming scheme. The prefix indicates the source modality:
| Prefix | Source | Examples |
|---|---|---|
visual.* | Video frames (visual content) | visual.context, visual.color, visual.object.localized |
audio.* | Audio track | audio.context, audio.speech, audio.voice_emotion |
human.* | Human-centered detections | human.face, human.face_group |
transcript.* | User-provided transcript | transcript.keyword.novelty_word, transcript.keyword.name.person |
topic.* | Video-level or section-level topics | topic.iab, topic.iab.section, topic.general |
external.* | User-provided title/description | external.keyword.novelty_word |
For the full list of all 36+ detection types, see Detection Types.
Occurrences
An Occurrence represents a continuous time segment during which a detection is visible or audible in the video. A single detection can have multiple occurrences.
For example, if a "car" is detected from 0:05 to 0:12 and again from 1:30 to 1:45, the visual.context detection for "car" will have two occurrence objects in its occs array.
Each occurrence contains:
| Field | Type | Description |
|---|---|---|
id | string | Unique occurrence ID within this metadata |
ss | float | Start second (seconds from video start) |
se | float | End second |
c_max | float | Maximum detection confidence during this occurrence (for visual.context and audio.context) |
c | float | Confidence for the entire occurrence (for topic.iab.section) |
shs | integer | Shot index where the occurrence starts |
she | integer | Shot index where the occurrence ends |
Not all detection types have occurrences. Video-level detections like topic.iab, topic.general, and external.keyword.* do not have time-bound occurrences.
Detection Groupings
The detection_groupings section organizes detections for efficient querying. It contains four sub-structures:
by_detection_type
Groups all detection IDs by their type. Within each type, detections are sorted by relevance (most prominent first).
"by_detection_type": {
"visual.context": ["42", "86", "15", ...],
"human.face": ["64", "65", ...],
"audio.context": ["12", "8", ...]
}
Use this to answer: "What visual objects were detected?" or "Who appeared in the video?"
by_second
An array where each element represents one second of the video. Each second contains an array of detection references with per-second confidence values.
"by_second": [
[{"d": "42", "c": 0.87, "o": ["267"]}, {"d": "64", "o": ["114"]}],
[{"d": "42", "c": 0.91, "o": ["267"]}, {"d": "12", "o": ["97"] "c": 0.65}],
...
]
Use this to answer: "What is happening at second 45 of the video?"
The d field references a detection ID, c provides the confidence at that specific second, and o lists the relevant occurrence IDs.
by_detection_property
Groups detections by shared properties. Currently used for human.face detections grouped by similar_to_face_id, which merges occurrences of the same recognized person across multiple face detections.
by_frequency
Summarizes simultaneously appearing visual concepts into thematic groups.
Metadata Types
Valossa produces several types of metadata, each downloaded separately:
| Type | Content | Download Parameter |
|---|---|---|
core | All detections, groupings, segmentations | Default (no parameter needed) |
frames_faces | Per-frame face bounding boxes | type=frames_faces |
seconds_objects | Per-second object bounding boxes | type=seconds_objects |
frames_objects | Per-frame object bounding boxes | type=frames_objects |
speech_to_text_srt | Speech transcript in SRT format | type=speech_to_text_srt |
The Core metadata is always available. Other types require that the corresponding AI features were enabled for your subscription.
Putting It Together
Here is how these concepts relate in a typical workflow:
- You create a Job by submitting a video.
- The AI produces Detections of various Detection Types (visual objects, faces, speech, etc.).
- Each detection may have Occurrences indicating when it appears in the video.
- Detection Groupings organize detections for efficient access -- by type, by time, or by property.
- You download the Metadata (Core, frames_faces, etc.) and parse the JSON in your application.
For the detailed JSON structure, see JSON Structure.