Skip to main content

Core Concepts

This page explains the key concepts in Valossa AI's data model. Understanding these concepts is essential for working with the API and interpreting analysis results.

Jobs

A Job represents a single video analysis request. Every video you submit for analysis creates a job with a unique job_id (a UUID string).

Job Lifecycle

  1. Submitted -- You send a new_job request with a video URL and your API key.
  2. Downloading -- The system downloads your video file from the provided URL. (Uploading files to analysis by clicking in Valossa Portal is also possible.)
  3. Processing -- The AI models analyze the video's visual, audio, and speech content.
  4. Finished -- Analysis is finished. Results are available for download.
  5. Error -- Something went wrong (invalid URL, unsupported format, etc.).

You can monitor a job's progress with the job_status endpoint and retrieve results with job_results once the status is finished.

Jobs can also be cancelled (if still in the waiting queue) with cancel_job or deleted entirely with delete_job.

Detections

A Detection is a single recognized concept found in the video. Detections represent things like:

  • A visual object (e.g., "car", "sunglasses")
  • A face (with optional identity matching)
  • An audio event (e.g., "guitar", "applause")
  • A speech transcript segment
  • A topic or category (e.g., an IAB advertising category)
  • A keyword extracted from speech

Each detection has a unique detection ID (a string) within its metadata file and contains at minimum:

  • t -- The detection type identifier (e.g., visual.context, human.face)
  • label -- A human-readable label for the detected concept

Optional fields include occs (occurrences), a (attributes), cid (concept identifier), ext_refs (external references), and categ (category tags).

Detection Types

Detection types are organized in a hierarchical naming scheme. The prefix indicates the source modality:

PrefixSourceExamples
visual.*Video frames (visual content)visual.context, visual.color, visual.object.localized
audio.*Audio trackaudio.context, audio.speech, audio.voice_emotion
human.*Human-centered detectionshuman.face, human.face_group
transcript.*User-provided transcripttranscript.keyword.novelty_word, transcript.keyword.name.person
topic.*Video-level or section-level topicstopic.iab, topic.iab.section, topic.general
external.*User-provided title/descriptionexternal.keyword.novelty_word

For the full list of all 36+ detection types, see Detection Types.

Occurrences

An Occurrence represents a continuous time segment during which a detection is visible or audible in the video. A single detection can have multiple occurrences.

For example, if a "car" is detected from 0:05 to 0:12 and again from 1:30 to 1:45, the visual.context detection for "car" will have two occurrence objects in its occs array.

Each occurrence contains:

FieldTypeDescription
idstringUnique occurrence ID within this metadata
ssfloatStart second (seconds from video start)
sefloatEnd second
c_maxfloatMaximum detection confidence during this occurrence (for visual.context and audio.context)
cfloatConfidence for the entire occurrence (for topic.iab.section)
shsintegerShot index where the occurrence starts
sheintegerShot index where the occurrence ends

Not all detection types have occurrences. Video-level detections like topic.iab, topic.general, and external.keyword.* do not have time-bound occurrences.

Detection Groupings

The detection_groupings section organizes detections for efficient querying. It contains four sub-structures:

by_detection_type

Groups all detection IDs by their type. Within each type, detections are sorted by relevance (most prominent first).

"by_detection_type": {
"visual.context": ["42", "86", "15", ...],
"human.face": ["64", "65", ...],
"audio.context": ["12", "8", ...]
}

Use this to answer: "What visual objects were detected?" or "Who appeared in the video?"

by_second

An array where each element represents one second of the video. Each second contains an array of detection references with per-second confidence values.

"by_second": [
[{"d": "42", "c": 0.87, "o": ["267"]}, {"d": "64", "o": ["114"]}],
[{"d": "42", "c": 0.91, "o": ["267"]}, {"d": "12", "o": ["97"] "c": 0.65}],
...
]

Use this to answer: "What is happening at second 45 of the video?"

The d field references a detection ID, c provides the confidence at that specific second, and o lists the relevant occurrence IDs.

by_detection_property

Groups detections by shared properties. Currently used for human.face detections grouped by similar_to_face_id, which merges occurrences of the same recognized person across multiple face detections.

by_frequency

Summarizes simultaneously appearing visual concepts into thematic groups.

Metadata Types

Valossa produces several types of metadata, each downloaded separately:

TypeContentDownload Parameter
coreAll detections, groupings, segmentationsDefault (no parameter needed)
frames_facesPer-frame face bounding boxestype=frames_faces
seconds_objectsPer-second object bounding boxestype=seconds_objects
frames_objectsPer-frame object bounding boxestype=frames_objects
speech_to_text_srtSpeech transcript in SRT formattype=speech_to_text_srt

The Core metadata is always available. Other types require that the corresponding AI features were enabled for your subscription.

Putting It Together

Here is how these concepts relate in a typical workflow:

  1. You create a Job by submitting a video.
  2. The AI produces Detections of various Detection Types (visual objects, faces, speech, etc.).
  3. Each detection may have Occurrences indicating when it appears in the video.
  4. Detection Groupings organize detections for efficient access -- by type, by time, or by property.
  5. You download the Metadata (Core, frames_faces, etc.) and parse the JSON in your application.

For the detailed JSON structure, see JSON Structure.