JSON Structure
This page describes the top-level structure of the Valossa Core metadata JSON file and provides examples for each section.
Top-Level Structure
{
"version_info": { ... },
"job_info": { ... },
"media_info": { ... },
"detections": { ... },
"detection_groupings": {
"by_detection_property": { ... },
"by_detection_type": { ... },
"by_frequency": { ... },
"by_second": [ ... ]
},
"segmentations": { ... }
}
version_info
Contains version information about the metadata format and processing backend.
{
"metadata_type": "core",
"metadata_format": "1.8.1",
"backend": "3.1.1"
}
| Field | Description |
|---|---|
metadata_type | Type identifier: core, frames_faces, seconds_objects, or frames_objects |
metadata_format | Version of the metadata format (see Changelog) |
backend | Version of the processing backend |
job_info
Contains information about the analysis request and job.
{
"job_id": "167d6a67-fb99-438c-a44c-c22c98229b93",
"request": {
"media": {
"video": { "url": "https://example.com/video.mp4" },
"transcript": { "url": null },
"description": null,
"language": "en-US",
"title": "My motorbike vid"
},
"analysis_parameters": {
"face_min_relative_height": 0.1
}
}
}
| Field | Description |
|---|---|
job_id | UUID of the analysis job |
request.media | The original media parameters from the new_job request |
request.analysis_parameters | Any analysis parameters that were applied |
media_info
Contains technical information about the analyzed video and customer-provided metadata.
{
"technical": {
"duration_s": 1234.56789,
"fps": 24,
"resolution": { "x_pixels": 1280, "y_pixels": 720 },
"video_codec": { "name_long": "H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10", "name_short": "h264", "bps": 1609644 },
"audio_codec": { "name_long": "AAC (Advanced Audio Coding)", "name_short": "aac", "bps": 192023 },
"n_audio_channels": 2
},
"from_customer": {
"title": "My motorbike vid",
"description": "Weekend ride through the mountains"
}
}
| Field | Description |
|---|---|
technical.duration_s | Video duration in seconds |
technical.fps | Frames per second |
technical.resolution | Video resolution: x_pixels (width) and y_pixels (height) |
technical.video_codec | Video codec info: name_long (full name), name_short (e.g., "h264"), bps (bitrate in bits per second) |
technical.audio_codec | Audio codec info: name_long, name_short (e.g., "aac"), bps. May be absent if the video has no audio track. |
technical.n_audio_channels | Number of audio channels (e.g., 2 for stereo) |
from_customer.title | Title provided in the new_job request |
from_customer.description | Description provided in the new_job request (if any) |
detections
An associative object where each key is a detection ID (string) and each value is a detection object. Detection IDs are unique within the metadata file.
{
"42": {
"t": "visual.context",
"label": "sunglasses",
"cid": "abc123",
"ext_refs": {
"wikidata": { "id": "Q130397" },
"gkg": { "id": "/m/0jyfg" }
},
"categ": {
"tags": ["fashion_wear"]
},
"occs": [
{
"id": "267",
"ss": 60.227,
"se": 66.191,
"c_max": 0.804,
"c_med": 0.75,
"shs": 47,
"she": 48
}
]
},
"64": {
"t": "human.face",
"label": "face",
"a": {
"gender": { "c": 0.929, "value": "female" },
"s_visible": 4.4,
"similar_to": [
{
"c": 0.928,
"name": "Jane Doe",
"gallery": { "id": "a3ead7b4-..." },
"gallery_face": { "id": "f6a728c6-...", "name": "Jane Doe" }
}
]
},
"occs": [
{ "id": "123", "ss": 28.333, "se": 33.567, "shs": 8, "she": 9 }
]
}
}
Common Detection Fields
| Field | Type | Always Present | Description |
|---|---|---|---|
t | string | Yes | Detection type identifier (e.g., visual.context) |
label | string | Yes | Human-readable label |
occs | array | No | Array of occurrences |
a | object | No | Attributes specific to the detection type |
c | float | No | Detection-level confidence (for audio.speech_detailed) |
cid | string | No | Valossa Concept Ontology identifier |
ext_refs | object | No | External ontology references (Wikidata, Google Knowledge Graph, IAB) |
categ | object | No | Category tags for the detection |
tr | object | No | Label translations. Keys are language codes (e.g., "fi"), values are translated label strings. Present on visual.context and audio.context when translations are available. |
detection_groupings
Provides organized views of the detections for efficient querying.
by_detection_type
Groups detection IDs by their type. Within each type, detections are sorted by relevance (most prominent first).
{
"by_detection_type": {
"visual.context": ["42", "86", "15"],
"human.face": ["64", "65"],
"audio.context": ["12", "8"],
"audio.speech": ["100", "101", "102"]
}
}
audio.speech and audio.speech_detailed detections are ordered by time, not by prominence. The audio.speech_detailed listing follows the chronological order of words as spoken in the video.
by_second
An array where index n corresponds to second n of the video. Each element is an array of detection references for that second.
{
"by_second": [
[
{ "d": "42", "c": 0.87, "o": ["267"] },
{ "d": "64", "o": ["123"], "a": { "sz": { "h": 0.188 } } }
],
[
{ "d": "42", "c": 0.91, "o": ["267"] },
{ "d": "12", "c": 0.65, "o": ["8"] }
]
]
}
| Field | Description |
|---|---|
d | Detection ID (reference into detections) |
c | Confidence at this second (for visual.context, audio.context) |
o | Array of occurrence IDs active during this second |
a | Per-second attributes (e.g., face size, color data, emotion data) |
by_detection_property
Groups detections by shared properties. Supports human.face grouped by similar_to_face_id and cross-type groupings like refined_from_multiple_detection_types for minor detection. See Faces & Identity for details.
{
"by_detection_property": {
"human.face": {
"similar_to_face_id": {
"cb6f580b-fa3f-4ed4-...": {
"moccs": [
{ "ss": 5.0, "se": 10.0 },
{ "ss": 21.0, "se": 35.0 }
],
"det_ids": ["3", "4"]
}
}
}
}
}
by_frequency
Summarizes simultaneously appearing visual concepts into thematic groups. Contains lists of co-occurring detection IDs.
segmentations
Contains time-based segmentation data, primarily shot boundaries.
{
"segmentations": {
"detected_shots": [
{ "ss": 0.083, "se": 5.214, "fs": 0, "fe": 122, "sdur": 5.131 },
{ "ss": 5.214, "se": 10.177, "fs": 123, "fe": 241, "sdur": 4.963 }
]
}
}
See Segmentation & Shots for details on the shot boundary format.