Skip to main content

JSON Structure

This page describes the top-level structure of the Valossa Core metadata JSON file and provides examples for each section.

Top-Level Structure

{
"version_info": { ... },
"job_info": { ... },
"media_info": { ... },
"detections": { ... },
"detection_groupings": {
"by_detection_property": { ... },
"by_detection_type": { ... },
"by_frequency": { ... },
"by_second": [ ... ]
},
"segmentations": { ... }
}

version_info

Contains version information about the metadata format and processing backend.

{
"metadata_type": "core",
"metadata_format": "1.8.1",
"backend": "3.1.1"
}
FieldDescription
metadata_typeType identifier: core, frames_faces, seconds_objects, or frames_objects
metadata_formatVersion of the metadata format (see Changelog)
backendVersion of the processing backend

job_info

Contains information about the analysis request and job.

{
"job_id": "167d6a67-fb99-438c-a44c-c22c98229b93",
"request": {
"media": {
"video": { "url": "https://example.com/video.mp4" },
"transcript": { "url": null },
"description": null,
"language": "en-US",
"title": "My motorbike vid"
},
"analysis_parameters": {
"face_min_relative_height": 0.1
}
}
}
FieldDescription
job_idUUID of the analysis job
request.mediaThe original media parameters from the new_job request
request.analysis_parametersAny analysis parameters that were applied

media_info

Contains technical information about the analyzed video and customer-provided metadata.

{
"technical": {
"duration_s": 1234.56789,
"fps": 24,
"resolution": { "x_pixels": 1280, "y_pixels": 720 },
"video_codec": { "name_long": "H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10", "name_short": "h264", "bps": 1609644 },
"audio_codec": { "name_long": "AAC (Advanced Audio Coding)", "name_short": "aac", "bps": 192023 },
"n_audio_channels": 2
},
"from_customer": {
"title": "My motorbike vid",
"description": "Weekend ride through the mountains"
}
}
FieldDescription
technical.duration_sVideo duration in seconds
technical.fpsFrames per second
technical.resolutionVideo resolution: x_pixels (width) and y_pixels (height)
technical.video_codecVideo codec info: name_long (full name), name_short (e.g., "h264"), bps (bitrate in bits per second)
technical.audio_codecAudio codec info: name_long, name_short (e.g., "aac"), bps. May be absent if the video has no audio track.
technical.n_audio_channelsNumber of audio channels (e.g., 2 for stereo)
from_customer.titleTitle provided in the new_job request
from_customer.descriptionDescription provided in the new_job request (if any)

detections

An associative object where each key is a detection ID (string) and each value is a detection object. Detection IDs are unique within the metadata file.

{
"42": {
"t": "visual.context",
"label": "sunglasses",
"cid": "abc123",
"ext_refs": {
"wikidata": { "id": "Q130397" },
"gkg": { "id": "/m/0jyfg" }
},
"categ": {
"tags": ["fashion_wear"]
},
"occs": [
{
"id": "267",
"ss": 60.227,
"se": 66.191,
"c_max": 0.804,
"c_med": 0.75,
"shs": 47,
"she": 48
}
]
},
"64": {
"t": "human.face",
"label": "face",
"a": {
"gender": { "c": 0.929, "value": "female" },
"s_visible": 4.4,
"similar_to": [
{
"c": 0.928,
"name": "Jane Doe",
"gallery": { "id": "a3ead7b4-..." },
"gallery_face": { "id": "f6a728c6-...", "name": "Jane Doe" }
}
]
},
"occs": [
{ "id": "123", "ss": 28.333, "se": 33.567, "shs": 8, "she": 9 }
]
}
}

Common Detection Fields

FieldTypeAlways PresentDescription
tstringYesDetection type identifier (e.g., visual.context)
labelstringYesHuman-readable label
occsarrayNoArray of occurrences
aobjectNoAttributes specific to the detection type
cfloatNoDetection-level confidence (for audio.speech_detailed)
cidstringNoValossa Concept Ontology identifier
ext_refsobjectNoExternal ontology references (Wikidata, Google Knowledge Graph, IAB)
categobjectNoCategory tags for the detection
trobjectNoLabel translations. Keys are language codes (e.g., "fi"), values are translated label strings. Present on visual.context and audio.context when translations are available.

detection_groupings

Provides organized views of the detections for efficient querying.

by_detection_type

Groups detection IDs by their type. Within each type, detections are sorted by relevance (most prominent first).

{
"by_detection_type": {
"visual.context": ["42", "86", "15"],
"human.face": ["64", "65"],
"audio.context": ["12", "8"],
"audio.speech": ["100", "101", "102"]
}
}
note

audio.speech and audio.speech_detailed detections are ordered by time, not by prominence. The audio.speech_detailed listing follows the chronological order of words as spoken in the video.

by_second

An array where index n corresponds to second n of the video. Each element is an array of detection references for that second.

{
"by_second": [
[
{ "d": "42", "c": 0.87, "o": ["267"] },
{ "d": "64", "o": ["123"], "a": { "sz": { "h": 0.188 } } }
],
[
{ "d": "42", "c": 0.91, "o": ["267"] },
{ "d": "12", "c": 0.65, "o": ["8"] }
]
]
}
FieldDescription
dDetection ID (reference into detections)
cConfidence at this second (for visual.context, audio.context)
oArray of occurrence IDs active during this second
aPer-second attributes (e.g., face size, color data, emotion data)

by_detection_property

Groups detections by shared properties. Supports human.face grouped by similar_to_face_id and cross-type groupings like refined_from_multiple_detection_types for minor detection. See Faces & Identity for details.

{
"by_detection_property": {
"human.face": {
"similar_to_face_id": {
"cb6f580b-fa3f-4ed4-...": {
"moccs": [
{ "ss": 5.0, "se": 10.0 },
{ "ss": 21.0, "se": 35.0 }
],
"det_ids": ["3", "4"]
}
}
}
}
}

by_frequency

Summarizes simultaneously appearing visual concepts into thematic groups. Contains lists of co-occurring detection IDs.

segmentations

Contains time-based segmentation data, primarily shot boundaries.

{
"segmentations": {
"detected_shots": [
{ "ss": 0.083, "se": 5.214, "fs": 0, "fe": 122, "sdur": 5.131 },
{ "ss": 5.214, "se": 10.177, "fs": 123, "fe": 241, "sdur": 4.963 }
]
}
}

See Segmentation & Shots for details on the shot boundary format.