JSON Structure

This page describes the top-level structure of the Valossa Core metadata JSON file and provides examples for each section.

Top-Level Structure

{
  "version_info": { ... },
  "job_info": { ... },
  "media_info": { ... },
  "detections": { ... },
  "detection_groupings": {
    "by_detection_property": { ... },
    "by_detection_type": { ... },
    "by_frequency": { ... },
    "by_second": [ ... ]
  },
  "segmentations": { ... }
}

version_info

Contains version information about the metadata format and processing backend.

{
  "metadata_type": "core",
  "metadata_format": "1.8.1",
  "backend": "3.1.1"
}

Field	Description
`metadata_type`	Type identifier: `core`, `frames_faces`, `seconds_objects`, or `frames_objects`
`metadata_format`	Version of the metadata format (see Changelog)
`backend`	Version of the processing backend

job_info

Contains information about the analysis request and job.

{
  "job_id": "167d6a67-fb99-438c-a44c-c22c98229b93",
  "request": {
    "media": {
      "video": { "url": "https://example.com/video.mp4" },
      "transcript": { "url": null },
      "description": null,
      "language": "en-US",
      "title": "My motorbike vid"
    },
    "analysis_parameters": {
      "face_min_relative_height": 0.1
    }
  }
}

Field	Description
`job_id`	UUID of the analysis job
`request.media`	The original media parameters from the `new_job` request
`request.analysis_parameters`	Any analysis parameters that were applied

media_info

Contains technical information about the analyzed video and customer-provided metadata.

{
  "technical": {
    "duration_s": 1234.56789,
    "fps": 24,
    "resolution": { "x_pixels": 1280, "y_pixels": 720 },
    "video_codec": { "name_long": "H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10", "name_short": "h264", "bps": 1609644 },
    "audio_codec": { "name_long": "AAC (Advanced Audio Coding)", "name_short": "aac", "bps": 192023 },
    "n_audio_channels": 2
  },
  "from_customer": {
    "title": "My motorbike vid",
    "description": "Weekend ride through the mountains"
  }
}

Field	Description
`technical.duration_s`	Video duration in seconds
`technical.fps`	Frames per second
`technical.resolution`	Video resolution: `x_pixels` (width) and `y_pixels` (height)
`technical.video_codec`	Video codec info: `name_long` (full name), `name_short` (e.g., `"h264"`), `bps` (bitrate in bits per second)
`technical.audio_codec`	Audio codec info: `name_long`, `name_short` (e.g., `"aac"`), `bps`. May be absent if the video has no audio track.
`technical.n_audio_channels`	Number of audio channels (e.g., `2` for stereo)
`from_customer.title`	Title provided in the `new_job` request
`from_customer.description`	Description provided in the `new_job` request (if any)

detections

An associative object where each key is a detection ID (string) and each value is a detection object. Detection IDs are unique within the metadata file.

{
  "42": {
    "t": "visual.context",
    "label": "sunglasses",
    "cid": "abc123",
    "ext_refs": {
      "wikidata": { "id": "Q130397" },
      "gkg": { "id": "/m/0jyfg" }
    },
    "categ": {
      "tags": ["fashion_wear"]
    },
    "occs": [
      {
        "id": "267",
        "ss": 60.227,
        "se": 66.191,
        "c_max": 0.804,
        "c_med": 0.75,
        "shs": 47,
        "she": 48
      }
    ]
  },
  "64": {
    "t": "human.face",
    "label": "face",
    "a": {
      "gender": { "c": 0.929, "value": "female" },
      "s_visible": 4.4,
      "similar_to": [
        {
          "c": 0.928,
          "name": "Jane Doe",
          "gallery": { "id": "a3ead7b4-..." },
          "gallery_face": { "id": "f6a728c6-...", "name": "Jane Doe" }
        }
      ]
    },
    "occs": [
      { "id": "123", "ss": 28.333, "se": 33.567, "shs": 8, "she": 9 }
    ]
  }
}

Common Detection Fields

Field	Type	Always Present	Description
`t`	string	Yes	Detection type identifier (e.g., `visual.context`)
`label`	string	Yes	Human-readable label
`occs`	array	No	Array of occurrences
`a`	object	No	Attributes specific to the detection type
`c`	float	No	Detection-level confidence (for `audio.speech_detailed`)
`cid`	string	No	Valossa Concept Ontology identifier
`ext_refs`	object	No	External ontology references (Wikidata, Google Knowledge Graph, IAB)
`categ`	object	No	Category tags for the detection
`tr`	object	No	Label translations. Keys are language codes (e.g., `"fi"`), values are translated label strings. Present on `visual.context` and `audio.context` when translations are available.

detection_groupings

Provides organized views of the detections for efficient querying.

by_detection_type

Groups detection IDs by their type. Within each type, detections are sorted by relevance (most prominent first).

{
  "by_detection_type": {
    "visual.context": ["42", "86", "15"],
    "human.face": ["64", "65"],
    "audio.context": ["12", "8"],
    "audio.speech": ["100", "101", "102"]
  }
}

note

audio.speech and audio.speech_detailed detections are ordered by time, not by prominence. The audio.speech_detailed listing follows the chronological order of words as spoken in the video.

by_second

An array where index n corresponds to second n of the video. Each element is an array of detection references for that second.

{
  "by_second": [
    [
      { "d": "42", "c": 0.87, "o": ["267"] },
      { "d": "64", "o": ["123"], "a": { "sz": { "h": 0.188 } } }
    ],
    [
      { "d": "42", "c": 0.91, "o": ["267"] },
      { "d": "12", "c": 0.65, "o": ["8"] }
    ]
  ]
}

Field	Description
`d`	Detection ID (reference into `detections`)
`c`	Confidence at this second (for `visual.context`, `audio.context`)
`o`	Array of occurrence IDs active during this second
`a`	Per-second attributes (e.g., face size, color data, emotion data)

by_detection_property

Groups detections by shared properties. Supports human.face grouped by similar_to_face_id and cross-type groupings like refined_from_multiple_detection_types for minor detection. See Faces & Identity for details.

{
  "by_detection_property": {
    "human.face": {
      "similar_to_face_id": {
        "cb6f580b-fa3f-4ed4-...": {
          "moccs": [
            { "ss": 5.0, "se": 10.0 },
            { "ss": 21.0, "se": 35.0 }
          ],
          "det_ids": ["3", "4"]
        }
      }
    }
  }
}

by_frequency

Summarizes simultaneously appearing visual concepts into thematic groups. Contains lists of co-occurring detection IDs.

segmentations

Contains time-based segmentation data, primarily shot boundaries.

{
  "segmentations": {
    "detected_shots": [
      { "ss": 0.083, "se": 5.214, "fs": 0, "fe": 122, "sdur": 5.131 },
      { "ss": 5.214, "se": 10.177, "fs": 123, "fe": 241, "sdur": 4.963 }
    ]
  }
}

See Segmentation & Shots for details on the shot boundary format.

Top-Level Structure​

version_info​

job_info​

media_info​

detections​

Common Detection Fields​

detection_groupings​

by_detection_type​

by_second​

by_detection_property​

by_frequency​

segmentations​