Skip to main content

Occurrences

Occurrences represent the time segments during which a detection is present in the video. They are stored in the occs array of each detection.

Occurrence Fields

FieldTypeDescription
idstringUnique occurrence ID within this metadata file
ssfloatStart second -- seconds from the beginning of the video
sefloatEnd second -- seconds from the beginning of the video
c_maxfloatMaximum confidence during this occurrence (for visual.context and audio.context)
c_medfloatMedian confidence during this occurrence (for visual.context). Complements c_max by showing the typical confidence rather than the peak.
cfloatConfidence for the entire occurrence (for topic.iab.section)
shsintegerShot start -- 0-based index into the detected_shots array in segmentations
sheintegerShot end -- 0-based index into the detected_shots array
fsintegerFrame start -- 0-based frame number (for audio.speech_detailed and visual.text_region.keyword.compliance)
feintegerFrame end -- 0-based frame number
sdurfloatSegment duration in seconds (for audio.speech_detailed). Equals se - ss.
aobjectPer-occurrence attributes (for topic.iab.section). See Type-specific occurrence fields below.
scrfloatScore for the occurrence (for highlight detections). Ranges from 0.0 to 1.0.

Example

A visual.context detection for "umbrella" detected twice in the video:

{
"t": "visual.context",
"label": "umbrella",
"cid": "abc123",
"occs": [
{
"id": "45",
"ss": 0.3,
"se": 3.6,
"c_max": 0.912,
"c_med": 0.87,
"shs": 0,
"she": 1
},
{
"id": "46",
"ss": 64.4,
"se": 68.2,
"c_max": 0.876,
"c_med": 0.801,
"shs": 22,
"she": 23
}
]
}

This means:

  • First appearance: from 0.3s to 3.6s, with peak confidence 0.912 (median 0.87), spanning shots 0-1
  • Second appearance: from 64.4s to 68.2s, with peak confidence 0.876 (median 0.801), spanning shots 22-23

Confidence Behavior

Confidence values differ by detection type:

  • visual.context and audio.context: Occurrences have c_max (the maximum confidence observed at any frame/second during the occurrence) and c_med (the median confidence). Since confidence varies from moment to moment, these summary statistics are reported in the occurrence. For per-second confidence, use the c field in the by_second structure.
  • topic.iab.section: Occurrences have c (a single confidence for the entire occurrence).
  • human.face: Occurrences do not have a confidence field. Face confidence is only available in the similar_to items (gallery match confidence).

Type-Specific Occurrence Fields

Some detection types include additional fields on their occurrences beyond the common set.

audio.speech_detailed

Word-level speech occurrences include frame-level timing and duration:

{
"id": "1064",
"ss": 7.262,
"se": 7.482,
"fs": 174,
"fe": 179,
"sdur": 0.22,
"shs": 5,
"she": 5
}
FieldDescription
fsStart frame number (0-based)
feEnd frame number (0-based)
sdurSegment duration in seconds

visual.text_region.keyword.compliance

OCR compliance keyword occurrences also include frame references:

{
"id": "902",
"ss": 9.009,
"se": 13.513,
"fs": 216,
"fe": 323,
"shs": 6,
"she": 7
}

topic.iab.section

IAB section occurrences include per-occurrence attributes indicating the source modalities and optional ad score:

{
"id": "1687",
"ss": 7.262,
"se": 31.642,
"c": 0.779,
"a": {
"sources": ["speech"],
"ad_score": 0.85
},
"shs": 5,
"she": 10
}
FieldDescription
a.sourcesArray of modalities that contributed to this classification. Values: "speech", "visual", "ocr"
a.ad_scoreOptional. Ad suitability score for this section (0.0 to 1.0)

highlight

Highlight occurrences include a score indicating the significance of the segment:

{
"id": "12932",
"ss": 3.545,
"se": 43.377,
"shs": 3,
"she": 51,
"scr": 1.0
}
FieldDescription
scrHighlight score (0.0 to 1.0), indicating how significant or interesting the segment is

Which Detections Have Occurrences?

Not all detection types include occurrences:

Has OccurrencesDetection Types
Yesvisual.context, visual.object.localized, audio.context, audio.speech, audio.speech_detailed, audio.voice_emotion, audio.keyword.*, transcript.keyword.*, human.face, human.face_group, topic.iab.section, visual.text_region.*, highlight
Notopic.iab, topic.general, topic.genre, external.keyword.*, visual.color

Detections without occurrences are either video-level (topics, external keywords) or have their temporal data represented differently (color is in by_second only).

Shot References

The shs and she fields connect occurrences to the shot boundary data in segmentations.detected_shots. This is useful for workflows that need to clip entire shots rather than just occurrence segments.

For example, when detecting inappropriate content, you might want to flag the entire shot containing the occurrence rather than just the specific seconds.

# Get the shot that contains the start of an occurrence
shot_index = occurrence["shs"]
shot = metadata["segmentations"]["detected_shots"][shot_index]
shot_start = shot["ss"]
shot_end = shot["se"]

Occurrences vs. by_second

The two structures answer different questions:

StructureQuestion AnsweredAccess Pattern
occs (in a detection)"When does this specific thing appear?"Start from a detection, read its time segments
by_second (in detection_groupings)"What is happening at this specific second?"Start from a time position, see all active detections

When to Use Occurrences

Use occurrences when you need to:

  • Find all time segments where a specific concept appears
  • Calculate total screen time for a detection
  • Generate clips containing a specific object or person
  • Get the peak confidence for a detection's appearance

When to Use by_second

Use by_second when you need to:

  • Build a timeline view of all detections
  • Find what is happening at a specific moment in the video
  • Read per-second confidence values
  • Access rapidly changing data (face emotions, colors) that only makes sense on a per-second basis

Code Example

Python: List All Occurrences of a Detection

import json

with open("core_metadata.json", "r") as f:
metadata = json.load(f)

# Find all occurrences of visual.context detections
visual_ids = metadata["detection_groupings"]["by_detection_type"].get("visual.context", [])

for det_id in visual_ids[:5]: # First 5 detections
detection = metadata["detections"][det_id]
print(f"\n{detection['label']}:")
if "occs" in detection:
for occ in detection["occs"]:
print(f" {occ['ss']:.1f}s - {occ['se']:.1f}s (confidence: {occ.get('c_max', 'N/A')})")

Python: What is Happening at Second 45?

second = 45
if second < len(metadata["detection_groupings"]["by_second"]):
detections_at_second = metadata["detection_groupings"]["by_second"][second]
for item in detections_at_second:
det = metadata["detections"][item["d"]]
confidence = item.get("c", "N/A")
print(f" {det['t']}: {det['label']} (confidence: {confidence})")