Video Tagging Guide

Available on Transcribe Pro Vision MAX

Visual tags, audio tags, and color detection are fully included in the Transcribe Pro Vision MAX free trial. Start free — no sales call →

This guide explains how to extract visual, audio, and color tags from Valossa AI metadata for building video indexes, search systems, and content catalogs.

Quick Exploration with Metadata Reader

Before writing custom code, you can quickly list all tags using the Metadata Reader CLI tool:

# List all detections (tags) sorted by relevance
python -m metareader list-detections core_metadata.json

# Get a screentime-based summary of top tags
python -m metareader summary core_metadata.json

# See what's detected second by second
python -m metareader list-detections-by-second core_metadata.json

Overview

Valossa AI produces three main types of content tags:

Tag Type	Detection Type	Description
Visual tags	`visual.context`	Objects, scenes, actions, and visual concepts
Audio tags	`audio.context`	Sounds, music genres, environmental audio
Color tags	`visual.color`	Dominant colors per second

Extracting Visual Tags

Visual tags (visual.context) are the most numerous and diverse detection type. They cover thousands of concepts: objects, animals, scenes, actions, and more.

import json

with open("core_metadata.json", "r") as f:
    metadata = json.load(f)

visual_ids = metadata["detection_groupings"]["by_detection_type"].get("visual.context", [])

# Detections are sorted by relevance (most prominent first)
print(f"Total visual detections: {len(visual_ids)}")
print("\nTop 20 visual tags:")

for det_id in visual_ids[:20]:
    detection = metadata["detections"][det_id]
    label = detection["label"]
    categories = detection.get("categ", {}).get("tags", [])

    # Get total screen time from occurrences
    total_time = sum(
        occ["se"] - occ["ss"]
        for occ in detection.get("occs", [])
    )

    # Get peak confidence
    max_confidence = max(
        (occ.get("c_max", 0) for occ in detection.get("occs", [])),
        default=0
    )

    print(f"  {label} (confidence: {max_confidence:.2f}, screen time: {total_time:.1f}s, categories: {categories})")

Filtering by Category

Use detection categories to filter visual tags by theme:

# Get only food-related visual detections
food_tags = []
for det_id in visual_ids:
    detection = metadata["detections"][det_id]
    if "categ" in detection and "food_drink" in detection["categ"]["tags"]:
        food_tags.append(detection["label"])

print(f"Food tags: {food_tags}")

Using External References

Many visual detections include references to external ontologies:

for det_id in visual_ids[:5]:
    detection = metadata["detections"][det_id]
    if "ext_refs" in detection:
        refs = detection["ext_refs"]
        wikidata = refs.get("wikidata", {}).get("id", "N/A")
        gkg = refs.get("gkg", {}).get("id", "N/A")
        print(f"{detection['label']}: Wikidata={wikidata}, GKG={gkg}")

Extracting Audio Tags

Audio tags (audio.context) detect sounds, music types, and environmental audio events:

audio_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.context", [])

print(f"Total audio detections: {len(audio_ids)}")
print("\nAudio tags:")

for det_id in audio_ids:
    detection = metadata["detections"][det_id]
    print(f"  {detection['label']}")
    for occ in detection.get("occs", []):
        print(f"    {occ['ss']:.1f}s - {occ['se']:.1f}s (confidence: {occ.get('c_max', 'N/A')})")

note

audio.context detects sounds (music, applause, laughter, etc.), not speech. For speech content, see Speech-to-Text Guide.

Extracting Color Tags

Color data is stored as a single visual.color detection, with per-second color values in the by_second structure:

# Find the visual.color detection ID
color_ids = metadata["detection_groupings"]["by_detection_type"].get("visual.color", [])

if color_ids:
    color_det_id = color_ids[0]

    for second_idx, second_data in enumerate(metadata["detection_groupings"]["by_second"]):
        for item in second_data:
            if item["d"] == color_det_id and "a" in item and "rgb" in item["a"]:
                colors = item["a"]["rgb"]
                dominant = colors[0] if colors else None
                if dominant:
                    print(f"Second {second_idx}: #{dominant['v']} ({dominant['f']*100:.0f}% of frame)")

Each color entry includes:

v: RGB hex color value (lowercase, e.g., "112c58")
f: Fraction of the frame covered by this color (0.0 to 1.0)

Building a Tag Summary

Combine all tag types into a structured summary:

def build_tag_summary(metadata, max_tags=10):
    summary = {
        "visual": [],
        "audio": [],
        "topics": []
    }

    # Visual tags (top N by relevance)
    visual_ids = metadata["detection_groupings"]["by_detection_type"].get("visual.context", [])
    for det_id in visual_ids[:max_tags]:
        det = metadata["detections"][det_id]
        summary["visual"].append(det["label"])

    # Audio tags
    audio_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.context", [])
    for det_id in audio_ids[:max_tags]:
        det = metadata["detections"][det_id]
        summary["audio"].append(det["label"])

    # Topic tags
    for topic_type in ["topic.general", "topic.iab"]:
        topic_ids = metadata["detection_groupings"]["by_detection_type"].get(topic_type, [])
        for det_id in topic_ids[:max_tags]:
            det = metadata["detections"][det_id]
            summary["topics"].append(det["label"])

    return summary

tags = build_tag_summary(metadata)
print(json.dumps(tags, indent=2))

Detection Types -- Full detection type reference
Detection Categories -- Category tags for filtering
Core Concepts -- Understanding detections and occurrences
Metadata Reader -- CLI tool for quick tag extraction and summaries

Overview​

Extracting Visual Tags​

Filtering by Category​

Using External References​

Extracting Audio Tags​

Extracting Color Tags​

Building a Tag Summary​

Related Resources​