Video Tagging Guide
Visual tags, audio tags, and color detection are fully included in the Transcribe Pro Vision MAX free trial. Start free — no sales call →
This guide explains how to extract visual, audio, and color tags from Valossa AI metadata for building video indexes, search systems, and content catalogs.
Before writing custom code, you can quickly list all tags using the Metadata Reader CLI tool:
# List all detections (tags) sorted by relevance
python -m metareader list-detections core_metadata.json
# Get a screentime-based summary of top tags
python -m metareader summary core_metadata.json
# See what's detected second by second
python -m metareader list-detections-by-second core_metadata.json
Overview
Valossa AI produces three main types of content tags:
| Tag Type | Detection Type | Description |
|---|---|---|
| Visual tags | visual.context | Objects, scenes, actions, and visual concepts |
| Audio tags | audio.context | Sounds, music genres, environmental audio |
| Color tags | visual.color | Dominant colors per second |
Extracting Visual Tags
Visual tags (visual.context) are the most numerous and diverse detection type. They cover thousands of concepts: objects, animals, scenes, actions, and more.
import json
with open("core_metadata.json", "r") as f:
metadata = json.load(f)
visual_ids = metadata["detection_groupings"]["by_detection_type"].get("visual.context", [])
# Detections are sorted by relevance (most prominent first)
print(f"Total visual detections: {len(visual_ids)}")
print("\nTop 20 visual tags:")
for det_id in visual_ids[:20]:
detection = metadata["detections"][det_id]
label = detection["label"]
categories = detection.get("categ", {}).get("tags", [])
# Get total screen time from occurrences
total_time = sum(
occ["se"] - occ["ss"]
for occ in detection.get("occs", [])
)
# Get peak confidence
max_confidence = max(
(occ.get("c_max", 0) for occ in detection.get("occs", [])),
default=0
)
print(f" {label} (confidence: {max_confidence:.2f}, screen time: {total_time:.1f}s, categories: {categories})")
Filtering by Category
Use detection categories to filter visual tags by theme:
# Get only food-related visual detections
food_tags = []
for det_id in visual_ids:
detection = metadata["detections"][det_id]
if "categ" in detection and "food_drink" in detection["categ"]["tags"]:
food_tags.append(detection["label"])
print(f"Food tags: {food_tags}")
Using External References
Many visual detections include references to external ontologies:
for det_id in visual_ids[:5]:
detection = metadata["detections"][det_id]
if "ext_refs" in detection:
refs = detection["ext_refs"]
wikidata = refs.get("wikidata", {}).get("id", "N/A")
gkg = refs.get("gkg", {}).get("id", "N/A")
print(f"{detection['label']}: Wikidata={wikidata}, GKG={gkg}")
Extracting Audio Tags
Audio tags (audio.context) detect sounds, music types, and environmental audio events:
audio_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.context", [])
print(f"Total audio detections: {len(audio_ids)}")
print("\nAudio tags:")
for det_id in audio_ids:
detection = metadata["detections"][det_id]
print(f" {detection['label']}")
for occ in detection.get("occs", []):
print(f" {occ['ss']:.1f}s - {occ['se']:.1f}s (confidence: {occ.get('c_max', 'N/A')})")
audio.context detects sounds (music, applause, laughter, etc.), not speech. For speech content, see Speech-to-Text Guide.
Extracting Color Tags
Color data is stored as a single visual.color detection, with per-second color values in the by_second structure:
# Find the visual.color detection ID
color_ids = metadata["detection_groupings"]["by_detection_type"].get("visual.color", [])
if color_ids:
color_det_id = color_ids[0]
for second_idx, second_data in enumerate(metadata["detection_groupings"]["by_second"]):
for item in second_data:
if item["d"] == color_det_id and "a" in item and "rgb" in item["a"]:
colors = item["a"]["rgb"]
dominant = colors[0] if colors else None
if dominant:
print(f"Second {second_idx}: #{dominant['v']} ({dominant['f']*100:.0f}% of frame)")
Each color entry includes:
v: RGB hex color value (lowercase, e.g., "112c58")f: Fraction of the frame covered by this color (0.0 to 1.0)
Building a Tag Summary
Combine all tag types into a structured summary:
def build_tag_summary(metadata, max_tags=10):
summary = {
"visual": [],
"audio": [],
"topics": []
}
# Visual tags (top N by relevance)
visual_ids = metadata["detection_groupings"]["by_detection_type"].get("visual.context", [])
for det_id in visual_ids[:max_tags]:
det = metadata["detections"][det_id]
summary["visual"].append(det["label"])
# Audio tags
audio_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.context", [])
for det_id in audio_ids[:max_tags]:
det = metadata["detections"][det_id]
summary["audio"].append(det["label"])
# Topic tags
for topic_type in ["topic.general", "topic.iab"]:
topic_ids = metadata["detection_groupings"]["by_detection_type"].get(topic_type, [])
for det_id in topic_ids[:max_tags]:
det = metadata["detections"][det_id]
summary["topics"].append(det["label"])
return summary
tags = build_tag_summary(metadata)
print(json.dumps(tags, indent=2))
Related Resources
- Detection Types -- Full detection type reference
- Detection Categories -- Category tags for filtering
- Core Concepts -- Understanding detections and occurrences
- Metadata Reader -- CLI tool for quick tag extraction and summaries