Natural Language Scene Descriptions
Visual scene descriptions are fully included in the Transcribe Pro Vision MAX free trial. Start free — no sales call →
Valossa AI generates natural language scene descriptions — human-readable text that describes what is visually happening in each scene of a video. Unlike detection labels (which are structured metadata like "car" or "outdoor"), scene descriptions are full sentences such as "A woman walks into a modern office building carrying a laptop bag."
The Metadata Reader CLI tool currently operates on Core metadata only and does not yet support visual_captions. A new list-scene-descriptions mode is planned. For now, use the manual parsing approach shown in this guide.
What Are Scene Descriptions Used For?
| Use Case | Description |
|---|---|
| Video accessibility | Generate alt-text and audio descriptions for visually impaired users |
| Video SEO | Add indexable text to video pages that search engines can crawl |
| Content cataloguing | Auto-generate human-readable summaries for video libraries and DAM systems |
| Video search | Enable natural language search over video archives ("find the scene where someone opens a gift") |
| AI and RAG pipelines | Feed scene descriptions into LLMs or retrieval-augmented generation systems |
| Subtitle enhancement | Combine with speech transcripts for full audio-visual narration |
| Social media | Auto-generate captions and descriptions for video posts |
How It Works
Scene descriptions are delivered as a separate metadata type called visual_captions, downloaded via the job_results endpoint. The visual_captions metadata contains time-coded natural language descriptions covering the entire video.
Video timeline: [========================================]
Scene descriptions: |"A man walks"|"Close-up of"|"Wide shot of"|
|"through a |"a document |"a city |
|"busy street"|"on a desk" |"skyline" |
Step 1: Download Scene Description Metadata
Use the type=visual_captions parameter with the job_results endpoint:
curl
curl "https://api-eu.valossa.com/core/1.0/job_results?api_key=YOUR_API_KEY&job_id=JOB_ID&type=visual_captions" \
-o visual_captions.json
Python
import requests
response = requests.get(
"https://api-eu.valossa.com/core/1.0/job_results",
params={
"api_key": "YOUR_API_KEY",
"job_id": "JOB_ID",
"type": "visual_captions"
}
)
captions = response.json()
JavaScript
const response = await fetch(
"https://api-eu.valossa.com/core/1.0/job_results?" +
new URLSearchParams({
api_key: "YOUR_API_KEY",
job_id: "JOB_ID",
type: "visual_captions"
})
);
const captions = await response.json();
Step 2: Parse Scene Descriptions
The visual_captions metadata contains time-coded scene descriptions. Extract them into a chronological timeline:
Python
import json
with open("visual_captions.json", "r") as f:
captions = json.load(f)
# Extract and display scene descriptions chronologically
for caption in captions.get("captions", []):
start = caption.get("ss", 0)
end = caption.get("se", 0)
text = caption.get("text", "")
start_min = int(start // 60)
start_sec = start % 60
end_min = int(end // 60)
end_sec = end % 60
print(f"[{start_min:02d}:{start_sec:05.2f} - {end_min:02d}:{end_sec:05.2f}] {text}")
JavaScript
const fs = require("fs");
const captions = JSON.parse(fs.readFileSync("visual_captions.json", "utf-8"));
for (const caption of captions.captions || []) {
const start = caption.ss || 0;
const end = caption.se || 0;
const text = caption.text || "";
const fmt = (s) => {
const m = Math.floor(s / 60).toString().padStart(2, "0");
const sec = (s % 60).toFixed(2).padStart(5, "0");
return `${m}:${sec}`;
};
console.log(`[${fmt(start)} - ${fmt(end)}] ${text}`);
}
Step 3: Combine with Speech Transcript
For a complete audio-visual understanding of the video, combine scene descriptions with the speech transcript from Core metadata:
import json
# Load both metadata types
with open("visual_captions.json", "r") as f:
captions_data = json.load(f)
with open("core_metadata.json", "r") as f:
core = json.load(f)
# Extract scene descriptions
scenes = []
for caption in captions_data.get("captions", []):
scenes.append({
"type": "visual",
"start": caption.get("ss", 0),
"end": caption.get("se", 0),
"text": caption.get("text", "")
})
# Extract speech transcript
speech_ids = core["detection_groupings"]["by_detection_type"].get("audio.speech", [])
for det_id in speech_ids:
det = core["detections"][det_id]
if det.get("occs"):
scenes.append({
"type": "speech",
"start": det["occs"][0]["ss"],
"end": det["occs"][0]["se"],
"text": det["label"]
})
# Merge and sort chronologically
scenes.sort(key=lambda x: x["start"])
# Print combined timeline
for s in scenes:
prefix = "👁" if s["type"] == "visual" else "🗣"
start_min = int(s["start"] // 60)
start_sec = s["start"] % 60
print(f"[{start_min:02d}:{start_sec:05.2f}] {prefix} {s['text']}")
Example output:
[00:00.00] 👁 A news anchor sits at a desk with a city skyline behind them
[00:01.50] 🗣 Good evening and welcome to the six o'clock news
[00:05.20] 👁 Cut to aerial footage of a flooded residential area
[00:06.00] 🗣 Severe flooding has affected thousands of homes across the region
[00:12.40] 👁 Close-up of rescue workers helping residents into a boat
[00:13.10] 🗣 Emergency services have been working around the clock
Step 4: Export as Structured Data
Export as CSV
import csv
with open("scene_descriptions.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["start_seconds", "end_seconds", "description"])
for caption in captions_data.get("captions", []):
writer.writerow([
caption.get("ss", 0),
caption.get("se", 0),
caption.get("text", "")
])
print("Exported to scene_descriptions.csv")
Export as SRT-style text
for i, caption in enumerate(captions_data.get("captions", []), 1):
start = caption.get("ss", 0)
end = caption.get("se", 0)
def srt_time(seconds):
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds % 1) * 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
print(f"{i}")
print(f"{srt_time(start)} --> {srt_time(end)}")
print(f"{caption.get('text', '')}")
print()
Language Support
Scene descriptions are generated based on visual analysis and are produced in English regardless of the media.language setting. The visual analysis itself is language-independent — it describes what is seen in the video frames.
Related Resources
- Metadata Overview — All metadata types including visual_captions
- Speech-to-Text Guide — Extract transcripts to combine with scene descriptions
- Video Tagging Guide — Structured detection labels (complementary to scene descriptions)
- Job Results API — How to download different metadata types
- Metadata Reader — CLI tool for Core metadata exploration (
list-scene-descriptionsmode planned)