Natural Language Scene Descriptions
Visual scene descriptions are fully included in the Transcribe Pro Vision MAX free trial. Start free — no sales call →
Valossa AI generates natural language scene descriptions — human-readable text that describes what is visually happening in each scene of a video. Unlike detection labels (which are structured metadata like "car" or "outdoor"), scene descriptions are full sentences such as "A woman walks into a modern office building carrying a laptop bag."
The Metadata Reader CLI tool currently operates on Core metadata only and does not yet support visual_captions. A new list-scene-descriptions mode is planned. For now, use the manual parsing approach shown in this guide.
What Are Scene Descriptions Used For?
| Use Case | Description |
|---|---|
| Video accessibility | Generate alt-text and audio descriptions for visually impaired users |
| Video SEO | Add indexable text to video pages that search engines can crawl |
| Content cataloguing | Auto-generate human-readable summaries for video libraries and DAM systems |
| Video search | Enable natural language search over video archives ("find the scene where someone opens a gift") |
| AI and RAG pipelines | Feed scene descriptions into LLMs or retrieval-augmented generation systems |
| Subtitle enhancement | Combine with speech transcripts for full audio-visual narration |
| Social media | Auto-generate captions and descriptions for video posts |
How It Works
Scene descriptions are delivered as a separate metadata type called visual_captions, downloaded via the job_results endpoint. The visual_captions metadata contains time-coded natural language descriptions in a selected_sections array.
Video timeline: [==========================================]
Scene descriptions: |"A man walks"|"Close-up of"|"Wide shot of"|
|"through a |"a document |"a city |
|"busy street"|"on a desk" |"skyline" |
Step 1: Download Scene Description Metadata
Use the type=visual_captions parameter with the job_results endpoint:
curl
curl "https://api-eu.valossa.com/core/1.0/job_results?api_key=YOUR_API_KEY&job_id=JOB_ID&type=visual_captions" \
-o visual_captions.json
Python
import requests
response = requests.get(
"https://api-eu.valossa.com/core/1.0/job_results",
params={
"api_key": "YOUR_API_KEY",
"job_id": "JOB_ID",
"type": "visual_captions"
}
)
captions = response.json()
JavaScript
const response = await fetch(
"https://api-eu.valossa.com/core/1.0/job_results?" +
new URLSearchParams({
api_key: "YOUR_API_KEY",
job_id: "JOB_ID",
type: "visual_captions"
})
);
const captions = await response.json();
Step 2: Parse Scene Descriptions
The visual_captions payload uses selected_sections, where each item contains a caption string and a nested section object with s_start / s_end time fields:
{
"selected_sections": [
{
"caption": "A close-up view of a car's side, focusing on the area near the rear window and the roof.",
"section": {
"s_start": 6.88,
"s_end": 9.96,
"shot_index": 3
}
}
]
}
Extract them into a chronological timeline:
Python
import json
with open("visual_captions.json", "r") as f:
captions = json.load(f)
# Extract and display scene descriptions chronologically
for item in captions.get("selected_sections", []):
start = item.get("section", {}).get("s_start", 0)
end = item.get("section", {}).get("s_end", 0)
text = item.get("caption", "")
start_min = int(start // 60)
start_sec = start % 60
end_min = int(end // 60)
end_sec = end % 60
print(f"[{start_min:02d}:{start_sec:05.2f} - {end_min:02d}:{end_sec:05.2f}] {text}")
JavaScript
const fs = require("fs");
const captions = JSON.parse(fs.readFileSync("visual_captions.json", "utf-8"));
for (const item of captions.selected_sections || []) {
const start = item.section?.s_start || 0;
const end = item.section?.s_end || 0;
const text = item.caption || "";
const fmt = (s) => {
const m = Math.floor(s / 60).toString().padStart(2, "0");
const sec = (s % 60).toFixed(2).padStart(5, "0");
return `${m}:${sec}`;
};
console.log(`[${fmt(start)} - ${fmt(end)}] ${text}`);
}
Step 3: Combine with Speech Transcript
For a complete audio-visual understanding of the video, combine scene descriptions with the speech transcript from Core metadata:
import json
# Load both metadata types
with open("visual_captions.json", "r") as f:
captions_data = json.load(f)
with open("core_metadata.json", "r") as f:
core = json.load(f)
# Extract scene descriptions
scenes = []
for item in captions_data.get("selected_sections", []):
scenes.append({
"type": "visual",
"start": item.get("section", {}).get("s_start", 0),
"end": item.get("section", {}).get("s_end", 0),
"text": item.get("caption", "")
})
# Extract speech transcript
speech_ids = core["detection_groupings"]["by_detection_type"].get("audio.speech", [])
for det_id in speech_ids:
det = core["detections"][det_id]
if det.get("occs"):
scenes.append({
"type": "speech",
"start": det["occs"][0]["ss"],
"end": det["occs"][0]["se"],
"text": det["label"]
})
# Merge and sort chronologically
scenes.sort(key=lambda x: x["start"])
# Print combined timeline
for s in scenes:
prefix = "👁" if s["type"] == "visual" else "🗣"
start_min = int(s["start"] // 60)
start_sec = s["start"] % 60
print(f"[{start_min:02d}:{start_sec:05.2f}] {prefix} {s['text']}")
Example output:
[00:00.00] 👁 A news anchor sits at a desk with a city skyline behind them
[00:01.50] 🗣 Good evening and welcome to the six o'clock news
[00:05.20] 👁 Cut to aerial footage of a flooded residential area
[00:06.00] 🗣 Severe flooding has affected thousands of homes across the region
[00:12.40] 👁 Close-up of rescue workers helping residents into a boat
[00:13.10] 🗣 Emergency services have been working around the clock
Step 4: Export as Structured Data
Export as CSV
import csv
with open("scene_descriptions.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["start_seconds", "end_seconds", "description"])
for item in captions_data.get("selected_sections", []):
writer.writerow([
item.get("section", {}).get("s_start", 0),
item.get("section", {}).get("s_end", 0),
item.get("caption", "")
])
print("Exported to scene_descriptions.csv")
Export as SRT-style text
for i, item in enumerate(captions_data.get("selected_sections", []), 1):
start = item.get("section", {}).get("s_start", 0)
end = item.get("section", {}).get("s_end", 0)
def srt_time(seconds):
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds % 1) * 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
print(f"{i}")
print(f"{srt_time(start)} --> {srt_time(end)}")
print(f"{item.get('caption', '')}")
print()
Time Field Naming Note
Most Core metadata time ranges use ss and se. visual_captions is an intentional exception and uses section.s_start and section.s_end. If you reuse Core-metadata parsing helpers, make sure they account for this different naming convention.
Language Support
Scene descriptions are generated based on visual analysis and are produced in English regardless of the media.language setting. The visual analysis itself is language-independent — it describes what is seen in the video frames.
Related Resources
- Metadata Overview — All metadata types including visual_captions
- Speech-to-Text Guide — Extract transcripts to combine with scene descriptions
- Video Tagging Guide — Structured detection labels (complementary to scene descriptions)
- Job Results API — How to download different metadata types
- Metadata Reader — CLI tool for Core metadata exploration (
list-scene-descriptionsmode planned)