Skip to main content

Natural Language Scene Descriptions

Available on Transcribe Pro Vision MAX

Visual scene descriptions are fully included in the Transcribe Pro Vision MAX free trial. Start free — no sales call →

Valossa AI generates natural language scene descriptions — human-readable text that describes what is visually happening in each scene of a video. Unlike detection labels (which are structured metadata like "car" or "outdoor"), scene descriptions are full sentences such as "A woman walks into a modern office building carrying a laptop bag."

Metadata Reader Support

The Metadata Reader CLI tool currently operates on Core metadata only and does not yet support visual_captions. A new list-scene-descriptions mode is planned. For now, use the manual parsing approach shown in this guide.

What Are Scene Descriptions Used For?

Use CaseDescription
Video accessibilityGenerate alt-text and audio descriptions for visually impaired users
Video SEOAdd indexable text to video pages that search engines can crawl
Content cataloguingAuto-generate human-readable summaries for video libraries and DAM systems
Video searchEnable natural language search over video archives ("find the scene where someone opens a gift")
AI and RAG pipelinesFeed scene descriptions into LLMs or retrieval-augmented generation systems
Subtitle enhancementCombine with speech transcripts for full audio-visual narration
Social mediaAuto-generate captions and descriptions for video posts

How It Works

Scene descriptions are delivered as a separate metadata type called visual_captions, downloaded via the job_results endpoint. The visual_captions metadata contains time-coded natural language descriptions in a selected_sections array.

Video timeline:     [==========================================]
Scene descriptions: |"A man walks"|"Close-up of"|"Wide shot of"|
|"through a |"a document |"a city |
|"busy street"|"on a desk" |"skyline" |

Step 1: Download Scene Description Metadata

Use the type=visual_captions parameter with the job_results endpoint:

curl

curl "https://api-eu.valossa.com/core/1.0/job_results?api_key=YOUR_API_KEY&job_id=JOB_ID&type=visual_captions" \
-o visual_captions.json

Python

import requests

response = requests.get(
"https://api-eu.valossa.com/core/1.0/job_results",
params={
"api_key": "YOUR_API_KEY",
"job_id": "JOB_ID",
"type": "visual_captions"
}
)

captions = response.json()

JavaScript

const response = await fetch(
"https://api-eu.valossa.com/core/1.0/job_results?" +
new URLSearchParams({
api_key: "YOUR_API_KEY",
job_id: "JOB_ID",
type: "visual_captions"
})
);

const captions = await response.json();

Step 2: Parse Scene Descriptions

The visual_captions payload uses selected_sections, where each item contains a caption string and a nested section object with s_start / s_end time fields:

{
"selected_sections": [
{
"caption": "A close-up view of a car's side, focusing on the area near the rear window and the roof.",
"section": {
"s_start": 6.88,
"s_end": 9.96,
"shot_index": 3
}
}
]
}

Extract them into a chronological timeline:

Python

import json

with open("visual_captions.json", "r") as f:
captions = json.load(f)

# Extract and display scene descriptions chronologically
for item in captions.get("selected_sections", []):
start = item.get("section", {}).get("s_start", 0)
end = item.get("section", {}).get("s_end", 0)
text = item.get("caption", "")

start_min = int(start // 60)
start_sec = start % 60
end_min = int(end // 60)
end_sec = end % 60

print(f"[{start_min:02d}:{start_sec:05.2f} - {end_min:02d}:{end_sec:05.2f}] {text}")

JavaScript

const fs = require("fs");
const captions = JSON.parse(fs.readFileSync("visual_captions.json", "utf-8"));

for (const item of captions.selected_sections || []) {
const start = item.section?.s_start || 0;
const end = item.section?.s_end || 0;
const text = item.caption || "";

const fmt = (s) => {
const m = Math.floor(s / 60).toString().padStart(2, "0");
const sec = (s % 60).toFixed(2).padStart(5, "0");
return `${m}:${sec}`;
};

console.log(`[${fmt(start)} - ${fmt(end)}] ${text}`);
}

Step 3: Combine with Speech Transcript

For a complete audio-visual understanding of the video, combine scene descriptions with the speech transcript from Core metadata:

import json

# Load both metadata types
with open("visual_captions.json", "r") as f:
captions_data = json.load(f)

with open("core_metadata.json", "r") as f:
core = json.load(f)

# Extract scene descriptions
scenes = []
for item in captions_data.get("selected_sections", []):
scenes.append({
"type": "visual",
"start": item.get("section", {}).get("s_start", 0),
"end": item.get("section", {}).get("s_end", 0),
"text": item.get("caption", "")
})

# Extract speech transcript
speech_ids = core["detection_groupings"]["by_detection_type"].get("audio.speech", [])
for det_id in speech_ids:
det = core["detections"][det_id]
if det.get("occs"):
scenes.append({
"type": "speech",
"start": det["occs"][0]["ss"],
"end": det["occs"][0]["se"],
"text": det["label"]
})

# Merge and sort chronologically
scenes.sort(key=lambda x: x["start"])

# Print combined timeline
for s in scenes:
prefix = "👁" if s["type"] == "visual" else "🗣"
start_min = int(s["start"] // 60)
start_sec = s["start"] % 60
print(f"[{start_min:02d}:{start_sec:05.2f}] {prefix} {s['text']}")

Example output:

[00:00.00] 👁 A news anchor sits at a desk with a city skyline behind them
[00:01.50] 🗣 Good evening and welcome to the six o'clock news
[00:05.20] 👁 Cut to aerial footage of a flooded residential area
[00:06.00] 🗣 Severe flooding has affected thousands of homes across the region
[00:12.40] 👁 Close-up of rescue workers helping residents into a boat
[00:13.10] 🗣 Emergency services have been working around the clock

Step 4: Export as Structured Data

Export as CSV

import csv

with open("scene_descriptions.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["start_seconds", "end_seconds", "description"])
for item in captions_data.get("selected_sections", []):
writer.writerow([
item.get("section", {}).get("s_start", 0),
item.get("section", {}).get("s_end", 0),
item.get("caption", "")
])

print("Exported to scene_descriptions.csv")

Export as SRT-style text

for i, item in enumerate(captions_data.get("selected_sections", []), 1):
start = item.get("section", {}).get("s_start", 0)
end = item.get("section", {}).get("s_end", 0)

def srt_time(seconds):
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds % 1) * 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

print(f"{i}")
print(f"{srt_time(start)} --> {srt_time(end)}")
print(f"{item.get('caption', '')}")
print()

Time Field Naming Note

Most Core metadata time ranges use ss and se. visual_captions is an intentional exception and uses section.s_start and section.s_end. If you reuse Core-metadata parsing helpers, make sure they account for this different naming convention.

Language Support

Scene descriptions are generated based on visual analysis and are produced in English regardless of the media.language setting. The visual analysis itself is language-independent — it describes what is seen in the video frames.