Skip to main content

Natural Language Scene Descriptions

Available on Transcribe Pro Vision MAX

Visual scene descriptions are fully included in the Transcribe Pro Vision MAX free trial. Start free — no sales call →

Valossa AI generates natural language scene descriptions — human-readable text that describes what is visually happening in each scene of a video. Unlike detection labels (which are structured metadata like "car" or "outdoor"), scene descriptions are full sentences such as "A woman walks into a modern office building carrying a laptop bag."

Metadata Reader Support

The Metadata Reader CLI tool currently operates on Core metadata only and does not yet support visual_captions. A new list-scene-descriptions mode is planned. For now, use the manual parsing approach shown in this guide.

What Are Scene Descriptions Used For?

Use CaseDescription
Video accessibilityGenerate alt-text and audio descriptions for visually impaired users
Video SEOAdd indexable text to video pages that search engines can crawl
Content cataloguingAuto-generate human-readable summaries for video libraries and DAM systems
Video searchEnable natural language search over video archives ("find the scene where someone opens a gift")
AI and RAG pipelinesFeed scene descriptions into LLMs or retrieval-augmented generation systems
Subtitle enhancementCombine with speech transcripts for full audio-visual narration
Social mediaAuto-generate captions and descriptions for video posts

How It Works

Scene descriptions are delivered as a separate metadata type called visual_captions, downloaded via the job_results endpoint. The visual_captions metadata contains time-coded natural language descriptions covering the entire video.

Video timeline:    [========================================]
Scene descriptions: |"A man walks"|"Close-up of"|"Wide shot of"|
|"through a |"a document |"a city |
|"busy street"|"on a desk" |"skyline" |

Step 1: Download Scene Description Metadata

Use the type=visual_captions parameter with the job_results endpoint:

curl

curl "https://api-eu.valossa.com/core/1.0/job_results?api_key=YOUR_API_KEY&job_id=JOB_ID&type=visual_captions" \
-o visual_captions.json

Python

import requests

response = requests.get(
"https://api-eu.valossa.com/core/1.0/job_results",
params={
"api_key": "YOUR_API_KEY",
"job_id": "JOB_ID",
"type": "visual_captions"
}
)

captions = response.json()

JavaScript

const response = await fetch(
"https://api-eu.valossa.com/core/1.0/job_results?" +
new URLSearchParams({
api_key: "YOUR_API_KEY",
job_id: "JOB_ID",
type: "visual_captions"
})
);

const captions = await response.json();

Step 2: Parse Scene Descriptions

The visual_captions metadata contains time-coded scene descriptions. Extract them into a chronological timeline:

Python

import json

with open("visual_captions.json", "r") as f:
captions = json.load(f)

# Extract and display scene descriptions chronologically
for caption in captions.get("captions", []):
start = caption.get("ss", 0)
end = caption.get("se", 0)
text = caption.get("text", "")

start_min = int(start // 60)
start_sec = start % 60
end_min = int(end // 60)
end_sec = end % 60

print(f"[{start_min:02d}:{start_sec:05.2f} - {end_min:02d}:{end_sec:05.2f}] {text}")

JavaScript

const fs = require("fs");
const captions = JSON.parse(fs.readFileSync("visual_captions.json", "utf-8"));

for (const caption of captions.captions || []) {
const start = caption.ss || 0;
const end = caption.se || 0;
const text = caption.text || "";

const fmt = (s) => {
const m = Math.floor(s / 60).toString().padStart(2, "0");
const sec = (s % 60).toFixed(2).padStart(5, "0");
return `${m}:${sec}`;
};

console.log(`[${fmt(start)} - ${fmt(end)}] ${text}`);
}

Step 3: Combine with Speech Transcript

For a complete audio-visual understanding of the video, combine scene descriptions with the speech transcript from Core metadata:

import json

# Load both metadata types
with open("visual_captions.json", "r") as f:
captions_data = json.load(f)

with open("core_metadata.json", "r") as f:
core = json.load(f)

# Extract scene descriptions
scenes = []
for caption in captions_data.get("captions", []):
scenes.append({
"type": "visual",
"start": caption.get("ss", 0),
"end": caption.get("se", 0),
"text": caption.get("text", "")
})

# Extract speech transcript
speech_ids = core["detection_groupings"]["by_detection_type"].get("audio.speech", [])
for det_id in speech_ids:
det = core["detections"][det_id]
if det.get("occs"):
scenes.append({
"type": "speech",
"start": det["occs"][0]["ss"],
"end": det["occs"][0]["se"],
"text": det["label"]
})

# Merge and sort chronologically
scenes.sort(key=lambda x: x["start"])

# Print combined timeline
for s in scenes:
prefix = "👁" if s["type"] == "visual" else "🗣"
start_min = int(s["start"] // 60)
start_sec = s["start"] % 60
print(f"[{start_min:02d}:{start_sec:05.2f}] {prefix} {s['text']}")

Example output:

[00:00.00] 👁 A news anchor sits at a desk with a city skyline behind them
[00:01.50] 🗣 Good evening and welcome to the six o'clock news
[00:05.20] 👁 Cut to aerial footage of a flooded residential area
[00:06.00] 🗣 Severe flooding has affected thousands of homes across the region
[00:12.40] 👁 Close-up of rescue workers helping residents into a boat
[00:13.10] 🗣 Emergency services have been working around the clock

Step 4: Export as Structured Data

Export as CSV

import csv

with open("scene_descriptions.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["start_seconds", "end_seconds", "description"])
for caption in captions_data.get("captions", []):
writer.writerow([
caption.get("ss", 0),
caption.get("se", 0),
caption.get("text", "")
])

print("Exported to scene_descriptions.csv")

Export as SRT-style text

for i, caption in enumerate(captions_data.get("captions", []), 1):
start = caption.get("ss", 0)
end = caption.get("se", 0)

def srt_time(seconds):
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds % 1) * 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

print(f"{i}")
print(f"{srt_time(start)} --> {srt_time(end)}")
print(f"{caption.get('text', '')}")
print()

Language Support

Scene descriptions are generated based on visual analysis and are produced in English regardless of the media.language setting. The visual analysis itself is language-independent — it describes what is seen in the video frames.