Natural Language Scene Descriptions

Available on Transcribe Pro Vision MAX

Visual scene descriptions are fully included in the Transcribe Pro Vision MAX free trial. Start free — no sales call →

Valossa AI generates natural language scene descriptions — human-readable text that describes what is visually happening in each scene of a video. Unlike detection labels (which are structured metadata like "car" or "outdoor"), scene descriptions are full sentences such as "A woman walks into a modern office building carrying a laptop bag."

Metadata Reader Support

The Metadata Reader CLI tool currently operates on Core metadata only and does not yet support visual_captions. A new list-scene-descriptions mode is planned. For now, use the manual parsing approach shown in this guide.

What Are Scene Descriptions Used For?

Use Case	Description
Video accessibility	Generate alt-text and audio descriptions for visually impaired users
Video SEO	Add indexable text to video pages that search engines can crawl
Content cataloguing	Auto-generate human-readable summaries for video libraries and DAM systems
Video search	Enable natural language search over video archives ("find the scene where someone opens a gift")
AI and RAG pipelines	Feed scene descriptions into LLMs or retrieval-augmented generation systems
Subtitle enhancement	Combine with speech transcripts for full audio-visual narration
Social media	Auto-generate captions and descriptions for video posts

How It Works

Scene descriptions are delivered as a separate metadata type called visual_captions, downloaded via the job_results endpoint. The visual_captions metadata contains time-coded natural language descriptions covering the entire video.

Video timeline:    [========================================]
Scene descriptions: |"A man walks"|"Close-up of"|"Wide shot of"|
                    |"through a   |"a document  |"a city       |
                    |"busy street"|"on a desk"  |"skyline"     |

Step 1: Download Scene Description Metadata

Use the type=visual_captions parameter with the job_results endpoint:

curl

curl "https://api-eu.valossa.com/core/1.0/job_results?api_key=YOUR_API_KEY&job_id=JOB_ID&type=visual_captions" \
  -o visual_captions.json

Python

import requests

response = requests.get(
    "https://api-eu.valossa.com/core/1.0/job_results",
    params={
        "api_key": "YOUR_API_KEY",
        "job_id": "JOB_ID",
        "type": "visual_captions"
    }
)

captions = response.json()

JavaScript

const response = await fetch(
  "https://api-eu.valossa.com/core/1.0/job_results?" +
  new URLSearchParams({
    api_key: "YOUR_API_KEY",
    job_id: "JOB_ID",
    type: "visual_captions"
  })
);

const captions = await response.json();

Step 2: Parse Scene Descriptions

The visual_captions metadata contains time-coded scene descriptions. Extract them into a chronological timeline:

Python

import json

with open("visual_captions.json", "r") as f:
    captions = json.load(f)

# Extract and display scene descriptions chronologically
for caption in captions.get("captions", []):
    start = caption.get("ss", 0)
    end = caption.get("se", 0)
    text = caption.get("text", "")

    start_min = int(start // 60)
    start_sec = start % 60
    end_min = int(end // 60)
    end_sec = end % 60

    print(f"[{start_min:02d}:{start_sec:05.2f} - {end_min:02d}:{end_sec:05.2f}] {text}")

JavaScript

const fs = require("fs");
const captions = JSON.parse(fs.readFileSync("visual_captions.json", "utf-8"));

for (const caption of captions.captions || []) {
  const start = caption.ss || 0;
  const end = caption.se || 0;
  const text = caption.text || "";

  const fmt = (s) => {
    const m = Math.floor(s / 60).toString().padStart(2, "0");
    const sec = (s % 60).toFixed(2).padStart(5, "0");
    return `${m}:${sec}`;
  };

  console.log(`[${fmt(start)} - ${fmt(end)}] ${text}`);
}

Step 3: Combine with Speech Transcript

For a complete audio-visual understanding of the video, combine scene descriptions with the speech transcript from Core metadata:

import json

# Load both metadata types
with open("visual_captions.json", "r") as f:
    captions_data = json.load(f)

with open("core_metadata.json", "r") as f:
    core = json.load(f)

# Extract scene descriptions
scenes = []
for caption in captions_data.get("captions", []):
    scenes.append({
        "type": "visual",
        "start": caption.get("ss", 0),
        "end": caption.get("se", 0),
        "text": caption.get("text", "")
    })

# Extract speech transcript
speech_ids = core["detection_groupings"]["by_detection_type"].get("audio.speech", [])
for det_id in speech_ids:
    det = core["detections"][det_id]
    if det.get("occs"):
        scenes.append({
            "type": "speech",
            "start": det["occs"][0]["ss"],
            "end": det["occs"][0]["se"],
            "text": det["label"]
        })

# Merge and sort chronologically
scenes.sort(key=lambda x: x["start"])

# Print combined timeline
for s in scenes:
    prefix = "👁" if s["type"] == "visual" else "🗣"
    start_min = int(s["start"] // 60)
    start_sec = s["start"] % 60
    print(f"[{start_min:02d}:{start_sec:05.2f}] {prefix} {s['text']}")

Example output:

[00:00.00] 👁 A news anchor sits at a desk with a city skyline behind them
[00:01.50] 🗣 Good evening and welcome to the six o'clock news
[00:05.20] 👁 Cut to aerial footage of a flooded residential area
[00:06.00] 🗣 Severe flooding has affected thousands of homes across the region
[00:12.40] 👁 Close-up of rescue workers helping residents into a boat
[00:13.10] 🗣 Emergency services have been working around the clock

Step 4: Export as Structured Data

Export as CSV

import csv

with open("scene_descriptions.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["start_seconds", "end_seconds", "description"])
    for caption in captions_data.get("captions", []):
        writer.writerow([
            caption.get("ss", 0),
            caption.get("se", 0),
            caption.get("text", "")
        ])

print("Exported to scene_descriptions.csv")

Export as SRT-style text

for i, caption in enumerate(captions_data.get("captions", []), 1):
    start = caption.get("ss", 0)
    end = caption.get("se", 0)

    def srt_time(seconds):
        h = int(seconds // 3600)
        m = int((seconds % 3600) // 60)
        s = int(seconds % 60)
        ms = int((seconds % 1) * 1000)
        return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

    print(f"{i}")
    print(f"{srt_time(start)} --> {srt_time(end)}")
    print(f"{caption.get('text', '')}")
    print()

Language Support

Scene descriptions are generated based on visual analysis and are produced in English regardless of the media.language setting. The visual analysis itself is language-independent — it describes what is seen in the video frames.

Metadata Overview — All metadata types including visual_captions
Speech-to-Text Guide — Extract transcripts to combine with scene descriptions
Video Tagging Guide — Structured detection labels (complementary to scene descriptions)
Job Results API — How to download different metadata types
Metadata Reader — CLI tool for Core metadata exploration (list-scene-descriptions mode planned)

What Are Scene Descriptions Used For?​

How It Works​

Step 1: Download Scene Description Metadata​

curl​

Python​

JavaScript​

Step 2: Parse Scene Descriptions​

Python​

JavaScript​

Step 3: Combine with Speech Transcript​

Step 4: Export as Structured Data​

Export as CSV​

Export as SRT-style text​

Language Support​

Related Resources​

What Are Scene Descriptions Used For?

How It Works

Step 1: Download Scene Description Metadata

curl

Python

JavaScript

Step 2: Parse Scene Descriptions

Python

JavaScript

Step 3: Combine with Speech Transcript

Step 4: Export as Structured Data

Export as CSV

Export as SRT-style text

Language Support

Related Resources