Speech-to-Text Guide

Available on Transcribe Pro Vision MAX

Speech-to-text, SRT/VTT subtitle download, speech keywords, and diarization are fully included in the Transcribe Pro Vision MAX free trial. Start free — no sales call →

This guide walks through extracting speech transcripts, downloading subtitles, and working with speech-derived keywords and named entities from Valossa AI analysis results.

Quick Exploration with Metadata Reader

Before writing custom code, you can quickly extract speech data using the Metadata Reader CLI tool:

# List speech detections
python -m metareader list-detections --type "audio.speech" core_metadata.json

# Export speech as SRT subtitles
python -m metareader list-detections --type "audio.speech" --format srt core_metadata.json

# List speech keywords and named entities
python -m metareader list-detections --type "audio.keyword.*" core_metadata.json

Overview

Valossa AI's speech analysis produces:

Output	Detection Type	Description
Transcript segments	`audio.speech`	Sentence-level transcript chunks
Word-level transcript	`audio.speech_detailed`	Individual words with timestamps and speaker IDs
Keywords	`audio.keyword.novelty_word`	Noteworthy keywords from speech
Person names	`audio.keyword.name.person`	People mentioned in speech
Location names	`audio.keyword.name.location`	Places mentioned in speech
Organization names	`audio.keyword.name.organization`	Organizations mentioned in speech
Compliance flags	`audio.keyword.compliance`	Profanity and sensitive terms
SRT subtitles	(downloadable file)	Standard subtitle format

Step 1: Extract the Transcript

Sentence-Level (audio.speech)

import json

with open("core_metadata.json", "r") as f:
    metadata = json.load(f)

speech_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech", [])

full_transcript = []
for det_id in speech_ids:
    detection = metadata["detections"][det_id]
    text = detection["label"]
    start = detection["occs"][0]["ss"] if detection.get("occs") else 0
    end = detection["occs"][0]["se"] if detection.get("occs") else 0
    full_transcript.append({"text": text, "start": start, "end": end})

# Print chronologically
for segment in sorted(full_transcript, key=lambda x: x["start"]):
    start_min = int(segment["start"] // 60)
    start_sec = segment["start"] % 60
    print(f"[{start_min:02d}:{start_sec:05.2f}] {segment['text']}")

Word-Level with Speaker Diarization (audio.speech_detailed)

detailed_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech_detailed", [])

speakers = {}
for det_id in detailed_ids:
    det = metadata["detections"][det_id]
    word = det["label"]
    confidence = det.get("c", 0)
    speaker_id = det.get("a", {}).get("s", {}).get("id", "unknown")
    start = det["occs"][0]["ss"] if det.get("occs") else 0

    speakers.setdefault(speaker_id, []).append({
        "word": word,
        "start": start,
        "confidence": confidence
    })

for speaker, words in speakers.items():
    text = " ".join(w["word"] for w in sorted(words, key=lambda x: x["start"]))
    avg_conf = sum(w["confidence"] for w in words) / len(words) if words else 0
    print(f"Speaker {speaker} (avg confidence: {avg_conf:.2f}):")
    print(f"  {text}\n")

Step 2: Download SRT Subtitles

import requests

srt_response = requests.get(
    "https://api-eu.valossa.com/core/1.0/job_results",
    params={
        "api_key": "YOUR_API_KEY",
        "job_id": "JOB_ID",
        "type": "speech_to_text_srt"
    }
)

with open("subtitles.srt", "w") as f:
    f.write(srt_response.text)

print("SRT file saved.")

Or using curl:

curl "https://api-eu.valossa.com/core/1.0/job_results?api_key=YOUR_API_KEY&job_id=JOB_ID&type=speech_to_text_srt" \
  -o subtitles.srt

Step 3: Extract Keywords and Named Entities

keyword_types = {
    "audio.keyword.novelty_word": "Keywords",
    "audio.keyword.name.person": "People",
    "audio.keyword.name.location": "Locations",
    "audio.keyword.name.organization": "Organizations",
    "audio.keyword.name.general": "Entities"
}

for det_type, label in keyword_types.items():
    det_ids = metadata["detection_groupings"]["by_detection_type"].get(det_type, [])
    if det_ids:
        print(f"\n{label}:")
        for det_id in det_ids:
            detection = metadata["detections"][det_id]
            times = []
            for occ in detection.get("occs", []):
                times.append(f"{occ['ss']:.1f}s")
            print(f"  {detection['label']} (at: {', '.join(times)})")

Step 4: Check for Compliance Keywords

compliance_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.keyword.compliance", [])

if compliance_ids:
    print("Compliance-flagged speech keywords:")
    for det_id in compliance_ids:
        detection = metadata["detections"][det_id]
        print(f"  '{detection['label']}'")
        for occ in detection.get("occs", []):
            print(f"    Time: {occ['ss']:.1f}s - {occ['se']:.1f}s")
else:
    print("No compliance-flagged speech keywords found.")

Language Support

Speech-to-text is supported in 23 languages. Keyword extraction and named entity recognition are available in a subset of those languages. See Supported Languages for details.

Specify the language in your new_job request:

{
  "api_key": "YOUR_API_KEY",
  "media": {
    "video": {"url": "https://example.com/video.mp4"},
    "language": "de-DE"
  }
}

Speech & Transcription Reference -- Metadata format details
Supported Languages -- Language availability
Content Moderation -- Using speech compliance for moderation
Metadata Reader -- CLI tool for quick speech extraction and SRT export

Overview​

Step 1: Extract the Transcript​

Sentence-Level (audio.speech)​

Word-Level with Speaker Diarization (audio.speech_detailed)​

Step 2: Download SRT Subtitles​

Step 3: Extract Keywords and Named Entities​

Step 4: Check for Compliance Keywords​

Language Support​

Related Resources​