Skip to main content

Speech-to-Text Guide

Available on Transcribe Pro Vision MAX

Speech-to-text, SRT/VTT subtitle download, speech keywords, and diarization are fully included in the Transcribe Pro Vision MAX free trial. Start free — no sales call →

This guide walks through extracting speech transcripts, downloading subtitles, and working with speech-derived keywords and named entities from Valossa AI analysis results.

Quick Exploration with Metadata Reader

Before writing custom code, you can quickly extract speech data using the Metadata Reader CLI tool:

# List speech detections
python -m metareader list-detections --type "audio.speech" core_metadata.json

# Export speech as SRT subtitles
python -m metareader list-detections --type "audio.speech" --format srt core_metadata.json

# List speech keywords and named entities
python -m metareader list-detections --type "audio.keyword.*" core_metadata.json

Overview

Valossa AI's speech analysis produces:

OutputDetection TypeDescription
Transcript segmentsaudio.speechSentence-level transcript chunks
Word-level transcriptaudio.speech_detailedIndividual words with timestamps and speaker IDs
Keywordsaudio.keyword.novelty_wordNoteworthy keywords from speech
Person namesaudio.keyword.name.personPeople mentioned in speech
Location namesaudio.keyword.name.locationPlaces mentioned in speech
Organization namesaudio.keyword.name.organizationOrganizations mentioned in speech
Compliance flagsaudio.keyword.complianceProfanity and sensitive terms
SRT subtitles(downloadable file)Standard subtitle format

Step 1: Extract the Transcript

Sentence-Level (audio.speech)

import json

with open("core_metadata.json", "r") as f:
metadata = json.load(f)

speech_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech", [])

full_transcript = []
for det_id in speech_ids:
detection = metadata["detections"][det_id]
text = detection["label"]
start = detection["occs"][0]["ss"] if detection.get("occs") else 0
end = detection["occs"][0]["se"] if detection.get("occs") else 0
full_transcript.append({"text": text, "start": start, "end": end})

# Print chronologically
for segment in sorted(full_transcript, key=lambda x: x["start"]):
start_min = int(segment["start"] // 60)
start_sec = segment["start"] % 60
print(f"[{start_min:02d}:{start_sec:05.2f}] {segment['text']}")

Word-Level with Speaker Diarization (audio.speech_detailed)

detailed_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech_detailed", [])

speakers = {}
for det_id in detailed_ids:
det = metadata["detections"][det_id]
word = det["label"]
confidence = det.get("c", 0)
speaker_id = det.get("a", {}).get("s", {}).get("id", "unknown")
start = det["occs"][0]["ss"] if det.get("occs") else 0

speakers.setdefault(speaker_id, []).append({
"word": word,
"start": start,
"confidence": confidence
})

for speaker, words in speakers.items():
text = " ".join(w["word"] for w in sorted(words, key=lambda x: x["start"]))
avg_conf = sum(w["confidence"] for w in words) / len(words) if words else 0
print(f"Speaker {speaker} (avg confidence: {avg_conf:.2f}):")
print(f" {text}\n")

Step 2: Download SRT Subtitles

import requests

srt_response = requests.get(
"https://api-eu.valossa.com/core/1.0/job_results",
params={
"api_key": "YOUR_API_KEY",
"job_id": "JOB_ID",
"type": "speech_to_text_srt"
}
)

with open("subtitles.srt", "w") as f:
f.write(srt_response.text)

print("SRT file saved.")

Or using curl:

curl "https://api-eu.valossa.com/core/1.0/job_results?api_key=YOUR_API_KEY&job_id=JOB_ID&type=speech_to_text_srt" \
-o subtitles.srt

Step 3: Extract Keywords and Named Entities

keyword_types = {
"audio.keyword.novelty_word": "Keywords",
"audio.keyword.name.person": "People",
"audio.keyword.name.location": "Locations",
"audio.keyword.name.organization": "Organizations",
"audio.keyword.name.general": "Entities"
}

for det_type, label in keyword_types.items():
det_ids = metadata["detection_groupings"]["by_detection_type"].get(det_type, [])
if det_ids:
print(f"\n{label}:")
for det_id in det_ids:
detection = metadata["detections"][det_id]
times = []
for occ in detection.get("occs", []):
times.append(f"{occ['ss']:.1f}s")
print(f" {detection['label']} (at: {', '.join(times)})")

Step 4: Check for Compliance Keywords

compliance_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.keyword.compliance", [])

if compliance_ids:
print("Compliance-flagged speech keywords:")
for det_id in compliance_ids:
detection = metadata["detections"][det_id]
print(f" '{detection['label']}'")
for occ in detection.get("occs", []):
print(f" Time: {occ['ss']:.1f}s - {occ['se']:.1f}s")
else:
print("No compliance-flagged speech keywords found.")

Language Support

Speech-to-text is supported in 23 languages. Keyword extraction and named entity recognition are available in a subset of those languages. See Supported Languages for details.

Specify the language in your new_job request:

{
"api_key": "YOUR_API_KEY",
"media": {
"video": {"url": "https://example.com/video.mp4"},
"language": "de-DE"
}
}