Speech-to-Text Guide
Available on Transcribe Pro Vision MAX
Speech-to-text, SRT/VTT subtitle download, speech keywords, and diarization are fully included in the Transcribe Pro Vision MAX free trial. Start free — no sales call →
This guide walks through extracting speech transcripts, downloading subtitles, and working with speech-derived keywords and named entities from Valossa AI analysis results.
Quick Exploration with Metadata Reader
Before writing custom code, you can quickly extract speech data using the Metadata Reader CLI tool:
# List speech detections
python -m metareader list-detections --type "audio.speech" core_metadata.json
# Export speech as SRT subtitles
python -m metareader list-detections --type "audio.speech" --format srt core_metadata.json
# List speech keywords and named entities
python -m metareader list-detections --type "audio.keyword.*" core_metadata.json
Overview
Valossa AI's speech analysis produces:
| Output | Detection Type | Description |
|---|---|---|
| Transcript segments | audio.speech | Sentence-level transcript chunks |
| Word-level transcript | audio.speech_detailed | Individual words with timestamps and speaker IDs |
| Keywords | audio.keyword.novelty_word | Noteworthy keywords from speech |
| Person names | audio.keyword.name.person | People mentioned in speech |
| Location names | audio.keyword.name.location | Places mentioned in speech |
| Organization names | audio.keyword.name.organization | Organizations mentioned in speech |
| Compliance flags | audio.keyword.compliance | Profanity and sensitive terms |
| SRT subtitles | (downloadable file) | Standard subtitle format |
Step 1: Extract the Transcript
Sentence-Level (audio.speech)
import json
with open("core_metadata.json", "r") as f:
metadata = json.load(f)
speech_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech", [])
full_transcript = []
for det_id in speech_ids:
detection = metadata["detections"][det_id]
text = detection["label"]
start = detection["occs"][0]["ss"] if detection.get("occs") else 0
end = detection["occs"][0]["se"] if detection.get("occs") else 0
full_transcript.append({"text": text, "start": start, "end": end})
# Print chronologically
for segment in sorted(full_transcript, key=lambda x: x["start"]):
start_min = int(segment["start"] // 60)
start_sec = segment["start"] % 60
print(f"[{start_min:02d}:{start_sec:05.2f}] {segment['text']}")
Word-Level with Speaker Diarization (audio.speech_detailed)
detailed_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech_detailed", [])
speakers = {}
for det_id in detailed_ids:
det = metadata["detections"][det_id]
word = det["label"]
confidence = det.get("c", 0)
speaker_id = det.get("a", {}).get("s", {}).get("id", "unknown")
start = det["occs"][0]["ss"] if det.get("occs") else 0
speakers.setdefault(speaker_id, []).append({
"word": word,
"start": start,
"confidence": confidence
})
for speaker, words in speakers.items():
text = " ".join(w["word"] for w in sorted(words, key=lambda x: x["start"]))
avg_conf = sum(w["confidence"] for w in words) / len(words) if words else 0
print(f"Speaker {speaker} (avg confidence: {avg_conf:.2f}):")
print(f" {text}\n")
Step 2: Download SRT Subtitles
import requests
srt_response = requests.get(
"https://api-eu.valossa.com/core/1.0/job_results",
params={
"api_key": "YOUR_API_KEY",
"job_id": "JOB_ID",
"type": "speech_to_text_srt"
}
)
with open("subtitles.srt", "w") as f:
f.write(srt_response.text)
print("SRT file saved.")
Or using curl:
curl "https://api-eu.valossa.com/core/1.0/job_results?api_key=YOUR_API_KEY&job_id=JOB_ID&type=speech_to_text_srt" \
-o subtitles.srt
Step 3: Extract Keywords and Named Entities
keyword_types = {
"audio.keyword.novelty_word": "Keywords",
"audio.keyword.name.person": "People",
"audio.keyword.name.location": "Locations",
"audio.keyword.name.organization": "Organizations",
"audio.keyword.name.general": "Entities"
}
for det_type, label in keyword_types.items():
det_ids = metadata["detection_groupings"]["by_detection_type"].get(det_type, [])
if det_ids:
print(f"\n{label}:")
for det_id in det_ids:
detection = metadata["detections"][det_id]
times = []
for occ in detection.get("occs", []):
times.append(f"{occ['ss']:.1f}s")
print(f" {detection['label']} (at: {', '.join(times)})")
Step 4: Check for Compliance Keywords
compliance_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.keyword.compliance", [])
if compliance_ids:
print("Compliance-flagged speech keywords:")
for det_id in compliance_ids:
detection = metadata["detections"][det_id]
print(f" '{detection['label']}'")
for occ in detection.get("occs", []):
print(f" Time: {occ['ss']:.1f}s - {occ['se']:.1f}s")
else:
print("No compliance-flagged speech keywords found.")
Language Support
Speech-to-text is supported in 23 languages. Keyword extraction and named entity recognition are available in a subset of those languages. See Supported Languages for details.
Specify the language in your new_job request:
{
"api_key": "YOUR_API_KEY",
"media": {
"video": {"url": "https://example.com/video.mp4"},
"language": "de-DE"
}
}
Related Resources
- Speech & Transcription Reference -- Metadata format details
- Supported Languages -- Language availability
- Content Moderation -- Using speech compliance for moderation
- Metadata Reader -- CLI tool for quick speech extraction and SRT export