Speech and Transcription
Valossa AI generates speech-to-text results from the audio track of analyzed videos. The transcript data is available in multiple formats: as detections in the Core metadata JSON, or as a downloadable SRT subtitle file.
Speech Detection Types
audio.speech
Contains speech transcript segments, roughly corresponding to typical subtitle groupings (a few words or a sentence per detection).
{
"t": "audio.speech",
"label": "we profoundly believe that justice will win despite the looming challenges",
"a": { "sen": { "val": 0.307 } },
"occs": [
{ "id": "100", "ss": 12.5, "se": 16.8, "shs": 4, "she": 5 }
]
}
The label field contains the transcribed text. If speech sentiment analysis is enabled, the a.sen.val field contains the valence score.
In by_detection_type, audio.speech detections are ordered by time (not by prominence).
audio.speech_detailed
Contains individual words with precise timestamps, confidence scores, and speaker diarization.
{
"t": "audio.speech_detailed",
"label": "stay",
"c": 0.59,
"a": { "s": { "id": "14" } },
"occs": [
{ "id": "341", "ss": 44.32, "se": 44.4, "shs": 28, "she": 28 }
]
}
| Field | Description |
|---|---|
label | The word or punctuation character |
c | Confidence score for this word (0.0 to 1.0) |
a.s.id | Speaker ID for diarization (identifies which speaker said this word) |
occs | Precise start and end time for this word |
Punctuation characters (periods, commas) have zero duration (ss equals se).
In by_detection_type, audio.speech_detailed detections are ordered chronologically.
audio.speech_detailed may not be available for all subscriptions. If unavailable, use audio.speech for transcript data.
audio.speech_summary
An AI-generated summary of the video's speech content.
audio.speech_summary.keyword
Keywords extracted from the speech summary.
Downloading SRT/VTT Subtitles
Speech-to-text results are also available as a downloadable SRT subtitle file through the API:
curl "https://api-eu.valossa.com/core/1.0/job_results?api_key=YOUR_API_KEY&job_id=JOB_ID&type=speech_to_text_srt" \
-o subtitles.srt
Example SRT Output
1
00:00:01,910 --> 00:00:04,420
hello James and Jolie
2
00:00:04,420 --> 00:00:08,120
you shouldn't go there you know
3
00:00:08,119 --> 00:00:13,639
I had no idea that this could work
please listen to me this is so fabulous
SRT files use Unix newlines (LF only, not CRLF).
The SRT content is generated from the audio.speech detections, so the information is equivalent whether you read the JSON or download the SRT.
SRT and VTT files can also be downloaded from the Valossa Portal results page.
Speech Keywords
Keywords extracted from speech are available in several detection types:
| Detection Type | Description |
|---|---|
audio.keyword.novelty_word | Noteworthy or distinguishing keywords |
audio.keyword.name.person | Person names mentioned in speech |
audio.keyword.name.location | Location names |
audio.keyword.name.organization | Organization names |
audio.keyword.name.general | Other named entities |
audio.keyword.compliance | Compliance-flagged words (for example: profanity, substance references) |
If a pre-existing transcript was provided, the equivalent types use the transcript.* prefix instead (e.g., transcript.keyword.novelty_word).
Diarization (Speaker Identification)
When available, audio.speech_detailed detections include a speaker ID in the a.s.id field. This allows you to separate speech by speaker.
Diarization is available:
- For all languages with the Automatic Captions and Speech Analysis subscription
- For da-DK, fi-FI, lt-LT, nb-NO, and sv-SE in other subscriptions that include speech recognition
Pre-Existing Transcripts
If you provide a pre-existing SRT transcript with your new_job request:
- Automatic speech-to-text is not performed
- Keywords are extracted from your transcript and appear as
transcript.keyword.*detections audio.contextdetection still runs normally- Transcript keyword timestamps are based on the SRT timestamps you provided
Code Example
Python: Extract Full Transcript
import json
with open("core_metadata.json", "r") as f:
metadata = json.load(f)
# Get speech detections in order
speech_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech", [])
full_transcript = []
for det_id in speech_ids:
detection = metadata["detections"][det_id]
text = detection["label"]
start = detection["occs"][0]["ss"] if "occs" in detection else 0
full_transcript.append({"text": text, "start": start})
for segment in sorted(full_transcript, key=lambda x: x["start"]):
minutes = int(segment["start"] // 60)
seconds = segment["start"] % 60
print(f"[{minutes:02d}:{seconds:05.2f}] {segment['text']}")
Python: Extract Keywords by Speaker
# Group words by speaker using audio.speech_detailed
detailed_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech_detailed", [])
speakers = {}
for det_id in detailed_ids:
det = metadata["detections"][det_id]
if "a" in det and "s" in det["a"]:
speaker_id = det["a"]["s"]["id"]
speakers.setdefault(speaker_id, []).append(det["label"])
for speaker, words in speakers.items():
print(f"Speaker {speaker}: {' '.join(words)}")
Related Resources
- Speech-to-Text Guide -- Complete workflow for speech analysis
- Sentiment & Emotion -- Speech valence and voice emotion
- Supported Languages -- Language availability for speech features