Skip to main content

Speech and Transcription

Valossa AI generates speech-to-text results from the audio track of analyzed videos. The transcript data is available in multiple formats: as detections in the Core metadata JSON, or as a downloadable SRT subtitle file.

Speech Detection Types

audio.speech

Contains speech transcript segments, roughly corresponding to typical subtitle groupings (a few words or a sentence per detection).

{
"t": "audio.speech",
"label": "we profoundly believe that justice will win despite the looming challenges",
"a": { "sen": { "val": 0.307 } },
"occs": [
{ "id": "100", "ss": 12.5, "se": 16.8, "shs": 4, "she": 5 }
]
}

The label field contains the transcribed text. If speech sentiment analysis is enabled, the a.sen.val field contains the valence score.

In by_detection_type, audio.speech detections are ordered by time (not by prominence).

audio.speech_detailed

Contains individual words with precise timestamps, confidence scores, and speaker diarization.

{
"t": "audio.speech_detailed",
"label": "stay",
"c": 0.59,
"a": { "s": { "id": "14" } },
"occs": [
{ "id": "341", "ss": 44.32, "se": 44.4, "shs": 28, "she": 28 }
]
}
FieldDescription
labelThe word or punctuation character
cConfidence score for this word (0.0 to 1.0)
a.s.idSpeaker ID for diarization (identifies which speaker said this word)
occsPrecise start and end time for this word

Punctuation characters (periods, commas) have zero duration (ss equals se).

In by_detection_type, audio.speech_detailed detections are ordered chronologically.

note

audio.speech_detailed may not be available for all subscriptions. If unavailable, use audio.speech for transcript data.

audio.speech_summary

An AI-generated summary of the video's speech content.

audio.speech_summary.keyword

Keywords extracted from the speech summary.

Downloading SRT/VTT Subtitles

Speech-to-text results are also available as a downloadable SRT subtitle file through the API:

curl "https://api-eu.valossa.com/core/1.0/job_results?api_key=YOUR_API_KEY&job_id=JOB_ID&type=speech_to_text_srt" \
-o subtitles.srt

Example SRT Output

1
00:00:01,910 --> 00:00:04,420
hello James and Jolie

2
00:00:04,420 --> 00:00:08,120
you shouldn't go there you know

3
00:00:08,119 --> 00:00:13,639
I had no idea that this could work
please listen to me this is so fabulous

SRT files use Unix newlines (LF only, not CRLF).

The SRT content is generated from the audio.speech detections, so the information is equivalent whether you read the JSON or download the SRT.

SRT and VTT files can also be downloaded from the Valossa Portal results page.

Speech Keywords

Keywords extracted from speech are available in several detection types:

Detection TypeDescription
audio.keyword.novelty_wordNoteworthy or distinguishing keywords
audio.keyword.name.personPerson names mentioned in speech
audio.keyword.name.locationLocation names
audio.keyword.name.organizationOrganization names
audio.keyword.name.generalOther named entities
audio.keyword.complianceCompliance-flagged words (for example: profanity, substance references)

If a pre-existing transcript was provided, the equivalent types use the transcript.* prefix instead (e.g., transcript.keyword.novelty_word).

Diarization (Speaker Identification)

When available, audio.speech_detailed detections include a speaker ID in the a.s.id field. This allows you to separate speech by speaker.

Diarization is available:

  • For all languages with the Automatic Captions and Speech Analysis subscription
  • For da-DK, fi-FI, lt-LT, nb-NO, and sv-SE in other subscriptions that include speech recognition

Pre-Existing Transcripts

If you provide a pre-existing SRT transcript with your new_job request:

  • Automatic speech-to-text is not performed
  • Keywords are extracted from your transcript and appear as transcript.keyword.* detections
  • audio.context detection still runs normally
  • Transcript keyword timestamps are based on the SRT timestamps you provided

Code Example

Python: Extract Full Transcript

import json

with open("core_metadata.json", "r") as f:
metadata = json.load(f)

# Get speech detections in order
speech_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech", [])

full_transcript = []
for det_id in speech_ids:
detection = metadata["detections"][det_id]
text = detection["label"]
start = detection["occs"][0]["ss"] if "occs" in detection else 0
full_transcript.append({"text": text, "start": start})

for segment in sorted(full_transcript, key=lambda x: x["start"]):
minutes = int(segment["start"] // 60)
seconds = segment["start"] % 60
print(f"[{minutes:02d}:{seconds:05.2f}] {segment['text']}")

Python: Extract Keywords by Speaker

# Group words by speaker using audio.speech_detailed
detailed_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech_detailed", [])

speakers = {}
for det_id in detailed_ids:
det = metadata["detections"][det_id]
if "a" in det and "s" in det["a"]:
speaker_id = det["a"]["s"]["id"]
speakers.setdefault(speaker_id, []).append(det["label"])

for speaker, words in speakers.items():
print(f"Speaker {speaker}: {' '.join(words)}")