Skip to main content

Speech and Transcription

Valossa AI generates speech-to-text results from the audio track of analyzed videos. The transcript data is available in multiple formats: as detections in the Core metadata JSON, or as downloadable subtitle/text result types such as SRT, WebVTT, TTML, and Avid TXT.

Speech Detection Types

audio.speech

Contains speech transcript segments, roughly corresponding to typical subtitle groupings (a few words or a sentence per detection).

{
"t": "audio.speech",
"label": "we profoundly believe that justice will win despite the looming challenges",
"a": { "sen": { "val": 0.307 } },
"occs": [
{ "id": "100", "ss": 12.5, "se": 16.8, "shs": 4, "she": 5 }
]
}

The label field contains the transcribed text. If speech sentiment analysis is enabled, the a.sen.val field contains the segment-level text sentiment score.

FieldDescription
a.sen.valSpeech segment sentiment (-1.0 to 1.0) derived from transcript meaning, not vocal tone

When present, a.sen.val is typically available for each audio.speech segment. This is separate from audio.voice_emotion, which measures how the voice sounds rather than what the words mean.

In by_detection_type, audio.speech detections are ordered by time (not by prominence).

audio.speech_detailed

Contains individual words with precise timestamps, confidence scores, and speaker diarization.

{
"t": "audio.speech_detailed",
"label": "stay",
"c": 0.59,
"a": { "s": { "id": "14" } },
"occs": [
{ "id": "341", "ss": 44.32, "se": 44.4, "shs": 28, "she": 28 }
]
}
FieldDescription
labelThe word or punctuation character
cConfidence score for this word (0.0 to 1.0)
a.s.idSpeaker ID for diarization (identifies which speaker said this word)
occsPrecise start and end time for this word

Punctuation characters (periods, commas) have zero duration (ss equals se).

In by_detection_type, audio.speech_detailed detections are ordered chronologically.

note

audio.speech_detailed may not be available for all subscriptions. If unavailable, use audio.speech for transcript data.

audio.speech_summary

An AI-generated summary of the video's speech content.

audio.speech_summary.keyword

Keywords extracted from the speech summary.

Downloading Subtitle Formats

Speech-to-text results are also available as dedicated download formats through the API:

curl "https://api-eu.valossa.com/core/1.0/job_results?api_key=YOUR_API_KEY&job_id=JOB_ID&type=speech_to_text_srt" \
-o subtitles.srt
Result typeFormat details
speech_to_text_srtStandard SubRip text. Uses Unix newlines (LF).
speech_to_text_vttStandard WebVTT text output.
speech_to_text_ttmlTTML XML text output.
speech_to_text_avid_txtAvid-compatible text output.

Example SRT Output

1
00:00:01,910 --> 00:00:04,420
hello James and Jolie

2
00:00:04,420 --> 00:00:08,120
you shouldn't go there you know

3
00:00:08,119 --> 00:00:13,639
I had no idea that this could work
please listen to me this is so fabulous

SRT files use Unix newlines (LF only, not CRLF).

The SRT content is generated from the audio.speech detections, so the information is equivalent whether you read the JSON or download the SRT.

SRT, VTT, TTML, and Avid TXT files can also be downloaded from the Valossa Portal results page.

Speech Keywords

Keywords extracted from speech are available in several detection types:

Detection TypeDescription
audio.keyword.novelty_wordNoteworthy or distinguishing keywords
audio.keyword.name.personPerson names mentioned in speech
audio.keyword.name.locationLocation names
audio.keyword.name.organizationOrganization names
audio.keyword.name.generalOther named entities
audio.keyword.complianceCompliance-flagged words (for example: profanity, substance references)

If a pre-existing transcript was provided, the equivalent types use the transcript.* prefix instead (e.g., transcript.keyword.novelty_word).

Diarization (Speaker Identification)

When available, audio.speech_detailed detections include a speaker ID in the a.s.id field. This allows you to separate speech by speaker.

Diarization is available:

  • For all languages with the Automatic Captions and Speech Analysis subscription
  • For da-DK, fi-FI, lt-LT, nb-NO, and sv-SE in other subscriptions that include speech recognition

Pre-Existing Transcripts

If you provide a pre-existing SRT transcript with your new_job request:

  • Automatic speech-to-text is not performed
  • Keywords are extracted from your transcript and appear as transcript.keyword.* detections
  • audio.context detection still runs normally
  • Transcript keyword timestamps are based on the SRT timestamps you provided

Code Example

Python: Extract Full Transcript

import json

with open("core_metadata.json", "r") as f:
metadata = json.load(f)

# Get speech detections in order
speech_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech", [])

full_transcript = []
for det_id in speech_ids:
detection = metadata["detections"][det_id]
text = detection["label"]
start = detection["occs"][0]["ss"] if "occs" in detection else 0
full_transcript.append({"text": text, "start": start})

for segment in sorted(full_transcript, key=lambda x: x["start"]):
minutes = int(segment["start"] // 60)
seconds = segment["start"] % 60
print(f"[{minutes:02d}:{seconds:05.2f}] {segment['text']}")

Python: Extract Keywords by Speaker

# Group words by speaker using audio.speech_detailed
detailed_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech_detailed", [])

speakers = {}
for det_id in detailed_ids:
det = metadata["detections"][det_id]
if "a" in det and "s" in det["a"]:
speaker_id = det["a"]["s"]["id"]
speakers.setdefault(speaker_id, []).append(det["label"])

for speaker, words in speakers.items():
print(f"Speaker {speaker}: {' '.join(words)}")