Speech and Transcription

Valossa AI generates speech-to-text results from the audio track of analyzed videos. The transcript data is available in multiple formats: as detections in the Core metadata JSON, or as a downloadable SRT subtitle file.

Speech Detection Types

audio.speech

Contains speech transcript segments, roughly corresponding to typical subtitle groupings (a few words or a sentence per detection).

{
  "t": "audio.speech",
  "label": "we profoundly believe that justice will win despite the looming challenges",
  "a": { "sen": { "val": 0.307 } },
  "occs": [
    { "id": "100", "ss": 12.5, "se": 16.8, "shs": 4, "she": 5 }
  ]
}

The label field contains the transcribed text. If speech sentiment analysis is enabled, the a.sen.val field contains the valence score.

In by_detection_type, audio.speech detections are ordered by time (not by prominence).

audio.speech_detailed

Contains individual words with precise timestamps, confidence scores, and speaker diarization.

{
  "t": "audio.speech_detailed",
  "label": "stay",
  "c": 0.59,
  "a": { "s": { "id": "14" } },
  "occs": [
    { "id": "341", "ss": 44.32, "se": 44.4, "shs": 28, "she": 28 }
  ]
}

Field	Description
`label`	The word or punctuation character
`c`	Confidence score for this word (0.0 to 1.0)
`a.s.id`	Speaker ID for diarization (identifies which speaker said this word)
`occs`	Precise start and end time for this word

Punctuation characters (periods, commas) have zero duration (ss equals se).

In by_detection_type, audio.speech_detailed detections are ordered chronologically.

note

audio.speech_detailed may not be available for all subscriptions. If unavailable, use audio.speech for transcript data.

audio.speech_summary

An AI-generated summary of the video's speech content.

audio.speech_summary.keyword

Keywords extracted from the speech summary.

Downloading SRT/VTT Subtitles

Speech-to-text results are also available as a downloadable SRT subtitle file through the API:

curl "https://api-eu.valossa.com/core/1.0/job_results?api_key=YOUR_API_KEY&job_id=JOB_ID&type=speech_to_text_srt" \
  -o subtitles.srt

Example SRT Output

1
00:00:01,910 --> 00:00:04,420
hello James and Jolie

2
00:00:04,420 --> 00:00:08,120
you shouldn't go there you know

3
00:00:08,119 --> 00:00:13,639
I had no idea that this could work
please listen to me this is so fabulous

SRT files use Unix newlines (LF only, not CRLF).

The SRT content is generated from the audio.speech detections, so the information is equivalent whether you read the JSON or download the SRT.

SRT and VTT files can also be downloaded from the Valossa Portal results page.

Speech Keywords

Keywords extracted from speech are available in several detection types:

Detection Type	Description
`audio.keyword.novelty_word`	Noteworthy or distinguishing keywords
`audio.keyword.name.person`	Person names mentioned in speech
`audio.keyword.name.location`	Location names
`audio.keyword.name.organization`	Organization names
`audio.keyword.name.general`	Other named entities
`audio.keyword.compliance`	Compliance-flagged words (for example: profanity, substance references)

If a pre-existing transcript was provided, the equivalent types use the transcript.* prefix instead (e.g., transcript.keyword.novelty_word).

Diarization (Speaker Identification)

When available, audio.speech_detailed detections include a speaker ID in the a.s.id field. This allows you to separate speech by speaker.

Diarization is available:

For all languages with the Automatic Captions and Speech Analysis subscription
For da-DK, fi-FI, lt-LT, nb-NO, and sv-SE in other subscriptions that include speech recognition

Pre-Existing Transcripts

If you provide a pre-existing SRT transcript with your new_job request:

Automatic speech-to-text is not performed
Keywords are extracted from your transcript and appear as transcript.keyword.* detections
audio.context detection still runs normally
Transcript keyword timestamps are based on the SRT timestamps you provided

Code Example

Python: Extract Full Transcript

import json

with open("core_metadata.json", "r") as f:
    metadata = json.load(f)

# Get speech detections in order
speech_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech", [])

full_transcript = []
for det_id in speech_ids:
    detection = metadata["detections"][det_id]
    text = detection["label"]
    start = detection["occs"][0]["ss"] if "occs" in detection else 0
    full_transcript.append({"text": text, "start": start})

for segment in sorted(full_transcript, key=lambda x: x["start"]):
    minutes = int(segment["start"] // 60)
    seconds = segment["start"] % 60
    print(f"[{minutes:02d}:{seconds:05.2f}] {segment['text']}")

Python: Extract Keywords by Speaker

# Group words by speaker using audio.speech_detailed
detailed_ids = metadata["detection_groupings"]["by_detection_type"].get("audio.speech_detailed", [])

speakers = {}
for det_id in detailed_ids:
    det = metadata["detections"][det_id]
    if "a" in det and "s" in det["a"]:
        speaker_id = det["a"]["s"]["id"]
        speakers.setdefault(speaker_id, []).append(det["label"])

for speaker, words in speakers.items():
    print(f"Speaker {speaker}: {' '.join(words)}")

Speech-to-Text Guide -- Complete workflow for speech analysis
Sentiment & Emotion -- Speech valence and voice emotion
Supported Languages -- Language availability for speech features

Speech Detection Types​

audio.speech​

audio.speech_detailed​

audio.speech_summary​

audio.speech_summary.keyword​

Downloading SRT/VTT Subtitles​

Example SRT Output​

Speech Keywords​

Diarization (Speaker Identification)​

Pre-Existing Transcripts​

Code Example​

Python: Extract Full Transcript​

Python: Extract Keywords by Speaker​

Related Resources​