Skip to main content

Detection Types

Every detection in Valossa Metadata has a t field containing its detection type identifier. Detection types follow a hierarchical naming convention where the prefix indicates the source modality.

Detection Type Reference

Visual Detections (visual.*)

Detected from the visual content of video frames.

TypeDescriptionHas occsHas cid
visual.contextBroad visual concept detection (objects, scenes, actions, explicit content)YesYes
visual.object.localizedVisual objects with bounding box location data (currently logos). Requires seconds_objects or frames_objects metadata for coordinates.YesYes
visual.colorDominant colors per second. Color values are in the by_second structure as RGB hex strings.Single occurrenceNo
visual.text_region.full_frame_analysisOCR text detected from the full video frameYesNo
visual.text_region.lower_thirdOCR text detected from the lower third of frames (typical subtitle area)YesNo
visual.text_region.middle_thirdOCR text detected from the middle third of framesYesNo
visual.text_region.upper_thirdOCR text detected from the upper third of framesYesNo
visual.text_region.keyword.complianceCompliance-flagged keywords from OCR (e.g., profanity in visual text)YesNo

Audio Detections (audio.*)

Detected from the audio track of the video.

TypeDescriptionHas occsHas cid
audio.contextAudio event detection (music, applause, laughter, environmental sounds)YesYes
audio.speechSpeech-to-text transcript segments (roughly corresponding to subtitle groupings)YesNo
audio.speech_detailedIndividual words with precise timestamps, confidence scores, and speaker diarizationYesNo
audio.speech_detailed.statsStatistics about the detailed speech analysisNoNo
audio.speech_summaryAI-generated summary of the speech contentNoNo
audio.speech_summary.keywordKeywords extracted from the speech summaryNoNo
audio.voice_emotionVoice emotion data (valence and arousal from voice prosodics). Data is in by_second.YesNo
audio.keyword.complianceCompliance-flagged words from speech (profanity, substance references, etc.)YesNo
audio.keyword.novelty_wordNoteworthy or distinguishing keywords from speechYesNo
audio.keyword.name.personPerson names mentioned in speechYesNo
audio.keyword.name.locationLocation names mentioned in speechYesNo
audio.keyword.name.organizationOrganization names mentioned in speechYesNo
audio.keyword.name.generalOther named entities mentioned in speechYesNo

Human Detections (human.*)

Human-centered detections, currently face-related only.

TypeDescriptionHas occsHas cid
human.faceDetected face with optional identity matching, gender, screen time.YesNo
human.face_groupGroup of faces with temporal correlation (likely interacting people)YesNo

Transcript Detections (transcript.*)

Derived from a user-provided pre-existing SRT transcript (not from automatic speech-to-text).

TypeDescriptionHas occsHas cid
transcript.keyword.complianceCompliance-flagged words from the provided transcriptYesNo
transcript.keyword.novelty_wordNoteworthy keywords from the provided transcriptYesNo
transcript.keyword.name.personPerson names from the provided transcriptYesNo
transcript.keyword.name.locationLocation names from the provided transcriptYesNo
transcript.keyword.name.organizationOrganization names from the provided transcriptYesNo
transcript.keyword.name.generalOther named entities from the provided transcriptYesNo

Topic Detections (topic.*)

Video-level or section-level topic classifications.

TypeDescriptionHas occsHas cid
topic.iabIAB Content Taxonomy categories for the entire videoNoNo
topic.iab.sectionIAB categories for time-based sub-sections of the video (with optional Ad Score)YesNo
topic.generalNon-IAB topic categories for the entire videoNoNo
topic.genreGenre classification of the videoNoNo
topic.iab.audio(Deprecated) Audio-based IAB categoriesNoNo
topic.iab.transcript(Deprecated) Transcript-based IAB categoriesNoNo
topic.iab.visual(Deprecated) Visual-based IAB categoriesNoNo

External Detections (external.*)

Derived from user-provided title and description text (submitted in the new_job request).

TypeDescriptionHas occsHas cid
external.keyword.novelty_wordNoteworthy keywords from the video title/descriptionNoNo
external.keyword.name.personPerson names from the video title/descriptionNoNo
external.keyword.name.locationLocation names from the video title/descriptionNoNo
external.keyword.name.organizationOrganization names from the video title/descriptionNoNo
external.keyword.name.generalOther named entities from the video title/descriptionNoNo

Highlight Detections (highlight)

Automatically identified highlight segments of the video.

TypeDescriptionHas occsHas cid
highlightHighlight segments with a relevance score. Labels indicate the highlight category (e.g., "action").YesNo

Each occurrence includes a scr (score) field from 0.0 to 1.0, indicating how significant or interesting the segment is. See Occurrences for details.

Explicit Content (explicit_content.*) -- Deprecated

TypeDescription
explicit_content.audio.offensive(Deprecated) Offensive audio content. Use audio.keyword.compliance instead.
explicit_content.transcript.offensive(Deprecated) Offensive transcript content. Use transcript.keyword.compliance instead.

Explicit visual content detections are now part of visual.context and are identified using category tags such as content_compliance, sexual, violence, etc.

Understanding Novelty Words

The term "novelty word" appears in several detection types (e.g., audio.keyword.novelty_word). A novelty word is a keyword or phrase detected as particularly relevant or distinguishing in the content. This established NLP term distinguishes these content-descriptive keywords from proper names (person, location, organization) which have their own detection types.

JSON Examples

visual.context Detection

{
"t": "visual.context",
"label": "hair",
"cid": "lC4vVLdd5huQ",
"ext_refs": {
"wikidata": { "id": "Q28472" },
"gkg": { "id": "/m/03q69" }
},
"categ": { "tags": ["human_features"] },
"occs": [
{ "id": "267", "ss": 60.227, "se": 66.191, "c_max": 0.804, "c_med": 0.75, "shs": 47, "she": 48 }
]
}

human.face Detection

{
"t": "human.face",
"label": "face",
"a": {
"gender": { "c": 0.929, "value": "female" },
"s_visible": 4.4,
"similar_to": [
{
"c": 0.928,
"name": "Jane Doe",
"gallery": { "id": "a3ead7b4-8e84-43ac-9e6b-d1727b05f189" },
"gallery_face": { "id": "f6a728c6-5991-47da-9c17-b5302bfd0aff", "name": "Jane Doe" }
}
]
},
"occs": [
{ "id": "123", "ss": 28.333, "se": 33.567, "shs": 8, "she": 9 }
]
}

audio.context Detection

{
"t": "audio.context",
"label": "exciting music",
"cid": "o7WLKO1GuL5r",
"ext_refs": {
"gkg": { "id": "/t/dd00035" }
},
"occs": [
{ "id": "8", "ss": 15.0, "se": 49.0, "shs": 14, "she": 29, "c_max": 0.979 }
]
}

audio.speech_detailed Detection

{
"t": "audio.speech_detailed",
"label": "stay",
"c": 0.59,
"a": { "s": { "id": "14" } },
"occs": [
{ "id": "341", "ss": 44.32, "se": 44.4, "fs": 1064, "fe": 1066, "sdur": 0.08, "shs": 28, "she": 28 }
]
}

The a.s.id field contains the speaker ID for diarization purposes. Occurrences include fs/fe (frame start/end) and sdur (segment duration) for frame-accurate word timing.

topic.iab Detection

{
"t": "topic.iab",
"label": "Personal Finance",
"ext_refs": {
"iab": {
"labels_hierarchy": ["Personal Finance"],
"id": "IAB13"
}
}
}

Keyword Detection

{
"t": "transcript.keyword.name.location",
"label": "Chillsbury Hills",
"occs": [
{ "ss": 109.075, "se": 110.975, "id": "460" }
]
}