Skip to main content

Frequently Asked Questions

Getting Started

How do I get an API key?

Log in to Valossa Portal, go to My Account → Subscriptions and API Keys. Your API key is listed there. Each subscription has its own API key — if you have multiple subscriptions, use the key that corresponds to the features you need.

Is there a free trial or sandbox environment?

Yes! Transcribe Pro Vision MAX is a self-serve subscription with a free trial — no sales call required. It gives you access to the Valossa API with a broad set of detection types (faces, speech, objects, emotions, moderation, IAB categories, OCR, and more).

Sign up at valossa.com/transcribe-video-to-text-ai → get a Portal account → copy your API key → start calling the API immediately.

The only features not included in the trial are scene-level IAB Ad Score and automatic video clip generation (Autopreview). See the full feature list →

Can I get started without contacting sales?

Yes — use Transcribe Pro Vision MAX. It's self-serve with instant API access. Sign up at valossa.com/transcribe-video-to-text-ai.

You only need to contact sales if you need: (1) high-volume pricing, (2) scene-level Ad Score for programmatic advertising, (3) Autopreview video clip generation, or (4) custom enterprise/on-premises deployment.

What features are available on the Transcribe Pro Vision MAX trial?

Almost everything — 34 of the 36 detection types are available, including:

  • Visual objects and scenes (visual.context)
  • Face detection, celebrity recognition, custom face galleries (human.face)
  • Speech-to-text, keywords, named entities (audio.speech, audio.keyword.*)
  • Content moderation — visual, audio, and speech (content_compliance category)
  • Emotion and sentiment — face valence, named emotions, voice emotion
  • IAB content categories (topic.iab, topic.general, topic.genre)
  • OCR visual text recognition (visual.text_region.*)
  • 23 language support for speech

Not included: topic.iab.section with Ad Score (AdScout subscription), Autopreview clip generation.

How do I make my first API call?

See the Quickstart Guide — it walks you through submitting a video and reading results with curl, Python, and JavaScript in under 15 minutes.


API & Jobs

What is the base URL for the API?

https://api-eu.valossa.com/core/1.0/

All API calls must use HTTPS.

How long does video analysis take?

Processing time depends on video length, resolution, and the AI features in your subscription. Typical turnaround is faster than real-time for standard content. Use the job_status endpoint to poll for completion — it returns a suggested_wait_s hint for when to poll again.

Can I be notified when analysis is complete instead of polling?

Webhook callbacks are supported for some customer configurations. Contact Valossa to enable callback URLs for your account. Polling via job_status is the recommended default approach.

How do I submit a video?

Two methods:

  1. URL download — provide a direct link to a video file in the new_job request. We support HTTP/HTTPS, Google Drive, Dropbox, and (on request) AWS S3.
  2. File upload — use the 3-step upload API, which generates a valossaupload:// URL to reference in your new_job call.

What happens to my video files after analysis?

Valossa stores your job results temporarily. Use job_results to download and store metadata in your own system promptly. Use delete_job to remove a job and its associated assets when no longer needed. Specific data retention periods depend on your subscription agreement.


Input Formats & Limits

What video formats are supported?

Most common formats work, including MP4 (H.264), MPEG, AVI, FLV, and WebM. MP4/H.264 is the most reliable format.

What are the file size and duration limits?

LimitValue
Maximum file size7 GB
Maximum video duration5 hours
Maximum vertical resolution4096 pixels
Maximum transcript (SRT) file size5 MB

Can I analyze audio-only files?

The API is primarily designed for video files. Contact Valossa if you have specific audio-only use cases.

Should I provide my own SRT transcript?

Generally no — providing a pre-existing SRT transcript disables automatic speech-to-text and restricts results to transcript-based keyword detections. Only provide a transcript if your content has pre-existing, high-quality subtitles and you explicitly want them used for keyword analysis.


Detection & Metadata

What detection types does Valossa support?

36 detection types across these categories:

CategoryExamples
Visualvisual.context (objects, scenes), visual.color, visual.text_region.* (OCR), visual.object.localized
Audioaudio.context (sounds), audio.speech, audio.speech_detailed, audio.voice_emotion
Humanhuman.face, human.face_group
Keywordsaudio.keyword.name.person, audio.keyword.novelty_word, audio.keyword.compliance
Topicstopic.iab, topic.iab.section, topic.general, topic.genre
Externalexternal.keyword.* (from title/description you provide)
Transcripttranscript.keyword.* (if you provide an SRT file)

See Detection Types for the full reference.

What is the difference between by_second and occurrences?

  • by_second answers: "What is detected at time X?" — indexed by second, shows everything present at that moment.
  • Occurrences (occs) answers: "When does detection Y appear?" — lists all time ranges where a specific thing was detected.

What does the confidence score mean?

Confidence ranges from 0.0 to 1.0 and indicates how certain the AI is about a detection. Only detections above 0.5 confidence are included in the metadata. Higher is more certain.

How are detections ordered in by_detection_type?

By relevance/prominence — the most dominant detections appear first. Exception: audio.speech and audio.speech_detailed are ordered chronologically instead.

How do I find a specific concept like "cat"?

Loop over visual.context detections in by_detection_type and match against:

  • label field (human-readable): "cat"
  • cid field (Valossa Concept ID): 02g28TYU3dMt
  • ext_refs.wikidata.id: Q146
  • ext_refs.gkg.id: /m/01yrx

What metadata types can I download?

TypeParameterContent
Core metadatatype=core (default)All detections, groupings, topics
Face bounding boxestype=frames_facesPer-frame face coordinates
Object bounding boxes (seconds)type=seconds_objectsPer-second object coordinates
Object bounding boxes (frames)type=frames_objectsPer-frame object coordinates
Speech subtitlestype=speech_to_text_srtSRT subtitle file

Languages

What languages are supported for speech-to-text?

23 languages, including English (en-US), German (de-DE), French (fr-FR), Spanish (es-ES), Finnish (fi-FI), Swedish (sv-SE), Dutch (nl-NL), Italian (it-IT), Portuguese (pt-PT, pt-BR), Ukrainian (uk-UA), Danish (da-DK), Norwegian (nb-NO), and more. See Supported Languages for the full matrix.

Is English required for all features?

No, but English has the broadest feature support (speech keywords, IAB v2.2, content compliance OCR). Most languages support speech-to-text and IAB categories (visual-only).


Face Recognition

How does celebrity recognition work?

Valossa maintains a celebrities gallery with thousands of face identities. When a face is detected and matches someone in the gallery, the similar_to field in the human.face detection contains the person's name and confidence.

Can I add my own custom faces?

Yes — use the Face Training API to create custom face galleries and upload training images. Custom faces appear in the same similar_to field as celebrities.

Can I have multiple face galleries?

Yes. You can have one default gallery and multiple named non-default galleries. Specify the gallery UUID in the new_job request to use a non-default gallery.


Content Moderation

How do I detect inappropriate content?

Check for detections with the content_compliance category tag. This includes visual, speech, and audio-based detections. See the Content Moderation Guide.

What types of sensitive content are detected?

24 sensitive categories including: explicit_content, violence, act_of_violence, threat_of_violence, gun_weapon, substance_use, injury, and more. See Detection Categories.