video to prompt Β· vision LLM extractor

Video to Prompt: Extract a Reusable Prompt From Any Reference Video

Drop a clip. Get a structured prompt. Regenerate it across Runway, Veo, Kling and Sora-tier models β€” all from one $19.99/mo account.

Powered by GPT vision Β· Claude vision Β· Gemini. Output ports to Runway Gen-3, Veo 3, Kling 2.0 and Sora-tier models.

reference.mp4drop zone
drag a clip or paste a YouTube URL

sampling 8 keyframes Β· 24fps

// extracted_prompt.json
"Subject":"lone surfer in black wetsuit"
"Action":"paddling across glassy swell, head turning right"
"Camera":"low aerial drone, slow dolly-forward, 24mm equiv"
"Lighting":"golden-hour backlight, warm rim on shoulders"
"Style":"cinematic, shallow color grade, film grain 6%"
"Pacing":"4s clip, single continuous take, 24fps"
"Audio":"ambient ocean wash, low wind, no music"
"Negative":"no text, no logos, no second figure"
TL;DR: Video to prompt uses frontier vision LLMs to read a reference clip and return a structured generative recipe β€” Subject, Action, Camera, Lighting, Style, Pacing, Audio, Negative. Inside ZeroTwo, the same video to prompt extraction routes through GPT vision, Claude vision, and Gemini, and the resulting prompt runs across 60+ models including Runway, Veo, and Kling from one workspace.

What "video to prompt" actually means

Video to prompt is the inverse of text-to-video. Captioning gives you a sentence; video to prompt gives you a recipe. The output is structured key-value text β€” the same shape a generative video model expects as input β€” so you can paste it back into Runway Gen-3, Veo 3, Kling 2.0, or Sora-tier models and recreate the aesthetic with controlled variation.

Why it matters: the global AI video market is forecast to grow from $1.42B in 2024 to roughly $14.5B by 2032 (Grand View Research / Verified Market Research). The teams shipping fastest are not writing prompts from scratch β€” they are reverse-engineering reference clips and porting the recipe across models.

The 8-field structured template

Copy this schema. It is the canonical output of every ZeroTwo video-to-prompt extraction and the input shape every modern video model accepts (with minor per-model syntax tweaks covered below).

{
  "Subject":  "lone surfer in black wetsuit",
  "Action":   "paddling across glassy swell, head turning right",
  "Camera":   "low aerial drone, slow dolly-forward, 24mm equiv",
  "Lighting": "golden-hour backlight, warm rim on shoulders",
  "Style":    "cinematic, shallow color grade, film grain 6%",
  "Pacing":   "4s clip, single continuous take, 24fps",
  "Audio":    "ambient ocean wash, low wind, no music",
  "Negative": "no text, no logos, no second figure"
}

Three worked examples

Cinematic β€” 4s drone over coastline

Reference: aerial pull-back over a black-sand beach at sunset, single surfer paddling out, no cuts.

Subject: lone surfer in black wetsuit on longboard
Action: paddling out over glassy swell
Camera: drone, slow pull-back, 35mm equiv, 24fps
Lighting: golden-hour backlight, warm rim light
Style: cinematic teal-orange grade, mild film grain
Pacing: 4 seconds, single continuous take
Audio: ambient surf, soft wind
Negative: no on-screen text, no second person

What each model renders:

  • Runway: Gen-3 reads the camera move strongly; map 'drone slow pull-back' to camera_motion=dolly_back, motion_strength=4.
  • Veo: Veo 3 honors lighting + audio fields natively; pass Audio verbatim, it generates the soundscape.
  • Kling: Kling 2.0 prefers shorter motion verbs; condense to 'paddling, slow drone pull-back, golden hour'.

Social/UGC β€” handheld iPhone selfie walk

Reference: vertical 9:16 selfie clip walking through Tokyo Shibuya at dusk, neon reflections, talking to camera.

Subject: young woman, casual streetwear, holding phone selfie-style
Action: walking forward, talking to camera, occasional smile
Camera: handheld iPhone selfie, 9:16, slight bob
Lighting: neon ambient, mixed magenta + cyan, dusk
Style: raw UGC, no color grade, mild lens flare
Pacing: 6 seconds, one take
Audio: city ambience, distant traffic, no music
Negative: no cinematic blur, no shallow DOF

What each model renders:

  • Runway: Gen-3 needs explicit aspect_ratio=9:16 and motion_strength=2 to keep the handheld bob subtle.
  • Veo: Veo 3 will add diegetic city sound from the Audio line β€” works well for UGC realism.
  • Kling: Kling 2.0 excels at human face fidelity; weight Subject + Action highest in the prompt.

Animated β€” Studio Ghibli-style forest clip

Reference: hand-drawn animated forest, sunbeams through leaves, small fox padding through ferns, painterly.

Subject: small red fox with white-tipped tail
Action: padding slowly forward through ferns, ears twitching
Camera: static medium shot, subtle parallax on background
Lighting: sun shafts through canopy, dust motes
Style: Studio Ghibli-inspired hand-drawn animation, painterly bg
Pacing: 5 seconds, looping-friendly
Audio: soft forest ambience, distant birdsong
Negative: no humans, no photoreal textures, no text

What each model renders:

  • Runway: Gen-3 Alpha Turbo handles painterly styles with style_strength=7 and a clean Negative line.
  • Veo: Veo 3 will lean photoreal by default; prepend 'illustrated, hand-drawn, 2D' to lock the style.
  • Kling: Kling 2.0 has strong anime priors β€” set style_reference='ghibli' if available in your account.

How to port the prompt across video models

Every modern video generator consumes the same eight fields, but the syntax differs. This table maps each field to the conventions used by Runway Gen-3, Veo 3, Kling 2.0, and Sora-tier endpoints so the recipe travels cleanly.

FieldRunway Gen-3Veo 3Kling 2.0Sora-tier
Subjectsubject token, weight 1.0natural language, no special syntaxsubject first, weight first 8 tokensnatural language, scene-first
Cameracamera_motion + motion_strength (1–10)natural verbs ('slow dolly in')camera_movement enumcinematic verbs in prose
Lightingappended descriptorfirst-class field in promptambient descriptorsscene-grounded prose
Stylestyle_strength (1–10)style reference textstyle_reference enumin-line style cues
Pacingduration param (5/10s)duration param (8s typ.)duration enumduration param
Audion/a (silent)native, generatedlimitednative, generated
Aspect16:9 / 9:16 / 1:116:9 / 9:1616:9 / 9:16 / 1:116:9 / 9:16 / 1:1
Negativenegative_prompt fieldinline 'avoid …'negative_prompt fieldinline 'avoid …'

By the numbers

Stop reverse-engineering by hand.

Upload a clip in ZeroTwo and the vision LLM does the structured extraction in roughly eight seconds β€” then runs the prompt across Runway, Veo, Kling, and Sora-tier models in the same tab.

Why a structured prompt beats a caption

A caption tells you what is on screen. A structured prompt tells the model how to put it back. The difference is controllability: with eight bounded fields, you can edit Camera without disturbing Subject, swap Style without losing Pacing, and harden Negative to kill artifacts β€” the kind of surgical iteration that flat captions do not enable.

"Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world."

That "world simulator" framing is exactly why structured prompts work: video models infer physics, lighting, and camera optics from your fields and fill in the rest. Underspecify and you get drift; over-specify and you get rigidity. The 8-field schema is the sweet spot.

Vision models doing the extraction

GPT vision

Best at scene composition, camera language, color grading vocabulary.

Claude vision

Strongest at long-form structure and nuanced negative-prompt construction.

Gemini multimodal

Native long-video context window; ideal for clips over 30 seconds.

All three are wired into ZeroTwo. Compare extractions side-by-side, ship the tightest one. Background: OpenAI Sora, Google Veo 3, Runway Gen-3 Alpha, Stable Video Diffusion paper.

Frequently asked questions

What does 'video to prompt' mean?
Video to prompt is the inverse of text-to-video: you upload a reference clip and a vision LLM returns a structured text prompt β€” Subject, Action, Camera, Lighting, Style, Pacing, Audio, Negative β€” that you can paste into Runway, Veo, Kling, or any other video generator to recreate the look.
How is video to prompt different from video captioning?
Captioning produces one sentence describing what is on screen. Video to prompt produces a structured, multi-field generative recipe with camera moves, lens choice, lighting, color grade, pacing, and a negative list β€” the actual inputs a video model needs to recreate the clip.
Which vision models extract the prompt inside ZeroTwo?
ZeroTwo routes extraction through GPT vision, Claude vision, and Gemini multimodal β€” three frontier vision LLMs in one tab. You can compare extractions side-by-side and pick whichever schema is tightest for your reference clip.
Can I extract a prompt from a YouTube link?
Yes. Paste a YouTube URL inside ZeroTwo's AI video generator and the platform pulls representative frames, runs the multimodal pass, and returns the 8-field prompt β€” ready to send straight to Runway, Veo, or Kling.
Will the same prompt work in Runway, Veo, and Kling?
The fields are portable; the syntax differs. Runway uses motion_strength and style_strength sliders, Veo accepts natural-language camera verbs and generates audio natively, and Kling prefers short motion verbs. The porting table on this page maps each field across models.
Do I need separate accounts for each video model?
No. ZeroTwo bundles 60+ frontier models β€” including the major video generators β€” under one $19.99/mo account, so you extract once and regenerate across models from the same workspace.
How long is the typical extraction?
Vision LLMs return the structured prompt in roughly eight seconds for a clip under 30 seconds. Longer reference videos are sampled at keyframes to keep latency and cost in check.
Can I edit the extracted prompt before generating?
Yes. The output is plain JSON or key-value text β€” every field is editable. Tweak the Camera line, swap the Style, harden the Negative, then send it straight to the video model.

Key takeaways

  • Video to prompt returns a recipe, not a caption β€” eight bounded fields you can edit and ship.
  • Vision LLMs (GPT, Claude, Gemini) do the extraction; ZeroTwo lets you compare all three.
  • The schema ports across Runway Gen-3, Veo 3, Kling 2.0, and Sora-tier with minor syntax tweaks.
  • Structured fields beat captions because they let you iterate one axis at a time.
  • One $19.99/mo ZeroTwo account unlocks 60+ models including every major video generator.

Related guides

ZeroTwo Multimodal Team

Practitioners shipping multi-model AI video workflows. Published 2026-05-03 Β· Updated 2026-05-03.

$19.99/mo unlocks 60+ models β€” including every major video generator.

One account. Vision LLMs for extraction, video gens for output, no per-model billing.