Vision Query Ingest

Overview

Vision Query lets you describe what you’re looking for when ingesting footage, and the engine will only keep clips that match your description. Instead of ingesting everything and sorting later, you tell the AI what matters up front.

This uses the same OpenCLIP model that powers the Driver System — your text query is encoded into a 768-dimensional CLIP vector and compared against every clip’s visual embedding using cosine similarity.

How to Use

Open the 📥 INGEST dialog from the Video Bin toolbar
Select your footage folder
In the Vision Query section, type a description of the clips you want to keep:
- "two men fighting"
- "sunset over water"
- "close-up of face"
- "car drifting on a track"
Adjust the Strictness slider (optional)
Click RUN INGEST

The engine will analyze all clips as usual, but at the end of each video’s ingest, it compares every clip’s embedding against your query and drops the ones that don’t match.

Strictness Slider

The Strictness slider controls how closely clips must match your description:

Slider Value	Behavior
0.10 (leftmost)	Very loose — keeps most clips that are even vaguely related
0.26 (default)	Balanced — keeps clips with clear visual relevance
0.40 (rightmost)	Very strict — only keeps clips that strongly match

How It Works

Under the hood, Vision Query uses a two-stage filtering approach:

Adaptive threshold: The engine computes the similarity distribution (mean + 0.5 × standard deviation) across all clips for your query, ensuring the top-matching clips are selected relative to the batch.
Hard floor: Your Strictness slider value acts as an absolute minimum — clips below this score are never included, regardless of the adaptive threshold.

This means the filter adapts to your footage. If many clips match your query well, the adaptive threshold rises and keeps only the best. If few clips match, it still finds the closest ones (as long as they meet the hard floor).

Example Workflow

Scenario: You have 10 hours of nature documentary footage and want to build a library of only the ocean shots.

Click 📥 INGEST → select your footage folder
Vision Query: "ocean waves, underwater, beach, coral reef"
Strictness: 0.22 (slightly loose to catch variety)
Run ingest

Result: Instead of 3,000 clips covering forests, mountains, and oceans, your library contains only the ~400 clips that visually match ocean/water content. Ready for editing immediately.

Combining with Collections

Vision Query and Collections serve different filtering purposes:

Vision Query filters at ingest time — clips that don’t match never enter your library
Collections filter at render time — all clips are in the library, but renders use only clips from a specific collection

You can combine both: ingest with a vision query to build a focused library, then further organize with collections for different projects.