Skip to content

Driver Creation Guide

A driver is a JSON file that acts as a creative brief for Onset Engine, telling it what visual content to assign to each energy tier of your music.

Without a driver, the engine falls back to using basic raw CLIP similarity and motion scores. With a driver, you get full control over the AI’s clip selection. It steers the selection process by using CLIP embeddings to find the closest match for the current music intensity across predefined energy tiers.

A driver is written in JSON format (specifically schema v3) and is composed of the following key sections:

  • meta: Basic information like name, version, and description.
  • global: Filters applied to the entire video, such as min_rating, exclude_tags, and whether to enforce shot_diversity.
  • tiers: The heart of the driver, representing four distinct energy levels (1_LOW, 2_MED, 3_HIGH, 4_MAX).

Each tier contains:

  • descriptions: A list of natural language text descriptions of the content.
  • subjects / tags: Specific tags you’ve identified in your library.
  • Optional filters like moods, scene_types, and min_rating.

Here is a partial look at how a driver structures its tiers based on drivers/examples/action_movie.json:

{
"meta": {
"name": "Action Movie",
"version": "3.0"
},
"tiers": {
"1_LOW": {
"descriptions": [
"character talking in a quiet room",
"city skyline establishing shot"
],
"moods": ["serene", "neutral"],
"scene_types": ["wide"]
},
"4_MAX": {
"descriptions": [
"massive explosion with fire and debris",
"slow-motion bullet impact"
],
"moods": ["epic", "chaotic"],
"min_rating": 4
}
}
}

The descriptions array in each tier is what powers the AI’s visual matching.

When the engine processes a driver, each natural language phrase (e.g., "city skyline establishing shot") is encoded into a mathematical 768-dimensional vector using a CLIP text encoder.

During editing, the engine computes cosine similarity between the description vector and the pre-computed embedding vectors for every clip in your library. It then ranks the clips to find the best match for the current tier’s text.

Beyond simple text matching, tiers can have strict filters applied to penalize or exclude clips that don’t match exactly what you want:

  • moods: Filters clips based on the AI’s mood classification (e.g., epic, serene, tense). If a clip’s mood doesn’t match the preferred mood, it receives a 0.50x score penalty.
  • scene_types: Filters based on scene composition (e.g., wide, close-up, aerial). Mismatches get a 0.60x score penalty.
  • min_rating: An integer (0-5) that ensures only high-quality clips are picked for a given tier.

These penalties stack, meaning a clip missing both the preferred mood and scene type is heavily penalized (0.50 * 0.60 = 0.30x score).

If you have already tagged specific subjects or themes in the Onset Engine GUI, you can reference them directly in your driver using the @ syntax in the tags (or legacy subjects) array.

{
"3_HIGH": {
"descriptions": ["fast punches", "beam clash"],
"tags": ["@Goku", "@Vegeta"]
}
}

These tags do not match as plain text. Instead, they resolve directly to the library tag centroids (an averaged vector of all clips you explicitly tagged).

Note that @Tag references are AND-filtered alongside descriptions. The clip must have a high text description similarity and match the requested tag.

Testing with Different Footage: Because drivers rely on CLIP embeddings rather than strict filenames, the exact same JSON driver can be used across completely different sets of footage. For example, a “Wedding Highlights” driver can be run on five different wedding folders, and the engine will dynamically adapt to find the best matching clips in each respective folder.

Sharing Drivers: Since driver files are just plain JSON, sharing them is incredibly simple. You can send the .json file to another Onset Engine user. They place it in their drivers/ folder (or load it directly from the GUI), and it will immediately work with their own footage library.