I run AI analysis on Meta ads at scale. Every time someone saves an ad in SpreshApp, my pipeline analyzes it with Gemini and returns structured creative insights: the hook tactic, messaging angle, format, target persona, and a few other signals that ad strategists actually use.
The first version worked well. It was also costing me around $100-150/month at roughly 1000 analyses per day, and trending upward. I needed to fix it.
How the original pipeline worked
The obvious approach when you want to analyze a video is to upload the whole video. That’s what v1 did. I’d download the video from Meta’s CDN, upload it to Gemini’s Files API, wait for it to process, then call generateContent with the file reference and a system prompt describing the taxonomy I wanted back.
For a typical 15-30 second Meta ad, that means sending somewhere between 5,000 and 15,000 tokens of video data per request. At Gemini Flash pricing, each analysis costs a few cents. That sounds cheap until you’re running thousands per day, and then it adds up fast.
The math was roughly: 1000 ads/day, average cost per analysis around $0.15-0.20, and you’re at $150-200/month before long.
The insight that changed things
Ad strategists mostly care about two things: what the hook looks like and what the voiceover is arguing. That’s it.
Visual format and hook type are contained almost entirely in the first 3 seconds. Is it a talking head, UGC, animation, a text-heavy slideshow? What’s the opening frame doing? These signals are set up early on purpose. TikTok and Reels punish anything that doesn’t grab immediately, so ads are engineered to communicate format and hook within seconds. After that, the rest of the video is usually more of the same.
The messaging angle, whether that’s fear, aspiration, social proof, whatever, lives in the voiceover and on-screen copy. Not in what the camera is doing at the 15-second mark.
So I was sending Gemini thousands of tokens of mid-video footage that contributed almost nothing to the output. I was paying to watch the whole movie when I needed maybe the first 30 seconds and the script.
The hybrid approach
The new pipeline runs two ffmpeg jobs in parallel before making any LLM call.
The first extracts a short hook clip:
ffmpeg -ss 0 -t 3 -vf fps=2,scale=512:-2 -c:v libx264 -preset veryfast output_hook.mp4
Three seconds, 2 frames per second, scaled to 512px wide. The output is a few hundred KB instead of several MB. That translates to roughly 800 Gemini tokens instead of 5,000-15,000.
The second job extracts the audio as a 16kHz mono WAV and sends it to Groq’s Whisper endpoint. Groq offers Whisper on a free tier, and it returns word-level timestamps, not just a flat string. So I get something like:
0.32-0.51 Don't
0.51-0.69 go
0.69-1.10 to sleep
...
I format that into a compact transcript block and include it as text in the user message alongside the hook clip. Gemini now sees the visual hook and the full spoken content, but the video payload is 90%+ smaller.
If the transcript comes back empty or too short to be useful (silent ads, bad audio), I fall back to sending the full video through the old path. Same for image ads, which go straight to multimodal. But for video ads with usable audio, which is most of them, I take the hybrid path.
Results
Monthly cost dropped to $5-15/month for the same volume. That’s roughly a 10x reduction.
The analysis quality held up. The things I care most about (format classification, hook tactic, messaging angle, persona) are all well-represented in 3 seconds of visual plus the full transcript. The model can’t see a visual transition that happens at the 12-second mark, but that transition was never load-bearing for the taxonomy anyway.
The one real trade-off is latency on the preprocessing side. Running ffmpeg and hitting the Groq API adds a few seconds before the Gemini call. I handle this by running both jobs concurrently and running the whole thing asynchronously, so the user gets a “processing” response immediately and polls for completion. That was already the architecture, so the extra prep time was mostly free.
Most video analysis pipelines I’ve seen over-index on sending full video because it feels more complete. It usually isn’t. Figure out where your actual signal lives, sample that aggressively, and use cheap text APIs for the rest. The cost difference ends up being significant.