Behind the Build: How We Make AI Videos Feel Human

14 May 2026

By Matt Basedow

The gap between "this looks AI-generated" and "this actually looks professional" comes down to a handful of specific decisions. Not magic. Not a single model. Decisions.

When we started building PropertyVideos.ai, the easy version was simple: take a photo, run it through an AI video model, and spit out a clip. Technically, that works. But the results feel wrong. The camera moves too fast, or too robotically. The lighting doesn't match the scene. A person placed into the frame looks pasted in, not present.

None of that is acceptable if the goal is a video that an agent would actually want to put their name on.

So we fixed each problem. Here's how.

The Prompt Problem: Generic In, Generic Out

The first thing we changed was how we talk to the AI.

Most tools send the same motion prompt to every image. That's why you get clips where a kitchen gets the same camera treatment as a master bedroom, and neither looks quite right.

We don't do that. Before every clip is generated, a vision AI analyses the actual image and writes a custom scene description. A kitchen becomes: "a modern open-plan kitchen with white marble countertops and pendant lighting." A pool becomes: "an elegant outdoor pool area surrounded by mature landscaping."

That description then gets injected into our motion prompt template: slow push-in, camera gliding forward, natural depth parallax, warm lighting, premium real estate aesthetic.

Generic prompts produce generic results. Telling the AI exactly what's in each scene changes how it moves through that scene.

The difference in output quality is significant. The camera feels like it belongs in the room, not just passing through it.

The Actor Problem: People Are Hard

Putting a person into a property photo is one of the hardest things to do well in AI.

The failure mode is obvious when you've seen it. The person looks like a cutout. The lighting on them doesn't match the room. They're facing slightly the wrong way. It looks like a bad photoshoot composite from 2009.

Our pipeline runs in a specific sequence for a reason. Virtual staging happens first, so if we're furnishing an empty room, the furniture is in place before we composite any people. Then, actor compositing happens using a separate image AI model. Only after both of those are baked into the photo does the clip get sent to our video generation model.

Why does this matter? Because the video AI is now animating a complete, coherent scene. It's not trying to reconcile a real room with a fake person. Everything is already resolved. The result is motion that treats the person as part of the space, not an overlay on top of it.

We also use a different video generation model for actor clips, specifically, one built to handle human subjects and preserve natural movement. This also means every clip in an actor project renders at the same frame rate for consistency, even the non-actor scenes.

The Voiceover Problem: It Shouldn't Sound Like a Robot Reading

Most AI voices sound fine until they hit a real estate address, a price, or an agent's name. Then they fall apart.

We use a best-in-class voice AI for voiceover generation. Scripts are written by AI from the listing details the agent provides. Premium users can also clone their own voice, which means their listings sound like them, not a generic narrator.

The voiceover gets generated before the final render. When it's included, it triggers a music volume reduction so the voiceover sits cleanly in the mix. These are small details. They're the difference between something that sounds assembled and something that sounds produced.

The Assembly Problem: Everything Has to Fit Together

A good clip means nothing if the final video looks messy.

The last step in our pipeline is a video composition engine. It takes all the individual clips, the voiceover, the background music, and the agent's branding, logo, colours, photo, contact details and assembles them into the finished video.

This is also where we handle frame rate consistency. Every clip in a project uses the same frame rate. A mix of frame rates looks wrong on playback, even if the viewer can't explain why.

The branding layer adapts dynamically. If an agent hasn't provided a phone number, that field doesn't show. If there are only two property features, the icon layout adjusts. Nothing sits blank, nothing overflows.

What We're Still Working On

The honest answer is that AI-generated video is still imperfect. Generative models can hallucinate objects that weren't in the photo. Pool water doesn't always ripple the way you'd want. Occasionally, the AI decides there's something interesting around a corner and tries to show you.

We've built error handling, auto-retry logic, and a monitoring system that tracks every job and kicks in if something stalls. But we haven't eliminated the underlying problem, because no one has yet. We're working on it.

What I can say is that we've closed the gap considerably by being specific about every decision in the pipeline. The right model for the right task. The right prompt for the right image. The right frame rate for the right content.

That's what makes AI videos feel human. Not one model. A pipeline that treats every detail as worth getting right.