RealMaster: Lifting Rendered Scenes into Photorealistic Video

Abstract

State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism.

We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.

How does it work?

Our method consists of two stages:

(1) Synthetic-to-Realistic Data Generation. Given a rendered video sequence, we first edit the first and last frames to serve as photorealistic visual anchors using an off-the-shelf image editing model. These anchors define the target photorealistic look for the full sequence. We then extract edge maps from the input video and use VACE to generate the full video conditioned on the photorealistically edited keyframes and the corresponding edge maps. Edge conditioning anchors generation to the input's structure and motion, allowing VACE to propagate the keyframe appearance while preserving scene layout and dynamics across intermediate frames.

(2) Model Training. We train a lightweight LoRA adapter that distills our data generation pipeline into a single model for sim-to-real video translation. We adopt an IC-LoRA architecture on top of a pre-trained text-to-video diffusion backbone. During training, we concatenate clean reference tokens from the rendered input video with noisy tokens and optimize the model to denoise toward the corresponding photorealistic target. At inference time, the resulting model avoids several constraints imposed by the pipeline: it does not require access to both the first and last frames of a sequence, it can handle objects appearing mid-sequence through learned priors, and it avoids over-editing artifacts from the image editing model.

Our Results

Below we show representative results from RealMaster. Each video shows the rendered input (left) alongside the photorealistic output produced by our model (right).

InputRealMaster

Comparison with Baselines

Input RealMaster LucyEdit Editto Runway-Aleph

Ablation Study — Data Pipeline Variants

Below we explore variants of our data pipeline. Using multiple anchors leads to severe inconsistency, while depth conditioning loses facial expression and, in some cases, the facial structure of the characters. Using edge maps yields the best overall results.

Source Multiple Anchors Depth Edges (Ours)

Ablation Study — Model vs Data Pipeline

As can be seen, the trained model mitigates several limitations of the data pipeline. It preserves objects that appear mid-sequence, such as gloves, which the pipeline often fails to maintain. It also more consistently maintains the scene's lighting and color palette across the sequence.

Input Data Pipeline RealMaster

Cross-Simulator Generalization

We run our model on videos created by the CARLA simulator. As can be seen, our model can enhance the photorealism of the videos while keeping the structure and scene, despite never seeing CARLA data during training.

InputRealMaster

Dynamic Weather Effects

By editing the text prompt of RealMaster, we're able to add weather effects like snow and rain to existing rendered scenes while enhancing their photorealism.

"Make it rain" →

"Make it snow" →