Set-the-Scene: Global-Local Training for Generating Controllable NeRF Scenes

Dana Cohen-Bar, Elad Richardson, Gal Metzer, Raja Giryes, Daniel Cohen-Or
Tel Aviv University

Set-the-Scene takes as input a text and a scene layout and synthesizes a composable NeRF scene, utilizing a Global-Local approach.


Recent breakthroughs in text-guided image generation have led to remarkable progress in the field of 3D synthesis from text. By optimizing neural radiance fields (NeRF) directly from text, recent methods are able to produce remarkable results. Yet, these methods are limited in their control of each object's placement or appearance, as they represent the scene as a whole. This can be a major issue in scenarios that require refining or manipulating objects in the scene. To remedy this deficit, we propose a novel Global-Local training framework for synthesizing a 3D scene using object proxies. A proxy represents the object's placement in the generated scene and optionally defines its coarse geometry. The key to our approach is to represent each object as an independent NeRF. We alternate between optimizing each NeRF on its own and as part of the full scene. Thus, a complete representation of each object can be learned, while also creating a harmonious scene with style and lighting match. We show that using proxies allows a wide variety of editing options, such as adjusting the placement of each independent object, removing objects from a scene, or refining an object. Our results show that Set-the-Scene offers a powerful solution for scene synthesis and manipulation, filling a crucial gap in controllable text-to-3D synthesis.


living room proxy

A Baroque living room

A futuristic living room

A Moroccan living room

bedroom proxy

A futuristic bedroom

An Asian bedroom

A kids bedroom

dining proxy

A Futuristic dining room

A Baroque dining room

A Moroccan dining room

Post Training Editing

dining proxy

Original generated scene

dining proxy

Placement editing

dining proxy

Original generated scene

dining proxy

Placement editing

Single Object NeRFs Composing The Scene

A Baroque bedroom

A Baroque bed

A Baroque nightstand

A Baroque wardrobe

How does it work?


The scene is represented as composition of multiple object NeRFs, each has a given position and possibly a shape defined by the scene proxies. The models are jointly optimized to “locally” represent the required object and “globally” be part of the larger scene. Both the local and the global optimization propagate gradients into the same models, creating a harmonious scene composed of disentangled objects which can be edited independently.