3D-RE-GEN

3D Reconstruction of Indoor Scenes with a Generative Framework

Overview

Single-image 3D scene reconstruction poses a fundamental challenge for content creators: extracting individual objects from a single photo and converting them into a complete, editable 3D scene with textured assets and a reconstructed background. 3D-RE-GEN addresses this by combining instance segmentation, context-aware generative inpainting, 2D-to-3D asset creation, and constrained optimization to produce physically plausible, production-ready scenes. The result is a coherent reconstruction where objects maintain proper spatial relationships, lighting consistency, and material fidelity are all extracted from a single input image.

Pipeline

The pipeline begins by identifying and segmenting objects in the input image. To reconstruct partially occluded objects, we use Application-Querying, a visual prompting technique that provides the image editing model with both full scene context and the target object in a structured UI-style layout. This allows the model to inpaint missing parts with scene-aware details. With isolated, inpainted object images and a clean background plate, we estimate camera parameters and reconstruct 3D point clouds using a geometry transformer trained on diverse scene data. Parallel to this, each inpainted object is processed by a 2D-to-3D generative model, producing textured 3D assets. Finally, a differentiable renderer optimizes each asset's position and orientation using a novel 4-DoF ground alignment constraint, which locks ground-contacting objects to the floor plane while allowing free movement for suspended objects. This hybrid approach ensures objects are physically plausible, spatially coherent, and ready for direct use in games, simulations, and visual effects.

Results

Our evaluation demonstrates that 3D-RE-GEN consistently outperforms existing methods across quantitative and qualitative metrics. We measure 3D scene geometry accuracy using Chamfer Distance (0.011 vs. 0.028 for competitors), F-Score (0.85 vs. 0.65–0.70), and Bounding Box IoU (0.63 vs. 0.44–0.57), indicating that objects are correctly positioned, sized, and aligned in the reconstructed scene. The Hausdorff distance (0.33 vs. 0.55–0.61) shows our method produces fewer outliers and artifacts. Qualitatively, 3D-RE-GEN excels at recovering sharp object boundaries, generating coherent backgrounds, and avoiding the mesh artifacts and object merging issues common in competing approaches. In a user study with 59 participants, our method achieved 81% preference over alternatives, with "Layout and Composition" cited as the primary reason for choice. Our method generalizes robustly across synthetic, real-world, and even outdoor scenes domains where existing methods often fail.

ComparisonsComparisons across four scenes (001, 002, 004, 011). Each row shows left-to-right: DepR, MIDI, Ours.

DepRMIDIOursGT

Full Scenes (Ours + Background)Our full reconstructions with the background plate shown for context.

GTOursOursGT

Quantitative MetricsOur method outperforms competing approaches MIDI and DepR across all major metrics. Chamfer Distance decreases from 0.028–0.036 to 0.011, indicating more accurate 3D geometry. F-Score improves from 0.65–0.70 to 0.85. Bounding Box IoU reaches 0.63 compared to 0.44–0.57, showing better spatial alignment. The significantly lower Hausdorff distance (0.33 vs. 0.55–0.61) demonstrates fewer outliers and more consistent reconstruction quality.

Qualitative ComparisonVisual comparisons across synthetic and real-world scenes reveal key strengths of 3D-RE-GEN. Unlike competitors that produce mesh artifacts, object merging, and floating or misaligned assets, our method consistently achieves clean object boundaries, coherent spatial layouts, and high-fidelity textures. The reconstructed background is sharp and integrates naturally with placed objects, making scenes immediately usable for VFX and game pipelines.

User StudyIn a study with 59 participants, 3D-RE-GEN achieved 81% preference over alternative methods. When asked to rate scene quality across metrics like spatial accuracy, texture fidelity, and mesh coherence, participants consistently ranked our approach highest. The top reason cited for preference was "Layout and Composition," confirming that users value spatially coherent, physically plausible scenes above other factors.

Ablations

To validate the importance of our key components, we conducted ablation studies removing either the 4-DoF ground alignment or the Application-Querying inpainting technique.

Impact of 4-DoF Ground AlignmentWithout the 4-DoF constraint, objects float or sink into the floor, breaking physical plausibility. Quantitatively, Chamfer Distance increases to 0.030, F-Score drops to 0.68, and IoU falls to 0.51. The 4-DoF constraint is essential for producing scenes ready for simulation and animation.

Impact of Application-QueryingRemoving Application-Querying and replacing it with naive inpainting yields incomplete objects lacking scene awareness. Background quality degrades significantly, and the inpainted assets lose context-aware details. Without A-Q, 2D inpainting metrics (SSIM) drop from 0.54 to 0.27, and perceptual metrics (LPIPS) worsen from 0.34 to 0.66. This validates that structured visual prompting is critical for high-quality, context-aware object recovery.

Citation

@inproceeding{3dregen, author ={Sautter, Tobias and Dihlmann, Jan-Niklas and Lensch, Hendrik P.A.},title ={3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework},booktitle ={},year ={2025}}

More Information

Open Positions

Interested in persuing a PhD in computer graphics?

Never miss an update

Join us on Twitter / X for the latest updates of our research group and more.