SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model
Yukai Shi1,3, Weiyu Li2,4, Zihao Wang4, Hongyang Li3, Xingyu Chen3, Ping Tan2,4, Lei Zhang3.
We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets will be released.
Our framework consists of scene perception, 3D object generation under occlusion, and pose estimation. We decouple the de-occlusion model from 3D object generation. We construct a unified pose estimation model that incorporates both global and local attention mechanisms.
We decouple and develop a robust de-occlusion model by leveraging image datasets for open-set occlusion prior. Our model achieves
higher quality and more text-controllable results under severe occlusion and open-set conditions.