F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet
with Cycle-Consistent Gaussian Splatting

Yuxin Wang1     Qianyi Wu2     Dan Xu1Superscript Image    
1Hong Kong University of Science and Technology   2Monash University  

F3D-Gaus tackles the problem of generalizable 3D-aware generation from monocular datasets, e.g., ImageNet[1]. Given a single image as input, F3D-Gaus generates a high-fidelity 3D Gaussian representation, achieving high-quality, multi-view consistent 3D-aware generation, with ~60fps.

Abstract

This paper tackles the problem of generalizable 3D-aware generation from monocular datasets, e.g., ImageNet[1]. The key challenge of this task is learning a robust 3D-aware representation without multi-view or dynamic data, while ensuring consistent texture and geometry across different viewpoints. Although some baseline methods are capable of 3D-aware generation, the quality of the generated images still lags behind state-of-the-art 2D generation approaches, which excel in producing high-quality, detailed images. To address this severe limitation, we propose a novel feed-forward pipeline based on pixel-aligned Gaussian Splatting, coined as F3D-Gaus, which can produce more realistic and reliable 3D renderings from monocular inputs. In addition, we introduce a self-supervised cycle-consistent constraint to enforce cross-view consistency in the learned 3D representation. This training strategy naturally allows aggregation of multiple aligned Gaussian primitives and significantly alleviates the interpolation limitations inherent in single-view pixel-aligned Gaussian Splatting. Furthermore, we incorporate video model priors to perform geometry-aware refinement, enhancing the generation of fine details in wide-viewpoint scenarios and improving the model’s capability to capture intricate 3D textures. Extensive experiments demonstrate that our approach not only achieves high-quality, multi-view consistent 3D-aware generation from monocular datasets, but also significantly improves training and inference efficiency.

Pipeline

Pipeline of F3D-Gaus. Given a single RGB image $I_0$ and depth map $D_0$, our model directly feeds them forward to output the pixel-aligned Gaussian Splatting representation $GS_0$, which can be used for novel view synthesis. After obtaining the 3DGS representation, we render the image $\tilde{I_1}$ and depth maps $\tilde{D_1}$ for the novel view, and then output its corresponding 3DGS $GS_1$. These two 3DGS representations are subsequently aggregated to produce the images for supervision. This novel self-supervised training strategy enforces cycle-consistent 3D representation learning across different views, allowing the generalized 3DGS representations to reinforce each other, thereby collaboratively enhancing the overall 3D representation capability.

Comparison with the State-of-the-art Method

G3DR[2]+Real-ESRGAN[3] Ours
G3DR[2]+Real-ESRGAN[3] Ours
G3DR[2]+Real-ESRGAN[3] Ours
G3DR[2]+Real-ESRGAN[3] Ours
G3DR[2]+Real-ESRGAN[3] Ours
G3DR[2]+Real-ESRGAN[3] Ours
G3DR[2]+Real-ESRGAN[3] Ours
G3DR[2]+Real-ESRGAN[3] Ours

Novel View Synthesis on Out-of-domain Samples


Although our training set is ImageNet[1], F3D-Gaus demonstrates its potential generalization ability by producing high-quality novel view synthesis results on out-of-domain samples, such as face and scene images.

Surface Extraction from Single Image via F3D-Gaus + GOF[4]

Input image

3DGS representation (the output of F3D-Gaus)

Extracted mesh


Our F3D-Gaus renders depth maps as GOF[4] for supervision. Thus, after we get the predicted 3DGS from a single image via F3D-Gaus, we could generate meshes utilizing GOF's mesh extraction method (tetrahedral grid generation combined with binary search). Note that the mesh is directly derived from the 3DGS representation and does not rely on image-based optimization methods.

References
[1] Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database. In: CVPR (2009)
[2] Reddy P, Elezi I, Deng J. Generative 3D Reconstruction in ImageNet. In: CVPR (2024)
[3] Wang X, Xie L, Dong C, et al. Training real-world blind super-resolution with pure synthetic data. In: ICCV (2021)
[4] Yu Z, Sattler T, Geiger A. Gaussian opacity fields: Efficient adaptive surface reconstruction in unbounded scenes. In: TOG (2024)

BibTeX

@article{wang2025f3dgaus,
    title={F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Consistent Gaussian Splatting},
    author={Wang, Yuxin and Wu, Qianyi and Xu, Dan},
    journal={arXiv preprint arXiv:2501.06714},
    year={2025}
}