Fuse3D — short for Generating 3D Assets Controlled by Multi-Image Fusion — addresses a fundamental limitation shared by virtually all existing text-to-3D and image-to-3D pipelines: they accept only a single conditioning image as global input, leaving creators unable to specify different visual characteristics for different spatial regions of a model in a single generation pass.
Developed at the State Key Laboratory of CAD&CG, Zhejiang University, Fuse3D introduces a principled multi-condition architecture that fuses visual features from multiple independent reference images and assigns them to precisely targeted 3D regions — without requiring any fine-tuning of the underlying generative model. The result is unprecedented local control over geometry, texture, and appearance, all within a single coherent 3D asset produced in under 20 seconds.
The framework is built upon TRELLIS, Microsoft's state-of-the-art image-to-3D model, and adopts 3D Gaussian Splatting (3DGS) as its core scene representation — a choice that enables photorealistic rendering at interactive frame rates while remaining fully compatible with downstream editing workflows.