NVIDIA announced a new model called Edify 3D for 3D asset generation.
(I wonder if 3D is going to be next dimension for AI models as they follow a tentative route of text → image → audio/video → 3D?)
If we think 3D, the first thing that comes to mind per our math books in school is the coordinate system with three axes : x, y and z. And this model capitalizes on that idea by working with multiple viewpoints of an object to generate the final result.
The two main components of this model are:
1. Multi-view diffusion model
2. Reconstruction model
🟢 For the multi-view diffusion model, the steps can be summarized as:
- Input : text prompt + camera poses
- A multi-view diffusion model (Edify Image) creates the RGB image for the input. It uses cross-attention to attend across different viewpoints using the same weights.
- Based on the RGB output and the text-prompt, a multi-view ControlNet adds surface normals to the image, which helps define the orientation of a surface at a specific point.
- Next, a upscaling ControlNet take in the previous output along with the low-pixel output of the reconstruction model and projects it into a high-resolution RGB image.
🟣 For the reconstruction model :
- Input : RGB and normal images from the multi-view diffusion model's output
- A transformer-based model is used to predict the geometry, texture, and materials of the 3D shape, using a set of latent tokens.
The mesh generated in this step is triangular which is processed further to obtain quadrilateral mesh with simplified geometry.