How SAM 3D works, the new Meta AI model that transforms 2D images into 3D models

Credit: Meta.

In recent days Meta announced SAM 3D (Segment Anything Model), a software that allows you to extrapolate complex and spatially coherent three-dimensional models starting from simple two-dimensional photographs. The heart of this new technological architecture lies in two distinct but complementary models: SAM 3D Objects and SAM 3D Body. The first is specifically engineered for the reconstruction of inanimate objects and entire scenes, solving common problems such as occlusions or partial perspectives, while the second specializes in the analysis of the human figure, estimating pose and body shape with previously unseen precision. Unlike previous attempts in the field, which relied predominantly on synthetic and isolated data, this system aims for a “common sense” understanding of the real physical world, making fundamental resources such as inference codes and new evaluation benchmarks accessible to the scientific community. According to Meta all this «has the potential to be used for creative applications in fields such as robotics, interactive media, science and sports medicine».

How SAM 3D Objects and SAM 3D Body work

Delving deeper into the technical operation of SAM 3D Objects, we notice a fundamental paradigm shift compared to traditional approaches. Historically, 3D reconstruction models have been limited by the scarcity of training data: while immense databases exist for text and images, the availability of “ground truth” in the field of 3D development is orders of magnitude lower. To overcome this obstacle, instead of relying only on computer-generated synthetic assets (which often do not reflect the complexity of the real world), an innovative data engine was developed.

This system uses a virtuous cycle in which human annotators do not have to create models from scratch, which is slow and expensive, but simply verify and classify the meshes generated by the AI. Let’s briefly explain what we mean by “mesh”: it is the network of polygons that defines the geometric structure of a 3D object. Thanks to this method, which combines automatic generation with human supervision, it was possible to annotate almost a million real images, creating a massive training dataset that allows the software to handle small objects, indirect views and complex backgrounds much better than its predecessors.

Shifting our attention to SAM 3D Body, we note that it is a solution designed to respond to the need to estimate the human form even in difficult conditions, such as unusual postures or crowded scenes. The peculiarity of this model lies in the use of the MHR format (Meta Momentum Human Rig), a new format that structurally separates the skeleton from the shape of the soft tissues, ensuring an anatomical rendering that is more faithful to reality.

The training was based on a dataset of approximately 8 million high-quality images and, according to what Meta says, «the model is trained using prompt-based guidance and multi-step refinement, allowing flexible user interaction and improving 2D alignment with visual evidence in the image».

The SA-3DAO dataset was also introduced (SAM 3D Artist Objects), which offers a series of benchmarks that are much more demanding than current standards, pushing research towards a more realistic and less artificial 3D perception.

Current limitations

As significant as Meta’s advances in 3D are, some limitations remain. When it comes to object reconstruction, the output resolution remains moderate, meaning that details of more complex structures may be lost or appear distorted. Additionally, SAM 3D Objects processes elements individually and is not yet capable of reasoning about physical interactions, such as contact or interpenetration between multiple objects. Even on the body reconstruction front, there is room for improvement: the model processes each individual separately, ignoring the interactions between people or between humans and the environment, and the precision in estimating the pose of the hands, although improved, does not yet reach the levels of systems specialized exclusively on that anatomical part.