A team of researchers from Microsoft Research Asiaone of the research laboratories of the technology giant founded by Bill Gates, announced VASA-1, a new artificial intelligence model capable of generating very realistic deepfake videos starting from a single photo and audio. Given the high degree of realism that can be achieved with this technology, its potential uses are almost infinite, as are the ethical and moral concerns that this type of solution usually brings with it. It is with these same fears, in fact, that other AI models have been welcomed in recent months, including OpenAI's Sora and Voice Engine or, even more recently, TikTok's function to generate the voice starting from a simple audio clip.
Incredible videos made with VASA-1
One of the clips made with VASA-1 that attracted the most attention shows the Mona Lisa performing a very realistic rap:
On the official page of the project we then find some videos of people talking to someone lip sync practically perfect and a remarkable range of facial expressions consistent with the content of the audio:
How VASA-1 works, Microsoft's new AI that creates deepfake videos
A truly incredible aspect of VASA-1 concerns the fact that this model is not only capable of producing lip movements perfectly in sync with the audio track, but allows you to replicate a rather wide spectrum of facial expressions through very realistic head movements that they give a touch of liveliness and authenticity to the subjects who are “fed” to the algorithm.
But how does VASA-1 generate deepfakes from a simple photo and a single audio file? In the paper released by the team of Microsoft researchers who worked on the project, we read:
Key innovations include a holistic diffusion-based model of facial dynamics and head motion generation that operates in a latent face space, and the development of an expressive and distinct latent face space using video. Through extensive experiments, including the evaluation of a number of new metrics, we demonstrate that our method significantly outperforms previous ones on various dimensions.
To “reduce to the bare bones” the technical explanation that can be read in the abstract, through a whole series of complex calculations, the VASA-1 algorithm goes to merge the “starting” image with the available audio track and, possibly, also with some parameters relating to human expressiveness, so as to give the subject the right expressiveness. This allows you to give the output videos that are generated with VASA-1 the following characteristics.
- Realism and liveliness: the videos generated with VASA-1 do not remain “fixed” with the face totally still, but detach themselves from the background by moving the head and moving in a rather natural way.
- Control over generation: you can make faces look in specific directions, resize them, and make them convey specific emotions. The model also allows you to individually change the appearance of a face, 3D head pose and facial expressions.
- Ability to create various types of content: Microsoft VASA-1 also manages to generate artistic content, for example showing subjects singing or speaking in other languages, despite not having been trained in these aspects. The training was carried out using thousands of images and a wide variety of facial expressions.
- Real-time efficiency: since VASA-1 is capable of generating video frames with dimensions of 512×512 pixels at 45 fps in offline batch processing mode and up to 40 fps in online streaming mode, the software ensures high efficiency. It took on average 2 minutes to produce videos using a Nvidia RTX 4090 GPU.
What can be done with VASA-1
What could be the potential uses of VASA-1? Such a powerful technology already in its version 1 can certainly have potentially infinite fields of application. Among the most obvious ones there are certainly the uses infield of video gamesallowing the creation of increasingly realistic avatars, but the model could also be used to create your own virtual avatars for use in the social sphere. Not to mention the use in musical fieldwhere VASA-1 seems to be able to extricate itself quite well, despite not having been trained to generate songs (as we have already mentioned before).
Above we showed you some deepfakes generated with VASA-1 starting from images generated in turn with AI (therefore these are not real shots but images generated with StyleGAN2 or with FROM-E 3, except for the deepfake featuring the Mona Lisa, of course). Looking at them you will be able to notice how far ahead the model is, even if with a careful enough eye you can notice some small artefacts here and there that betray the artificial nature of the videos in question.
What are the possible risks of this technology
Since “with great power comes great responsibility”, Microsoft has decided not to make the model public, given the possible risks deriving from improper use, primarily the propagation of artfully created deepfakes, which would be difficult to find, and which they could be used to create fake news that is potentially dangerous for the security of entire economies and countries. In fact, Microsoft reported bluntly:
We have no intention of releasing an online demo, API, product, additional implementation details, or any related offering until we are confident that the technology will be used responsibly and in compliance with appropriate regulations.
Sora, the video generation software from OpenAI (a company which for the record is part of the Microsoft group), was also not made available for practically the same reasons. It will be interesting to see whether the advent of VASA-1 will contribute to the development of the tools already developed by OpenAI and, if so, in what way.