Machines that create lifelike video portraits of an individual based on audio content are becoming increasingly prevalent. These talking face generation techniques are most commonly used in applications such as virtual avatars, online conferences and animated movies.
The current two-stage framework employed to create talking faces consists of a stage that predicts an intermediate representation from the input audio and then uses a renderer to synthesize the video portraits. Whilst this two-stage approach has made significant gains in terms of realism and lip-sync quality, it has difficulty in dealing with the one-to-many mapping problem that arises due to the variety of phoneme contexts, emotions and lighting conditions in which the portraits can be created.
To tackle this problem, a new approach has been proposed called MemFace. This system makes use of an implicit memory and an explicit memory which help to bridge the gap between the audio and visual representation when creating talking face portraits. This can be achieved by supplementing the missing data with memories and thus making the one-to-many mapping problem easier to solve. The implicit memory helps to capture more high-level semantic information by learning habitual patterns and attitudes in the target individual, whilst the explicit memory stores pixel-level information like wrinkles and shadow details. It utilizes an explicit memory to synthesize visual appearances based on the mouth shapes determined from expression estimations. This memory is composed of 3D face models and their accompanying picture patches, where the vertices of these models are used as the keys and the pixel-level information as the values. Querying this explicit memory with the vertices, the neural rendering model is able to obtain the pixel-level information for each input phrase. This pixel-level information is then returned to the neural rendering model, which is then utilized to render the desired visual appearance.
To generate a lifelike talking face portrait, MemFace makes use of an implicit memory and explicit memory. The implicit memory helps capture high-level semantic information such as habitual patterns and attitudes of an individual, while the explicit memory stores pixel-level details like wrinkles and shadows. Additionally, MemFace attempts to alleviate the one-to-many mapping problem by supplementing the missing data with memories. This helps to bridge the gap between the audio and visual representation when creating talking face portraits, leading to more realistic video portraiture and easier one-to-many mapping.
Research is still ongoing to further optimize MemFace and improve the system's ability to create more realistic portraits with improved lip-sync quality, emotion expression and head motion. With this in mind, enhanced talking face creation could soon become a reality, enabling the creation of lifelike virtual avatars, animated movies and more.
In conclusion, the new approach of MemFace provides a useful tool to alleviate the one-to-many mapping problem inherent in talking face creation. This could lead to more realistic video portraiture and opens up the potential for a variety of new applications in fields such as virtual reality, animation and online services. To read more CLICK HERE!