Abstract
Generative AI is making a tremendous impact on many areas of society, with the generation of images and text among the most well-known applications. While results created by popular software like DALL-E or ChatGPT might be impressive, there are still many unexplored questions in the area, both in the underlying methodology and its application. This PhD thesis addresses these topics, focusing on the use of Deep Generative AI for modeling human faces and expressions.
While traditional computer graphics methods for generating faces offer explicit and fine-grained control over the generated images, these methods still fall short of achieving true photorealism. On the other hand, data-driven Deep Generative Models, such as Generative Adversarial Networks and Denoising Diffusion Models, have seen rapid improvements in image quality in recent years and are now able to synthesize human face portraits that are almost indistinguishable from real photographs. However, these Deep Generative Models still lack the fine-grained control that traditional computer vision methods offer.
The main objective of the thesis is to achieve greater control over the images generated by Deep Generative Models. Deep Generative Models learn to compress the high-dimensional manifold of plausible face images into a reduced representation called the latent space. This thesis develops methods for discovering preferred directions or trajectories in the latent space that correspond to semantically interpretable changes to the generated images, for example, changes to pose or facial expression.
The first part of the thesis focuses on StyleGAN, a state-of-the-art Generative Adversarial Network architecture that has revolutionized the field of unconditional synthesis of human faces.
First, a multilinear framework is proposed for modeling facial expressions with StyleGAN.
The method allows factorizing the latent space of a pretrained StyleGAN model into different, semantically meaningful subspaces. These subspaces control a single attribute of the generated images, such as identity, facial expression, or pose angle.
Next, this thesis explores how StyleGAN represents 3D structure. We use Non-Rigid Structure-from-Motion to construct a sparse 3D model from 2D images generated with StyleGAN. We then propose a novel method for connecting the 3D model with the latent space of StyleGAN, allowing for explicit control of the 3D geometry of the generated images.
Very recently, Denoising Diffusion Models have emerged as a strong competitor to Generative Adversarial Networks, both in terms of the quality and diversity of the generated images. However, the latent space of this class of models is still not well understood. The final part of this thesis proposes novel supervised and fully unsupervised approaches for semantic editing of face images using the semantic latent space of Denoising Diffusion Models.
While traditional computer graphics methods for generating faces offer explicit and fine-grained control over the generated images, these methods still fall short of achieving true photorealism. On the other hand, data-driven Deep Generative Models, such as Generative Adversarial Networks and Denoising Diffusion Models, have seen rapid improvements in image quality in recent years and are now able to synthesize human face portraits that are almost indistinguishable from real photographs. However, these Deep Generative Models still lack the fine-grained control that traditional computer vision methods offer.
The main objective of the thesis is to achieve greater control over the images generated by Deep Generative Models. Deep Generative Models learn to compress the high-dimensional manifold of plausible face images into a reduced representation called the latent space. This thesis develops methods for discovering preferred directions or trajectories in the latent space that correspond to semantically interpretable changes to the generated images, for example, changes to pose or facial expression.
The first part of the thesis focuses on StyleGAN, a state-of-the-art Generative Adversarial Network architecture that has revolutionized the field of unconditional synthesis of human faces.
First, a multilinear framework is proposed for modeling facial expressions with StyleGAN.
The method allows factorizing the latent space of a pretrained StyleGAN model into different, semantically meaningful subspaces. These subspaces control a single attribute of the generated images, such as identity, facial expression, or pose angle.
Next, this thesis explores how StyleGAN represents 3D structure. We use Non-Rigid Structure-from-Motion to construct a sparse 3D model from 2D images generated with StyleGAN. We then propose a novel method for connecting the 3D model with the latent space of StyleGAN, allowing for explicit control of the 3D geometry of the generated images.
Very recently, Denoising Diffusion Models have emerged as a strong competitor to Generative Adversarial Networks, both in terms of the quality and diversity of the generated images. However, the latent space of this class of models is still not well understood. The final part of this thesis proposes novel supervised and fully unsupervised approaches for semantic editing of face images using the semantic latent space of Denoising Diffusion Models.
Original language | Danish |
---|
Number of pages | 166 |
---|---|
ISBN (Print) | 978-87-7949-410-7 |
Publication status | Published - 24 Nov 2023 |
Series | ITU-DS |
---|---|
Number | 214 |
ISSN | 1602-3536 |