Do you think the generated is too realistic, it is true, the background and texture of the image are generated so realistic is really impressive. In fact, I would like to say that BigGAN is so good that it is a bit too showy! Okay, let's get to the topic.
With the development of generative models such as GAN and VAE, image generation has advanced by leaps and bounds in the past few years. ImageNet is so realistic.
The biggest contributor among these should be GAN. The confrontational idea of GAN allows the generator and the discriminator to advance each other in the game, so that the generated images are clear and realistic. SAGAN has already scored 52 points for the IS generated by ImageNet. Qualitatively, I feel that SAGAN has already seen the generated ImageNet. I think it is already excellent. The generation of BigGAN makes me sigh with admiration, Why can BigGAN achieve such a big breakthrough?
Why can BigGAN achieve such a big breakthrough?
One of the big reasons is BigGAN as described in its title Large Scale GAN Training for High Fidelity Natural Image Synthesis Large Scale, in the training Batch uses a very large batch, which has reached 2048 (we usually train batches usually with 64), and the convolution channel has also become larger, and there is also the network The parameters of the network have increased. Under the batch of 2048, the parameters of the entire network have reached nearly 1.6 billion (seeing the GTX 1080 that I am still using, it suddenly becomes silent).
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Large Scale, in the training Batch uses a very large batch, which has reached 2048 (we usually train batches usually with 64), and the convolution channel has also become larger, and there is also the network The parameters of the network have increased. Under the batch of 2048, the parameters of the entire network have reached nearly 1.6 billion (seeing the GTX 1080 that I am still using, it suddenly becomes silent).
This is the reason why BigGAN is called BigGAN. I think the title of BigGAN is not only to explain the huge network, but also to imply that this article will bring a lot of information to people. The big impression is that I was really "scared". Such a big improvement of course cannot be achieved by blindly increasing the Batch and network parameters. Including the increase of Batch, the timely truncation and processing of the prior distribution z, and the stability of the model Sexual control, etc., we will start to explain later.
Including the increase of Batch, the timely truncation and processing of the prior distribution z, and the stability of the model Sexual control, etc., we will start to explain later.
According to the original text, summarize the contributions of BigGAN:Through the large-scale GAN Application, BigGAN has achieved a huge breakthrough in generation; Adopting the "cut-off technique" of the prior distribution z, allowing fine control of sample diversity and fidelity; Continuously overcoming model training problems in the implementation of large-scale GANs, using techniques Reduce training instability.
BigGAN's Way to Improve Generation
BigGAN builds a model based on SAGAN. For those who are unfamiliar with SAGAN, please refer to my previous Thesis Interpretation On the basis of SAGAN, BigGAN is designed to increase the Batch size, "cut off skills" and control the stability of the model.
Thesis Interpretation[5 ], BigGAN also uses Hinge Loss, BatchNorm, Spectral Norm and some other technologiesskillful.
On the basis of SAGAN, BigGAN is designed to increase the Batch size, "cut off skills" and control the stability of the model.
Increase in Batch size
The Batch size in SAGAN is 256. The author found that simply increasing the Batch size can achieve better performance. The article has been experimentally verified:
It can be seen that when the Batch size is increased to 8 times, the IS generation performance has increased by 46%. The article speculates that this may be the result of covering more patterns in each batch, providing a better gradient for generating and discriminating the two networks. Increasing the Batch size will also bring about training a better performance model in less time, but increasing the Batch size will also reduce the stability of the model in training. Later, we will analyze how to improve the stability.
In experiments, increasing the Batch size alone is still limited. The number of channels in each layer of the article has also been increased accordingly. When the channel increases by 50%, it is about twice The number of parameters in the two models. This will result in a further 21% increase in IS. The article believes that this is due to the increase in the capacity of the model relative to the complexity of the data set. Interestingly, the article found in experiments that blindly increasing the network depth will not bring better results, but there will be a certain decrease in the generation performance.
blindly increasing the network depth will not bring better results, but there will be a certain decrease in the generation performance.
Since BigGAN is a class for training ImageNet, the conditional generation is realized by adding the conditional label c. If the conditional label c is embedded under BatchNorm, it will bring a lot The parameter increases, the article uses shared embedding instead of setting a layer for each embedding. This embedding is linearly projected to the bias and weight of each layer. This idea is borrowed from SNGAN and SAGAN, which reduces the cost of calculation and memory, and Improved training speed (the number of iterations required to achieve a given performance) by 37%.
BigGAN has made improvements to the embedding of the prior distribution z. Common GANs use z as input directly into the generation network, while BigGAN sends the noise vector z to G The multiple layers instead of just the initial layer. The article believes that the latent space z can directly affect the features of different resolutions and hierarchical levels. For BigGAN's conditional generation, z is divided into a block of each resolution, and each block is connected to The condition vector c is implemented, which provides a moderate performance improvement of about 4% and increases the training speed by 18%.
The article believes that the latent space z can directly affect the features of different resolutions and hierarchical levels. For BigGAN's conditional generation, z is divided into a block of each resolution, and each block is connected to The condition vector c is implemented, which provides a moderate performance improvement of about 4% and increases the training speed by 18%.
Take a look at the detailed structure of BigGAN's generative network according to the above ideas:
As shown in the left figure, the noise vector z is divided into multiple blocks by split, and then connected with the condition label c and sent to each layer of the generation network. For each residual of the generation network The block can be further expanded into the structure shown on the right. It can be seen that the block of the noise vector z and the condition label c are sent to the BatchNorm layer after concat operation under the residual block, where this embedding is a shared embedding, which is linearly projected to the bias and weight of each layer.
For the prior distribution z. Generally, the standard normal distribution N(0,I) or the uniform distribution U[ −1,1] is selected. The article has doubts about this. Can other distributions not work? Through experiments, in order to meet the subsequent "cut off" requirements, the article finally chose z∼N(0,I).
The so-called "truncation technique" is to sample from the prior distribution z and set a thresholdTo cut off the sampling of z, the values outside the range are resampled to fall within the range. This threshold can be determined based on the quality indicators IS and FID generated.
Through experiments, we can know that by setting the threshold, the quality of the generation will be better and better as the threshold decreases, but due to the decrease of the threshold, the sampling The narrowing of the range will result in the simplification of the orientation of generation and the problem of insufficient diversity of generation. IS can often reflect the quality of image generation, while FID will pay more attention to the diversity of generation. Let’s understand the meaning of this truncation through the following figure:
due to the decrease of the threshold, the sampling The narrowing of the range will result in the simplification of the orientation of generation and the problem of insufficient diversity of generation. IS can often reflect the quality of image generation, while FID will pay more attention to the diversity of generation. Let’s understand the meaning of this truncation through the following figure:
As the truncation threshold decreases, The quality of production is improving, but the production is also approaching simplification. Therefore, according to the generation requirements of the experiment, it is a choice to weigh the quality of the generation and the diversity of the generation. Often the decline of the threshold will cause the IS to rise all the way, but the FID will first get better and then get worse all the way.
Some larger models are not suitable for truncation, and saturation artifacts will be generated when the truncation noise is embedded, as shown in the figure (b) above, in order to counteract this situation , The article enforces the adaptability of truncation by adjusting G to smooth, so that the entire space of z will be mapped to good output samples. To this end, the article adopts Orthogonal Regularization
Orthogonal Regularization[6 ], which directly enforces the orthogonality condition:
where W is the weight matrix and β is the hyperparameter. This kind of regularization is usually too restrictive. In order to relax the constraints and achieve the smoothness required by the model, the article found that the best version is to remove the diagonal term from the regularization and aims to minimize the paired cosines between the filters. Similarity, but does not limit their norms:
where 1 represents a matrix, all The elements are all set to 1. Through the Hier. in Table 1 above represents direct truncation and Ortho. represents the use of regular orthogonality, it can be seen that the performance of regular orthogonality has indeed improved.
I think the "truncation technique" in BigGAN is very similar to Glow
Glow[7 ] For the annealing technique in, BigGAN achieves the improvement in the quality of the generation by controlling the range of sampling, and Glow is the guarantee of the smoothness of the generated image by controlling the annealing coefficient (and also the control of the sampling range).
Control of model stability
Control of model stability
For the control of G:
In exploring the stability of the model, the article monitors a series of weights, gradients and loss statistics during training to find possible indications The indicator of the start of the training crash. Experiments have found that the first three singular values σ0, σ1, and σ2 of each weight matrix are the most useful. They can use the Alrnoldi iteration method
Alrnoldi iteration method[8 ] Perform effective calculations.
The experiment is shown in the figure (a) below. For the singular value σ0, most G layers have good spectral specifications, but some layers (usually the first layer in G) Instead of convolution), the performance is poor. The spectral norm grows throughout the training process and explodes when it crashes.
In order to solve the training collapse on G, the singular value σ0 is adjusted appropriately to To counteract the effects of spectral explosion. First, the article adjusts the top singular value σ0 of each weight, toward a fixed value or toward the ratio r of the second singular value, that is, toward r⋅sg(σ1), where sg is Control the operation of the gradient and stop in due course. Another method is to use partial singular value decompositionInstead of σ0, at a given weight W, its first singular value vector μ0 and ν0 and fixed , restrict the weight to:
The fixed setting is or r⋅sg(σ1). The above whole operation is to control the first singular value σ0 of the weight and place Sudden explosion.
Experiments have observed that under the operation of weight restriction, with or without the operation of spectral normalization, σ0 or
The explosion of , but even in some cases they can improve network performance to a certain extent, but nothing The combination can prevent training collapse (the conclusion is that the collapse is unavoidable).
After a meal, the article concludes that adjusting G can improve the stability of the model, but it cannot ensure constant stability, so the article turns to the control of D.
Control of D:
The entry point of and G is the same, the article is still Consider the spectrum of the D network and try to find additional constraints to seek stable training. As shown in Figure 3 (b) above, different from G, it can be seen that the spectrum is noisy, but the whole process is steady growth. When it collapses, it does not explode but jumps.
The article assumes that these noises are caused by the optimization of adversarial training. If this spectral noise is causally related to instability, then the relative countermeasure is to use gradient punishment. By using R1 zero center gradient penalty:
where γ is 10, training It becomes stable and improves the smoothness and boundedness of the spectrum in G and D, but the performance is severely reduced, resulting in a 45% reduction in IS. Reducing penalties can partially alleviate this deterioration, but it will lead to increasingly poor spectrum. Even if the penalty intensity is reduced to 1 (the lowest intensity at which no sudden collapse occurs), IS is reduced by 20%.
Using orthogonal regularization, DropOut and L2's various regular ideas repeat the experiment, revealing that these regularization strategies have similar behaviors: For D The penalty is high enough to achieve training stability but the performance cost is high.
For D The penalty is high enough to achieve training stability but the performance cost is high.
If the control and punishment of D is strong, it can indeed achieve stable training, but the image generation performance is also reduced, and the reduction is a bit much, This trade-off is very tangled.
The experiment also found that D's loss during training was close to zero, but experienced a sharp upward jump when it crashed. One possible explanation for this behavior is that D is over-fitting. Combine the training set and memorize training samples instead of learning some meaningful boundaries between the real image and the generated image.
To evaluate this guess, the article evaluates the discriminator on the ImageNet training and validation sets and measures the percentage of samples classified as real or generated. Although the accuracy is always higher than 98% under the training set, the verification accuracy is in the range of 50-55%, which is not better than random guessing (regardless of the regularization strategy). This confirms that D does remember the training set, which is also in line with D's role: Continuously refine the training data and provide G with useful learning signals.
Continuously refine the training data and provide G with useful learning signals.
Model stability does not only come from G or D, but from their interaction through the adversarial training process. Although their symptoms of poor regulation can be used to track and identify instability, ensuring reasonable regulation proves to be necessary for training, but it is not enough to prevent eventual training collapse.
Stability can be enforced by constraint D, but doing so will result in a huge cost in performance. Using the existing technology, by relaxing this adjustment and allowing collapse in the later stages of training (human grasping the actual training), better final performance can be achieved.When the model is fully trained to obtain good results.
The BigGAN experiment is mainly evaluated under the ImageNet data set, and the experiment is carried out in ImageNet ILSVRC 2012 (ImageNet data set that everyone is using ) Evaluation models on 128×128, 256×256 and 512×512 resolutions. The qualitative effect of the experiment is simply convincing. Quantitatively, compared with the latest SNGAN and SAGAN in IS and FID, it also crushes each other.
In order to further illustrate that the G network is not to remember the training set, it is adjusted under a fixed z The conditional label c is generated by interpolation. From the experimental results in the figure below, it can be found that the entire interpolation process is smooth, which also shows that G is not remembering the training set, but really achieving image generation.
Of course, the model also has unreasonable images generated, but unlike the previous GAN, once it is not generated Reasonable images are often distorted and transparent images. Unreasonable images of BigGAN training also retain a certain degree of texture and recognition, which can indeed be regarded as a good model.
The experiment was trained on my own training samples, brutally trained on 2.9 billion images in the 8,500 category, and it is similar to ImageNet and achieved good results.
Let’s talk about the experimental environment again. The overall experiment is based on SAGAN, and the training uses Google’s TPU. The performance of a TPU can match the performance of a dozen or more GPUs. The huge training parameters are also scary. At least I guess my computer can't run.
Another highlight of the article is to analyze the NG results of the experiment, and share the pits I have made with you. This is really conscientious, and we intercepted it. Some pits to share:Blindly deepening the network may hinder the performance of the generation; The idea of sharing classes is very troublesome in controlling hyperparameters, although it may increase the training speed; WeightNorm replaces BatchNorm in G Did not achieve good results; In addition to spectrum normalization, try to add BatchNorm to D (including class conditional and unconditional), but did not achieve good results; Use 5 or 7 instead of G or D or both The filter size of 3, the filter of 5 may be slightly improved, but the computational cost has also gone up; I tried to change the expansion of the convolution filter in 128×128 G and D, but found that even a small amount of Expansion will also reduce performance; try to use bilinear upsampling in G instead of the closest upsampling, but this reduces performance.
The experiment in this paper including appendices is quite sufficient. It can be seen that it took a long time to train and improve the model. DeepMind, as an AI team under Google, showed the "suffering". I would like to express my deep respect for this paper.
Finally, share the amazing results of BigGAN:
BigGAN A huge leap of GAN on ImageNet has been achieved. The potential of GAN has been developed to a new stage. Can IS or FID be further improved? If it is improved, it will be almost close to the real existence. Through large batches, large parameters, "cutting techniques" and large-scale GAN training stability control, the feat of BigGAN has been achieved. At the same time, the huge amount of calculations is also scary, but with the development of hardware, it may soon be popularized by AI big calculations, and there are still great expectations.
Through large batches, large parameters, "cutting techniques" and large-scale GAN training stability control, the feat of BigGAN has been achieved. At the same time, the huge amount of calculations is also scary, but with the development of hardware, it may soon be popularized by AI big calculations, and there are still great expectations.
[1 ]. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.
[2 ]. Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In arXiv preprint arXiv:1805.08318, 2018.
[3 ]. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. In NIPS, 2016.
[4 ]. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Gu ̈nter Klambauer, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash e. In NIPS, 2017.
[6 ]. Andrew Brock, Theodore Lim, JM Ritchie, and Nick Weston. Neural Photo Editing with Introspective Adversarial Networks. In ICLR, 2017.
[7 ]. King ma, DP, and Dhariwal, P. 2018. Glow: Generative flow with invertible 1x1 convolutions.
[8 ]. Gene Golub and Henk Van der Vorst. Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics, 123:35–65, 2000.
This article is selected and recommended by PaperWeekly, an AI academic community, and the community currently covers nature For research directions such as language processing, computer vision, artificial intelligence, machine learning, data mining and information retrieval, click " read the original text "Join the community now!
This article is selected and recommended by PaperWeekly, an AI academic community, and the community currently covers nature For research directions such as language processing, computer vision, artificial intelligence, machine learning, data mining and information retrieval, click "
read the original text
"Join the community now!
Click on the title to see more interpretations of the paper:
Click on the title to see more interpretations of the paper:
# Contribution #
Let your paper be seen by more people
Let your paper be seen by more people
How can more high-quality content reach the readers in a shorter path and shorten the cost for readers to find high-quality content? The answer is: someone you don’t know.
The answer is: someone you don’t know.
There are always people you don’t know who know what you want to know. PaperWeekly may become a bridge, prompting scholars and academic inspirations of different backgrounds and different directions to collide with each other, bursting out more possibilities.
PaperWeekly encourages university laboratories or individuals to share all kinds of high-quality content on our platform, which can be interpretation of the latest papers learning experience or technical dry goods. Our purpose is only one, to make knowledge truly flow.
interpretation of the latest papersstrong>, it can also be
learning experience or
technical dry goods. Our purpose is only one, to make knowledge truly flow.
📝 Manuscript standard:
• The manuscript is indeed personal Original work, the manuscript must indicate the author's personal information (name + school/work unit + education/position + research direction)
Original work, the manuscript must indicate the author's personal information (name + school/work unit + education/position + research direction)
• If the article is not first published, Please remind and attach all published links when submitting your paper.
• PaperWeekly defaults to each article as the first publication and will add the "original" logo
📬 Submission email:
📬 Submission email:
• Submission email: [email protected]
• All articles with pictures, please send them separately in the attachment
• Please leave your instant contact information ( WeChat or mobile phone) so that we can communicate with the author when editing and publishing
About PaperWeeklyReturn to Sohu to see more