Abstract

In this paper, we propose an Attentional Concatenation Generative Adversarial Network (ACGAN) aiming at generating 1024 × 1024 high-resolution images. First, we propose a multilevel cascade structure, for text-to-image synthesis. During training progress, we gradually add new layers and, at the same time, use the results and word vectors from the previous layer as inputs to the next layer to generate high-resolution images with photo-realistic details. Second, the deep attentional multimodal similarity model is introduced into the network, and we match word vectors with images in a common semantic space to compute a fine-grained matching loss for training the generator. In this way, we can pay attention to the fine-grained information of the word level in the semantics. Finally, the measure of diversity is added to the discriminator, which enables the generator to obtain more diverse gradient directions and improve the diversity of generated samples. The experimental results show that the inception scores of the proposed model on the CUB and Oxford-102 datasets have reached 4.48 and 4.16, improved by 2.75% and 6.42% compared to Attentional Generative Adversarial Networks (AttenGAN). The ACGAN model has a better effect on text-generated images, and the resulting image is closer to the real image.

1. Introduction

In recent years, with the rise of artificial intelligence and deep learning, natural language processing and computer vision have become the hot research fields. The text to image as a basic problem in the field has also attracted the attention and research of many scholars. Text to image is the generation of a realistic image that matches a given text description, requiring processing fuzzy and incomplete information in natural language descriptions. Text to image drives the development of multimodal learning and cross-modal generation and shows great potential in applications such as cross-modal information retrieval, photo editing, and computer-aided design.

Since Goodfellow et al. [1] proposed Generative Adversarial Networks (GANs) in 2014; the network model has received extensive attention from academia and industry. With the continuous development of GAN, it has been widely used to generate realistic high-quality images based on text descriptions. The commonly used method [25] encodes the entire text description into a global sentence vector, which is input to the generator as a condition variable of GAN to generate an image. However, due to the large structural differences between text and images, the use of only word-level attention does not ensure the consistency of global semantics, while it is difficult to generate complex scenes; moreover, fine-grained word information is still not explicitly used for generating images. Therefore, the generated image does not contain enough details and is still significantly different from the real image.

To address this issue, this paper proposes Attention Cascade Generative Adversarial Networks (ACGAN). The network adopts multilevel cascade structure, the generator and discriminator in each layer are composed of convolution units, and a new network layer is added layer by layer during the training process, and the generator and discriminator are added for processing the details of the higher resolution image. At the same time, the deep attentional multimodal similarity model is introduced into the network, focusing on the fine-grained information of the word level in the semantics. The word vector is used as the input of the generator, and through the constraint of the word vector, in the case of ensuring that the overall shape of the image is unchanged, the details of the generated image are emphasized, the consistency of the image and the semantic cross-modality is maintained, and the generation process is smooth. Finally, a measure of diversity is added to a layer of the discriminator to influence the discriminator’s discriminant, so that the generator can obtain more diverse gradient directions, increase the diversity of generated samples, and improve the quality of generated samples.

The contribution of our method is threefold as follows:(i)A multilevel cascade structure is proposed, which improves the resolution of the generated image, and can generate a high-resolution image of up to 1024 × 1024.(ii)Introduce the attention mechanism model into the network, and make the details of the generated image richer by paying attention to the fine-grained information of the word level in the semantics.(iii)Add the measure of diversity to the discriminator, increase the diversity of the generated samples, and improve the quality of the generated samples.

Generative image modeling is a fundamental problem in computer vision. There has been remarkable progress in this direction with the emergence of deep learning techniques. Variational Auto Encoders (VAEs) [6, 7] is aimed to maximize the lower bound of the data likelihood. Autoregressive models (e.g., PixelRNN) [8] that utilized neural networks to model the conditional distribution of the pixel space have also generated appealing synthetic images. Recently, Generative Adversarial Networks (GANs) have shown promising performance for generating sharper images and video [911]. For example, Eghbal-zadeh et al. [12] proposed a Mixture Density Generative Adversarial Networks to improve the clarity and quality of generated images. Gecer et al. [13] combined the generated confrontation network with a deep convolutional neural network to reconstruct a 3D facial structure from a single face image. But training instability makes it hard for GAN models to generate high-resolution images. A lot of work has been proposed to stabilize the training and improve the image quality [1419].

Generating high-resolution images from text descriptions, though very challenging, is important for many practical applications such as art generation and computer-aided design. Lyu et al. [9] learn joint embedding to establish the relationship between natural language and real images, and then train GAN to generate 64 × 64 images that are conditional on text descriptions. Cao et al. [10] proposed a Stacked Generative Adversarial Networks, which decompose the complex problem of generating high-quality images into some subproblems with better control and generate 256 × 256 high-resolution images.

Recently, attention models have been widely used in computer vision and natural language processing, for example, object detection [20, 21], video subtitle [22], and visual question answer [23, 24]. Xu et al. [25] introduced the attention mechanism into the GAN network and proposed Attentional Generative Adversarial Networks, which instruct the generator to focus on different word-level fine-grained information when generating different image subregions. Qiao et al. [26] proposed a global-to-local collaborative attention module that uses word attention and global sentence attention to enhance the consistency of generated images and semantics.

2.1. The Proposed Model

The Attentional Concatenation Generative Adversarial Networks model proposed in this paper consists of two parts: attentional concatenation generative adversarial networks and deep attentional multimodal similarity model. As shown in Figure 1, the Attentional Concatenation Generative Adversarial Networks model is divided into multiple levels; each layer contains a generator G and a discriminator D, using a multilevel cascade structure, increasing generators and discriminators layer by layer, and continuously adds a new residual network layer during the training process, corresponding to the generation from low-resolution to high-resolution images. The Deep Attentional Multimodal Similarity Model contains a common semantic space, mapping the subregions of the image and the word vector of the sentence into one of the semantic spaces, and measuring the word-level image and text similarity. Instead of adopting a one-step approach, the entire model’s training process tries to generate low-resolution images, then continuously increase the resolution, and finally generate high-resolution and high-quality images.

2.2. Concatenation Generative Adversarial Networks

The generative network has k generators , which take the hidden states as input to the generator , generating images of different resolutions.

Specifically,

Here, z is a noise vector usually sampled from a standard normal distribution. is a global sentence vector, and is a word vector. represents the Conditioning Augmentation [10] that converts the sentence vector to the conditioning vector. is the proposed attention model at the stage of the attention model. The attention model has two inputs: the word features and the image features from the previous hidden layer.

Training starts with both the generator G and discriminator D having a low spatial resolution of 64 × 64 pixels. As the training advances, we incrementally add layers to G and D, and all existing layers remain trainable throughout the process. When doubling the resolution of the generator G and discriminator D we fade in the new layers smoothly. During the transition, we treat the layers that operate on the higher resolution like a residual block, whose weight increases linearly from 0 to 1.

Then we add a new residual layer and transform word features into semantic space of image features. Based on the hidden feature h of the image, a word context vector is calculated for each subregion of the image.

Finally, the image features and corresponding word context features are combined to generate an image in the next stage. In order to generate a real image with multiple levels (sentence level and word level) of conditions, the final objective function of the attention generation network is defined as

Here, is a hyperparameter to balance the two terms of equation (2). The first term is the GAN loss that jointly approximates conditional and unconditional distributions. At the stage of the ACGAN, the adversarial loss for is defined aswhere the unconditional loss determines whether the image is real or fake, while the conditional loss determines whether the image and the sentence match or not.

As shown in Figure 2, for unconditional image generation, the discriminator is trained to distinguish between real images and forged images. For conditional image generation, images and variables are input to the discriminator to determine if the image matches the condition, and the bootstrap generator approximates the conditional image distribution. Discriminator is trained to classify the input into the class of real or fake by minimizing the cross-entropy loss defined bywhere is from the true image distribution at the scale, and is from the model distribution at the same scale. Discriminators of the ACGAN are structurally disjoint, so they can be trained in parallel and each of them focuses on a single image scale.

2.3. Deep Attentional Multimodal Similarity Model

The Deep Attentional Multimodal Similarity Model [25] learns two neural networks that map subregions of the image and words of the sentence to a common semantic space, thus measuring the image-text similarity at the word level to compute a fine-grained loss for image generation.

This paper first uses a standard convolutional neural network to transform an image into a set of feature maps. Each feature map represents some subregions of the image. The dimension of the feature map is equal to the dimension of the word vector, and they are treated as equivalent entities. Next, based on each token in the text, attention is applied to the feature map and their weighted averages are calculated. Finally, the DAMSM is trained to minimize the difference between the attention vector and the word vector described above.where “w” stands for “word”.

Symmetrically, we also minimizewhere P is the posterior probability that sentence is matched with its corresponding image .

Finally, the DAMSM loss is defined as

Using attention mechanism, the DAMSM is able to compute the fine-grained text-image matching loss . And is only applied to the output of the last generator, because the ultimate goal of this paper is to generate high-resolution images through the last generator. If is applied to the images generated by all generators , the computational cost will increase greatly and the performance will not improve.

2.4. Standard Deviation of Measuring Diversity

GAN usually tends to capture only the changes found in the training data. In order to obtain more training data, this paper has greatly simplified this approach and has also improved the change based on “minibatch discrimination”. Not only can feature statistics be calculated from a single image, but they can also calculate feature statistics for the entire small batch, thereby encouraging the generation of images and training images to display similar statistics. By adding a small batch layer at the end of the discriminator, the layer learns a large tensor and converts the input into a set of statistics. Finally, each instance is generated with a separate set of statistics and connected to the output of the layer so that the discriminator can use the statistics internally.

3. Experiments and Evaluation

3.1. Experimental Environment and Data

The algorithm uses the deep learning framework Tensorflow [27], and the experimental environment is Ubuntu 14.04 operating system, using four NVIDIA 1080Ti graphics processing unit (GPU) to accelerate the operation. At the same time, all models were trained on the CUB [28] and Oxford [29] datasets. As shown in Table 1, the CUB data set contains 200 species of birds with a total of 11,788 images. In this paper, 8855 images are used as training datasets and 2933 images as test datasets. Since the target area of 80% of the bird images in the dataset is less than 0.5 [28], we preprocess all images before training to ensure that the proportion of the bird target area is greater than 0.75 of the image size. The Oxford dataset contains 102 flower categories with a total of 8189 images. This article uses 7034 pictures as the training data set and 1155 pictures as the test data set.

3.2. Evaluation Metrics

For the evaluation of the GAN model, qualitative evaluation is usually used; that is, the visual fidelity of the image generated by manual inspection is required. This method is time-consuming and subjective and is somewhat misleading. Therefore, this paper mainly uses two evaluation criteria to evaluate the quality and diversity of generated images.

3.2.1. Inception Score

We choose numerical assessment approach “inception score” [16] for quantitative evaluation,where x denotes one generated sample, and y is the text label corresponding to the sample, is the marginal distribution, and is the conditional distribution. The KL divergence between the marginal distribution and the conditional distribution should be large, so that a variety of high-quality images can be generated. In the experiments, an inception model was given to the CUB data sets, and samples of each model were evaluated.

3.2.2. Human Rank

Human rank for qualitative assessment 50 text descriptions was randomly selected in the CUB and Oxford test sets, and for each sentence, the generated model generated 5 images. The five images and corresponding texts are described to different people to rank the image quality in different ways, and finally the average ranking is calculated to evaluate the quality and diversity of the generated images.

4. Experimental Result

The comparisons between the inception score and human rank results of various models on the CUB and Oxford datasets are presented in Table 2. As can be seen from the table, compared to the inception score of the AttnGAN model, the inception score of the ACGAN model on the CUB dataset has increased by 2.75% (Inception score increased from 4.36 to 4.48). Through the analysis of experimental results, ACGAN scores higher in Inception score than other GAN models; from an intuitive visual point, Human rank score is lower than other GAN models. It shows that the quality and diversity of the sample images generated by the model in this paper have been enhanced, and it is closer to the real image.

Subjective visual comparisons between the three models of StackGAN++, AttnGAN, and ACGAN on the CUB dataset are presented in Figure 3. It can be seen that the image details generated by StackGAN++ and AttnGAN are lost, colors are inconsistent with the text descriptions (1st and 2nd row), and the shape looks strange (2nd and 3rd column) for some examples. ACGAN achieved better results with more details and consistent colors and shapes compared to AttnGAN. For example, the wings are vivid in the 3rd and 4th row. By comparing ACGAN with AttnGAN, we can see that ACGAN contributes to producing fine-grained images with more details and better semantic consistency. For example, the color of the bird in the 2nd column was corrected to black. By comparing ACGAN (256 × 256) with ACGAN (1024 × 1024), we can see that the images generated by ACGAN (1024 × 1024) have higher definition, more vivid colors, and more lifelike details. Generally, content in the CUB dataset is less; therefore, it is easier to generate visually realistic and semantically consistent results on CUB. These results confirm that ACGAN is better than AttnGAN, and the generated image is closer to the real image.

Detailed (beak, wings) comparisons of the results between the three models of StackGAN++, AttnGAN, and ACGAN on the CUB dataset are presented in Figure 4. It can be seen that the beak, wings, and feet of the bird are clearer, and the edges and details are more realistic in the images generated by the ACGAN in this paper. For example, the beak of a bird is more vivid and conforms to the text description in the 4th column. Compared with StackGAN++ and AttnGAN, it has achieved better results.

Subjective visual comparisons between the three models of GAN-INT-CLS, StackGAN++ and ACGAN on the Oxford dataset are presented in Figure 5. Details (petals) comparison of the results are presented in Figure 6. It can be seen that the image details generated by GAN-INT-CLS and StackGAN++ are lost, and the shape looks strange (1st and 2nd row) for some examples. ACGAN achieved better results with more details and consistent colors and shapes compared to StackGAN++. For example, the overall shape of the flowers is clearer, and the details of the petals are more obvious in the 4th row. These results confirm that ACGAN is better than StackGAN++, and the generated image is closer to the real image.

5. Conclusions

This paper adds attention mechanism and multilevel cascade structure to generate adversarial network, uses attention mechanism to pay attention to the fine-grained information of word level in semantics, enriches the details of generated images, and generates through cascade structure Higher resolution images. Experiments have shown that, on the same data set, the Attentional Concatenation Generative Adversarial Networks have clearer edge details and local textures against the image generated by the network, making the generated image closer to the real image. Although this method has achieved good results in generating images, it is still difficult to model complex scenes in life. How to deal with this problem needs further study. At the same time, the generated image is similar to the training data, lacking diversity. Therefore, it is intended to combine the zero shot learning and the generative adversarial networks to synthesize the new category image, which will be the focus of the next step.

Data Availability

The basic data used in this article was downloaded from the Internet. There are two-part datasets: (1) the CUB is a public dataset that can be downloaded from http://www.vision.caltech.edu/visipedia/CUB-200-2011.html. (2) The Oxford is a public dataset that can be downloaded from http://www.robots.ox.ac.uk/∼vgg/data/flowers/102/index.html.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

Linyan Li and Yu Sun contributed equally to this work.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (61876121, 62002254, 61801323, and 62062003), Natural Science Foundation of the Jiangsu Higher Education Institutions of China (19KJB520054), Research Fund of Suzhou Institute of Trade and Commerce (KY-ZRA1805) , Primary Research and Development Plan of Jiangsu Province (BE2017663), Foundation of Key Laboratory in Science and Technology Development Project of Suzhou (SZS201609), and Graduate Research and Innovation Plan of Jiangsu Province (KYCX18_2549).