Text to Image Synthesis via Mask Anchor Points and Aesthetic Assessment

Date of Award


Degree Name

Master of Computer Science (M.C.S.)


Department of Computer Science


Advisor: Tam Nguyen


Text-to-image is a process of generating an image from the input text. It has a variety of applications in art generation, computer-aided design, and photo-editing. In this thesis, we propose a new framework that leverages mask anchor points to incorporate two major steps in the image synthesis. In the first step, the mask image is generated from the input text and the mask dataset. In the second step, the mask image is fed into the state-of-the-art mask-to-image generator. Note that the mask image captures the semantic information and the location relationship via the anchor points. We develop a user-friendly interface that helps parse the input text into the meaningful semantic objects. However, to synthesize an appealing image from the text, image aesthetics criteria should be considered. Therefore, we further improve our proposed framework by incorporating the aesthetic assessment from photography composition rules. To this end, we randomize a set of mask maps from the input text via the anchor point-based mask map generator, and then we compute and rank the image aesthetics score for all generated mask maps following two composition rules, namely, the rule of thirds along with the rule of formal balance. In the next stage, we feed the subset of the mask maps, which are the highest, lowest, and the average aesthetic scores, into the state-of-the-art mask-to-image generator via image generator. The photorealistic images are further re-ranked to obtain the synthesized image with the highest aesthetic score. Thus, to overcome the state-of-the-arts generated images' problems such as the un-naturality, the ambiguity, and the distortion, we propose a new framework. Our framework maintains the clarity of the entities' shape, the details of the entity edges, and the proper layout no matter how complex the input text is and how many entities and spatial relations in the text. Our contribution is converting the input text to an appropriate constructed mask map or to a set of mask maps via Mask Map Generator (MG). Furthermore, the aesthetic assessment is part of our contribution in this study via Aesthetic Ranking (AR) component. The experiments on the most challenging COCO-stuff dataset illustrates the superiority of our proposed approach over the previous state of the arts.


Computer Science, Text-to-image, mask dataset, image synthesis, anchor points, image aesthetics

Rights Statement

Copyright 2020, author