Text to Image Synthesis via Mask Anchor Points and Aesthetic Assessment

Text to Image Synthesis via Mask Anchor Points and Aesthetic Assessment



Samah Saeed A. Baraheem


Presentation: 11:40 a.m.-12:00 p.m., Kennedy Union 211


Link to Full Text

Download Project


Text-to-image is a process of generating an image from the input text. It has a variety of applications in art generation, computer-aided design, and photo-editing. In this thesis, we propose a new framework that leverages mask anchor points to incorporate two major steps in the image synthesis. In the first step, the mask image is generated from the input text and the mask dataset. In the second step, the mask image is fed into the state-of-the-art mask-to-image generator. Note that the mask image captures the semantic information and the location relationship via the anchor points. We develop a user-friendly interface that helps parse the input text into the meaningful semantic objects. However, to synthesize an appealing image from the text, image aesthetics criteria should be considered. Therefore, we further improve our proposed framework by incorporating the aesthetic assessment from photography composition rules. To this end, we randomize a set of mask maps from the input text via the anchor point-based mask map generator, and then we compute and rank the image aesthetics score for all generated mask maps following two composition rules, namely, the rule of thirds along with the rule of formal balance. In the next stage, we feed the subset of the mask maps, which are the highest, lowest, and the average aesthetic scores, into the state-of-the-art mask-to-image generator via image generator. The photorealistic images are further re-ranked to obtain the synthesized image with the highest aesthetic score. Thus, to overcome the state-of-the-arts generated images’ problems such as the un-naturality, the ambiguity, and the distortion, we propose a new framework. Our framework maintains the clarity of the entities’ shape, the details of the entity edges, and the proper layout no matter how complex the input text is and how many entities and spatial relations in the text. The experiments on the most challenging COCO-stuff dataset illustrates the superiority of our proposed approach over the previous state of the arts.

Publication Date


Project Designation

Graduate Research

Primary Advisor

Van Tam Nguyen

Primary Advisor's Department

Computer Science


Stander Symposium project, College of Arts and Sciences

United Nations Sustainable Development Goals

Industry, Innovation, and Infrastructure

Text to Image Synthesis via Mask Anchor Points and Aesthetic Assessment