Emergence of Text Semantics in CLIP Image Encoders
Published in UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models (NeurIPS workshop 2024), 2024
Certain self-supervised approaches to train image encoders, like CLIP, align images with their text captions. However, these approaches do not have an a priori incentive to learn to associate text inside the image with the semantics of the text. Humans process text visually; our work studies the semantics of text rendered in images. We show that the semantic information captured by image representations can decisively classify the sentiment of sentences and is robust against visual attributes like font and not based on simple character frequency associations.
Recommended citation: Sreeram Vennam*, Shashwat Singh*, Anirudh Govil, Ponnurangam Kumaraguru
Download Paper