ECVA | European Computer Vision Association

DCDM: Diffusion-Conditioned-Diffusion Model for Scene Text Image Super-Resolution

Shrey Singh*, Prateek Keserwani, Masakazu Iwamura*, Partha Pratim Roy ;

Abstract

"Severe blurring of scene text images, resulting in the loss of critical strokes and textual information, has a profound impact on text readability and recognizability. Therefore, scene text image super-resolution, aiming to enhance text resolution and legibility in low-resolution images, is a crucial task. In this paper, we introduce a novel generative model for scene text super-resolution called diffusion-conditioned-diffusion model (DCDM). The model is designed to learn the distribution of high-resolution images via two conditions: 1) the low-resolution image and 2) the character-level text embedding generated by a latent diffusion text model. The latent diffusion text module is specifically designed to generate character-level text embedding space from the latent space of low-resolution images. Additionally, the character-level CLIP module has been used to align the high-resolution character-level text embeddings with low-resolution embeddings. This ensures visual alignment with the semantics of scene text image characters. Our experiments on the TextZoom and Real-CE datasets demonstrate the superiority of the proposed method to state-of-the-art methods. The source codes and other resources will be available through the project page: https://github.com/shreygithub/DCDM."

Related Material

[pdf] [supplementary material] [DOI]