FaceCPT: Toward Cross-Modal Facial Representation Learning with Face-Caption Pre-Training


Md Mahedi Hasan (West Virginia University), Shoaib Meraj Sami (West Virginia University), Nasser Nasrabadi (West Virginia University), Jeremy M. Dawson (West Virginia University)
The 35th British Machine Vision Conference

Abstract

Facial representation learning (FRL) through weakly-supervised pre-training has shown significant promise across various downstream tasks, highlighting its improved generalizability. However, most existing FRL models excel only in face generation and analysis, and there is a lack of models that address cross-modal tasks. To fill this gap, we pose the following question: Can we learn a universal facial representation by pre-training on web-sourced face-caption pairs for surveillance-related tasks? These tasks include both cross-modal understanding, such as face captioning and text-based face image retrieval, as well as face analysis tasks like face, attribute, expression recognition, and age estimation. In this paper, we take a step toward this objective by introducing FaceCPT, a new framework for learning facial representation using Face Caption Pre-Training. However, domain misalignment and information asymmetry between image-text pairs challenge the model's ability to achieve a meaningful interaction. To overcome this, we utilize contrastive learning along with a semantic attribute-aware loss (SAAL) to improve the semantic associations between face-caption pairs and encourage the model to focus on key semantic attributes, respectively. Experiments show that FaceCPT outperforms existing vision-language pre-training and FRL baselines, achieving state-of-the-art results in task-specific fine-tuning and improved zero-shot transferability across both single-modal and cross-modal tasks, even with low-resolution inputs.

Citation

@inproceedings{Hasan_2025_BMVC,
author    = {Md Mahedi Hasan and Shoaib Meraj Sami and Nasser Nasrabadi and Jeremy M. Dawson},
title     = {FaceCPT: Toward Cross-Modal Facial Representation Learning with Face-Caption Pre-Training},
booktitle = {36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025},
publisher = {BMVA},
year      = {2025},
url       = {https://bmva-archive.org.uk/bmvc/2025/assets/papers/Paper_898/paper.pdf}
}


Copyright © 2025 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection