Towards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor

1University of Maryland, 2Apple *Internship at Apple

Abstract

Recent advances in multimodal large language models (MLLMs) have enabled image-based question-answering capabilities. However, a key limitation is the use of CLIP as the visual encoder; while it can capture coarse global information, it often can miss fine-grained details that are relevant to the input query. To address these shortcomings, this work studies whether pre-trained text-to-image diffusion models can serve as instruction-aware visual encoders. Through an analysis of their internal representations, we find diffusion features are both rich in semantics and can encode strong image-text alignment. Moreover, we find that we can leverage text conditioning to focus the model on regions relevant to the input question. We then investigate how to align these features with large language models and uncover a leakage phenomenon, where the LLM can inadvertently recover information from the original diffusion prompt. We analyze the causes of this leakage and propose a mitigation strategy. Based on these insights, we explore a simple fusion strategy that utilizes both CLIP and conditional diffusion features. We evaluate our approach on both general VQA and specialized MLLM benchmarks, demonstrating the promise of diffusion models for visual understanding, particularly in vision-centric tasks that require spatial and compositional reasoning.


Overview


We present Mustafar, a multimodal understanding framework that leverages Stable Diffusion as a task-aware feature extractor.

Teaser


What Visual Information Do Diffusion Models Encode?


Diffusion Features Encode Semantic and Structural Information

unconditional features.

We first extract and inspect unconditional features using Stable Diffusion v2.1 as our diffusion model. We utilize PCA for dimensionality reduction to visualize the features, with each channel representing a principal component. Our visualizations reveal three key insights about diffusion features:

Feature Diversity: Features from different blocks capture both shared semantics and image-specific details. Additionally, features like out and res-out contain tokens that act as "registers" - shared global descriptors across similar images.

Timestep Behavior: Higher timesteps encode coarse layout, while lower timesteps emphasize fine-grained structure.

Similarity Analysis: To understand how well diffusion features encode visual differences, we plot average cosine similarity between pairs of similar images. We observe that: (1) diffusion features capture intra-pair visual differences better than CLIP; (2) cross-q features show higher pairwise similarity than b0-out and res-out features; and (3) pairwise similarity decreases as timestep increases.



Cross-Attention Maps Capture Text-Aligned Visual Semantics

unconditional features.

We first visualize cross-attention maps to identify how well diffusion models capture image-text alignment. We show two samples from the COCO-captions dataset and their averaged cross-attention maps. Cross-attention maps at higher timesteps show higher focus on background elements (e.g., “court”). Attention maps from lower timesteps provide improved localization of both object and action concepts (e.g., “racquets” and “holding”). Overall, we see robust correspondence with precise localization of regions relevant to objects and actions.



Can We Extract Task-Aware Features for Question-Answering?

Teaser


Yes! We first identify that passing questions as the text-prompt can enable the model to focus on relevant regions. To mitigate the leakage effect further, we perform pre-training with no input text-prompt and only utilize questions during supervised fine-tuning. To harness the benefits of both CLIP and diffusion features, we experiment with two simple fusion strategies: (1) concatenation and (2) cross-attention. Our results show that both fusion strategies are able to match or exceed the performance of the CLIP baseline on general-purpose and vision-centric benchmarks.