📹 (ACM MM 2025) HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval (Model Weights)

Zhiwei Chen¹, Yupeng Hu^1✉, Zixu Li¹, Zhiheng Fu¹, Haokun Wen², Weili Guan²

¹School of Software, Shandong University
²School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen),
^✉Corresponding author

This repository hosts the official pre-trained model weights for HUD, a novel framework tackling both Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR) tasks by explicitly leveraging the disparity in information density between modalities.

📌 Model Information

1. Model Name

HUD (Hierarchical Uncertainty-Aware Disambiguation Network) Checkpoints.

2. Task Type & Applicable Tasks

Task Type: Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR).
Applicable Tasks: Retrieving a target video or image based on a reference visual input and a text modifier. HUD excels at addressing modification subject referring ambiguity and limited detailed semantic focus.

3. Project Introduction

HUD is the first framework that explicitly leverages the disparity in information density between video and text. It achieves State-of-the-Art (SOTA) performance through three key modules:

🎯 Holistic Pronoun Disambiguation: Exploits overlapping semantics through holistic cross-modal interaction to indirectly disambiguate pronoun referents.
🔍 Atomistic Uncertainty Modeling: Discerns key detail semantics via uncertainty modeling at the atomistic level, enhancing focus on fine-grained visual details.
⚖️ Holistic-to-Atomistic Alignment: Adaptively aligns the composed query representation with the target media by incorporating a learnable similarity bias.

4. Training Data Source & Hosted Weights

The HUD framework supports both video and image retrieval benchmarks. This repository provides pre-trained checkpoints evaluated on the following datasets:

CVR: WebVid-CoVR dataset.
CIR: FashionIQ and CIRR datasets.

(Note: Download the respective .ckpt files hosted in the "Files and versions" tab of this repository).

🚀 Usage & Basic Inference

These weights are designed to be evaluated using the highly modular, Hydra-configured HUD GitHub repository.

Step 1: Prepare the Environment

We recommend using Anaconda. Clone the repository and install dependencies:

git clone https://github.com/iLearn-Lab/MM25-HUD
cd MM25-HUD
conda create -n hud python=3.8.10 -y
conda activate hud
conda install pytorch==2.1.0 torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

Step 2: Download Model Weights

Download the specific checkpoints from this Hugging Face repository and place them into your local directory. Ensure your dataset paths are correctly configured in configs/machine/default.yaml.

Step 3: Run Evaluation

To evaluate a trained model, use test.py and specify the target benchmark and checkpoint path via Hydra overrides:

python3 test.py \
    model.ckpt_path=/path/to/your/downloaded_checkpoint.ckpt \
    +test=webvid-covr # or fashioniq / cirr-all

⚠️ Limitations & Notes

Configuration: HUD is entirely managed by Hydra and Lightning Fabric. Make sure to override configurations via the CLI or modify the YAML files in the configs/ directory as needed.
Hardware & Environment: The project was specifically developed and tested on Python 3.8.10, PyTorch 2.1.0, and an NVIDIA A40 48G GPU. Using significantly different environment settings may impact reproducibility.

📝⭐️ Citation

If you find our framework, code, or these weights useful in your research, please consider leaving a Star ⭐️ on our GitHub repository and citing our ACM MM 2025 paper:

@inproceedings{HUD, 
  title = {HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval}, 
  author = {Chen, Zhiwei and Hu, Yupeng and Li, Zixu and Fu, Zhiheng and Wen, Haokun and Guan, Weili}, 
  booktitle = {Proceedings of the ACM International Conference on Multimedia}, 
  pages = {6143–6152}, 
  year = {2025} 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support