πΉ (ACM MM 2025) HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval (Model Weights)
1School of Software, Shandong University2School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen),
β Corresponding author
This repository hosts the official pre-trained model weights for HUD, a novel framework tackling both Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR) tasks by explicitly leveraging the disparity in information density between modalities.
π Model Information
1. Model Name
HUD (Hierarchical Uncertainty-Aware Disambiguation Network) Checkpoints.
2. Task Type & Applicable Tasks
- Task Type: Composed Video Retrieval (CVR) and Composed Image Retrieval (CIR).
- Applicable Tasks: Retrieving a target video or image based on a reference visual input and a text modifier. HUD excels at addressing modification subject referring ambiguity and limited detailed semantic focus.
3. Project Introduction
HUD is the first framework that explicitly leverages the disparity in information density between video and text. It achieves State-of-the-Art (SOTA) performance through three key modules:
- π― Holistic Pronoun Disambiguation: Exploits overlapping semantics through holistic cross-modal interaction to indirectly disambiguate pronoun referents.
- π Atomistic Uncertainty Modeling: Discerns key detail semantics via uncertainty modeling at the atomistic level, enhancing focus on fine-grained visual details.
- βοΈ Holistic-to-Atomistic Alignment: Adaptively aligns the composed query representation with the target media by incorporating a learnable similarity bias.
4. Training Data Source & Hosted Weights
The HUD framework supports both video and image retrieval benchmarks. This repository provides pre-trained checkpoints evaluated on the following datasets:
- CVR: WebVid-CoVR dataset.
- CIR: FashionIQ and CIRR datasets.
(Note: Download the respective .ckpt files hosted in the "Files and versions" tab of this repository).
π Usage & Basic Inference
These weights are designed to be evaluated using the highly modular, Hydra-configured HUD GitHub repository.
Step 1: Prepare the Environment
We recommend using Anaconda. Clone the repository and install dependencies:
git clone https://github.com/iLearn-Lab/MM25-HUD
cd MM25-HUD
conda create -n hud python=3.8.10 -y
conda activate hud
conda install pytorch==2.1.0 torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
Step 2: Download Model Weights
Download the specific checkpoints from this Hugging Face repository and place them into your local directory. Ensure your dataset paths are correctly configured in configs/machine/default.yaml.
Step 3: Run Evaluation
To evaluate a trained model, use test.py and specify the target benchmark and checkpoint path via Hydra overrides:
python3 test.py \
model.ckpt_path=/path/to/your/downloaded_checkpoint.ckpt \
+test=webvid-covr # or fashioniq / cirr-all
β οΈ Limitations & Notes
- Configuration: HUD is entirely managed by Hydra and Lightning Fabric. Make sure to override configurations via the CLI or modify the YAML files in the
configs/directory as needed. - Hardware & Environment: The project was specifically developed and tested on Python 3.8.10, PyTorch 2.1.0, and an NVIDIA A40 48G GPU. Using significantly different environment settings may impact reproducibility.
πβοΈ Citation
If you find our framework, code, or these weights useful in your research, please consider leaving a Star βοΈ on our GitHub repository and citing our ACM MM 2025 paper:
@inproceedings{HUD,
title = {HUD: Hierarchical Uncertainty-Aware Disambiguation Network for Composed Video Retrieval},
author = {Chen, Zhiwei and Hu, Yupeng and Li, Zixu and Fu, Zhiheng and Wen, Haokun and Guan, Weili},
booktitle = {Proceedings of the ACM International Conference on Multimedia},
pages = {6143β6152},
year = {2025}
}