基于多层次去噪的水电厂监控视频跨模态语义检索
胡晓连 , 唐佳庆 , 杨志 , 周文 , 黄坤 , 曹亮亮 , 莫益军 , 凌贺飞 , 史宇轩 , 李建博
水利水电技术(中英文) ›› 2025, Vol. 56 ›› Issue (11) : 179 -188.
基于多层次去噪的水电厂监控视频跨模态语义检索
Cross-modal semantic retrieval for hydropower plant surveillance videos based on multi-level denoising
【目的】为了将跨模态检索机制应用于水电视频监控系统中的人员安防、设施保护、仪器状态监控等场景,通过构建文本图像之间的多模态数据映射,实现基于文本描述的灵活语义内容搜索。【方法】提出多层次去噪的多模态融合技术,以解决现有跨模态方法中单流模型推理速度慢和双流模型缺乏模态融合的问题。该技术基于双流预训练模型,结合掩码语言建模和细粒度跨模态语义对齐的思想,在神经网络的多个层次上设计了“先加噪、再去噪”的任务,以促进图像和文本之间的细粒度交互。【结果】通过大量试验验证,在不同设置下,相比基线模型CLIP微调后的R@1,在Flickr30K数据集上,图像检索和文本检索任务的召回率分别提高了4.1%和2.7%;在MS-COCO数据集上,这两者分别提高了4.3%和3.2%;在自己收集的水电系统监控场景数据上,针对坝区漂浮人员、设备运行状态、仪表仪器异常等工况的检索进行了测试并取得了较好的效果。【结论】通过试验验证了多层次去噪算法在跨模态语义检索任务中的优越性,证明了其在水电厂监控视频场景的适用性。
[Objective] To apply the cross-modal retrieval mechanisms to scenarios such as personnel security, facility protection, and equipment status monitoring in hydropower video surveillance systems, a multi-modal data mapping between texts and images is developed to enable flexible semantic content search through textual descriptions. [Methods] In order to address issues of the slow inference speed of single-stream models and the lack of modal fusion in dual-stream models in existing cross-modal method, a multi-level denoising multimodal fusion technology was proposed. Based on a dual-stream pre-trained model, this technology integrated masked language modeling with fine-grained cross-modal semantic alignment. A “noise addition followed by denoising” task was designed at multiple levels of the neural network to promote fine-grained interactions between texts and images. [Results] Through extensive experiments, it was validated that under different settings, compared with the fine-tuned CLIP baseline model, the R@1 recall rates for image and text retrieval tasks were increased by 4.1% and 2.7%, respectively, on the Flickr30K dataset. On the MS-COCO dataset, the recall rates were increased by 4.3% and 3.2%, respectively. In a self-collected dataset of hydropower system surveillance scenarios, retrieval tests for personnel in dam areas, equipment operating status, and instrument anomalies were conducted, achieving satisfactory result. [Conclusion] Experiments verify the advantages of the multi-level denoising algorithm in cross-modal semantic retrieval tasks and prove its applicability in hydropower plant surveillance video scenarios.
跨模态检索 / 图像文本检索 / 视觉语言预训练 / 对比学习 / 去噪
cross-modal retrieval / image-text retrieval / vision-language pre-training / contrastive learning / denoising
/
| 〈 |
|
〉 |