针对远程光电容积描记法(rPPG)在非接触式血氧饱和度(SpO2)测量中存在的时空特征建模不足以及复杂场景下鲁棒性差的挑战,提出了一种趋势感知时空融合网络(trend-aware spatio-temporal fusion network, TAST-Net).该网络通过一个创新的双路融合架构,将3D卷积神经网络(3D CNN)分支提取的局部生理特征与ViViT(video vision transformer)分支捕捉的全局时空依赖进行协同融合.为增强模型对信号动态变化的敏感性,设计了一种结合均方误差与皮尔逊相关性损失的加权组合损失函数.在2个公开数据集上的实验结果表明,TAST-Net表现出优秀的性能:在PURE(pulse rate estimation)数据集上均方根误差()为0.53%,平均绝对误差()为0.37%,皮尔逊相关系数(R)为0.96;在更具挑战性的VIPL-HR(visual information processing and learning-heart rate)数据集上,为0.84%,为0.57%,R为0.82,其综合性能优于其他对比方法.研究结果表明,TAST-Net为从面部视频中实现准确、稳健的SpO2估计提供了一个有效的方案,并验证了融合局部与全局特征策略在rPPG信号处理中的有效性.
Abstract
To address the challenges of inadequate spatio-temporal feature modeling and poor robustness in complex scenarios for non-contact blood oxygen saturation (SpO2) measurement using remote photoplethysmography (rPPG),a trend-aware spatio-temporal fusion network (TAST-Net) was proposed. The proposed network adopted an innovative dual-branch fusion architecture that synergistically fused local physiological features extracted by a 3D convolutional neural network (3D CNN) branch with global spatio-temporal dependencies captured by a video vision transformer (ViViT) branch. To enhance the model’s sensitivity to signal dynamics, a weighted composite loss function combining mean squared error (MSE) and Pearson correlation loss was designed. Experimental results on two public datasets demonstrate the superior performance of TAST-Net. On the pulse rate estimation (PURE) dataset, it achieves a root mean squared error () of 0.53%, a mean absolute error () of 0.37%, and a Pearson correlation coefficient (R) of 0.96. On the more challenging visual information processing and learning-heart rate (VIPL-HR) dataset, the , , and R reach 0.84%, 0.57%, and 0.82, respectively, outperforming other comparative methods. These findings indicate that TAST-Net provides an effective solution for accurate and robust SpO2 estimation from facial videos and validates the advantage of integrating local and global features in rPPG signal processing.
2) 背景去除:在获得对齐的ROI后,为彻底分离面部皮肤与头发、衣领等背景噪声,本文采用了Google的MediaPipe Face Mesh技术[21]进行第二阶段的精细化分割.该技术可生成1个包含468个关键点的密集面部网格,其轮廓紧密贴合从下巴至发际线的完整面部边界.利用该网格的外圈轮廓点生成一个精确的面部多边形掩码(mask),并将其应用于ROI,从而得到如图2c所示的纯净面部图像.
3) 欧拉视频放大:由于rPPG信号非常微弱,本文采用EVM( Eulerian video magnification)算法[22]进行视频帧的颜色放大.该技术通过对视频进行拉普拉斯金字塔分解,并沿时间轴对特定频带内的信号进行放大,从而增强皮肤区域因血流变化引起的微弱颜色变化.本研究设置放大倍数为120,并选择0.4~4 Hz的频率范围以匹配心率波动.处理后的ROI如图2d所示.
图3中可视化结果进一步证实了TAST-Net的性能优越性.Bland-Altman结果显示,在PURE数据集上,TAST-Net估计值与真实值的平均偏差仅为0.16%,且95%的一致性界限(limits of agreement)位于-0.84%~1.16%这一狭窄区间内,表明2种估计结果具有良好的一致性.而在更具挑战性的VIPL-HR数据集上,该模型依然表现稳健,平均偏差为0.06%,95%的一致性界限为-1.58%~1.71%.根据国际标准ISO 80601-2-61[30]对医用脉搏血氧仪的要求,其需小于3%.本研究中TAST-Net在VIPL-HR上的=0.84%,远低于该临床标准.这些具体的定量指标表明,TAST-Net的估计结果不仅系统性偏差极小,而且绝大多数估计误差都在临床可接受的范围内,从而在统计学上验证了其估计结果的准确性和可靠性.
LarattaC R, AyasN T, PovitzM, et al. Diagnosis and treatment of obstructive sleep apnea in adults[J]. Canadian Medical Association Journal, 2017, 189(48): 1481-1488.
[2]
WatsonA R, WahR, ThammanR. The value of remote monitoring for the COVID-19 pandemic[J]. Telemedicine Journal and e-Health, 2020, 26(9): 1110-1112.
[3]
AmooreJ N. Pulse oximetry: an equipment management perspective[C]//IEE Colloquium on Pulse Oximetry: A Critical Appraisal. London, 2002: 124-126.
[4]
ShimazakiT, HaraS, OkuhataH, et al. Cancellation of motion artifact induced by exercise for PPG-based heart rate sensing[C]// The 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Chicago, 2014: 3216-3219.
[5]
VerkruysseW, SvaasandL O, NelsonJ S. Remote plethysmographic imaging using ambient light[J]. Optics Express, 2008, 16(26): 21434-21445.
[6]
de HaanG, JeanneV. Robust pulse rate from chrominance-based rPPG[J]. IEEE Transactions on Bio-medical Engineering, 2013, 60(10): 2878-2886.
[7]
PohM Z, McDuffD J, PicardR W. Non-contact, automated cardiac pulse measurements using video imaging and blind source separation[J]. Optics Express, 2010, 18(10): 10762-10774.
[8]
BalakrishnanG, DurandF, GuttagJ. Detecting pulse from head motions in video[C]//2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, 2013: 3430-3437.
[9]
WangW J, den BrinkerA C, StuijkS, et al. Algorithmic principles of remote PPG[J]. IEEE Transactions on Biomedical Engineering, 2017, 64(7): 1479-1491.
MathewJ, TianX, WongC W, et al. Remote blood oxygen estimation from videos using neural networks[J]. IEEE Journal of Biomedical and Health Informatics, 2023, 27(8): 3710-3720.
[12]
YuZ T, ShenY M, ShiJ G, et al. PhysFormer++: facial video-based physiological measurement with slow fast temporal difference transformer[J]. International Journal of Computer Vision, 2023, 131(6): 1307-1330.
[13]
YuZ T, ShenY M, ShiJ G, et al. PhysFormer: facial video-based physiological measurement with temporal difference transformer[C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, 2022: 4176-4186.
[14]
DuJ D, LiuS Q, ZhangB C, et al. Weakly supervised rPPG estimation for respiratory rate estimation[C]// IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Montreal, 2021: 2391-2397.
[15]
GideonJ, StentS. The way to my heart is through contrastive learning: remote photoplethysmography from unlabelled video[C]//IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, 2022: 3975-3984.
[16]
VaswaniA, ShazeerN, ParmarN, et al. Attention is all you need [C]//Advances in Neural Information Processing Systems. Long Beach,CA,2017:5998-6008.
[17]
ArnabA, DehghaniM, HeigoldG, et al. ViViT: a video vision transformer[C]// IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, 2021: 6816-6826.
[18]
StrickerR, MüllerS, GrossH M. Non-contact video-based pulse rate measurement on a mobile service robot[C]//The 23rd IEEE International Symposium on Robot and Human Interactive Communication. Edinburgh, 2014: 1056-1062.
[19]
NiuX S, HanH, ShanS G, et al. VIPL-HR: a multi-modal database for pulse estimation from less-constrained face video[C]//Computer Vision-ACCV 2018. Cham: Springer, 2018: 562-576.
[20]
KazemiV, SullivanJ. One millisecond face alignment with an ensemble of regression trees[C]// IEEE Conference on Computer Vision and Pattern Recognition. Columbus, 2014: 1867-1874.
[21]
LugaresiC, TangJ Q, NashH, et al. MediaPipe: a framework for building perception pipelines[EB/OL]. (2019-06-12)[2024-11-19].
[22]
WuH Y, RubinsteinM, ShihE, et al. Eulerian video magnification for revealing subtle changes in the world[J]. ACM Transactions on Graphics, 2012, 31(4): 1-8.
HeK M, ZhangX Y, RenS Q, et al. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification[C]// IEEE International Conference on Computer Vision (ICCV). Santiago, 2015: 1026-1034.
[25]
HeK M, ZhangX Y, RenS Q, et al. Deep residual learning for image recognition[C]// IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, 2016: 770-778.
[26]
KimS Y, LimJ, NaT, et al. 3DSRnet: video super-resolution using 3D convolutional neural networks[EB/OL]. (2018-12-21)[2024-11-15].
[27]
LiuK, TangJ K, JiangZ, et al. Summit vitals: multi-camera and multi-signal biosensing at high altitudes[C]// IEEE Smart World Congress (SWC). Nadi, 2025: 284-291.
[28]
ZhuS W, LiuS H, JingX J, et al. Innovative approaches in imaging photoplethysmography for remote blood oxygen monitoring[J]. Scientific Reports, 2024, 14: 19144.
[29]
HuM, WuX, WangX H, et al. Contactless blood oxygen estimation from face videos: a multi-model fusion method based on deep learning[J]. Biomedical Signal Processing and Control, 2023, 81: 104487.
[30]
Respiratory Devices and Related Equipment Used for Patient Care. Medical electrical equipment. Part 2-61: particular requirements for basic safety and essential performance of pulse oximeter equipment: ISO 80601-2-61:2017 [S/OL]. (2017-12-15)[2025-03-12].