采样算子融合与内存优化协同驱动的图神经网络并行训练加速
李欣嵘 , 韩承磊 , 于建志 , 胡克坤 , 梁建国 , 董文博
山东科技大学学报(自然科学版) ›› 2026, Vol. 45 ›› Issue (2) : 106 -116.
采样算子融合与内存优化协同驱动的图神经网络并行训练加速
Synergizing operator fusion and memory optimization for parallel GNN training acceleration
为了缓解大规模图神经网络 (GNN) 训练面临的内存占用和计算开销过大问题, 基于采样技术的 GNN 训练得到广泛应用, 但其采样和特征传输阶段效率低下, 限制了模型的训练速度和扩展性。为此, 将 GNN 点采样训练过程进一步分为采样、特征提取、前向传播和反向传播四个阶段, 通过评测发现采样和特征提取是主要性能瓶颈。本研究在多图形处理器 (GPU) 环境下提出一系列优化方法: 在采样阶段, 设计 toBlockFast 算子, 并与 SampleNeighbors 算子融合, 提升采样效率; 在特征传输阶段, 提出一种基于固定锁页内存的优化方法, 结合多进程与多流的并行传输技术, 实现特征传输与前向计算的高效并行化。实验结果表明, 在单 GPU 环境下, 与现有的 DGL 方法相比, 采样阶段可实现 1.52 倍的加速; 进一步优化特征传输阶段后, 总体训练效率提升至 1.80 倍。此外, 在 2、4 和 8 个 GPU 环境下, 本研究分别获得了 1.10、1.16 和 1.12 倍的性能加速。
To alleviate the problems of large memory occupation and high computational overhead in graph neural networks (GNN) training on large-scale graphs, sampling-based GNN training methods have been widely applied. However, the inefficiencies in the sampling and feature transfer stages limit the training speed and scalability of the models. In this paper, the sampling training process was divided into four stages: sampling, feature transfer, forward propagation, and backward propagation. After evaluation, the sampling and feature transfer stages were found to be the primary performance bottlenecks. In multi-GPU environments, this study proposed a series of optimization methods. In the sampling stage, the toBlockFast operator was designed and integrated with the SampleNeighbors operator to enhance the sampling efficiency. In the feature transfer stage, an optimization scheme based on pinned memory was proposed and combined with multi-process and multi-stream parallel transfer techniques to achieve efficient parallelization of feature transfer and forward computation. Experimental results show that, in a single GPU environment, compared to the existing deep graph library (DGL) methods, the proposed approach achieves a 1.52 times speedup in the sampling stage, and that the overall training efficiency improves by 1.80 times after the further optimization of the feature transfer stage. In a multi-GPU environment, our solution achieves performance improvements of 1.10, 1.16, and 1.12 times on 2, 4, and 8 GPU setups, respectively.
| [1] |
|
| [2] |
GENG H Y, CHEN C, HE Y X, |
| [3] |
|
| [4] |
张丽英, 孙海航, 孙玉发 . 基于图卷积神经网络的节点分类方法研究综述[J]. 计算机科学, 2024, 51(4): 95-105. |
| [5] |
|
| [6] |
ZHANG M H, CHEN Y X. Link prediction based on graph neural networks[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems.Red Hook: Curran Associates Inc., 2018: 5171-5181. |
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
SONG S, JIANG P. Rethinking graph data placement for graph neural network training on multiple GPUs[C]// Proceedings of the 36th ACM International Conference on Supercomputing.New York: ACM, 2022: 1-10. |
| [19] |
|
| [20] |
|
| [21] |
|
山东省自然科学基金创新发展联合基金项目(ZR2023LZH009)
山东省先进计算重点实验室开放课题资助项目(2025LCJSJ0004)
/
| 〈 |
|
〉 |