采样算子融合与内存优化协同驱动的图神经网络并行训练加速

李欣嵘; 韩承磊; 于建志; 胡克坤; 梁建国; 董文博

doi:10.16452/j.cnki.sdkjzk.2026.02.010

山东科技大学学报（自然科学版） ›› 2026, Vol. 45 ›› Issue (2) : 106 -116. DOI: 10.16452/j.cnki.sdkjzk.2026.02.010

数学·计算机·系统科学

采样算子融合与内存优化协同驱动的图神经网络并行训练加速

李欣嵘 ¹ ,
韩承磊 ² ,
于建志 ¹^,³ ,
胡克坤 ⁴ ,
梁建国 ⁵ ,
董文博 ⁶

作者信息 +

Synergizing operator fusion and memory optimization for parallel GNN training acceleration

Xinrong LI ¹ ,
Chenglei HAN ² ,
Jianzhi YU ¹^,³ ,
Kekun HU ⁴ ,
Jianguo LIANG ⁵ ,
Wenbo DONG ⁶

Author information +

文章历史 +

PDF (1343K)

摘要

为了缓解大规模图神经网络 (GNN) 训练面临的内存占用和计算开销过大问题, 基于采样技术的 GNN 训练得到广泛应用, 但其采样和特征传输阶段效率低下, 限制了模型的训练速度和扩展性。为此, 将 GNN 点采样训练过程进一步分为采样、特征提取、前向传播和反向传播四个阶段, 通过评测发现采样和特征提取是主要性能瓶颈。本研究在多图形处理器 (GPU) 环境下提出一系列优化方法: 在采样阶段, 设计 toBlockFast 算子, 并与 SampleNeighbors 算子融合, 提升采样效率; 在特征传输阶段, 提出一种基于固定锁页内存的优化方法, 结合多进程与多流的并行传输技术, 实现特征传输与前向计算的高效并行化。实验结果表明, 在单 GPU 环境下, 与现有的 DGL 方法相比, 采样阶段可实现 1.52 倍的加速; 进一步优化特征传输阶段后, 总体训练效率提升至 1.80 倍。此外, 在 2、4 和 8 个 GPU 环境下, 本研究分别获得了 1.10、1.16 和 1.12 倍的性能加速。

Abstract

To alleviate the problems of large memory occupation and high computational overhead in graph neural networks (GNN) training on large-scale graphs, sampling-based GNN training methods have been widely applied. However, the inefficiencies in the sampling and feature transfer stages limit the training speed and scalability of the models. In this paper, the sampling training process was divided into four stages: sampling, feature transfer, forward propagation, and backward propagation. After evaluation, the sampling and feature transfer stages were found to be the primary performance bottlenecks. In multi-GPU environments, this study proposed a series of optimization methods. In the sampling stage, the toBlockFast operator was designed and integrated with the SampleNeighbors operator to enhance the sampling efficiency. In the feature transfer stage, an optimization scheme based on pinned memory was proposed and combined with multi-process and multi-stream parallel transfer techniques to achieve efficient parallelization of feature transfer and forward computation. Experimental results show that, in a single GPU environment, compared to the existing deep graph library (DGL) methods, the proposed approach achieves a 1.52 times speedup in the sampling stage, and that the overall training efficiency improves by 1.80 times after the further optimization of the feature transfer stage. In a multi-GPU environment, our solution achieves performance improvements of 1.10, 1.16, and 1.12 times on 2, 4, and 8 GPU setups, respectively.

关键词

图神经网络训练 / 点采样 / 算子融合 / 特征传输 / 性能优化

Key words

graph neural network training / node sampling / operator fusion / feature transfer / performance optimization

引用本文

引用格式 ▾

李欣嵘,韩承磊,于建志,胡克坤,梁建国,董文博. 采样算子融合与内存优化协同驱动的图神经网络并行训练加速[J]. 山东科技大学学报（自然科学版）, 2026, 45(2): 106-116 DOI:10.16452/j.cnki.sdkjzk.2026.02.010

登录浏览全文

4963

注册一个新账户忘记密码

参考文献

原文顺序 | 出版日期 | 本文引用

[1]	CAO H , LI M , NIE L , et al. Vertex—based graph neural network classification model considering structural topological features for structural optimization[J/OL]. Computers and Structures, 2024, 305.DOI: 10.1016/j.compsture.2024.107542.

[2]	GENG H Y, CHEN C, HE Y X, et al. Pyramid graph neural network: A graph sampling and filtering approach for multi—scale disentangled representations[C]// Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.New York: Association for Computing Machinery, 2023: 518-530.

[3]	LI W Z , WANG C D , XIONG H , et al. GraphSHA: Synthesizing harder samples for class—imbalanced node classification[C]// Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.New York: Association for Computing Machinery, 2023: 1328-1340.

[4]	张丽英, 孙海航, 孙玉发 . 基于图卷积神经网络的节点分类方法研究综述[J]. 计算机科学, 2024, 51(4): 95-105.

[5]	ZHANG Liying , SUN Haihang , SUN Yufa . Review of node classification methods based on graph convolutional neural networks[J]. Computer Science, 2024, 51(4): 95-105.

[6]	ZHANG M H, CHEN Y X. Link prediction based on graph neural networks[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems.Red Hook: Curran Associates Inc., 2018: 5171-5181.

[7]	TAN T , YANG W . Research on knowledge graph entity recognition and relation extraction algorithm based on deep learning[C]// 2024 IEEE 2nd International Conference on Electrical, Automation and Computer Engineering.Changchun: IEEE, 2024: 1521-1525.

[8]	MA L , SHENG Z A , LI X K , et al. Acceleration algorithms in GNNs: A survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2025, 37(6): 3173-3192.

[9]	SHEN Y Y , CHEN L , FANG J Z , et al. Efficient training of graph neural networks on large graphs[J]. Proceedings of the VLDB Endowment, 2024, 17(12): 4237-4240.

[10]	KHEMANI B , PATIL S , KOTECHA K , et al. A review of graph neural networks: Concepts, architectures, techniques, challenges, datasets, applications, and future directions[J]. Journal of Big Data, 2024, 11(1): 18-61.

[11]	HAMILTON W L , YING R , LESKOVEC J . Inductive representation learning on large graphs[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems.Red Hook: Curran Associates Inc., 2017: 1025-1035.

[12]	YE R , LI X , FANG Y J , et al. A vectorized relational graph convolutional network for multi—relational network alignment[C]// 28th International Joint Conference on Artificial Intelligence.Macau: IJCAI, 2019, 2019: 4052-4058.

[13]	WANG M Y , ZHENG D , YE Z , et al. Deep graph library: Towards efficient and scalable deep learning on graphs[PP/OL]. arXiv (2019—09—03)[2025—04—20].https://arxiv.org/pdf/1412.6980.

[14]	LIN Z Q , LI C , MIAO Y S , et al. PaGraph: Scaling GNN training on large graphs via computation—aware caching[C]// Proceedings of the 11th ACM Symposium on Cloud Computing.New York: Association for Computing Machinery, 2020: 401-415.

[15]	ZHENG D , MA C , WANG M J , et al. DistDGL: Distributed graph neural network training for billion—scale graphs[C]// 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3).Piscataway: IEEE, 2020: 36-44.

[16]	YANG J B , TANG D H , SONG X Y , et al. GNNLab: A factored system for sample—based GNN training over GPUs[C]// Proceedings of the 17th European Conference on Computer Systems.New York: Association for Computing Machinery, 2022: 417-434.

[17]	YIN Q J , LIU Q , FU Z R , et al. scGraph: A graph neural network—based approach to automatically identify cell types[J]. Bioinformatics, 2022, 38(11): 2996-3003.

[18]	SONG S, JIANG P. Rethinking graph data placement for graph neural network training on multiple GPUs[C]// Proceedings of the 36th ACM International Conference on Supercomputing.New York: ACM, 2022: 1-10.

[19]	KINGMA D P , BA J . Adam: A method for stochastic optimization[PP/OL]. arXiv (2017—01—30)[2025—04—20].https://arxiv.org/pdf/1412.6980.

[20]	FEY M , LENSSEN J E . Fast graph representation learning with PyTorch geometric[PP/OL]. arXiv (2019—04—25)[2025—04—20].https://arxiv.org/pdf/1903.02428.

[21]	ZHU R , ZHAO K , YANG H X , et al. AliGraph: A comprehensive graph neural network platform[J]. Proceedings of the VLDB Endowment, 2019, 12(12): 2094-2105.