To address performance bottlenecks in cross-architecture binary code representation learning caused by compiler optimizations, architectural variations, and code obfuscation, while mitigating data scarcity constraints in binary vulnerability detection, the DeepClap framework is proposed. Key innovations include the following aspects. A quantized DeepSeek model for code explanation generation. A lightweight residual alignment network to reduce training costs with enhanced representation fidelity. A natural language processing (NLP) -bridged vulnerability detection method linking target binary code is constructed to source code vulnerability datasets. Experimental results demonstrate the improvement of 14.8% in the baseline model's area under the receiver operating characteristic curve (AUC) for binary code similarity analysis. Zero-shot evaluation shows 7.1 percentage point accuracy enhancement, while cross-modal vulnerability retrieval achieves mean reciprocal rank (MRR) of 0.76 and recall@1 of 0.73. The framework is verified to significantly improve cross-architecture code representation quality and vulnerability detection capability, exhibiting particular effectiveness in data-scarce and zero-shot scenarios.
CLAP[3]是大语言模型(Large Language Model, LLM)在二进制代码相似性分析中的突破性应用,通过语义解释增强代码表示,显著提升了跨任务迁移能力。该方法无需任务特定训练即可媲美基线全监督方法[4]。当前DeepSeek[5]等新一代开源LLM降低了技术门槛,使得高质量代码解释的生成变得更加高效,为改进CLAP提供了新的机遇。
尽管CLAP等方法取得了显著进展,二进制代码表示学习仍面临以下关键挑战:一是模型可迁移性不足,零样本场景下多数方法的准确率平均下降超30%;二是计算资源消耗与效率问题突出;三是二进制漏洞数据集严重稀缺,企业常以秘密方式修补漏洞而不公开公共漏洞和暴露(Common Vulnerabilities and Exposures, CVE)编号[6],导致高质量数据集匮乏。
针对跨架构二进制代码表示学习中存在的性能瓶颈及二进制漏洞数据集稀缺问题,本文提出DeepClap框架,通过部署32 B Q4量化DeepSeek模型生成高质量代码解释,引入轻量级残差对齐网络实现二进制与源语言的高效低耦合对齐,并创新性地将二进制代码与源码漏洞数据集进行跨模态相似性分析。该框架提出参数高效的残差对齐网络以降低训练成本、提高表示质量;实现漏洞跨模态检测以缓解数据稀缺;利用量化大模型替代复杂数据集生成流程,从而提高系统效率。
在数据对齐准备阶段,本研究采用属性记录调试器(Debugging With Attributed Record Formats, DWARF)工具从含调试信息的二进制文件中提取函数信息。DWARF是一种广泛使用的二进制分析工具,能够从二进制文件的调试信息中精确提取二进制代码与源代码之间的映射关系。具体提取流程如下。
XUX J, LIUC, FENGQ, et al. Neural network-based graph embedding for cross-platform binary code similarity detection[C]∥Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. New York,USA: ACM, 2017:363-376.
[2]
MASSARELLIL, DI LUNAG A, PETRONIF, et al. SAFE: self-attentive function embeddings for binary similarity[M]∥Detection of intrusions and malware, and vulnerability assessment. Cham, Switzerland: Springer, 2019:309-329.
[3]
WANGH, GAOZ Y, ZHANGC, et al. CLAP: learning transferable binary code representations with natural language supervision[C]∥Proceedings of the 33rd ACM SI- GSOFT International Symposium on Software Testing and Analysis. New York,USA: ACM, 2024:503-515.
[4]
WANGH, QUW J, KATZG, et al. jTrans: jump-aware transformer for binary code similarity detection[C]∥Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York,USA: ACM, 2022:1-13.
[5]
DEEPSEEK-AI, GUOD Y, YANGD J, et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning[DB/OL]. (2025-01-22)[2025-02-10].
[6]
LIUB C, MENGG Z, ZOUW, et al. A large-scale empirical study on vulnerability distribution within projects and the lessons learned[C]∥Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. New York,USA: ACM, 2020:1547-1559.
[7]
RAFFE, BARKERJ, SYLVESTERJ, et al. Malware detection by eating a whole EXE[C]∥Proceedings of the Workshops of the Thirty-Second AAAI Conference on Artificial Intelligence. Washington, USA: AAAI Press, 2018:268-276.
[8]
GUOW B, MUD L, XINGX Y, et al. DeepVSA: facilitating value-set analysis with deep learning for postmortem program analysis[C]∥Proceedings of the 28th USENIX Security Symposium. Berkeley, USA: USENIX Association, 2019:1787-1804.
[9]
LIUB C, HUOW, ZHANGC, et al. αDiff: cross-version binary code similarity detection with DNN[C]∥Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. New York,USA: ACM, 2018:667-678.
[10]
LIY J, GUC J, DULLIENT, et al. Graph matching networks for learning the similarity of graph structured objects[DB/OL].(2019-04-29)[2025-02-11].
[11]
DINGS H H, FUNGB C M, CHARLANDP. Asm2Vec: boosting static representation robustness for binary clone search against code obfuscation and compiler optimization[C]∥Proceedings of the 2019 IEEE Symposium on Security and Privacy. Piscataway,USA: IEEE, 2019:472-489.
[12]
PEIK X, XUANZ, YANGJ F, et al. Trex: learning execution semantics from micro-traces for binary similarity[DB/OL].(2020-12-16)[2025-02-11].
[13]
VASWANIA, SHAZEERN, PARMARN, et al. Attention is all you need[C]∥Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, USA:Curran Associates Inc., 2017:6000-6010.