Addressing the issues of low accuracy and weak generalization in field semantic inference during binary protocol reverse engineering, an automatic inference method based on the field semantic inference model for softmax classification model (FSISC) is proposed. Firstly, the collected protocol data are divided into sessions according to IP addresses and port numbers. Secondly, three kinds of gated recurrent unit (GRU) are used to extract features for known and unknown protocol field, field column context and multiple sequence row context. Thirdly, the semantic descriptions of known protocol fields are converted into embedding vectors, the cosine similarity between these vectors is calculated, and the k-means++ algorithm is used to cluster according to the semantic similarity of field descriptions. Finally, the softmax classification model is employed to map the extracted features and the aggregated semantic categories to realize the automated semantic inference of unknown protocols. Experimental results demonstrate that the generalization ability of unknown protocols is effectively improved by using the proposed method, achieving semantic inference for four protocols. Compared with the automated field semantics inference method for binary protocol reverse engineering (FSIBP), the average accuracy of semantic inference is improved.
为解决协议的自动化语义推断问题,将语义推理视为一个分类问题。每个语义被看做是一个单独的类别,采用深度学习与机器学习相结合的设计思路。首先利用深度学习的方法,从已知协议中学习字段数据的上下文特征及其语义的相关性,然后利用机器学习的方法,对未知协议进行有效的语义推理。基于该思路,提出一种基于softmax分类模型的字段语义推断模型(Field Semantic Inference Model for the Softmax Classification Model, FSISC),处理流程如图1所示。具体来说,可分为训练和测试两个主要阶段。在训练阶段,首先对已知协议的字段数据进行预处理,以标准化数据格式和维度,随后采用门控循环单元(Gated Recurrent Unit, GRU)深度学习模型提取字段特征,捕捉数据中的时序特征和潜在关联。同时,通过对相似语义描述的聚类处理,聚合成明确的类别,简化模型的分类任务并提高准确性。在测试阶段,经训练的模型将未知协议字段数据分类至相应语义类别,实现对未知协议的语义推断。
语义推理在某种程度上可以被视为一个分类问题,而语义类别聚合的目的是为分类设置类别,即为样本设置标签。将已知协议的细粒度语义聚合,使多个相似的语义形成一个语义类别,便于后续使用这些语义类别进行分类。对于已知的协议,通过协议解析器Wireshark爬取每个字段的语义描述,这些描述用自然语言表示。在本研究中使用自然语言处理(Natural Language Processing, NLP)模型将网络协议字段的描述性文本转换成句子嵌入,即高维空间中的向量表示。随后,构建一个相似度矩阵,通过计算嵌入向量之间的距离来量化句子间的相似度。基于相似度矩阵应用聚类算法进行聚类。
该算法首先设置一系列可能的聚类类别数k(如3~20),并对每个k值执行k-means++聚类算法;其次,计算所有输入组内距离的误差平方和(Sum of the Squared Errors, SSE),并将其作为聚类质量指标;最后,根据SSE与k值的关系,找出最优的k值。因为,对于不同的k值可以得到相应的SSE,通常随着k值的增大,SSE会逐渐减小,但当k值达到某个点后,SSE的减少速度会明显放缓,这个点就是“拐点”,其所对应的k值即最佳k值[10]。
BERMUDEZI, TONGAONKARA, ILIOFOTOUM, et al. Towards automatic protocol field inference[J]. Computer Communications, 2016,84:40-51.
[6]
CHANDLERJ, WICKA, FISHERK. BinaryInferno: a semantic-driven approach to field inference for binary message formats[DB/OL]. (2023-02-27)[2024-04-08].
[7]
WANGQ, SUNZ H, WANGZ Q, et al. A practical format and semantic reverse analysis approach for industrial control protocols[J]. Security and Communication Networks, 2021,2021(1):No.6690988.
[8]
ZHANM Q, LIY, LIB, et al. Toward automated field semantics inference for binary protocol reverse engineering[J]. IEEE Transactions on Information Forensics and Security, 2023,19:764-776.
[9]
YEY P, ZHANGZ, WANGF, et al. NetPlier: probabilistic network protocol reverse engineering from message traces[DB/OL]. (2021-02-21)[2024-04-08].
[10]
SATOPAAV, ALBRECHTJ, IRWIND, et al. Finding a “kneedle” in a haystack: detecting knee points in system behavior[C]∥Proceedings of the 2011 31st International Conference on Distributed Computing Systems Workshops. Piscataway, USA: IEEE, 2011:166-171.