Accurate identification of glycation sites is of great significance to understanding the molecular mechanism of glycation. Because of the heavy workload and time consuming of traditional experimental methods, it is urgent to develop computational auxiliary methods to predict the glycation sites. A new fuzzy support vector machine algorithm was designed, which magnified the weight difference between the important features and the weak correlation features, and also considered the distribution within the sample. The algorithm could effectively deal with the problem of noise data in the prediction of glycation modification sites. Based on the proposed fuzzy support vector machine algorithm and Bi-Profile Bayes(BPB) feature extraction method, a new lysine glycation site model, FSVM_GlySite, was constructed. The results of 10-fold cross-validation showes that the prediction effect of FSVM_GlySite is better than that of several existing glycation site predictors.
目前,糖化的分子机制在很大程度上仍是未知的。为了更好地理解糖化的分子机制,需要高精度地识别糖化底物及其相应的糖化位点。大规模蛋白质组学方法如质谱分析,已被用于检测糖化位点[10-11]。然而传统的实验方法不仅花费高,并且耗时耗力,很大程度上延缓了相关研究的进展。因此,有关蛋白质糖化的计算辅助方法受到了越来越多的关注。到目前为止,已有很多学者通过机器学习算法对糖化位点进行了预测。Johansen等[12]利用人工神经网络算法提出了第一个预测赖氨酸糖化位点的预测器NetGlycate,最终得到的马氏相关系数为0.77,AUC值为0.58,体现了使用机器学习算法进行蛋白质糖化位点预测的可行性。Liu等[13]提出了一种预测糖化位点的计算方法PreGly,该方法使用氨基酸因子、氨基酸出现频率和k间距氨基酸对组成进行特征提取,使用最大相关和最小冗余(mRMR,max Relevance and Min Redundancy)进行特征选择,在k =4时获得了最优的模型。Xu等[14]开发了一种名为Gly-PseAAC的预测器,利用位置特异性氨基酸偏好提取蛋白质包含的信息,然后使用支持向量机(Support vector Machine,SVM)算法预测糖化位点,通过PSAAP特征有效地验证了赖氨酸是否发生糖化反应的问题。Ju等[15]提出了BPB(Bi-Profile Bayes)的特征提取方式,并结合支持向量机进行预测,预测的结果要优于以上算法。
然而,标准的SVM算法会因数据中存在野点或噪声而导致分类精度下降,因此Lin等[16]提出了模糊支持向量机(Fuzzy Support Vector Machine,FSVM)方法,其思想为给每个样本以不同的隶属度,可以有效地降低野点或噪声对分类精度的影响。在此基础上,学者们提出了更多的隶属度函数设计方法,如文献[17]将样本的不确定性和样本与类中心的距离相结合,提出了一种基于信息熵的改进FSVM算法,在不平衡数据集上具有较高的分类精度。李村合等[18]通过加入参数来调整分离超平面与样本的距离,当样本分布不均时也能得到较高的分类精度。Wang等[19]提出了基于中心核对齐的模糊支持向量机。左喻灏等[20]提出了Relief-F特征加权的FSVM算法,通过赋予样本权重与特征权重来提高分类效率。本文在文献[15]的基础上,提出了一种基于两步特征加权的模糊支持向量机算法。首先,利用信息增益算法获取样本的特征权重;然后,选择信息增益最大的特征,计算其与剩余特征的斯皮尔曼相关系数,将最大的特征权重与其他特征的相关系数相乘并加到其他特征原有的权重上,得到新的特征权重;最后,将得到的特征权重应用到隶属度函数距离的计算与核函数的构建中,同时考虑样本的亲和度,通过样本内部的分布情况对隶属度函数做出进一步修正。本文将上述算法与BPB的特征提取方式结合,提出了一种预测赖氨酸糖化位点的方法FSVM_GlySite,并用十折交叉验证,结果表明,FSVM_GlySite的预测结果要优于现有的几种常用的预测模型。
MillerA K, HamblyD M, KerwinB A,et al .Characterization of site-specific glycation during process development of a human therapeutic monoclonal antibody[J].Journal of Pharmaceutical Sciences,2011,100(7):2543-2550.
[2]
ChoS J, RomanG, YeboahF,et al. The road to advanced glycation end products: a mechanistic perspective[J]. Current Medicinal Chemistry,2007,14(15):1653-1671.
[3]
LapollaA, FedeleD, MartanoL,et al.Advanced glycation end products:a highly complex set of biologically relevant compounds detected by mass spectrometry[J].Journal of Mass Spectrometry,2001,36(4):370-378.
[4]
GarlickR L, MazerJ S. The principal site of nonenzymatic glycosylation of human serum albumin in vivo[J]. Journal of Biological Chemistry,1983,258(10):6142-6146.
[5]
ShiltonB H, WaltonD J. Sites of glycation of human and horse liver alcohol dehydrogenase in vivo[J]. Journal of Biological Chemistry,1991,266(9):5587-5592.
[6]
SchleicherE, WielandO H.Kinetic analysis of glycation as a tool for assessing the half-life of proteins[J].Biochimica et Biophysica Acta (BBA) - General Subjects,1986,884(1):199-205.
[7]
AhmedN, Babaei-JadidiR, HowellS K,et al. Degradation products of proteins damaged by glycation, oxidation and nitration in clinical type 1 diabetes[J]. Diabetologia,2005,48(8):1590-603.
[8]
AgalouS, AhmedN, Babaei-JadidiR,et al.Profound mishandling of protein glycation degradation products in uremia and dialysis[J].Journal of the American Society of Nephrology,2005,16(5):1471-1485.
[9]
LingX, SakashitaN, TakeyaM,et al. Immunohistochemical distribution and subcellular localization of three distinct specific molecular structures of advanced glycation end products in human tissues[J]. A journal of technical methods and pathology,1998,78(12):1591-606.
[10]
ZhangQ B, AmesJ M, SmithR D,et al.A perspective on the Maillard reaction and the analysis of protein glycation by mass spectrometry:probing the pathogenesis of chronic disease[J].Journal of Proteome Research,2009,8(2):754-769.
[11]
ThornalleyP J, RabbaniN.Detection of oxidized and glycated proteins in clinical samples using mass spectrometry-a user’s perspective[J].Biochimica et Biophysica Acta (BBA)-General Subjects,2014,1840(2):818-829.
[12]
JohansenM B, KiemerL, BrunakS.Analysis and prediction of mammalian protein glycation[J].Glycobiology,2006,16(9):844-853.
[13]
LiuY, GuW X, ZhangW Y,et al.Predict and analyze protein glycation sites with the mRMR and IFS methods[EB/OL].(2015-04-15)[2022-04-10].
[14]
XuY, LiL, DingJ,et al. Gly-PseAAC: identifying protein lysine glycation through sequences[J]. Gene,2017,602(4):1-7.
[15]
JuZ, SunJ H, LiY J,et al. Predicting lysine glycation sites using bi-profile bayes feature extraction[J]. Computational Biology and Chemistry,2017,71(2):98-103.
[16]
LinC F, WangS D. Fuzzy support vector machines[J]. IEEE Transactions on Neural Networks,2002,13(2):464-471.
LiuZ X, WangY B, GaoT S,et al.CPLM:a database of protein lysine modifications[J].Nucleic Acids Research,2014,42(1):531-536.
[22]
ShaoJ L, XuD, TsaiS N,et al.Computational identification of protein methylation sites through bi-profile Bayes feature extraction[J].PLoS One,2009,4(3):4920.
[23]
JuZ, GuH.Predicting pupylation sites in prokaryotic proteins using semi-supervised self-training support vector machine algorithm[J].Analytical Biochemistry,2016,507(5):1-6.
VeropoulosK, CampbellC, CristianiniN. Controlling the senesitivity of support vector machines[J]. Proceedings of the International Joint Conference on Artificial Intelligence,1999,12(2):55-60.