1.School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China
2.Department of Gynecology, Sun Yat-sen University Cancer Center, South China State Key Laboratory of Oncology, Provincial-Ministry Collaborative Innovation Center for Medical Oncology, Guangzhou 510145, China
3.Guangdong Provincial People's Hospital, Guangzhou 510080, China
Objective To evaluate the performance of a multi-constraint representation learning classification model for identifying ovarian cancer with missing laboratory indicators. Methods Tabular data with missing laboratory indicators were collected from 393 patients with ovarian cancer and 1951 control patients. The missing ovarian cancer laboratory indicator features were projected to the latent space to obtain a classification model using the representational learning classification model based on discriminative learning and mutual information coupled with feature projection significance score consistency and missing location estimation. The proposed constraint term was ablated experimentally to assess the feasibility and validity of the constraint term by accuracy, area under the ROC curve (AUC), sensitivity, and specificity. Cross-validation methods and accuracy, AUC, sensitivity and specificity were also used to evaluate the discriminative performance of this classification model in comparison with other interpolation methods for processing of the missing data. Results The results of the ablation experiments showed good compatibility among the constraints, and each constraint had good robustness. The cross-validation experiment showed that for identification of ovarian cancer with missing laboratory indicators, the AUC, accuracy, sensitivity and specificity of the proposed multi-constraints representation-based learning classification model was 0.915, 0.888, 0.774, and 0.910, respectively, and its AUC and sensitivity were superior to those of other interpolation methods. Conclusion The proposed model has excellent discriminatory ability with better performance than other missing data interpolation methods for identification of ovarian cancer with missing laboratory indicators.
首先定义一个矩阵,其代表患者数据矩阵, m 代表患者数量,n代表特征数目。为了让数据集的表征学习过程不被数据大小所影响,首先对数据集每一项特征进行了归一化的处理。为使学习数据的潜在表示以获得完整数据,假设存在一个潜在空间定义为,并通过投影矩阵将数据矩阵投影到该潜在空间并学习共享特征, k 表示数据被投影到的潜在空间的共享特征维度。同时通过重构矩阵将潜在空间重构回数据矩阵,反投影的过程可以显著提高表征学习中数据投影的可靠性[19-21]。具体的投射和重构过程可定义如下:
DochezV, CaillonH, VaucelE, et al. Biomarkers and algorithms for diagnosis of ovarian cancer: CA125, HE4, RMI and ROMA, a review[J]. J Ovarian Res, 2019, 12(1): 28.
[6]
LiJP, DowdyS, TiptonT, et al. HE4 as a biomarker for ovarian and endometrial cancer management[J]. Expert Rev Mol Diagn, 2009, 9(6): 555-66.
[7]
GuoYY, JiangTJ, OuyangLL, et al. A novel diagnostic nomogram based on serological and ultrasound findings for preoperative prediction of malignancy in patients with ovarian masses[J]. Gynecol Oncol, 2021, 160(3): 704-12.
[8]
NijmanS, LeeuwenbergAM, BeekersI, et al. Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review[J]. J Clin Epidemiol, 2022, 142: 218-29.
[9]
PapageorgiouG, GrantSW, TakkenbergJJM, et al. Statistical primer: how to deal with missing data in scientific research?[J]. Interact Cardiovasc Thorac Surg, 2018, 27(2): 153-8.
[10]
HastieT, MazumderR, LeeJD, et al. Matrix completion and low-rank SVD via fast alternating least squares[J]. J Mach Learn Res, 2015, 16: 3367-402.
[11]
van BuurenS, Groothuis-OudshoornK. Mice: multivariate im-putation by chained equations in R[J]. J Stat Soft, 2011, 45(3): 1-67.
[12]
QuL, LiL, ZhangY, et al. PPCA-based missing data imputation for traffic flow volume: a systematical approach[J]. IEEE Trans Intell Transp Syst, 2009, 10(3): 512-22.
[13]
CrookstonNL, FinleyAO. yaImpute: An Rpackage for KNN imputation[J]. J Stat Soft, 2008, 23(10): 1-16.
[14]
StekhovenDJ, BühlmannP. MissForest: non-parametric missing value imputation for mixed-type data[J]. Bioinformatics, 2012, 28(1): 112-8.
[15]
ZhangXM, YanC, GaoC, et al. Predicting missing values in medical data via XGBoost regression[J]. J Healthc Inform Res, 2020, 4(4): 383-94.
[16]
YoonJ, JordonJ, SchaarM. GAIN: missing data imputation using generative adversarial nets[EB/OL]. [2018-06-07].
[17]
DuTY, MelisL, WangT. ReMasker: imputing tabular data with masked autoencoding[EB/OL]. [2023-09-25].
[18]
MuzellecB, JosseJ, BoyerC, et al. Missing data imputation using optimal transport[EB/OL]. [2020-07-01].
[19]
NingZY, LinZH, XiaoQ, et al. Multi-constraint latent representation learning for prognosis analysis using multi-modal data[J]. IEEE Trans Neural Netw Learn Syst, 2023, 34(7): 3737-50.
[20]
NingZY, DuDH, TuC, et al. Relation-aware shared representation learning for cancer prognosis analysis with auxiliary clinical variables and incomplete multi-modality data[J]. IEEE Trans Med Imaging, 2022, 41(1): 186-98.
[21]
NingZY, XiaoQ, FengQJ, et al. Relation-induced multi-modal shared representation learning for Alzheimer’s disease diagnosis[J]. IEEE Trans Med Imaging, 2021, 40(6): 1632-45.
[22]
LiuY, HongXP, TaoXY, et al. Model behavior preserving for class-incremental learning[J]. IEEE Trans Neural Netw Learn Syst, 2023, 34(10): 7529-40.
[23]
YoonJS, ZhangY, JordanJ, et al. VIME: extending the success of self- and semi-supervised learning to tabular domain[C]//Advances in Neural Information Processing Systems 33, 2020.
[24]
GülmezogluMB, EdizkanR, ErginS, et al. Use of center of gravity with the common vector approach in isolated word recognition[J]. Expert Syst Appl, 2018, 38(4): 3690-6.
AntalB, HajduA. An ensemble-based system for automatic screening of diabetic retinopathy[J]. Knowl Based Syst, 2014, 60: 20-7.
[27]
CabitzaF, CampagnerA, FerrariD, et al. Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests[J]. Clin Chem Lab Med, 2020, 59(2): 421-31.
[28]
DicksonER, GrambschPM, FlemingTR, et al. Prognosis in primary biliary cirrhosis: model for decision making[J]. Hepatology, 1989, 10(1): 1-7.
[29]
GolovenkinSE, BacJ, ChervovA, et al. Trajectories, bifurcations, and pseudo-time in large clinical datasets: applications to myocardial infarction and diabetes data[J]. Gigascience, 2020, 9(11): giaa128.
[30]
García-LaencinaPJ, Sancho-GómezJL, Figueiras-VidalAR. Pattern classification with missing data: a review[J]. Neural Comput Appl, 2010, 19(2): 263-82.
[31]
AwanSE, BennamounM, SohelF, et al. A reinforcement learning-based approach for imputing missing data[J]. Neural Comput Appl, 2022, 34(12): 9701-16.
[32]
LinWC, TsaiCF. Missing value imputation: a review and analysis of the literature (2006-2017)[J]. Artif Intell Rev, 2020, 53(2): 1487-509.
[33]
Ramos-PérezI, Barbero-AparicioJA, Canepa-OnetoA, et al. An extensive performance comparison between feature reduction and feature selection preprocessing algorithms on imbalanced wide data[J]. Information, 2024, 15(4): 223.
[34]
NasirIM, KhanMA, YasminM, et al. Pearson correlation-based feature selection for document classification using balanced training[J]. Sensors, 2020, 20(23): 6793.
[35]
BerishaV, KrantsevichC, HahnPR, et al. Digital medicine and the curse of dimensionality[J]. NPJ Digit Med, 2021, 4(1): 153.
[36]
PingiST, ZhangDY, BasharMA, et al. Joint representation learning with generative adversarial imputation network for improved classification of longitudinal data[J]. Data Sci Eng, 2024, 9(1): 5-25.
[37]
DuWJ, CôtéD, LiuY. SAITS: self-attention-based imputation for time series[J]. Expert Syst Appl, 2023, 219: 119619.
[38]
ZhangP, GaoWF, HuJC, et al. Multi-label feature selection based on high-order label correlation assumption[J]. Entropy, 2020, 22(7): 797.
[39]
FanQC, LiuSC, ZhaoCJ, et al. An instance- and label-based feature selection method in classification tasks[J]. Information, 2023, 14(10): 532.
[40]
HeQ, LiX, Nathan KimDW, et al. Feasibility study of a multi-criteria decision-making based hierarchical model for multi-modality feature and multi-classifier fusion: applications in medical prognosis prediction[J]. Inf Fusion, 2020, 55: 207-19.
[41]
Tayarani-NajaranMH. A novel ensemble machine learning and an evolutionary algorithm in modeling the COVID-19 epidemic and optimizing government policies[J]. IEEE Trans Syst Man Cybern Syst, 2022, 52(10): 6362-72.