School of Applied Science, Taiyuan University of Science and Technology, Taiyuan 030024, China
Show less
文章历史+
Received
Accepted
Published
2023-09-15
2024-03-07
Issue Date
2025-10-30
PDF (766K)
摘要
随机限制最小绝对收缩和选择算子(M-Lasso)方法可以在变量选择的同时使用随机的先验信息,但是该方法基于的最小绝对收缩和选择算子(Lasso)对每一个系数的惩罚是等权重的,这可能会导致某些重要的信息被过度压缩。为此,本文提出随机限制自适应Lasso(Ma-Lasso)方法。该方法赋予系数不同的权重,并且在变量选择的同时使用了随机先验信息,可提高估计的精度。通过数值实验结果分析发现,该方法在稀疏模型上表现出比其他方法更小的均方误差,并且在发现率、真实发现率以及真实模型选择次数比例方面也具有一定的优势。最后,通过将该方法应用于贵州茅台各季度财报数据和股票价格数据中,发现Ma-Lasso所构建的模型的贝叶斯信息准则(Bayesian Information Criterion,BIC)值相较于M-Lasso方法下降了约5%,进一步验证了它的优越性。
Abstract
The Mixed least absolute shrinkage and selection operator (M-Lasso) method can use random prior information at the same time as variable selection, but the minimum absolute shrinkage and selection operator (Lasso) based on this method is equally weighted for each coefficient, which may cause some important information to be overcompressed. To address this issue, this paper proposes the stochastic constrained adaptive Lasso (Ma-Lasso) method. The method assigns different weights to the coefficients and has oracle properties. It uses stochastic prior information along with variable selection, which can improve the precision of the estimation. The analysis of numerical experimental results reveals that the method exhibits a smaller mean square error on the sparse model than the other methods, and also has some advantages in terms of the discovery rate, the true discovery rate, and the ratio of the number of times the true model is selected. Finally, by applying Ma-Lasso to the quarterly financial data and stock price data of Kweichow Moutai, it is found that the BIC value of the model constructed by this method decreases by about 5% compared with M-Lasso method, which further verifies its superiority.
最小二乘法(Ordinary Least Square,OLS)是最常用的线性回归模型系数的估计方法。它通过最小化结果变量的预测值与观测值之间的误差来估计回归模型中的参数。该方法可以对当前数据集进行无偏估计,但容易导致过拟合现象。当误差项为异方差或者相关时,研究人员可以用广义最小二乘法(Generalized Least Squares,GLS)来估计参数;而如果模型参数存在线性限制,则可以使用受限最小二乘法(Restricted Least Squares,RLS)[1]。
但是当数据间存在多重共线性时,最小二乘法不再是一个有效估计[2-5]。为此,Hoerl和Kennard提出岭回归方法[6],这是能够进行共线性数据分析的一种有偏估计回归方法。它可通过缩小回归系数来减少预测误差,以缓解过度拟合,但它不能精确地将系数压缩为0,不进行协变量选择,除非调节参数趋向于。Tibshirani[7]提出的最小绝对收缩和选择算子(Least Absolute Shrinkage and Selection Operator,Lasso)方法正好克服了这一点,当调节参数足够大时就会迫使一些系数估计为0,从而实现变量选择和系数估计[8-9]。Lasso在生物信息学和经济学方面应用广泛[10-12],但是对于特征数大于样本个数的数据集,Lasso方法最多可以选择个变量。Zou等[13]将岭回归与Lasso的惩罚函数结合提出带有L1范数与L2范数惩罚项的弹性网 (Elastic net)方法,该方法可以用来解决变量个数大于样本个数的情况。Zou[14]还在Lasso的基础上提出了自适应Lasso方法(Adaptive Lasso,A-Lasso),该方法在L1范数惩罚下引入数据自适应的权重,比Lasso算法具有更准确的变量选择和模型预测能力。
本文通过均方误差(Mean Squared Error,MSE),发现百分比(Discovery Percentage, DP),真实发现率(True Discovery Percentage,TDP),选择真实模型的次数比例(Proportion of Times that the True model is Selected,PTTS)作为评价指标[1]。公式如下:
GULERH, GULERE O. Sparsely Restricted Penalized Estimators[J]. Commun Stat Theory Meth, 2021, 50(7): 1656-1670. DOI: 10.1080/03610926.2019.1682164 .
[2]
DONOHOD L. High-dimensional Data Analysis: The Curses and Blessings of Dimensionality[C]//American Mathematical Society Math Challenges Lecture. Providence, Rhode Island: AMS, 2000, 1(2000): 1-32.
[3]
FANJ Q, HANF, LIUH. Challenges of Big Data Analysis[J]. Natl Sci Rev, 2014, 1(2): 293-314. DOI: 10.1093/nsr/nwt032 .
[4]
李欣. 高维数据的稀疏估计及其应用[D]. 杭州: 浙江大学, 2019.
[5]
LIX. Sparse Estimation for High-dimensional Data with Applications[D]. Hangzhou: Zhejiang University, 2019.
[6]
苏锦霞. 基于特征选择的高维数据统计分析[D]. 兰州: 兰州大学, 2018.
[7]
SUJ X. Statistical Analysis of High-dimensional Data Based on Feature Selection[D]. Lanzhou: Lanzhou University, 2018.
[8]
HOERLA E, KENNARDR W. Ridge Regression: Biased Estimation for Nonorthogonal Problems[J]. Technometrics, 1970, 12(1):55-67. DOI: 10.1080/00401706.1970.10488634 .
[9]
TIBSHIRANIR. Regression Shrinkage and Selection via the Lasso[J]. J R Stat Soc Ser B Methodol, 1996, 58(1): 267-288. DOI: 10.1111/j.2517-6161.1996.tb02080.x .
[10]
胡蓉. 基于随机Lasso的Meta分析[D]. 北京: 北京建筑大学, 2019.
[11]
HUR. Meta Analysis Based on Random Lasso[D]. Beijing: Beijing University of Civil Engineering and Architecture, 2019.
WANGL, SUNJ B. Application of Lasso Regression Method in the Selection of Feature Variables[J]. J Jilin Teach Inst Eng Technol, 2021, 37(12): 109-112. DOI: 10.3969/j.issn.1009-9042.2021.12.032 .
SUY T, LYUS Y, XIEW H, et al. A Risk Factor Analysis for Type 2 Diabetes Mellitus Based on LASSO Regression and Random Forest Algorithm[J]. J Environ Hyg, 2023, 13(7): 485-495. DOI: 10.13421/j.cnki.hjwsxzz.2023.07.002 .
XINGY. Construction and Application of LASSO Logistic Model for Fat (High) Big Data-taking Loan Repayment and Diabetes as Examples[D]. Jinan: Shandong University, 2022.
CHEQ Z, WANGJ, BAIW G, et al. Study on the State Identification Model of Kidney Yang Deficiency in Osteoporosis Patients Based on LASSO Regression[J]. China J Tradit Chin Med Pharm, 2022, 37(10): 5928-5933.
[20]
ZOUH, HASTIET. Regularization and Variable Selection via the Elastic Net[J]. J R Stat Soc Ser B Stat Methodol, 2005, 67(2): 301-320. DOI: 10.1111/j.1467-9868.2005.00503.x .
[21]
ZOUH. The Adaptive Lasso and Its Oracle Properties[J]. J Am Stat Assoc, 2006, 101(476): 1418-1429. DOI: 10.1198/016214506000000735 .
[22]
THEILH. On the Use of Incomplete Prior Information in Regression Analysis[J]. J Am Stat Assoc, 1963, 58(302): 401-414. DOI: 10.1080/01621459.1963.10500854 .
[23]
GULERH, GULERE O. Mixed Lasso Estimator for Stochastic Restricted Regression Models[J]. J Appl Stat, 2021, 48(13/14/15): 2795-2808. DOI: 10.1080/02664763.2021.1922614 .
LIUH W, XUW K. The Recursive Algorithm of Generalized Least Squares Estimation for Linear Model[J]. Nat Sci J Harbin Norm Univ, 2011, 27(3): 29-31. DOI: 10.3969/j.issn.1000-5617.2011.03.009 .
[26]
CONWAYR N, MITTELHAMMERR C. The Theory of Mixed Estimation in Econometric Modeling[J]. Stud Econ Finance, 1986, 10(1): 79-120. DOI: 10.1108/eb028665 .