# 模型平均对于糖尿病的预测Prediction of Diabetes by the Model Average

Abstract: Diabetes is the main cause of hypertension, dyslipidemia, cardiovascular and cerebrovascular diseases. In this paper, we use the data set of Pima women in India about diabetes, and use the model averaging method based on logistic regression to predict whether Pima women will have diabetes in five years. In the selection of model, after consulting the data, fixed the oral glucose tolerance test 2 hours after the plasma glucose concentration, postprandial 2 hours of serum insulin, diabetes genetic function three indicators for modeling, using Mallows weight selection criteria. The experimental results show that the prediction error rate of the model average is lower than that of the simple logistic regression method, and the effect is better.

1. 引言

2. 基于逻辑回归的模型平均

2.1. 逻辑回归

$g\left(z\right)=\frac{1}{1+{\text{e}}^{-z}}$

2.2. 广义线性模型的模型平均

$f\left({y}_{i}|{\theta }_{i},\phi \right)=\mathrm{exp}\left\{\frac{{y}_{i}{\theta }_{i}-b\left({\theta }_{i}\right)}{\phi }+c\left({y}_{i},\phi \right)\right\}$

${\stackrel{^}{\beta }}_{\left(\omega \right)}={\sum }_{s=1}^{S}{\omega }_{s}{\stackrel{^}{\beta }}_{\left( \omega \right)}$

$\theta$ 的真值 ${\theta }_{0}$，模型平均 ${\theta }_{0}$ 的估计：

$\theta \left({\stackrel{^}{\beta }}_{\left(\omega \right)}\right)={\left[{\theta }_{1}\left({\stackrel{^}{\beta }}_{\left(\omega \right)}\right),\cdots ,{\theta }_{n}\left({\stackrel{^}{\beta }}_{\left(\omega \right)}\right)\right]}^{\text{T}}=X{\stackrel{^}{\beta }}_{\left( \omega \right)}$

Mallows权重选择标准为：

$G\left(\omega \right)=2{\phi }^{-1}\text{Β}\left({\stackrel{^}{\beta }}_{\left(\omega \right)}\right)-2{\phi }^{-1}{y}^{\text{T}}\theta \left({\stackrel{^}{\beta }}_{\left(\omega \right)}\right)+{\lambda }_{n}{\omega }^{\text{T}}k$

$\omega =\mathrm{arg}{\mathrm{min}}_{w}G\left( \omega \right)$

2.3. 基于逻辑回归的模型平均

$f\left(y;p\right)=\mathrm{exp}\left\{ya-\mathrm{log}\left(1+{\text{e}}^{a}\right)\right\}$

$G\left(\omega \right)=2{\sum }_{i=1}^{n}\left\{{y}_{i}\mathrm{log}\left(1+\mathrm{exp}\left({x}_{i}^{\text{T}}{\stackrel{^}{\beta }}_{\left(\omega \right)}\right)\right)\right\}-{\sum }_{i=1}^{n}\left\{2{y}_{i}{x}_{i}^{\text{T}}{\stackrel{^}{\beta }}_{\left(\omega \right)}\right\}+2{\omega }^{\text{T}}k$

3. 糖尿病高危人群的预测

3.1. 数据来源

3.2. 数据预处理

3.3. 基于模型平均建模

Table 1. List of alternative models

Table 2. Weight ω

3.4. 模型平均的预测

Table 3. Comparison of error rates between model average and logistic regression average

4. 结论

[1] Zhang, X., Yu, D., Zou, G., et al. (2016) Optimal Model Averaging Estimation for Generalized Linear Models and Generalized Linear Mixed-Effects Models. Publications of the American Statistical Association, 111, 1775-1790.
https://doi.org/10.1080/01621459.2015.1115762

[2] 张新雨, 邹国华. 模型平均方法及其在预测中的应用[J]. 统计研究, 2011(6): 97-102.

Top