﻿ 基于水平镜像算法的改进Box-Cox变换

# 基于水平镜像算法的改进Box-Cox变换Improved Box Cox Transform Based on Horizontal Mirror Algorithm

Abstract: Based on a horizontal mirror algorithm for data with negative skew distribution, this paper proposes an improved Box-Cox transform: mirror Box-Cox transform, and carries out numerical experiments. The experimental results show that, compared with the traditional Box-Cox transform, mirror Box-Cox transform can process negative skewness on the basis of the same effect as the traditional Box-Cox transform. The effect of distributed data is better than that of traditional Box-Cox transform. Then the simulated regression model experiment is carried out. The experimental results show that the fitting and prediction effect of the regression model established by the mirror Box-Cox transformation data is improved, and the effect is better than the data after using the traditional Box-Cox transformation.

1. 引言

1.1. 研究背景

Box-Cox变换是George Box和David Cox在1964年提出的一种参数化广义幂变换方法 [2]，其主要特点是引入一个参数 $\lambda$，通过数据本身估计该参数 $\lambda$，从而确定应采取数据变换形式 [3]。常用于稳定方差、减少数据在统计建模中的非正态性和增强关联性度量的有效性。

1.2. 正态性检验及回归模型评价指标说明

1) Shapiro-Wilk检验 [4] (W检验)

W检验是用来检验数据是否符合正态分布的。可计算得到一个相关系数，它越接近1就越表明数据和正态分布拟合得越好。且W检验还会给出一个P值，若P值大于0.05，就无法拒绝其符合正态分布。若统计量W值接近1，但P值小于0.05，我们仍然拒绝其符合正态分布。W检验计算公式为：

$W=\frac{{\left(\underset{i=1}{\overset{n}{\sum }}{a}_{i}{y}_{i}\right)}^{2}}{\underset{i=1}{\overset{n}{\sum }}{\left({y}_{i}-\stackrel{¯}{y}\right)}^{2}}$

2) MAPE [5] (Mean Absolute Percentage Error，平均绝对百分比误差)

MAPE常用于描述准确度，它是一个百分比值，因此比其他统计量更容易理解。MAPE的值越小，说明预测模型拥有更好的精确度。其数学表达式为：

$\text{MAPE}=\frac{1}{n}\underset{i\text{=}1}{\overset{n}{\sum }}|\frac{{y}_{i}-{\stackrel{⌢}{y}}_{i}}{{y}_{i}}|×100\text{%}$

2. Box-Cox变换

$y\ge 0$ 时，Box-Cox变换是对原始数据做如下变换：

${y}^{\left(\lambda \right)}=\left\{\begin{array}{l}\frac{{y}^{\lambda }-1}{\lambda },\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\lambda \ne 0\\ \mathrm{log}y,\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\lambda =0\end{array}$ (1)

${y}^{\left(\lambda \right)}=\left\{\begin{array}{l}\frac{{\left(y+\beta \right)}^{\lambda }-1}{\lambda },\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{ }\text{ }\lambda \ne 0\\ \mathrm{log}\left(y+\beta \right),\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\lambda =0\end{array}$ (2)

$\lambda$ 是一个待定变换参数，对不同的 $\lambda$，所做的变换自然就不同，所以这是一个变换族。我们将(1)式称为Box-Cox变换的基本公式；将(2)式称为Box-Cox变换的扩展公式。

3. 镜像Box-Cox变换

${y}^{\left(\lambda \right)}=\left\{\begin{array}{l}\frac{{y}^{\lambda }-1}{\lambda },\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\lambda \ne 0\\ \mathrm{log}y,\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\lambda =0\end{array}$

$\begin{array}{ccccc}{\left({y}^{\left(\lambda \right)}\right)}^{\prime \text{​}\prime }<0& ⇒& \left\{\begin{array}{l}\lambda \left(\lambda -1\right){y}^{\lambda }<0,\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\lambda \ne 0\\ -\frac{1}{{y}^{2}}<0,\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\lambda =0\end{array}& ⇒& \lambda \in \left[0,1\right)\end{array}$

$\forall {y}_{i}\in ℝ$，镜像Box-Cox变换是对原始数据做如下变换：

${y}^{\left(\lambda \right)}=\left\{\begin{array}{l}\alpha \cdot \frac{{\left(\alpha \cdot y+\beta \right)}^{\lambda }-1}{\lambda },\text{\hspace{0.17em}}\text{ }\text{ }\text{\hspace{0.17em}}\text{\hspace{0.17em}}\lambda \ne 0\\ \alpha \cdot \mathrm{log}\left(\alpha \cdot y+\beta \right),\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\lambda =0\end{array}$ (3)

4. 数值实验

4.1. 符号说明

Table 1. Symbol description

4.2. 实验结果

4.2.1. 数据正态性检验图示结果

4.2.2. 数据正态性假设检验结果

5. 回归模型模拟

5.1. 实验分析

Figure 1. The histogram and P-P plot of three type data

Table 2. Data normality test results and optimal parameters ( λ )

5.2. 实验结果

Table 3. Data normality test results and model fitting effect evaluation

6. 结语

[1] 张彦玲. 处理非正态数据[J]. 中国质量, 2002(8): 22-24.

[2] Box, G. and Cox, D. (1964) An Analysis of Transformations (with Discussion). Journal of the Royal Statistical Society, Series B, 26, 211-252.
https://doi.org/10.1111/j.2517-6161.1964.tb00553.x

[3] 王松桂, 陈敏, 陈立萍. 线性统计模型——线性回归与方差分析[M]. 北京: 高等教育出版社, 1999: 52-55.

[4] Shapiro, S.S. and Wilk, M.B. (1965) An Analysis of Variance Test for Normality (Complete Samples). Biometrika, 52, 591-611.
https://doi.org/10.1093/biomet/52.3-4.591

[5] Hyndman, R.J. and Koehler, A.B. (2006) Another Look at Measures of Forecast Accuracy. International Journal of Forecasting, 22, 679-688.
https://doi.org/10.1016/j.ijforecast.2006.03.001

[6] 钟登华, 刘豹. Box-Cox变换模型参数估计方法研究[J]. 系统工程学报, 1993, 8(2): 40-46.

[7] 胡宏昌, 樊献花, 等. 广义Box-Cox变换[J]. 周口师范学院学报, 2006, 23(5): 17-18.

[8] Azzalini, A. and Capitanio, A. (1999) Statistical Applications of the Multivariate Skew-Normal Distribution. Journal of the Royal Statistical Society: Series B, 61, 579-602.
https://doi.org/10.1111/1467-9868.00194

[9] 茆诗松, 周纪芗. 概率论与数理统计[M]. 北京: 中国统计出版社, 2013: 260-262, 420-422.

[10] Rigby, R.A. and Stasinopoulos, D.M. (2010) Smooth Centile Curves for Skew and Kurtotic Data Modelled Using the Box-Cox Power Exponential Distribution. Stats in Medicine, 23, 3053-3076.
https://doi.org/10.1002/sim.1861

Top