# 基于关系结构的面板数据聚类方法研究Research on Clustering Method Based on Relationship Structure of Panel Data

Abstract: This paper studies the panel data clustering method, and proposes a clustering method based on the structural relationship between the influence and response of the panel data variables. The linear relationship, the nonlinear relationship, the multi-index based on the trajectory feature and the shape feature are discussed respectively. This paper divides the data with the same structural relationship into the same class, and divides the data with different relationship structures into different classes, so that the classes have the same or similar structural relationships and trajectory characteristics, and the structural relationships and trajectory characteristics of the data between classes and classes big different.

1. 问题提出

${y}_{it}=\underset{j=1}{\overset{p}{\sum }}{\beta }_{i,j}{y}_{i,t-j}+{\alpha }_{i}+{\epsilon }_{it},\text{\hspace{0.17em}}|{\beta }_{i}|<1$ (1)

2. 基于参数关系面板数据聚类

2.1. 基于线性关系面板数据相似性度量

${y}_{it}^{\left(1\right)}=f\left({y}_{it}^{\left(2\right)},{y}_{it}^{\left(3\right)},\cdots ,{y}_{it}^{\left(m\right)}\right)+{\epsilon }_{it}$, (2)

${r}_{il}^{\left(1\right)}=\frac{\mathrm{cov}\left({y}_{it}^{\left(1\right)},{y}_{it}^{\left(l\right)}\right)}{\sqrt{\mathrm{var}\left({y}_{it}^{\left(1\right)}\right)}\sqrt{\mathrm{var}\left({y}_{it}^{\left(l\right)}\right)}},l=2,\cdots ,m$

$\mathrm{cov}\left({y}_{it}^{\left(1\right)},{y}_{it}^{\left(l\right)}\right)=\frac{1}{T}\underset{t=1}{\overset{T}{\sum }}\left({y}_{it}^{\left(1\right)}-{\stackrel{¯}{y}}_{i}^{\left(1\right)}\right)\left({y}_{it}^{\left(l\right)}-{\stackrel{¯}{y}}_{i}^{\left(l\right)}\right)$,

${\stackrel{¯}{y}}_{i}^{\left(1\right)}\text{=}\frac{1}{T}\underset{t=1}{\overset{T}{\sum }}{y}_{it}^{\left(1\right)},\text{\hspace{0.17em}}{\stackrel{¯}{y}}_{i}^{\left(l\right)}\text{=}\frac{1}{T}\underset{t=1}{\overset{T}{\sum }}{y}_{it}^{\left(l\right)}$.

${G}_{i}=\left({\left({y}_{i}^{1}\right)}^{\prime },{\left({y}_{i}^{2}\right)}^{\prime },\cdots ,{\left({y}_{i}^{m}\right)}^{\prime }\right),\text{\hspace{0.17em}}i=1,2,\cdots ,N$,

${y}_{i}^{k}=\left({y}_{i1}^{k},\cdots ,{y}_{iT}^{k}\right),\text{\hspace{0.17em}}t=1,2,\cdots ,T$.

${d}_{ij}=\sqrt{\left({r}_{i}^{\left(1\right)}-{r}_{j}^{\left(1\right)}\right){\left({r}_{i}^{\left(1\right)}-{r}_{j}^{\left(1\right)}\right)}^{\prime }}$, (3)

2.2. 非线性结构关系相似性度量

${y}_{it}^{\left(1\right)}=f\left({y}_{it}^{\left(2\right)},{y}_{it}^{\left(3\right)},\cdots ,{y}_{it}^{\left(m\right)}\right)+{\epsilon }_{it}$, (4)

$\frac{\text{d}{y}_{it}^{\left(1\right)}}{\text{d}t}=\frac{\partial {y}_{it}^{\left(1\right)}}{\partial {y}_{it-l}^{\left(2\right)}}\frac{\text{d}{y}_{it}^{\left(2\right)}}{\text{d}t}+\frac{\partial {y}_{it}^{\left(1\right)}}{\partial {y}_{it-l}^{\left(3\right)}}\frac{\text{d}{y}_{it}^{\left(3\right)}}{\text{d}t}+\cdots +\frac{\partial {y}_{it}^{\left(1\right)}}{\partial {y}_{it-l}^{\left(m\right)}}\frac{\text{d}{y}_{it}^{\left(m\right)}}{\text{d}t}$,

${y}_{it}^{\left(1\right)}=f\left({y}_{it}^{\left(2\right)},{y}_{it}^{\left(3\right)},\cdots ,{y}_{it}^{\left(m\right)}\right)+{\epsilon }_{it},\text{\hspace{0.17em}}i=1,2\cdots ,N$,

${y}_{jt}^{\left(1\right)}=f\left({y}_{jt}^{\left(2\right)},{y}_{jt}^{\left(3\right)},\cdots ,{y}_{jt}^{\left(m\right)}\right)+{\epsilon }_{jt},\text{\hspace{0.17em}}j=1,2,\cdots ,N$,

$\frac{\partial {y}_{it}^{\left(1\right)}}{\partial {y}_{it-l}^{\left(k\right)}}\approx \frac{\partial {y}_{jt}^{\left(1\right)}}{\partial {y}_{jt-l}^{\left(k\right)}},\text{\hspace{0.17em}}k=2,3,\cdots ,m$. (5)

$\frac{\Delta {y}_{it}^{\left(1\right)}}{\Delta {y}_{it-l}^{\left(2\right)}}\approx \frac{\Delta {y}_{jt}^{\left(1\right)}}{\Delta {y}_{jt-l}^{\left(2\right)}}$. (6)

$\frac{\Delta {y}_{it}^{\left(1\right)}}{\Delta {y}_{it-l}^{\left(2\right)}}\approx \frac{\Delta {y}_{jt}^{\left(1\right)}}{\Delta {y}_{jt-l}^{\left(2\right)}},\frac{\Delta {y}_{it}^{\left(1\right)}}{\Delta {y}_{it-l}^{\left(3\right)}}\approx \frac{\Delta {y}_{jt}^{\left(1\right)}}{\Delta {y}_{jt-l}^{\left(3\right)}},\cdots ,\frac{\Delta {y}_{it}^{\left(1\right)}}{\Delta {y}_{it-l}^{\left(m\right)}}\approx \frac{\Delta {y}_{jt}^{\left(1\right)}}{\Delta {y}_{jt-l}^{\left(m\right)}}$. (7)

${d}_{ij}=\frac{1}{T}\underset{t=1}{\overset{T}{\sum }}\sqrt{{\left(\frac{\Delta {y}_{it}^{\left(1\right)}}{\Delta {y}_{it-l}^{\left(2\right)}}-\frac{\Delta {y}_{jt}^{\left(1\right)}}{\Delta {y}_{jt-l}^{\left(2\right)}}\right)}^{2}+{\left(\frac{\Delta {y}_{it}^{\left(1\right)}}{\Delta {y}_{it-l}^{\left(3\right)}}-\frac{\Delta {y}_{jt}^{\left(1\right)}}{\Delta {y}_{jt-l}^{\left(3\right)}}\right)}^{2}+{\left(\frac{\Delta {y}_{it}^{\left(1\right)}}{\Delta {y}_{it-l}^{\left(m\right)}}-\frac{\Delta {y}_{jt}^{\left(1\right)}}{\Delta {y}_{jt-l}^{\left(m\right)}}\right)}^{2}}$, (8)

2.3. 结构关系数据聚类

1) 初始聚类点确定

${D}_{\left(0\right)}=\left[\begin{array}{ccccc}0& {d}_{12}& {d}_{13}& \cdots & {d}_{1N}\\ 0& 0& {d}_{23}& \cdots & {d}_{2N}\\ ⋮& ⋮& ⋮& \ddots & ⋮\\ 0& 0& 0& \cdots & {d}_{\left(N-1\right)N}\\ 0& 0& 0& \cdots & 0\end{array}\right]$,

2) 聚合规则

${d}_{rl}^{2}=\mathrm{min}\left\{{d}_{i,j}^{2},i\in {G}_{i},i=1,2,\cdots ,K,j\notin {G}_{i},j=1,2,N-K\right\}$. (9)

3. 基于轨迹特征的面板数据关系聚类

3.1. 离散数据平滑处理

$‖{y}_{it}-{f}_{i}\left(t\right)‖=0$,

$0<‖{y}_{it}-{f}_{i}\left(t\right)‖<{\epsilon }_{it}$,

3.2. 基函数确定

${y}_{it}=\underset{k=1}{\overset{K}{\sum }}{\alpha }_{k}{\phi }_{k}\left(t\right),\text{\hspace{0.17em}}\left(k=1,2,\cdots ,K\right)$

3.3. 基函数系数向量的估计

$SSE\left(y/\alpha \right)={\left(y-\Phi \alpha \right)}^{\prime }\left(y-\Phi \alpha \right)$, (10)

$\alpha ={\left({\Phi }^{\prime }\Phi \right)}^{-1}{\Phi }^{\prime }y$.

3.4. 基于符号表示的相似性度量

${y}_{it}={y}_{i}\left(t\right)+{u}_{it}=\underset{k=1}{\overset{p}{\sum }}{\varphi }_{ik}{y}_{i}\left(t-k\right)+\underset{l=1}{\overset{q}{\sum }}{\theta }_{il}{\epsilon }_{i}\left(t-l\right)$. (11)

$M\left(f\left(t\right)\right)=\left({m}_{1},{m}_{2},\cdots ,{m}_{V}\right)$,

${m}_{v}=\left\{\begin{array}{l}1,\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{ }\text{ }第v个顶点在\text{\hspace{0.17em}}f\left(t\right)\text{\hspace{0.17em}}领域内达到极大值点\\ 0,\text{\hspace{0.17em}}\text{\hspace{0.17em}}第v个顶点在\text{\hspace{0.17em}}f\left(t\right)\text{\hspace{0.17em}}领域内达到极小值点\end{array},\text{\hspace{0.17em}}v=1,2,\cdots ,V$

$M\left(f\left(t\right)\right)$ 在一定领域内的顶点个数为 $|M\left(f\left(t\right)\right)|$

$|M\left({f}_{i}\left(t\right)\cap {f}_{j}\left(t\right)\right)|$,

$D\left({f}_{i}\left(t\right),{f}_{j}\left(t\right)\right)=|M\left({f}_{i}\left(t\right)\right)|+|{f}_{j}\left(t\right)|-|M\left({f}_{i}\left(t\right)\cap {f}_{j}\left(t\right)\right)|$. (12)

Salvatore Ingrassia (2003)已经证明该距离满足对称性，正定性，三角不等式关系，即：

$D\left({f}_{i}\left(t\right),{f}_{j}\left(t\right)\right)\ge 0$,

$D\left({f}_{i}\left(t\right),{f}_{j}\left(t\right)\right)=D\left({f}_{j}\left(t\right),{f}_{i}\left(t\right)\right)$,

$D\left({f}_{i}\left(t\right),{f}_{j}\left(t\right)\right)\le D\left({f}_{i}\left(t\right),{f}_{k}\left(t\right)\right)+D\left({f}_{k}\left(t\right),{f}_{j}\left(t\right)\right)$.

3.5. 基于形状特征的函数性数据聚类

$\left\{{y}_{i}^{k}\left(t\right),k=1,2,\cdots ,K\right\}$, (13)

${D}^{r}\left({y}_{i}\left(t\right)\right)=\mathrm{min}\left\{D\left({y}_{i}\left(t\right),{y}_{i}^{k}\left(t\right)\right),k=1,2,\cdots ,K\right\}$,

$D\left({y}_{i}\left(t\right),{y}_{i}^{k}\left(t\right)\right)=|M\left({y}_{i}\left(t\right)\right)|+|{y}_{j}^{k}\left(t\right)|-|M\left({y}_{i}\left(t\right)\cap {y}_{j}^{k}\right)|$

4. 小结

