本文原载于微信公众号【潇杂想】,欢迎关注交流~,原文链接 [机器学习模型数据集划分问题(交叉验证)](https://mp.weixin.qq.com/s/0nIJEUZFQ_eGkfljdo-u8Q "机器学习模型数据集划分问题(交叉验证)")
------------
本次的话题是有关在机器学习模型构建中数据集划分问题。
在机器学习建模中,通常需要将数据集划分为两部分,训练集、测试集。训练集是已经有标签的数据,测试集是无标签的数据。有标签的数据与无标签的数据相互独立。
由于在模型训练过程中经常出现过拟合的情况,也就是说模型在训练过程中表现良好,但是在实际测试环节却不如训练效果。这时,人们将训练集一分为二,将其再次划分为训练集和验证集,目的是在实际进行测试之前就进行一次自我评估,因为训练集和验证集都是带有标签数据的,我们可以自己判断预测的效果。
模型的自我评估通常采用交叉验证,交叉验证就是要通过数据集划分充分利用样本数据评估一个数据模型的表现,尤其是样本量比较小的情况下;交叉验证是要重复利用数据,把样本数据进行选择,组合出不同的训练集和验证集,某次训练集中样本可能在下次成为验证集中的样本。这是“交叉”的含义。
sklearn的官方文档中介绍了许多数据集的分割方法,看上去容易混淆。
train_test_split
Kfold
GroupKFold
StratifiedKFold
RepeatedKFold
RepeatedStratifiedKFold
ShuffleSplit
GroupShuffleSplit
StratifiedShuffleSplit
......
> scikit-learn官方文档
经过阅读文档并总结,本文将这些方法分解为几个关键词,那么数据集划分方法可以认为是关键词的组合。
train_test_split:训练集、测试集简单分割。
KFold:K 折交叉划分。
Stratified:“分层采样”,使训练集、验证集中各类别样本的比例与原始数据集中相同。
Group:测试集和训练集中不存在相同的组。
Repeated:重复KFold n次,每次重复产生不同的分割。
Shuffle:有放回抽样。

主要是对上面的各个函数概念的辨析。由于在sklearn文档中对每个函数都给出了详细的例子,所以本文不再一一赘述,仅给出以下例子或一些特别说明。
(1)train_test_split 简单分割函数
```python
import numpy as np
from sklearn.model_selection import train_test_split
X=['a1','a2','a3','a4','b1','b2','b3','b4','c1','c2','c3','c4']
y=[0,0,0,0,1,1,1,1,2,2,2,2]
train_test_split(X,test_size=0.33333,shuffle=True,random_state=124 ,stratify=y)
输出:[['c2', 'a1', 'a3', 'a4', 'b2', 'b3', 'c4', 'b4'], ['b1', 'c1', 'a2', 'c3']]
```
train_test_split中默认会对原始数据进行shuffle,(default=True)。
如果设置了stratify则会在各类中先进行shuffle,再进行采样。
也就是如果设置了stratify,则shuffle一定要设置成True,否则会报错。
```python
X=['a1','a2','a3','a4','b1','b2','b3','b4','c1','c2','c3','c4']
y=[0,0,0,0,1,1,1,1,2,2,2,2]
train_test_split(X,test_size=0.5,shuffle=False,random_state=124 ,stratify=y)
```

(2) KFold
```python
import numpy as np
from sklearn.model_selection import Kfold
X = np.array([['a1', 'a2'], ['b1', 'b2'], ['c1', 'c2'], ['d1', 'd2'],['e1','e2'],['f1','f2']])
y = np.array([1, 1, 2,2,2,2])
```

```python
kf = KFold(n_splits=3,shuffle=False,random_state=123)
for train_index, test_index in kf.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
print("X_train:\n", X[train_index], "\nX_test:\n", X[test_index])
y_train, y_test = y[train_index], y[test_index]
print("y_train:\n", X[train_index], "\ny_test:\n", X[test_index])
```

KFold默认是不shuffle
KFold分割后TEST中的三个数据[0,1]、[2,3]、[4,5]三个互不重叠,三者互斥构成全集。
(3) StratifiedKFold
```python
from sklearn.model_selection import StratifiedKFold
X = np.array([['a1', 'a2'], ['b1', 'b2'], ['c1', 'c2'], ['d1', 'd2'],['e1','e2'],['f1','f2']])
y = np.array([1, 1, 1,2,2,2])
skf = StratifiedKFold(n_splits=3,shuffle=True)
for train_index, test_index in skf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
print("X_train:\n", X[train_index], "\nX_test:\n", X[test_index])
y_train, y_test = y[train_index], y[test_index]
print("y_train:\n", y[train_index], "\ny_test:\n", y[test_index])
```
同样:
StratifiedKFold分割后TEST中的三个数据[0,5]、[1,4]、[2,3]三个互不重叠,三者互斥构成全集。
由于是分层抽样训练集和测试集中的数据类型比例均为1:1
(4)RepeatedKFold
重复KFold n次,每次重复产生不同的分割。
```python
from sklearn.model_selection import RepeatedKFold
X = np.array([['a1', 'a2'], ['b1', 'b2'], ['c1', 'c2'], ['d1', 'd2'],['e1','e2'],['f1','f2']])
y = np.array([1, 1, 1,2,2,2])
rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=123)
for train_index, test_index in rkf.split(X,y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
#print("X_train:\n", X[train_index], "\nX_test:\n", X[test_index])
y_train, y_test = y[train_index], y[test_index]
#print("y_train:\n", y[train_index], "\ny_test:\n", y[test_index])
```
n_splits=2时,重复2次之后形成4组数据集,观察发现这4组的TEST不再是互斥的,有重叠了

n_splits=3时,重复3次之后形成6组数据集,观察发现这6组的TEST不再是互斥的,有重叠了。
```python
rkf = RepeatedKFold(n_splits=3, n_repeats=2, random_state=123)
```

(5)RepeatedStratifiedKFold
同理,RepeatedStratifiedKFold划分是在每次重复中以不同的随机性重复n次StratifiedKFold划分
(6)GroupKFold
```python
import numpy as np
from sklearn.model_selection import GroupKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8],[9,10],[11,12]])
y = np.array([1, 2, 3, 4,5,6])
groups = np.array([0, 0, 1, 1,2,2])
group_kfold = GroupKFold(n_splits=2)
group_kfold.get_n_splits(X, y, groups)
for train_index, test_index in group_kfold.split(X, y, groups):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
print("X_train:\n", X[train_index], "\nX_test:\n", X[test_index])
print("y_train:\n", y[train_index], "\ny_test:\n", y[test_index])
```

测试集和训练集中不存在相同的组。
(2)~(6)为无放回抽样。
(7)ShuffleSplit
```python
import numpy as np
from sklearn.model_selection import ShuffleSplit
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [3, 4], [5, 6]])
y = np.array([1, 2, 1, 2, 1, 2])
rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)
rs.get_n_splits(X)#输出
for train_index, test_index in rs.split(X):
print("TRAIN:", train_index, "TEST:", test_index)
```

要设置test_size或train_size,否则默认为test_size=0.1
由于为又放回抽样,测试集中的划分出现了重叠,甚至是完全一样的[5,2],是有放回抽样。
(8)StratifiedShuffleSplit
```python
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 0, 1, 1, 1])
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
sss.get_n_splits(X, y)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
```

训练集、验证集中各类别样本的比例与原始数据集中相同,有放回分层抽样
(9)GroupShuffleSplit
```python
from sklearn.model_selection import GroupShuffleSplit
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 0.001]
y = ["a", "b", "b", "b", "c", "c", "c", "a"]
groups = [1, 1, 2, 2, 3, 3, 4, 4]
gss = GroupShuffleSplit(n_splits=4, test_size=0.5, random_state=0)
for train, test in gss.split(X, y, groups=groups):
print("%s %s" % (train, test))
```

测试集和训练集中不存在相同的组。
训练集与测试集中的组类型互斥,避免在测试集中组别信息的干扰而导致其他关键特征难以识别。
(7)~(9)属于有放回抽样。
另外sklearn还介绍了如下划分函数,大家可自行查询官方文档补充阅读。
Leave One Group Out
Leave P Groups Out
Leave One Out (LOO)
Leave P Out (LPO)
注:
本文示例代码来源于scikit-learn user guide elease 0.21.3 ,部分有修改
参考文献:
https://scikit-learn.org/
https://numpy.org/