Section I: Brief Introduction on StratifiedKFold
A slight improvement over the standard k-fold cross-validation approach is stratified k-fold cross-validattion, which can yeild better bias and variance estimates, especially in case of unequal class proportions. In stratified cross-validattion, the class proportionss are preserved in each fold to ensure that each fold is representative of the class proportions in the training dataset.
FROM
Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京:东南大学出版社,2018.
Section II: Code and Analyses
代码
from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline import numpy as np from sklearn.model_selection import StratifiedKFold import warnings warnings.filterwarnings("ignore") #Section 1: Load Breast data, i.e., Benign and Malignant breast=datasets.load_breast_cancer() X=breast.data y=breast.target X_train,X_test,y_train,y_test=\ train_test_split(X,y,test_size=0.2,stratify=y,random_state=1) #Section 2: Define PipeLine Model pipe_lr=make_pipeline(StandardScaler(), PCA(n_components=2), LogisticRegression(random_state=1)) #Section 3: Define StratifiedKFold Model print("Original Class Dist: %s\n" % np.bincount(y)) kfold=StratifiedKFold(n_splits=10,random_state=1).split(X_train,y_train) scores=[] for k,(train_idx,test_idx) in enumerate(kfold): pipe_lr.fit(X_train[train_idx],y_train[train_idx]) score=pipe_lr.score(X_train[test_idx],y_train[test_idx]) scores.append(score) print("Fold: %2d, Class dist: %s, Acc: %.3f" % (k+1,np.bincount(y_train[train_idx]),score)) print('CV Accuracy: %.3f +/- %.3f' % (np.mean(scores),np.std(scores))) #Section 4: The easier manner when cross_val_score used from sklearn.model_selection import cross_val_score scores=cross_val_score(estimator=pipe_lr, X=X_train, y=y_train, cv=10, n_jobs=1) print("\nCV Accuracy Scores: %s" % scores) print("CV Accuracy: %.3f +/- %.3f" % (np.mean(scores),np.std(np.std(scores))))
结果
Original Class Dist: [212 357] Fold: 1, Class dist: [153 256], Acc: 0.978 Fold: 2, Class dist: [153 256], Acc: 0.935 Fold: 3, Class dist: [153 256], Acc: 0.957 Fold: 4, Class dist: [153 256], Acc: 0.935 Fold: 5, Class dist: [153 256], Acc: 0.913 Fold: 6, Class dist: [153 257], Acc: 0.956 Fold: 7, Class dist: [153 257], Acc: 0.933 Fold: 8, Class dist: [153 257], Acc: 0.956 Fold: 9, Class dist: [153 257], Acc: 0.933 Fold: 10, Class dist: [153 257], Acc: 0.956 CV Accuracy: 0.945 +/- 0.018 CV Accuracy Scores: [0.97826087 0.93478261 0.95652174 0.93478261 0.91304348 0.95555556 0.93333333 0.95555556 0.93333333 0.95555556] CV Accuracy: 0.945 +/- 0.000
对比上述结果可知两点,其一,StratifiedKFold显然是根据照类别分布,按照比例采样形成训练集和测试集;其二,cross_val_score划分数据结果同StratifiedKFold采样结果是一致的。
参考文献:
Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京:东南大学出版社,2018.