恶性间皮瘤（MM），又被称为弥漫性恶性胸膜间皮瘤。恶性间皮瘤是原发于胸膜、侵袭性高的恶性肿瘤。恶性胸膜间皮瘤是胸膜原发肿瘤中最多见的类型。临床表现与侵袭行为有关，它通常局部侵袭胸膜腔及周围结构。如果不治疗，中位生存期4～12个月。以下列出一些其发病的主要相关因素：+ 石棉接触；+ 猿肾病毒40（SV40）感染；+ 遗传倾向性的；+ 分子机制；+ 农村生活。对于诊断恶性间皮瘤，可以基于一些临床监测指标(例如：石棉接触、白细胞、血小板计数、血清乳酸脱氢酶等)作为特征，建立分类模型，然后对是否患病进行预测。由于本案例中的特征数量比较多（34个），我们会先使用主成分分析（PCA）对特征数据先进行降维处理。之后再利用决策树算法来进行模型训练，并对比降维前后的训练出的模型性能。

1 数据源¶

我们使用UCI数据库中的"Mesotheliomas disease data set Data Set"数据集，包含324个样本点。每个样本包含34个特征变量，1个类别变量即（诊断结果为健康/患病），变量列表如下：

列名	说明	类型	示例
age	年龄	Int	47
gender	性别（0或1）	Int	1
city	城市（0到8）	Int	0
asbestos exposure	石棉接触（0或1）	Int	1
type of MM	恶性间皮瘤的种类（0、1或2）	Int	0
duration of asbestos exposure	石棉接触时间	Int	20
diagnosis method	诊断方法（0或1）	Int	1
keep side	0、1或2	Int	0
cytology	细胞学检查（0或1）	Int	1
duration of symptoms	症状持续时间	Float	24
dyspnoea	呼吸困难（0或1）	Int	1
ache on chest	胸部疼痛（0或1）	Int	1
weakness	虚弱（0或1）	Int	0
habit of cigarette	吸烟习惯（0到3）	Int	2
performance status	体力状态（0或1）	Int	1
white blood	白细胞	Int	8050
cell count (WBC)	细胞计数（白细胞）	Int	9
hemoglobin (HGB)	血红蛋白（0或1）	Int	1
platelet count (PLT)	血小板计数	Int	274
sedimentation	沉降	Int	60
blood lactic dehydrogenise (LDH)	血清乳酸脱氢酶	Int	258
……	……	……	……
class of diagnosis	0:Healthy；1: Mesothelioma	Int	0

2 数据探索和预处理¶

首先，使用pandas中的 read_excel() 函数将数据加载到数据框中：

import numpy as np
import pandas as pd
df = pd.read_csv("./input/Mesothelioma.csv")
df.head(5)

df_value_ravel = df.values.ravel()
print u'数据中的缺失值个数：', len(df_value_ravel[df_value_ravel==np.nan])

数据中的缺失值个数： 0

2.1 数据标准化¶

主成分分析必须从相同量纲的变量表格开始。由于需要将变量总方差分配给特征根，因此变量必须有相同的物理单位，方差和才有意义（方差的单位是变量单位的平方），或者变量是无量纲的数据，例如标准化或对数转化后的数据。因此在构建模型之前，我们需要进行数据标准化。常用的标准化方法有 min-max 标准化和 z-score 标准化等。在本例中，我们直接采用 z-score 标准化方法。

首先说明一下sklearn中preprocessing库里面的scale函数使用方法： sklearn.preprocessing.scale(X, axis=0, with_mean=True,with_std=True,copy=True) 根据参数的不同，可以沿任意轴标准化数据集。

参数解释：

X：数组或者矩阵
axis：int类型，初始值为0，axis用来计算均值 means 和标准方差 standard deviations. 如果是0，则单独的标准化每个特征（列），如果是1，则标准化每个观测样本（行）
with_mean: boolean类型，默认为True，表示将数据均值规范到0
with_std: boolean类型，默认为True，表示将数据方差规范到1

我们将采用默认参数

from sklearn import preprocessing
X = df.iloc[:,:-1]
y = df['class of diagnosis']

perm = np.random.permutation(len(X))
X= X.loc[perm]
y=y[perm]
X = preprocessing.scale(X)

3 使用 PCA 进行降维处理¶

PCA就是通过寻找高维空间中，数据变化最快（方差最大）的方向，对空间的基进行变换，然后选取重要的空间基来对数据降维，以尽可能的保持数据特征的情况下对数据进行降维。

3 .1函数原型及参数说明¶

这里，我们使用sklearn模块中的decomposition库中的PCA算法实现主成分分析。函数原形如下： sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False)
参数说明：

n_components:
- 意义：PCA算法中所要保留的主成分个数n，也即保留下来的特征个数n
- 类型：int 或者 string，缺省时默认为None，所有成分被保留。赋值为int，比如n_components=1，将把原始数据降到一个维度。赋值为string，比如n_components='mle'，将自动选取特征个数n，使得满足所要求的方差百分比。
copy:
- 意义：表示是否在运行算法时，将原始训练数据复制一份。若为True，则运行PCA算法后，原始训练数据的值不会有任何改变，因为是在原始数据的副本上进行运算；若为False，则运行PCA算法后，原始训练数据的值会改变，因为是在原始数据上进行降维计算。
- 类型：bool，True或者False，缺省时默认为True。
whiten:
- 意义：白化，使得每个特征具有相同的方差。
- 类型：bool，缺省时默认为False

3 .2 PCA对象的属性¶

components_ ：返回具有最大方差的成分。
explained_varianceratio：返回所保留的n个主成分的方差贡献率。
ncomponents：返回所保留的成分个数n。

3.3 PCA对象的方法¶

fit(X,y=None) fit()可以说是scikit-learn中通用的方法，每个需要训练的算法都会有fit()方法，它其实就是算法中的“训练”这一步骤。因为PCA是无监督学习算法，此处y自然等于None。
fit(X)，表示用数据X来训练PCA模型。

函数返回值：调用fit方法的对象本身。比如pca.fit(X)，表示用X对pca这个对象进行训练。
fit_transform(X) 用X来训练PCA模型，同时返回降维后的数据。 newX=pca.fit_transform(X)，newX就是降维后的数据。
inverse_transform() 将降维后的数据转换成原始数据，X=pca.inverse_transform(newX)
transform(X) 将数据X转换成降维后的数据。当模型训练好后，对于新输入的数据，都可以用transform方法来降维。

3.4 PCA处理过程¶

我们建立PCA对象，选择需要保留的维数为22，代码如下：

from sklearn.decomposition import PCA
pca = PCA(copy=True, n_components=22, whiten=False)
X_new = pca.fit_transform(X)

print u'所保留的n个主成分的方差贡献率为：'
print pca.explained_variance_ratio_
print u'排名前3的主成分特征向量为：'
print pca.components_[0:2]
print u'累计方差贡献率为：'
print sum(pca.explained_variance_ratio_)

所保留的n个主成分的方差贡献率为：
[ 0.12553081  0.07154234  0.06331551  0.05455819  0.04755711  0.04629502
  0.04300128  0.03680922  0.03525149  0.03303236  0.03250686  0.03247517
  0.02996476  0.02870988  0.02767058  0.02588683  0.02505337  0.02263919
  0.02165174  0.02126867  0.02044553  0.01958815]
排名前3的主成分特征向量为：
[[-0.07226296  0.04919076 -0.00729629  0.01612169  0.01787884 -0.0736354
  -0.01715455 -0.00796956 -0.08280908 -0.00329767 -0.03747524 -0.05541792
  -0.06134797  0.06570297  0.34735628 -0.04421487  0.05467926 -0.06011151
  -0.20734354 -0.11755069 -0.28734282 -0.05831704  0.10016904  0.02967766
   0.01025695 -0.1270108  -0.30164822 -0.27161576  0.3662886  -0.02511632
  -0.32240568 -0.29888796 -0.31615277 -0.27309709]
 [-0.14549557  0.00154012  0.43650205 -0.53809498 -0.14102176 -0.51684086
   0.09539106 -0.09988499 -0.11205484  0.03726002  0.13523726  0.07508918
  -0.22544025  0.12328454  0.03290527  0.01368637  0.00802218  0.15038872
   0.02987633 -0.05496272  0.12746521  0.02513513  0.07876572  0.02759405
  -0.02883099  0.09695859  0.012005    0.01005032 -0.0271272  -0.14574855
   0.04349126  0.01734347 -0.00912368 -0.04822   ]]
累计方差贡献率为：
0.864754051893

可以看出，从34个特征变量缩减为22个特征变量以后，累计方差贡献率为86.48%。表明新生成的特征综合描述原数据的能力较高。

3 模型训练¶

为了检验使用PCA降维后的数据的分类表现，我们使用 sklearn 包中的 tree.DecisionTreeClassifier 类。将数据切分成80%的数据作为训练集，20%的数据作为测试集

from sklearn import tree
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_new, y, test_size=.2,random_state=0)

/explorer/pyenv/jupyter/lib/python2.7/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

训练模型

clf = tree.DecisionTreeClassifier()
clf.fit(X_train,y_train)
print clf

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

4 模型性能评估¶

首先，使用predict()函数得到上一节训练的支持向量机模型在测试集合上的预测结果，然后使用 sklearn.metrics中的相关函数对模型的性能进行评估。

from sklearn import metrics
y_predict = clf.predict(X_test)
print metrics.classification_report(y_test, y_predict)
print metrics.confusion_matrix(y_test,y_predict)

             precision    recall  f1-score   support

          0       0.92      0.88      0.90        40
          1       0.81      0.88      0.85        25

avg / total       0.88      0.88      0.88        65

[[35  5]
 [ 3 22]]

上述混淆矩阵中对角线的元素表示模型正确预测数，对角元素之和表示模型整体预测正确的样本数。

现在，让我们来通过这个来计算模型在测试集中的预测正确率。

print "Accuracy: ", metrics.accuracy_score(y_test, y_predict)

可见，训练得到的模型在原始集的20%的测试样本中，预测的正确率（Accuaray）为87.69%。

5 使用降维前后数据训练出的模型的性能对比¶

为了检验使用降维之后的数据训练出的模型与使用降维之前的数据训练出的模型相比，仍然能够保持较高的性能，对没有经过降维处理的数据进行模型训练，代码如下：

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2,random_state=0)
clf2 = tree.DecisionTreeClassifier()
clf2.fit(X_train,y_train)
y_predict = clf2.predict(X_test)
print metrics.classification_report(y_test, y_predict)
print metrics.confusion_matrix(y_test,y_predict)

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        40
          1       1.00      1.00      1.00        25

avg / total       1.00      1.00      1.00        65

[[40  0]
 [ 0 25]]

上述混淆矩阵中对角线的元素表示模型正确预测数，对角元素之和表示模型整体预测正确的样本数。现在，让我们来通过这个来计算模型在测试集中的预测正确率。

print "Accuracy: ", metrics.accuracy_score(y_test, y_predict)

Accuracy:  1.0

可见，训练得到的模型在原始集的20%的测试样本中，预测的正确率（Accuaray）为100%。

经过对比，降维后的precision为0.89，比降维前小0.11；降维后的recall为0.88，比降维前小0.12；降维后的f1-score为0.88，比降维前小0.12；降维后的正确率为0.88比降维前低0.12。可以得知，使用降维之后的数据训练出的模型与使用降维之前的数据训练出的模型相比，仍然能够保持较高的性能。

	age	gender	city	asbestos exposure	duration of asbestos exposure	diagnosis method	keep side	cytology	duration of symptoms	...	pleural lactic dehydrogenise	pleural protein	pleural albumin	pleural glucose	dead or not	pleural effusion	pleural thickness on tomography	pleural level of acidity (pH)	C-reactive protein (CRP)	class of diagnosis
0	47	1	0	1	20	1	0	1	24.0	...	289	0.0	0.00	79	1	0	0	0	34	0
1	55	1	0	1	45	1	0	0	1.0	...	7541	1.6	0.80	6	1	1	1	1	42	0
2	29	1	1	1	23	0	1	0	1.0	...	480	0.0	0.00	90	1	0	0	0	43	1
3	39	1	0	1	10	1	0	0	3.0	...	459	5.0	2.80	45	1	1	0	0	21	0
4	47	1	0	1	10	1	1	1	1.5	...	213	3.6	1.95	53	1	1	0	0	11	0