商业机构会使用多种营销手段来宣传自己的产品和服务,如电话营销、电视广告投放和平面宣传等. 其目的在于更加精准和有效地将产品推送给潜在客户,以提高营收和效益.

本案例使用的数据集，是有关葡萄牙一家银行通过电话营销的方式推广定存产品的信息。数据集一共包含 45211 个样本,每个样本包含 17 个特征,其中最后一个特征记录了客户是否订购产品. 其余 16 个特征可以大致分为客户基本信息、营销活动信息和社会经济环境信息三大类。客户基本信息包括年龄、职业、婚姻状态、教育程度、房产和贷款等. 营销活动信息包括通话方式、通话次数和上次营销结果等。社会经济环境信息包括就业变化率、居民消费价格指数和消费者信息指数等。

我们尝试使用过滤式，嵌入式，以及包裹式特征选择方法，从原始数据集中提取有价值的特征，然后使用LinearSVC等算法构建分类模型。

1 数据探索¶

本案例的数据来源于葡萄牙一家银行推广定存产品的信息，可以参考其官网，该数据集搜集了包括客户基本信息、营销活动信息和社会经济环境信息在内的17个特征，目标特征为客户是否订购产品。

下面是17个特征的含义介绍：

% config InlineBackend.figure_format='retina'
# 载入一些基本的包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 读取介绍特征信息的文本文件
df_intro=pd.read_table("./input/bank_introduction.txt",names=["intro"],sep="\n")

# 逐行输出方便观察
for i in range(len(df_intro)):
    print(df_intro["intro"][i])

﻿Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone') 
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric) 
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

客户基本信息包括年龄、职业、婚姻状态、教育程度、房产和贷款等；营销活动信息包括通话⽅式、通话次数和上次营销结果等；社会经济环境信息包括就业变化率、居民消费价格指数和消费者信息指数等。

读取数据：

# 为显示全部的列，添加如下的命令。
pd.set_option('display.max_columns',None)
df=pd.read_table("./input/bank-additional-full.csv",sep=";")
data=df.copy()
data.head()

观察数据集的基本信息：

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usage: 6.6+ MB

观察发现：
1）数据集维度是(41188, 17)；
2）统计显示所有特征都很完整，但是很多object类型的特征中都用unknown来表示缺失，所以需要做缺失值处理；
3）有11个特征是object类型，需要做数值化转换。

data.describe()

观察各特征的常用统计量发现，数据间存在明显的量级差异，需要做标准化处理。

2 数据预处理¶

缺失值处理¶

首先找到所有的object类型特征，并统计它们的缺失值数量。

# 找到object类型特征
col=data.columns
object_list=[]
for i in range(len(col)):
    if type(data[col[i]][0])!=np.int64 and type(data[col[i]][0])!=np.float64:
        object_list.append(col[i])

# 统计各特征缺失值数量
missing_list=[]
for item in object_list:
    count=0
    for i in range(len(data)):
        if data[item][i]=="unknown":
            count += 1
    if count!=0:
        missing_list.append(item)
    print "The number of missing data about %s :"  %item, count 
print "The features of object type: "
print object_list
print "The features having misssing data:"
print missing_list

The number of missing data about job : 330
The number of missing data about marital : 80
The number of missing data about education : 1731
The number of missing data about default : 8597
The number of missing data about housing : 990
The number of missing data about loan : 990
The number of missing data about contact : 0
The number of missing data about month : 0
The number of missing data about day_of_week : 0
The number of missing data about poutcome : 0
The number of missing data about y : 0
The features of object type: 
['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome', 'y']
The features having misssing data:
['job', 'marital', 'education', 'default', 'housing', 'loan']

除了default和education之外，缺失值数量都较少，直接舍弃缺失样本即可；对default和education而言，用众数填补缺失。

# 舍弃一部分缺失样本
new_data=data[data["job"]!="unknown"][data["marital"]!="unknown"][data["housing"]!="unknown"][data["loan"]!="unknown"]
new_data.index=range(len(new_data))

# 用众数填补数量多的缺失值
new_data["default"][new_data["default"]=="unknown"]=new_data["default"].value_counts().index[0]
new_data["education"][new_data["education"]=="unknown"]=new_data["education"].value_counts().index[0]

/explorer/pyenv/jupyter/lib/python2.7/site-packages/ipykernel/__main__.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  from ipykernel import kernelapp as app
/explorer/pyenv/jupyter/lib/python2.7/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/explorer/pyenv/jupyter/lib/python2.7/site-packages/ipykernel/__main__.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

# 检查填补后的情况
print " This is the new distribution of default:"
print new_data["default"].value_counts()
print " This is the new distribution of education:"
print new_data["education"].value_counts()

 This is the new distribution of default:
no     39800
yes        3
Name: default, dtype: int64
 This is the new distribution of education:
university.degree      13379
high.school             9244
basic.9y                5856
professional.course     5100
basic.4y                4002
basic.6y                2204
illiterate                18
Name: education, dtype: int64

数值化转换¶

# 把object类型特征用one-hot变量来表示
object_list.remove("y")
for item in object_list:
    dummies=pd.get_dummies(new_data[item],prefix=item)
    new_data=pd.concat([new_data,dummies],axis=1)
    del new_data[item]
    
# 目标特征数值化
new_data["label"]=0
new_data["label"][new_data["y"]=="yes"]=1
new_data["label"][new_data["y"]=="no"]=0
del new_data["y"]

/explorer/pyenv/jupyter/lib/python2.7/site-packages/ipykernel/__main__.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/explorer/pyenv/jupyter/lib/python2.7/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

可视化¶

# 计算相关矩阵
corr = new_data.corr()

# 设置图形类型为下三角矩阵
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# 做相关矩阵图
f, ax = plt.subplots(figsize=(10, 10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1.0,
            square=True, xticklabels=2, yticklabels=2,
            linewidths=.3, cbar_kws={"shrink": .5}, ax=ax)
plt.show()

纵坐标的最下面一行是目标特征，可见还是有很多特征与之相关的，颜色最深的是特征duration，进一步观察它们之间的关系：

f, ax1 = plt.subplots(1, 1, figsize=(6,4))
sns.boxplot(x="label",y='duration',data=new_data,ax=ax1)
plt.show()

容易发现，特征duration对目标特征影响很大，但是注意到特征信息：

duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

这说明特征duration不是能随时获得的，所以在建立模型时，我们要舍弃该特征。

0-1标准化¶

from sklearn import  preprocessing as prep
x=new_data.copy()
del x["label"]
del x["duration"]
y=new_data["label"]
minmax_scale=prep.MinMaxScaler().fit( x[ x.columns])
x[ x.columns]=minmax_scale.transform( x[ x.columns])
x.head()

3 特征选择及模型建立¶

本案例的要求是找到影响订购产品的关键特征并建立模型预测，所以选择封装式和嵌入式的特征选择方法比较合适，在这里我们尝试了两种嵌入式方法：正则化模型和基于树的模型。共有56个特征，为了能有效减少特征数量我们采用L1正则化，因为在本案例中模型效果比稳定性更重要；基于树的模型采用ExtraTrees模型，该模型是一个类似于随机森林的集成模型，选择集成模型也是为了提升模型预测效果，选择分支节点的标准分别尝试了Gini不纯度和Entropy熵。

首先以7:3比例随机地划分训练集和测试集：

# 随机地划分训练集和测试集
from sklearn.cross_validation import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.3, random_state=0)

定义三个需要的函数：
1）函数evaluate(pred,test_y)用来对分类结果进行评价；
2）函数find_name(new_feature, df_feature)用来输出关键特征的名称；
3）函数FS_importance(arr_importance, col, N)用来根据特征重要性选择特征。

from sklearn import metrics
from sklearn.metrics import classification_report 

"""
函数evaluate(pred,test_y)用来对分类结果进行评价；
输入：真实的分类、预测的分类结果
输出：分类的准确率、混淆矩阵等
"""
def evaluate(pred,test_y):
    
    # 输出分类的准确率
    print("Accuracy: %.4f"  % (metrics.accuracy_score(test_y,pred)))
    
    # 输出衡量分类效果的各项指标
    print(classification_report(test_y, pred)) 
    
    # 更直观的，我们通过seaborn画出混淆矩阵
    %matplotlib inline
    plt.figure(figsize=(6,4))
    colorMetrics = metrics.confusion_matrix(test_y,pred)
    
    # 坐标y代表test_y，即真实的类别，坐标x代表估计出的类别pred
    sns.heatmap(colorMetrics,annot=True,fmt='d',xticklabels=[0,1],yticklabels=[0,1])
    sns.plt.show()
    
"""
函数find_name(new_feature, df_feature)用来输出关键特征的名称；
输入：原始特征、选择的关键特征；
输出：关键特征的名称
"""
def find_name(new_feature, df_feature):
    
    # 定义列表存储关键特征名称
    feature_name=[]
    col=df_feature.columns
    
    # 寻找关键特征的名称信息
    for i in range(int(new_feature.shape[0])):
        for j in range(df_feature.shape[1]):
            
            # 判别标准为new_feature中的特征向量与df_feature中的特征向量一致
            if np.mean(abs(new_feature[i]-df_feature[col[j]]))==0:
                feature_name.append(col[j])
                print i+1,col[j] 
                break
    return feature_name

"""
函数FS_importance(arr_importance, col, N)用来根据特征重要性选择特征
输入：特征重要性、特征名称集、选择的特征个数
输出：根据特征重要性求出来的前N个特征
"""
def FS_importance(arr_importance, col, N):
        
    # 字典存储    
    dict_order=dict(zip(col, arr_importance))
    
    # 按特征重要性大小排序
    new_feature=sorted(dict_order.iteritems(),key=lambda item:item[1], reverse=True)
    feature_top, score=zip(*new_feature)
    return list(feature_top[:N])

3.1 正则化模型¶

本节包括三个部分：
1）特征选择：特征选择采用L1正则化，分别尝试了惩罚系数C取不同值时的模型效果，模型效果是用在训练集上5折交叉验证得到的平均准确率来衡量的，通过做图对比，找到使模型效果最好的惩罚系数；
2）在特征子集上建立模型：利用在特征选择中得到的最优子集建立模型；
3）在特征全集上建立模型：利用原始的特征全集建立模型，并与第2部分的模型效果作对比，观察特征选择对模型效果的影响。

特征选择¶

探讨选择不同惩罚系数对特征选择效果的影响，并在训练集上进行5折交叉验证以验证模型效果，baseline是用特征全集在训练集上进行5折交叉验证得到的模型平均准确率。

# 导入交叉验证和特征选择需要的模块
import tqdm
from sklearn import cross_validation
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

# 定义不进行正则化的模型
lsvc = LinearSVC(C=1e-5)

# 交叉验证计算应用特征全集的平均准确率
score=cross_validation.cross_val_score(lsvc, train_x, train_y, cv=5)

# 探讨不同参数对特征选择效果的影响
# xx代表参数集合
acc_subset=[]
xx=np.array(range(5,205,5))*0.01
for i in tqdm.tqdm(xx):
    
    # 定义进行L1正则化特征选择的模型
    lsvc_l1 = LinearSVC(C=i, penalty="l1", dual=False)
    
    # 交叉验证计算应用特征子集的平均准确率
    scores=cross_validation.cross_val_score(lsvc_l1, train_x, train_y, cv=5)
    acc_subset.append(scores.mean())


ImportErrorTraceback (most recent call last)
<ipython-input-17-ee12c0240f4f> in <module>()
      1 # 导入交叉验证和特征选择需要的模块
----> 2 import tqdm
      3 from sklearn import cross_validation
      4 from sklearn.svm import LinearSVC
      5 from sklearn.feature_selection import SelectFromModel

ImportError: No module named tqdm

画图来观察不同惩罚系数对特征选择效果的影响：

# yy代表不同参数下应用特征子集的平均准确率
# y0代表应用特征全集的平均准确率，以此作baseline
yy=acc_subset
y0=[score.mean()]*len(xx)

# 比较应用特征全集和特征子集的区别
fig=plt.figure(figsize=(18,6))
ax1=fig.add_subplot(1,2,1)
plt.plot(xx,yy,"bo-")
plt.plot(xx,y0,"r-")
ax1.set_xlabel("The value of the parameter C")
ax1.set_ylabel("Mean accuracy on training set by CV")
ax1.set_title("Comparison between feature universal set and subset")
ax1.set_ylim([score.mean()-0.01,max(yy)+0.01])
ax1.set_xlim([min(xx),max(xx)])

# 观察不同特征子集的区别
ax2=fig.add_subplot(1,2,2)
plt.plot(xx,yy,"bo-")
plt.plot(xx,y0,"r-")
ax2.set_xlabel("The value of the parameter C")
ax2.set_ylabel("Mean accuracy on training set by CV")
ax2.set_title("Comparison among different parameters")
ax2.set_ylim([min(yy)-0.001,max(yy)+0.001])
ax2.set_xlim([min(xx),max(xx)])
plt.show()

上述两图的横坐标都是惩罚系数，纵坐标都是交叉验证得到的平均准确率，观察左图，蓝色折现代表不同惩罚系数下应用特征子集的模型准确率，红色直线则是我们的baseline，即应用特征全集在同样模型上得到的准确率，可以发现，蓝线都在红线之上，这说明经过L1正则化特征选择后的模型效果都明显变得更好了，进一步还想观察不同特征子集的区别，右图是把蓝线放大了，我们可以看出，当惩罚系数为1.7时，模型效果最好，所以我们可以尝试用惩罚系数为1.7的L1正则化来进行特征选择。

在特征子集上建立模型¶

在上一部分得到的最优子集上建立LinearSVC模型，并输出经L1正则化选出来的特征名称。

# 建立L1正则化模型
lsvc_l1 = LinearSVC(C=1.7, penalty="l1", dual=False).fit(train_x, train_y)
pred=lsvc_l1.predict(test_x)

# 输出经L1正则化选出的特征名称
model = SelectFromModel(lsvc_l1,prefit=True)
new_train_x = model.transform(train_x).T
FS_result=find_name(new_train_x, train_x)

详细的模型效果评价：

evaluate(pred,test_y)

对比：在特征全集上建立模型¶

在特征全集上建立同样的LinearSVC模型：

# 建立不进行正则化的模型
lsvc = LinearSVC(C=1e-5)
lsvc.fit(train_x,train_y)
pred=lsvc.predict(test_x)

更详细的模型效果评价：

evaluate(pred,test_y)

更直观地对比特征子集和特征全集带来的模型效果差异：

index=["accuracy","precision","recall","f1-score"]
data=np.array([[0.8906,0.9043],[0.79,0.89],[0.89,0.90],[0.84,0.88]])
df=pd.DataFrame(data,columns=["universal set","subset"],index=index)
df.plot(kind="bar",figsize=(6,4))
plt.ylim([0.75,0.95])
plt.show()

应用特征子集建立的模型在各方面都有了更好的性能，说明在建模前进行特征选择是很有必要的。

3.2 基于树的模型¶

本节包括三个部分：
1）特征选择：特征选择分别尝试了以Gini不纯度和Entropy为指标，并对比了选择不同数量特征对模型效果的影响，模型效果是用在训练集上5折交叉验证得到的平均准确率来衡量的，通过做图对比，找到使模型效果最好的特征个数；
2）在特征子集上建立模型：利用在特征选择中得到的最优子集建立模型；
3）在特征全集上建立模型：利用原始的特征全集建立模型，并与第2部分的模型效果作对比，观察特征选择对模型效果的影响。

特征选择¶

分别以Gini不纯度和Entropy为特征选择指标，并探讨选择不同特征个数对特征选择效果的影响，在训练集上进行5折交叉验证以验证模型效果，baseline是用特征全集在训练集上进行5折交叉验证得到的模型平均准确率。

# 定义以Gini不纯度为分类指标的模型
from sklearn.ensemble import ExtraTreesClassifier
clf_gini = ExtraTreesClassifier(criterion='gini')

# 在训练集上拟合模型，获得特征重要性得分
clf_gini = clf_gini.fit(train_x, train_y)
importance_gini=clf_gini.feature_importances_ 

# 交叉验证计算应用特征全集的平均准确率
score_gini=cross_validation.cross_val_score(clf_gini, train_x, train_y, cv=5)

# 定义以Entropy熵为分类指标的模型
from sklearn.ensemble import ExtraTreesClassifier
clf_entropy = ExtraTreesClassifier(criterion='entropy')

# 在训练集上拟合模型，获得特征重要性得分
clf_entropy = clf_entropy.fit(train_x, train_y)
importance_entropy=clf_entropy.feature_importances_ 

# 交叉验证计算应用特征全集的平均准确率
score_entropy=cross_validation.cross_val_score(clf_entropy, train_x, train_y, cv=5)

# 以Gini不纯度为指标
# 探讨不同特征个数对特征选择效果的影响
# xx代表特征个数集合
acc_subset_gini=[]
xx=np.array(range(1,56,1))
for i in tqdm.tqdm(xx):
    
    # 定义ExtraTrees模型
    clf_gini = ExtraTreesClassifier(criterion='gini')
    
    # 根据特征重要性选择特征
    feature=FS_importance(importance_gini, train_x.columns, i)
    new_train_x=train_x[feature]
    
    # 交叉验证计算应用特征子集的平均准确率
    scores=cross_validation.cross_val_score(clf_gini, new_train_x, train_y, cv=5)
    acc_subset_gini.append(scores.mean())

# 以Entropy熵为指标
# 探讨不同特征个数对特征选择效果的影响
# xx代表特征个数集合
acc_subset_entropy=[]
xx=np.array(range(1,56,1))
for i in tqdm.tqdm(xx):
    
    # 定义ExtraTrees模型
    clf_entropy = ExtraTreesClassifier(criterion='entropy')
    
    # 根据特征重要性选择特征
    feature=FS_importance(importance_entropy, train_x.columns, i)
    new_train_x=train_x[feature]
    
    # 交叉验证计算应用特征子集的平均准确率
    scores=cross_validation.cross_val_score(clf_entropy, new_train_x, train_y, cv=5)
    acc_subset_entropy.append(scores.mean())

画图来观察不同特征个数对特征选择效果的影响：

# 以Gini不纯度为指标
# yy代表不同特征个数下应用特征子集的平均准确率
# y0代表应用特征全集的平均准确率，以此作baseline
yy=acc_subset_gini
y0=[score_gini.mean()]*len(xx)

# 比较应用特征全集和特征子集的区别
fig=plt.figure(figsize=(18,6))
ax1=fig.add_subplot(1,2,1)
plt.plot(xx,yy,"bo-")
plt.plot(xx,y0,"r-")
ax1.set_xlabel("The number of features")
ax1.set_ylabel("Mean accuracy on training set by CV")
ax1.set_title("Comparison among feature numbers by Gini")
ax1.set_ylim([min(yy)-0.001,max(yy)+0.001])
ax1.set_xlim([min(xx),max(xx)])

# yy代表不同参数下应用特征子集的平均准确率
# y0代表应用特征全集的平均准确率，以此作baseline
yy=acc_subset_entropy
y0=[score_entropy.mean()]*len(xx)

# 观察不同特征子集的区别
ax2=fig.add_subplot(1,2,2)
plt.plot(xx,yy,"bo-")
plt.plot(xx,y0,"r-")
ax2.set_xlabel("The number of features")
ax2.set_ylabel("Mean accuracy on training set by CV")
ax2.set_title("Comparison among feature numbers by Entropy")
ax2.set_ylim([min(yy)-0.001,max(yy)+0.001])
ax2.set_xlim([min(xx),max(xx)])
plt.show()

上述两图的横坐标都是特征个数，纵坐标都是交叉验证得到的平均准确率，蓝色折现代表不同特征个数下应用特征子集的模型准确率，红色直线则是我们的baseline，即应用特征全集在同样模型上得到的准确率，不同的是左图是以Gini不纯度为指标做特征选择的，而右图是以Entropy为指标做特征选择。可以发现，特征个数对特征选择的效果影响很大，在一些特征子集上（比如特征个数为2、5、6、55时），模型可以取得比baseline更好的效果；但在另一些特征子集上（比如特征个数为14、15、16时），模型效果反而降低了很多。从图上发现，特征子集中包含2个特征时，模型效果最好，但是经过试验发现特征个数太少时虽然准确率确实高，却会造成很低的F1值，最终我们选择以Entropy为指标得到的特征个数为6的特征子集为最优子集候选。这里提醒大家，对于分类问题，不能只看准确率，其他的指标也很重要，对特征子集的选择要从多方面综合考虑。

前6个关键特征如下：

print "The most important 6 features computed by mini:"
print(FS_importance(importance_gini, train_x.columns, 6))
print "The most important 6 features computed by entropy:"
print(FS_importance(importance_entropy, train_x.columns, 6))

在特征子集上建立模型¶

在上一部分得到的最优子集上建立ExtraTrees模型，并输出以entropy为指标选出来的特征名称。

# 定义ExtraTrees模型
clf_entropy = ExtraTreesClassifier(criterion='entropy')
    
# 根据特征重要性选择特征
feature=FS_importance(importance_entropy, train_x.columns, 6)
new_train_x=train_x[feature]
new_test_x=test_x[feature]

clf_entropy = clf_entropy.fit(new_train_x, train_y) 
pred=clf_entropy.predict(new_test_x)

详细的模型效果评价：

evaluate(pred,test_y)

对比：在特征全集上建立模型¶

在特征全集上建立同样的ExtraTrees模型：

from sklearn.ensemble import ExtraTreesClassifier
clf_entropy = ExtraTreesClassifier(criterion='entropy')
clf_entropy = clf_entropy.fit(train_x, train_y) 
pred=clf_entropy.predict(test_x)

详细的模型效果评价：

evaluate(pred,test_y)

更直观地对比特征子集和特征全集带来的模型效果差异：

index=["accuracy","precision","recall","f1-score"]
data=np.array([[0.8789,0.8886],[0.86,0.87],[0.88,0.89],[0.87,0.88]])
df=pd.DataFrame(data,columns=["universal set","subset"],index=index)
df.plot(kind="bar",figsize=(6,4))
plt.ylim([0.85,0.91])
plt.show()

应用特征子集建立的模型在各方面都有了更好的性能，说明在建模前进行特征选择是很有必要的。

4 特征理解与评估¶

我们共尝试了三种特征选择方法：L1正则化、Gini不纯度、Entropy，这三种方法都认为特征age和campaign对是否订购产品影响最大，所以我们想进一步观察一下这种影响关系是怎样的。

# 进一步观察age和label的关系
sns.jointplot(x='age',y='label',data=new_data,kind='reg',x_estimator= np.mean,order=2)
plt.show()

因为label是离散变量，所以用其均值代替，可以发现age与label之间存在一种二次关系，中年人不太喜欢订购产品，反而是年轻人和老年人订购产品的比率更高。该发现可以引导银行对电话营销的目标人群做更精准的定位。

# 进一步观察campaign和label的关系
f, ax1 = plt.subplots(1, 1, sharex=True, figsize=(6, 4))
c1, c2 = sns.color_palette('Set1', 2)
sns.kdeplot(new_data['campaign'][new_data["label"]==1], shade=True, color=c1, label='Yes',ax=ax1)
sns.kdeplot(new_data['campaign'][new_data["label"]==0], shade=True, color=c2, label='No', ax=ax1)
plt.show()

特征campaign代表和客户的联系次数，红色分布代表订购产品的campaign数据分布，蓝色分布代表不订购的campaign数据分布，可以发现：红色分布更偏向于0，数值分布更小，这说明当联系次数少时，客户更倾向于订购产品，对银行而言，应控制业务员频繁联系客户的行为。

	age	duration	campaign	pdays	previous	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed
count	41188.00000	41188.000000	41188.000000	41188.000000	41188.000000	41188.000000	41188.000000	41188.000000	41188.000000	41188.000000
mean	40.02406	258.285010	2.567593	962.475454	0.172963	0.081886	93.575664	-40.502600	3.621291	5167.035911
std	10.42125	259.279249	2.770014	186.910907	0.494901	1.570960	0.578840	4.628198	1.734447	72.251528
min	17.00000	0.000000	1.000000	0.000000	0.000000	-3.400000	92.201000	-50.800000	0.634000	4963.600000
25%	32.00000	102.000000	1.000000	999.000000	0.000000	-1.800000	93.075000	-42.700000	1.344000	5099.100000
50%	38.00000	180.000000	2.000000	999.000000	0.000000	1.100000	93.749000	-41.800000	4.857000	5191.000000
75%	47.00000	319.000000	3.000000	999.000000	0.000000	1.400000	93.994000	-36.400000	4.961000	5228.100000
max	98.00000	4918.000000	56.000000	999.000000	7.000000	1.400000	94.767000	-26.900000	5.045000	5228.100000

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	duration	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	56	housemaid	married	basic.4y	no	no	no	telephone	may	mon	261	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
1	57	services	married	high.school	unknown	no	no	telephone	may	mon	149	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
2	37	services	married	high.school	no	yes	no	telephone	may	mon	226	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
3	40	admin.	married	basic.6y	no	no	no	telephone	may	mon	151	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
4	56	services	married	high.school	no	no	yes	telephone	may	mon	307	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no

	age	pdays	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	job_admin.	job_housemaid	job_services	marital_married	education_basic.4y	education_basic.6y	education_high.school	default_no	housing_no	housing_yes	loan_no	loan_yes	contact_telephone	month_may	day_of_week_mon	poutcome_nonexistent
0	0.481481	1.0	0.9375	0.698753	0.60251	0.957379	0.859735	0.0	1.0	0.0	1.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0	1.0	1.0	1.0	1.0
1	0.493827	1.0	0.9375	0.698753	0.60251	0.957379	0.859735	0.0	0.0	1.0	1.0	0.0	0.0	1.0	1.0	1.0	0.0	1.0	0.0	1.0	1.0	1.0	1.0
2	0.246914	1.0	0.9375	0.698753	0.60251	0.957379	0.859735	0.0	0.0	1.0	1.0	0.0	0.0	1.0	1.0	0.0	1.0	1.0	0.0	1.0	1.0	1.0	1.0
3	0.283951	1.0	0.9375	0.698753	0.60251	0.957379	0.859735	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	1.0	0.0	1.0	0.0	1.0	1.0	1.0	1.0
4	0.481481	1.0	0.9375	0.698753	0.60251	0.957379	0.859735	0.0	0.0	1.0	1.0	0.0	0.0	1.0	1.0	1.0	0.0	0.0	1.0	1.0	1.0	1.0	1.0