商业机构会使用多种营销手段来宣传自己的产品和服务,如电话营销、电视 广告投放和平面宣传等. 其目的在于更加精准和有效地将产品推送给潜在客户,以 提高营收和效益.

本案例使用的数据集,是有关葡萄牙一家银行通过电话营销的方式推广定存产品的信息。数据集一共包含 45211 个样本,每个样本包含 17 个特征,其中 最后一个特征记录了客户是否订购产品. 其余 16 个特征可以大致分为客户基本信 息、营销活动信息和社会经济环境信息三大类。客户基本信息包括年龄、职业、婚姻状态、教育程度、房产和贷款等. 营销活动信息包括通话方式、通话次数和上次营销结果等。社会经济环境信息包括就业变化率、居民消费价格指数和消费者信息指数等。

我们尝试使用过滤式,嵌入式,以及包裹式特征选择方法,从原始数据集中提取有价值的特征,然后使用LinearSVC等算法构建分类模型。

1 数据探索

本案例的数据来源于葡萄牙一家银行推广定存产品的信息,可以参考其官网,该数据集搜集了包括客户基本信息、营销活动信息和社会经济环境信息在内的17个特征,目标特征为客户是否订购产品。

下面是17个特征的含义介绍:

In [4]:
% config InlineBackend.figure_format='retina'
# 载入一些基本的包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 读取介绍特征信息的文本文件
df_intro=pd.read_table("./input/bank_introduction.txt",names=["intro"],sep="\n")

# 逐行输出方便观察
for i in range(len(df_intro)):
    print(df_intro["intro"][i])
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone') 
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric) 
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

客户基本信息包括年龄、职业、婚姻状态、教育程度、房产和贷款等;营销活动信息包括通话⽅式、通话次数和上次营销结果等;社会经济环境信息包括就业变化率、居民消费价格指数和消费者信息指数等。

读取数据:

In [5]:
# 为显示全部的列,添加如下的命令。
pd.set_option('display.max_columns',None)
df=pd.read_table("./input/bank-additional-full.csv",sep=";")
data=df.copy()
data.head()
Out[5]:
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no no telephone may mon 261 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
1 57 services married high.school unknown no no telephone may mon 149 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
2 37 services married high.school no yes no telephone may mon 226 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
3 40 admin. married basic.6y no no no telephone may mon 151 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
4 56 services married high.school no no yes telephone may mon 307 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no

观察数据集的基本信息:

In [6]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usage: 6.6+ MB

观察发现:
1)数据集维度是(41188, 17);
2)统计显示所有特征都很完整,但是很多object类型的特征中都用unknown来表示缺失,所以需要做缺失值处理;
3)有11个特征是object类型,需要做数值化转换。

In [7]:
data.describe()
Out[7]:
age duration campaign pdays previous emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
count 41188.00000 41188.000000 41188.000000 41188.000000 41188.000000 41188.000000 41188.000000 41188.000000 41188.000000 41188.000000
mean 40.02406 258.285010 2.567593 962.475454 0.172963 0.081886 93.575664 -40.502600 3.621291 5167.035911
std 10.42125 259.279249 2.770014 186.910907 0.494901 1.570960 0.578840 4.628198 1.734447 72.251528
min 17.00000 0.000000 1.000000 0.000000 0.000000 -3.400000 92.201000 -50.800000 0.634000 4963.600000
25% 32.00000 102.000000 1.000000 999.000000 0.000000 -1.800000 93.075000 -42.700000 1.344000 5099.100000
50% 38.00000 180.000000 2.000000 999.000000 0.000000 1.100000 93.749000 -41.800000 4.857000 5191.000000
75% 47.00000 319.000000 3.000000 999.000000 0.000000 1.400000 93.994000 -36.400000 4.961000 5228.100000
max 98.00000 4918.000000 56.000000 999.000000 7.000000 1.400000 94.767000 -26.900000 5.045000 5228.100000

观察各特征的常用统计量发现,数据间存在明显的量级差异,需要做标准化处理。

2 数据预处理

缺失值处理

首先找到所有的object类型特征,并统计它们的缺失值数量。

In [8]:
# 找到object类型特征
col=data.columns
object_list=[]
for i in range(len(col)):
    if type(data[col[i]][0])!=np.int64 and type(data[col[i]][0])!=np.float64:
        object_list.append(col[i])

# 统计各特征缺失值数量
missing_list=[]
for item in object_list:
    count=0
    for i in range(len(data)):
        if data[item][i]=="unknown":
            count += 1
    if count!=0:
        missing_list.append(item)
    print "The number of missing data about %s :"  %item, count 
print "The features of object type: "
print object_list
print "The features having misssing data:"
print missing_list
The number of missing data about job : 330
The number of missing data about marital : 80
The number of missing data about education : 1731
The number of missing data about default : 8597
The number of missing data about housing : 990
The number of missing data about loan : 990
The number of missing data about contact : 0
The number of missing data about month : 0
The number of missing data about day_of_week : 0
The number of missing data about poutcome : 0
The number of missing data about y : 0
The features of object type: 
['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome', 'y']
The features having misssing data:
['job', 'marital', 'education', 'default', 'housing', 'loan']

除了defaulteducation之外,缺失值数量都较少,直接舍弃缺失样本即可;对defaulteducation而言,用众数填补缺失。

In [9]:
# 舍弃一部分缺失样本
new_data=data[data["job"]!="unknown"][data["marital"]!="unknown"][data["housing"]!="unknown"][data["loan"]!="unknown"]
new_data.index=range(len(new_data))

# 用众数填补数量多的缺失值
new_data["default"][new_data["default"]=="unknown"]=new_data["default"].value_counts().index[0]
new_data["education"][new_data["education"]=="unknown"]=new_data["education"].value_counts().index[0]
/explorer/pyenv/jupyter/lib/python2.7/site-packages/ipykernel/__main__.py:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  from ipykernel import kernelapp as app
/explorer/pyenv/jupyter/lib/python2.7/site-packages/ipykernel/__main__.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/explorer/pyenv/jupyter/lib/python2.7/site-packages/ipykernel/__main__.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [10]:
# 检查填补后的情况
print " This is the new distribution of default:"
print new_data["default"].value_counts()
print " This is the new distribution of education:"
print new_data["education"].value_counts()
 This is the new distribution of default:
no     39800
yes        3
Name: default, dtype: int64
 This is the new distribution of education:
university.degree      13379
high.school             9244
basic.9y                5856
professional.course     5100
basic.4y                4002
basic.6y                2204
illiterate                18
Name: education, dtype: int64

数值化转换

In [11]:
# 把object类型特征用one-hot变量来表示
object_list.remove("y")
for item in object_list:
    dummies=pd.get_dummies(new_data[item],prefix=item)
    new_data=pd.concat([new_data,dummies],axis=1)
    del new_data[item]
    
# 目标特征数值化
new_data["label"]=0
new_data["label"][new_data["y"]=="yes"]=1
new_data["label"][new_data["y"]=="no"]=0
del new_data["y"] 
/explorer/pyenv/jupyter/lib/python2.7/site-packages/ipykernel/__main__.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/explorer/pyenv/jupyter/lib/python2.7/site-packages/ipykernel/__main__.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

可视化

In [12]:
# 计算相关矩阵
corr = new_data.corr()

# 设置图形类型为下三角矩阵
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# 做相关矩阵图
f, ax = plt.subplots(figsize=(10, 10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1.0,
            square=True, xticklabels=2, yticklabels=2,
            linewidths=.3, cbar_kws={"shrink": .5}, ax=ax)
plt.show()

纵坐标的最下面一行是目标特征,可见还是有很多特征与之相关的,颜色最深的是特征duration,进一步观察它们之间的关系:

In [13]:
f, ax1 = plt.subplots(1, 1, figsize=(6,4))
sns.boxplot(x="label",y='duration',data=new_data,ax=ax1)
plt.show()

容易发现,特征duration对目标特征影响很大,但是注意到特征信息:

duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

这说明特征duration不是能随时获得的,所以在建立模型时,我们要舍弃该特征。

0-1标准化

In [14]:
from sklearn import  preprocessing as prep
x=new_data.copy()
del x["label"]
del x["duration"]
y=new_data["label"]
minmax_scale=prep.MinMaxScaler().fit( x[ x.columns])
x[ x.columns]=minmax_scale.transform( x[ x.columns])
x.head()
Out[14]:
age campaign pdays previous emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed job_admin. job_blue-collar job_entrepreneur job_housemaid job_management job_retired job_self-employed job_services job_student job_technician job_unemployed marital_divorced marital_married marital_single education_basic.4y education_basic.6y education_basic.9y education_high.school education_illiterate education_professional.course education_university.degree default_no default_yes housing_no housing_yes loan_no loan_yes contact_cellular contact_telephone month_apr month_aug month_dec month_jul month_jun month_mar month_may month_nov month_oct month_sep day_of_week_fri day_of_week_mon day_of_week_thu day_of_week_tue day_of_week_wed poutcome_failure poutcome_nonexistent poutcome_success
0 0.481481 0.0 1.0 0.0 0.9375 0.698753 0.60251 0.957379 0.859735 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.493827 0.0 1.0 0.0 0.9375 0.698753 0.60251 0.957379 0.859735 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
2 0.246914 0.0 1.0 0.0 0.9375 0.698753 0.60251 0.957379 0.859735 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
3 0.283951 0.0 1.0 0.0 0.9375 0.698753 0.60251 0.957379 0.859735 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
4 0.481481 0.0 1.0 0.0 0.9375 0.698753 0.60251 0.957379 0.859735 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0

3 特征选择及模型建立

本案例的要求是找到影响订购产品的关键特征并建立模型预测,所以选择封装式和嵌入式的特征选择方法比较合适,在这里我们尝试了两种嵌入式方法:正则化模型和基于树的模型。共有56个特征,为了能有效减少特征数量我们采用L1正则化,因为在本案例中模型效果比稳定性更重要;基于树的模型采用ExtraTrees模型,该模型是一个类似于随机森林的集成模型,选择集成模型也是为了提升模型预测效果,选择分支节点的标准分别尝试了Gini不纯度和Entropy熵。

首先以7:3比例随机地划分训练集和测试集:

In [15]:
# 随机地划分训练集和测试集
from sklearn.cross_validation import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.3, random_state=0)

定义三个需要的函数:
1)函数evaluate(pred,test_y)用来对分类结果进行评价;
2)函数find_name(new_feature, df_feature)用来输出关键特征的名称;
3)函数FS_importance(arr_importance, col, N)用来根据特征重要性选择特征。

In [16]:
from sklearn import metrics
from sklearn.metrics import classification_report 

"""
函数evaluate(pred,test_y)用来对分类结果进行评价;
输入:真实的分类、预测的分类结果
输出:分类的准确率、混淆矩阵等
"""
def evaluate(pred,test_y):
    
    # 输出分类的准确率
    print("Accuracy: %.4f"  % (metrics.accuracy_score(test_y,pred)))
    
    # 输出衡量分类效果的各项指标
    print(classification_report(test_y, pred)) 
    
    # 更直观的,我们通过seaborn画出混淆矩阵
    %matplotlib inline
    plt.figure(figsize=(6,4))
    colorMetrics = metrics.confusion_matrix(test_y,pred)
    
    # 坐标y代表test_y,即真实的类别,坐标x代表估计出的类别pred
    sns.heatmap(colorMetrics,annot=True,fmt='d',xticklabels=[0,1],yticklabels=[0,1])
    sns.plt.show()
    
"""
函数find_name(new_feature, df_feature)用来输出关键特征的名称;
输入:原始特征、选择的关键特征;
输出:关键特征的名称
"""
def find_name(new_feature, df_feature):
    
    # 定义列表存储关键特征名称
    feature_name=[]
    col=df_feature.columns
    
    # 寻找关键特征的名称信息
    for i in range(int(new_feature.shape[0])):
        for j in range(df_feature.shape[1]):
            
            # 判别标准为new_feature中的特征向量与df_feature中的特征向量一致
            if np.mean(abs(new_feature[i]-df_feature[col[j]]))==0:
                feature_name.append(col[j])
                print i+1,col[j] 
                break
    return feature_name

"""
函数FS_importance(arr_importance, col, N)用来根据特征重要性选择特征
输入:特征重要性、特征名称集、选择的特征个数
输出:根据特征重要性求出来的前N个特征
"""
def FS_importance(arr_importance, col, N):
        
    # 字典存储    
    dict_order=dict(zip(col, arr_importance))
    
    # 按特征重要性大小排序
    new_feature=sorted(dict_order.iteritems(),key=lambda item:item[1], reverse=True)
    feature_top, score=zip(*new_feature)
    return list(feature_top[:N])

3.1 正则化模型

本节包括三个部分:
1)特征选择:特征选择采用L1正则化,分别尝试了惩罚系数C取不同值时的模型效果,模型效果是用在训练集上5折交叉验证得到的平均准确率来衡量的,通过做图对比,找到使模型效果最好的惩罚系数;
2)在特征子集上建立模型:利用在特征选择中得到的最优子集建立模型;
3)在特征全集上建立模型:利用原始的特征全集建立模型,并与第2部分的模型效果作对比,观察特征选择对模型效果的影响。

特征选择

探讨选择不同惩罚系数对特征选择效果的影响,并在训练集上进行5折交叉验证以验证模型效果,baseline是用特征全集在训练集上进行5折交叉验证得到的模型平均准确率。

In [17]:
# 导入交叉验证和特征选择需要的模块
import tqdm
from sklearn import cross_validation
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

# 定义不进行正则化的模型
lsvc = LinearSVC(C=1e-5)

# 交叉验证计算应用特征全集的平均准确率
score=cross_validation.cross_val_score(lsvc, train_x, train_y, cv=5)

# 探讨不同参数对特征选择效果的影响
# xx代表参数集合
acc_subset=[]
xx=np.array(range(5,205,5))*0.01
for i in tqdm.tqdm(xx):
    
    # 定义进行L1正则化特征选择的模型
    lsvc_l1 = LinearSVC(C=i, penalty="l1", dual=False)
    
    # 交叉验证计算应用特征子集的平均准确率
    scores=cross_validation.cross_val_score(lsvc_l1, train_x, train_y, cv=5)
    acc_subset.append(scores.mean())

ImportErrorTraceback (most recent call last)
<ipython-input-17-ee12c0240f4f> in <module>()
      1 # 导入交叉验证和特征选择需要的模块
----> 2 import tqdm
      3 from sklearn import cross_validation
      4 from sklearn.svm import LinearSVC
      5 from sklearn.feature_selection import SelectFromModel

ImportError: No module named tqdm

画图来观察不同惩罚系数对特征选择效果的影响:

In [ ]:
# yy代表不同参数下应用特征子集的平均准确率
# y0代表应用特征全集的平均准确率,以此作baseline
yy=acc_subset
y0=[score.mean()]*len(xx)

# 比较应用特征全集和特征子集的区别
fig=plt.figure(figsize=(18,6))
ax1=fig.add_subplot(1,2,1)
plt.plot(xx,yy,"bo-")
plt.plot(xx,y0,"r-")
ax1.set_xlabel("The value of the parameter C")
ax1.set_ylabel("Mean accuracy on training set by CV")
ax1.set_title("Comparison between feature universal set and subset")
ax1.set_ylim([score.mean()-0.01,max(yy)+0.01])
ax1.set_xlim([min(xx),max(xx)])

# 观察不同特征子集的区别
ax2=fig.add_subplot(1,2,2)
plt.plot(xx,yy,"bo-")
plt.plot(xx,y0,"r-")
ax2.set_xlabel("The value of the parameter C")
ax2.set_ylabel("Mean accuracy on training set by CV")
ax2.set_title("Comparison among different parameters")
ax2.set_ylim([min(yy)-0.001,max(yy)+0.001])
ax2.set_xlim([min(xx),max(xx)])
plt.show()

上述两图的横坐标都是惩罚系数,纵坐标都是交叉验证得到的平均准确率,观察左图,蓝色折现代表不同惩罚系数下应用特征子集的模型准确率,红色直线则是我们的baseline,即应用特征全集在同样模型上得到的准确率,可以发现,蓝线都在红线之上,这说明经过L1正则化特征选择后的模型效果都明显变得更好了,进一步还想观察不同特征子集的区别,右图是把蓝线放大了,我们可以看出,当惩罚系数为1.7时,模型效果最好,所以我们可以尝试用惩罚系数为1.7的L1正则化来进行特征选择。

在特征子集上建立模型

在上一部分得到的最优子集上建立LinearSVC模型,并输出经L1正则化选出来的特征名称。

In [ ]:
# 建立L1正则化模型
lsvc_l1 = LinearSVC(C=1.7, penalty="l1", dual=False).fit(train_x, train_y)
pred=lsvc_l1.predict(test_x)

# 输出经L1正则化选出的特征名称
model = SelectFromModel(lsvc_l1,prefit=True)
new_train_x = model.transform(train_x).T
FS_result=find_name(new_train_x, train_x)

详细的模型效果评价:

In [ ]:
evaluate(pred,test_y)

对比:在特征全集上建立模型

在特征全集上建立同样的LinearSVC模型:

In [ ]:
# 建立不进行正则化的模型
lsvc = LinearSVC(C=1e-5)
lsvc.fit(train_x,train_y)
pred=lsvc.predict(test_x)

更详细的模型效果评价:

In [ ]:
evaluate(pred,test_y)

更直观地对比特征子集和特征全集带来的模型效果差异:

In [ ]:
index=["accuracy","precision","recall","f1-score"]
data=np.array([[0.8906,0.9043],[0.79,0.89],[0.89,0.90],[0.84,0.88]])
df=pd.DataFrame(data,columns=["universal set","subset"],index=index)
df.plot(kind="bar",figsize=(6,4))
plt.ylim([0.75,0.95])
plt.show()

应用特征子集建立的模型在各方面都有了更好的性能,说明在建模前进行特征选择是很有必要的。

3.2 基于树的模型

本节包括三个部分:
1)特征选择:特征选择分别尝试了以Gini不纯度和Entropy为指标,并对比了选择不同数量特征对模型效果的影响,模型效果是用在训练集上5折交叉验证得到的平均准确率来衡量的,通过做图对比,找到使模型效果最好的特征个数;
2)在特征子集上建立模型:利用在特征选择中得到的最优子集建立模型;
3)在特征全集上建立模型:利用原始的特征全集建立模型,并与第2部分的模型效果作对比,观察特征选择对模型效果的影响。

特征选择

分别以Gini不纯度和Entropy为特征选择指标,并探讨选择不同特征个数对特征选择效果的影响,在训练集上进行5折交叉验证以验证模型效果,baseline是用特征全集在训练集上进行5折交叉验证得到的模型平均准确率。

In [ ]:
# 定义以Gini不纯度为分类指标的模型
from sklearn.ensemble import ExtraTreesClassifier
clf_gini = ExtraTreesClassifier(criterion='gini')

# 在训练集上拟合模型,获得特征重要性得分
clf_gini = clf_gini.fit(train_x, train_y)
importance_gini=clf_gini.feature_importances_ 

# 交叉验证计算应用特征全集的平均准确率
score_gini=cross_validation.cross_val_score(clf_gini, train_x, train_y, cv=5)

# 定义以Entropy熵为分类指标的模型
from sklearn.ensemble import ExtraTreesClassifier
clf_entropy = ExtraTreesClassifier(criterion='entropy')

# 在训练集上拟合模型,获得特征重要性得分
clf_entropy = clf_entropy.fit(train_x, train_y)
importance_entropy=clf_entropy.feature_importances_ 

# 交叉验证计算应用特征全集的平均准确率
score_entropy=cross_validation.cross_val_score(clf_entropy, train_x, train_y, cv=5)
In [ ]:
# 以Gini不纯度为指标
# 探讨不同特征个数对特征选择效果的影响
# xx代表特征个数集合
acc_subset_gini=[]
xx=np.array(range(1,56,1))
for i in tqdm.tqdm(xx):
    
    # 定义ExtraTrees模型
    clf_gini = ExtraTreesClassifier(criterion='gini')
    
    # 根据特征重要性选择特征
    feature=FS_importance(importance_gini, train_x.columns, i)
    new_train_x=train_x[feature]
    
    # 交叉验证计算应用特征子集的平均准确率
    scores=cross_validation.cross_val_score(clf_gini, new_train_x, train_y, cv=5)
    acc_subset_gini.append(scores.mean())
In [ ]:
# 以Entropy熵为指标
# 探讨不同特征个数对特征选择效果的影响
# xx代表特征个数集合
acc_subset_entropy=[]
xx=np.array(range(1,56,1))
for i in tqdm.tqdm(xx):
    
    # 定义ExtraTrees模型
    clf_entropy = ExtraTreesClassifier(criterion='entropy')
    
    # 根据特征重要性选择特征
    feature=FS_importance(importance_entropy, train_x.columns, i)
    new_train_x=train_x[feature]
    
    # 交叉验证计算应用特征子集的平均准确率
    scores=cross_validation.cross_val_score(clf_entropy, new_train_x, train_y, cv=5)
    acc_subset_entropy.append(scores.mean())

画图来观察不同特征个数对特征选择效果的影响:

In [ ]:
# 以Gini不纯度为指标
# yy代表不同特征个数下应用特征子集的平均准确率
# y0代表应用特征全集的平均准确率,以此作baseline
yy=acc_subset_gini
y0=[score_gini.mean()]*len(xx)

# 比较应用特征全集和特征子集的区别
fig=plt.figure(figsize=(18,6))
ax1=fig.add_subplot(1,2,1)
plt.plot(xx,yy,"bo-")
plt.plot(xx,y0,"r-")
ax1.set_xlabel("The number of features")
ax1.set_ylabel("Mean accuracy on training set by CV")
ax1.set_title("Comparison among feature numbers by Gini")
ax1.set_ylim([min(yy)-0.001,max(yy)+0.001])
ax1.set_xlim([min(xx),max(xx)])

# yy代表不同参数下应用特征子集的平均准确率
# y0代表应用特征全集的平均准确率,以此作baseline
yy=acc_subset_entropy
y0=[score_entropy.mean()]*len(xx)

# 观察不同特征子集的区别
ax2=fig.add_subplot(1,2,2)
plt.plot(xx,yy,"bo-")
plt.plot(xx,y0,"r-")
ax2.set_xlabel("The number of features")
ax2.set_ylabel("Mean accuracy on training set by CV")
ax2.set_title("Comparison among feature numbers by Entropy")
ax2.set_ylim([min(yy)-0.001,max(yy)+0.001])
ax2.set_xlim([min(xx),max(xx)])
plt.show()

上述两图的横坐标都是特征个数,纵坐标都是交叉验证得到的平均准确率,蓝色折现代表不同特征个数下应用特征子集的模型准确率,红色直线则是我们的baseline,即应用特征全集在同样模型上得到的准确率,不同的是左图是以Gini不纯度为指标做特征选择的,而右图是以Entropy为指标做特征选择。可以发现,特征个数对特征选择的效果影响很大,在一些特征子集上(比如特征个数为2、5、6、55时),模型可以取得比baseline更好的效果;但在另一些特征子集上(比如特征个数为14、15、16时),模型效果反而降低了很多。从图上发现,特征子集中包含2个特征时,模型效果最好,但是经过试验发现特征个数太少时虽然准确率确实高,却会造成很低的F1值,最终我们选择以Entropy为指标得到的特征个数为6的特征子集为最优子集候选。这里提醒大家,对于分类问题,不能只看准确率,其他的指标也很重要,对特征子集的选择要从多方面综合考虑。

前6个关键特征如下:

In [ ]:
print "The most important 6 features computed by mini:"
print(FS_importance(importance_gini, train_x.columns, 6))
print "The most important 6 features computed by entropy:"
print(FS_importance(importance_entropy, train_x.columns, 6))

在特征子集上建立模型

在上一部分得到的最优子集上建立ExtraTrees模型,并输出以entropy为指标选出来的特征名称。

In [ ]:
# 定义ExtraTrees模型
clf_entropy = ExtraTreesClassifier(criterion='entropy')
    
# 根据特征重要性选择特征
feature=FS_importance(importance_entropy, train_x.columns, 6)
new_train_x=train_x[feature]
new_test_x=test_x[feature]

clf_entropy = clf_entropy.fit(new_train_x, train_y) 
pred=clf_entropy.predict(new_test_x)

详细的模型效果评价:

In [ ]:
evaluate(pred,test_y)

对比:在特征全集上建立模型

在特征全集上建立同样的ExtraTrees模型:

In [ ]:
from sklearn.ensemble import ExtraTreesClassifier
clf_entropy = ExtraTreesClassifier(criterion='entropy')
clf_entropy = clf_entropy.fit(train_x, train_y) 
pred=clf_entropy.predict(test_x)

详细的模型效果评价:

In [ ]:
evaluate(pred,test_y)

更直观地对比特征子集和特征全集带来的模型效果差异:

In [ ]:
index=["accuracy","precision","recall","f1-score"]
data=np.array([[0.8789,0.8886],[0.86,0.87],[0.88,0.89],[0.87,0.88]])
df=pd.DataFrame(data,columns=["universal set","subset"],index=index)
df.plot(kind="bar",figsize=(6,4))
plt.ylim([0.85,0.91])
plt.show()

应用特征子集建立的模型在各方面都有了更好的性能,说明在建模前进行特征选择是很有必要的。

4 特征理解与评估

我们共尝试了三种特征选择方法:L1正则化、Gini不纯度、Entropy,这三种方法都认为特征agecampaign对是否订购产品影响最大,所以我们想进一步观察一下这种影响关系是怎样的。

In [ ]:
# 进一步观察age和label的关系
sns.jointplot(x='age',y='label',data=new_data,kind='reg',x_estimator= np.mean,order=2)
plt.show()

因为label是离散变量,所以用其均值代替,可以发现agelabel之间存在一种二次关系,中年人不太喜欢订购产品,反而是年轻人和老年人订购产品的比率更高。该发现可以引导银行对电话营销的目标人群做更精准的定位。

In [ ]:
# 进一步观察campaign和label的关系
f, ax1 = plt.subplots(1, 1, sharex=True, figsize=(6, 4))
c1, c2 = sns.color_palette('Set1', 2)
sns.kdeplot(new_data['campaign'][new_data["label"]==1], shade=True, color=c1, label='Yes',ax=ax1)
sns.kdeplot(new_data['campaign'][new_data["label"]==0], shade=True, color=c2, label='No', ax=ax1)
plt.show()

特征campaign代表和客户的联系次数,红色分布代表订购产品的campaign数据分布,蓝色分布代表不订购的campaign数据分布,可以发现:红色分布更偏向于0,数值分布更小,这说明当联系次数少时,客户更倾向于订购产品,对银行而言,应控制业务员频繁联系客户的行为。