科比·布莱恩特(Kobe Bryant)是NBA历史上一名伟大的球星. 他17岁进入NBA,拥有五枚总冠军戒指,于2016年4月14日正式退役. 现需要分析科比在20年的篮球运动职业生涯中投中和投失球的各项数据,并根据他投球的位置和投球的方式等信息预测某一次投球能否命中.

本案例提供了从1996年至2016年科比在NBA赛场上投篮的各项数据,包括投篮的位置、比赛对阵双方、是否主客场等信息,使用集成方法,构建一个预测某次投篮是否命中的分类器。

使用AdaBoost对科比投篮是否命中进行预测

AdaBoost(adaptive boosting,自适应提升算法)是集成方法(ensemble method)中最流行和最具代表性的一个版本,于1995年由Yoav Freund和Robert Shapire提出,它的想法是将多个弱学习算法进行组合形成一个强学习算法。对于分类问题而言,给定一个训练样本集,求比较粗糙的弱分类器要比求精确的强分类器容易的多,AdaBoost就是从弱分类器出发,反复学习,通过在学习过程中提高那些被前一轮弱分类器错误分类样本的权值,以及加大分类误差率小的弱分类器权值,最后进行加权多数表决,以此实现从弱分类器到强分类器的提升。

AdaBoost的优点在于其泛化错误率低,易编码,可以应用在大部分分类器上,无需参数调整。AdaBoost的缺点也较明显,它对异常值敏感。它适用于数值型(numeric)和名义型(nominal)数据的分析。

本案例的数据来源于著名数据建模和数据分析竞赛平台Kaggle上的一个名为”Kobe Brayant Shot Selection“的一个竞赛,该竞赛非常适合用于分类训练、特征工程和时间序列分析。

科比·布莱恩特(Kobe Bryant),是NBA历史上一名伟大的球星,他17岁进入NBA,曾在一场比赛中拿到81分,他拥有五枚总冠军戒指,他于2016年4月14日正式退役。本案例需要分析科比在20年职业生涯中投中和投失球的各项数据,并根据他投球的位置和投球的方式等信息预测该球能否投中,这是一个典型的二分类问题。

我们首先使用sklearn工具进行AdaBoost实现,然后再从AdaBoost算法理论出发,自己动手定义函数实现AdaBoost,比较普通分类算法和AdaBoost算法之间预测效果的差异。

1 数据源

在本案例中,我们使用的是1996年至2016年科比在NBA赛场上投篮的各项数据。我们首先导入之后会用到的包,然后看看这份数据里包含哪些变量:

In [1]:
import numpy as np # 线性代数
import pandas as pd # 数据处理及CSV文件的读写(例如 pd.read_csv)
import matplotlib.pyplot as plt # 常规画图
import seaborn as sns # 图形渲染好
from sklearn.decomposition import PCA, KernelPCA
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold, RFE, SelectKBest, chi2
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.grid_search import GridSearchCV
# 使用内嵌模式画图
%matplotlib inline
//anaconda/lib/python3.5/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
In [2]:
kobe = pd.read_csv('./input/data.csv')
kobe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30697 entries, 0 to 30696
Data columns (total 25 columns):
action_type           30697 non-null object
combined_shot_type    30697 non-null object
game_event_id         30697 non-null int64
game_id               30697 non-null int64
lat                   30697 non-null float64
loc_x                 30697 non-null int64
loc_y                 30697 non-null int64
lon                   30697 non-null float64
minutes_remaining     30697 non-null int64
period                30697 non-null int64
playoffs              30697 non-null int64
season                30697 non-null object
seconds_remaining     30697 non-null int64
shot_distance         30697 non-null int64
shot_made_flag        25697 non-null float64
shot_type             30697 non-null object
shot_zone_area        30697 non-null object
shot_zone_basic       30697 non-null object
shot_zone_range       30697 non-null object
team_id               30697 non-null int64
team_name             30697 non-null object
game_date             30697 non-null object
matchup               30697 non-null object
opponent              30697 non-null object
shot_id               30697 non-null int64
dtypes: float64(3), int64(11), object(11)
memory usage: 5.9+ MB

数据集包含30697个样本,共有25个变量,其中有3个浮点型(float)变量、11个整型(Int)变量和11个对象型(object)变量,除了shot_made_flag变量外都不存在缺失值,目标标量为show_made_flag,每个变量对应的含义如下所示:

变量名 说明 类型 示例
action_type 细分投篮类型 object Floating Jump Shot
combined_shot_type 综合投篮类型,共六种:
Jump Shot 跳投
Layup 上篮
Dunk 扣篮
Tip Shot 补篮
Hook Shot 勾手投篮
Bank Shot 擦板投篮
object Layup
game_event_id 赛事ID int64 318
game_id 比赛ID int64 21501228
lat 投篮位置纬度 float64 34.0443
loc_x 横向投篮位置(短边) $x$ int64 0
loc_y 纵向投篮位置(长边)$y$ int64 7
lon 投篮位置经度 float64 -118.2698
minutes_remaining 比赛剩余分钟数 int64 4
period 比赛第几节 int64 3
playoffs 季后赛标识
0 常规赛
1 季后赛
int64 1
season 赛季 object 2005-2006
seconds_remaining 比赛剩余秒数 int64 32
shot_distance 投篮离篮筐的距离 int64 24
shot_made_flag 命中标记
0 未命中
1 命中
float64 0
shot_type 得分类型
2PT Field Goal 2分球
3PT Field Goal 3分球
object 2PT Field Goal
shot_zone_area 投篮位置(区域划分) object Center(C)
shot_zone_basic 投篮位置(基本) object Mid-Range
shot_zone_range 投篮位置离框距离 object 15-24 ft
team_id 所在队伍ID int64 1610612747
team_name 所在队伍名字 object Los Angeles Lakers
game_date 比赛日期 object 2016-04-13
matchup 对阵形势 object LAL @ SAS
opponent 对手 object SAS
shot_id 投篮ID int64 2047

2 数据探索和预处理

数据集初探

我们首先用“投篮ID”作为索引(index),然后利用head方法对数据集做一个简单的浏览(这里我们设置“显示列数不限”以浏览所有的列):

In [3]:
kobe.set_index('shot_id',inplace=True)    # 用shot_id作为索引
pd.set_option('display.max_columns',None) # 设置无列显示上限 
kobe.head(4)
Out[3]:
action_type combined_shot_type game_event_id game_id lat loc_x loc_y lon minutes_remaining period playoffs season seconds_remaining shot_distance shot_made_flag shot_type shot_zone_area shot_zone_basic shot_zone_range team_id team_name game_date matchup opponent
shot_id
1 Jump Shot Jump Shot 10 20000012 33.9723 167 72 -118.1028 10 1 0 2000-01 27 18 NaN 2PT Field Goal Right Side(R) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR
2 Jump Shot Jump Shot 12 20000012 34.0443 -157 0 -118.4268 10 1 0 2000-01 22 15 0.0 2PT Field Goal Left Side(L) Mid-Range 8-16 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR
3 Jump Shot Jump Shot 35 20000012 33.9093 -101 135 -118.3708 7 1 0 2000-01 45 16 1.0 2PT Field Goal Left Side Center(LC) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR
4 Jump Shot Jump Shot 43 20000012 33.8693 138 175 -118.1318 6 1 0 2000-01 52 22 0.0 2PT Field Goal Right Side Center(RC) Mid-Range 16-24 ft. 1610612747 Los Angeles Lakers 2000-10-31 LAL @ POR POR

head浏览方式不易看出整个数据集的特征,一个更好的办法是利用numpy中的permutation方法随机选取数据集中的几个样本进行浏览,为了显示方便,我们将列名放在竖直方向:

In [4]:
random_sample = kobe.take(np.random.permutation(len(kobe))[:4])
random_sample.T
Out[4]:
shot_id 28594 17122 12961 23212
action_type Layup Shot Driving Layup Shot Jump Shot Jump Shot
combined_shot_type Layup Layup Jump Shot Jump Shot
game_event_id 199 305 4 87
game_id 40800145 21000511 20701082 29600942
lat 34.0443 34.0273 33.8423 33.8473
loc_x 0 -21 92 -142
loc_y 0 17 202 197
lon -118.27 -118.291 -118.178 -118.412
minutes_remaining 2 6 11 1
period 2 3 1 1
playoffs 1 0 0 0
season 2008-09 2010-11 2007-08 1996-97
seconds_remaining 49 43 28 19
shot_distance 0 2 22 24
shot_made_flag 0 1 0 1
shot_type 2PT Field Goal 2PT Field Goal 2PT Field Goal 3PT Field Goal
shot_zone_area Center(C) Center(C) Right Side Center(RC) Left Side Center(LC)
shot_zone_basic Restricted Area Restricted Area Mid-Range Above the Break 3
shot_zone_range Less Than 8 ft. Less Than 8 ft. 16-24 ft. 24+ ft.
team_id 1610612747 1610612747 1610612747 1610612747
team_name Los Angeles Lakers Los Angeles Lakers Los Angeles Lakers Los Angeles Lakers
game_date 2009-04-27 2011-01-04 2008-03-28 1997-03-17
matchup LAL vs. UTA LAL vs. DET LAL vs. MEM LAL @ DEN
opponent UTA DET MEM DEN

接下来,我们对数据集进行描述性统计分析,按照不同的类别查看描述性统计信息:

投篮区域可视化

In [42]:
shot_zone = ['shot_zone_area', 'shot_zone_basic', 'shot_zone_range']
for zone in shot_zone:
    plt.figure(figsize=[30,300])
    sns.lmplot('loc_x','loc_y', 
               data=kobe, 
               hue=zone,
               fit_reg=False,
               palette="Set1",
               size=10,
              )
    plt.xlim([-250,250])
    plt.ylim([0,500])
    plt.xlabel('')
    plt.ylabel('')
    sns.despine(left=True,bottom=True)
<matplotlib.figure.Figure at 0x11d061be0>
<matplotlib.figure.Figure at 0x119aa6908>
<matplotlib.figure.Figure at 0x11d018128>

描述性统计分析

为了后续更方便分析,我们先将一些变量转换成分类变量:

In [5]:
kobe['game_event_id'] = kobe['game_event_id'].astype('category')
kobe['game_id'] = kobe['game_id'].astype('category')
kobe['period'] = kobe['period'].astype('category')
kobe['playoffs'] = kobe['playoffs'].astype('category')
kobe['season'] = kobe['season'].astype('category')
kobe['shot_made_flag'] = kobe['shot_made_flag'].astype('category')
kobe['shot_type'] = kobe['shot_type'].astype('category')
kobe['team_id'] = kobe['team_id'].astype('category')

转换之后的变量情况:

In [6]:
kobe.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30697 entries, 1 to 30697
Data columns (total 24 columns):
action_type           30697 non-null object
combined_shot_type    30697 non-null object
game_event_id         30697 non-null category
game_id               30697 non-null category
lat                   30697 non-null float64
loc_x                 30697 non-null int64
loc_y                 30697 non-null int64
lon                   30697 non-null float64
minutes_remaining     30697 non-null int64
period                30697 non-null category
playoffs              30697 non-null category
season                30697 non-null category
seconds_remaining     30697 non-null int64
shot_distance         30697 non-null int64
shot_made_flag        25697 non-null category
shot_type             30697 non-null category
shot_zone_area        30697 non-null object
shot_zone_basic       30697 non-null object
shot_zone_range       30697 non-null object
team_id               30697 non-null category
team_name             30697 non-null object
game_date             30697 non-null object
matchup               30697 non-null object
opponent              30697 non-null object
dtypes: category(8), float64(2), int64(5), object(9)
memory usage: 4.3+ MB
In [7]:
# 数值型变量的描述性统计信息
kobe.describe(include=['number']).T
Out[7]:
count mean std min 25% 50% 75% max
lat 30697.0 33.953192 0.087791 33.2533 33.8843 33.9703 34.0403 34.0883
loc_x 30697.0 7.110499 110.124578 -250.0000 -68.0000 0.0000 95.0000 248.0000
loc_y 30697.0 91.107535 87.791361 -44.0000 4.0000 74.0000 160.0000 791.0000
lon 30697.0 -118.262690 0.110125 -118.5198 -118.3378 -118.2698 -118.1748 -118.0218
minutes_remaining 30697.0 4.885624 3.449897 0.0000 2.0000 5.0000 8.0000 11.0000
seconds_remaining 30697.0 28.365085 17.478949 0.0000 13.0000 28.0000 43.0000 59.0000
shot_distance 30697.0 13.437437 9.374189 0.0000 5.0000 15.0000 21.0000 79.0000

我们观察到shot_made_flag变量存在5000个缺失值,这是因为该数据集由于竞赛需要,主办方在数据集中选择了5000个样本作为测试集数据,需要参赛者对其进行预测,以此对参赛者进行评估。后面我们将扔掉这5000个样本,并在剩下的样本中划分训练集和测试集,以验证模型训练的效果。

In [8]:
# 名义型变量的描述性统计信息
kobe.describe(include=['object']).T
Out[8]:
count unique top freq
action_type 30697 57 Jump Shot 18880
combined_shot_type 30697 6 Jump Shot 23485
shot_zone_area 30697 6 Center(C) 13455
shot_zone_basic 30697 7 Mid-Range 12625
shot_zone_range 30697 5 Less Than 8 ft. 9398
team_name 30697 1 Los Angeles Lakers 30697
game_date 30697 1559 2016-04-13 50
matchup 30697 74 LAL @ SAS 1020
opponent 30697 33 SAS 1978

从上面的统计信息可以看得出,科比的投篮方式主要是跳投(Jump Shot),并且他在NBA的职业生涯里只效力过一支球队:洛杉矶湖人队(Los Angeles Lakers),伟大的湖人王朝正是在科比带领下建立起来的。

接下来,我们将检查数据集是否为不平衡数据集,以及对其做异常值检测。

不平衡分析和异常值检测

我们首先检查数据集的不平衡状况:

In [9]:
sns.set_style('ticks')       # 显示刻度,并取消网格
sns.set_palette('Set1')      # 配色使用Set1
ax = plt.axes() 
sns.countplot(x='shot_made_flag',
              data=kobe,
              ax=ax) 
ax.set_title('Target class distribution')
Out[9]:
<matplotlib.text.Text at 0x10976dcc0>

可以看到,目标变量的值分布比较均匀,不存在不平衡问题,无需进行不平衡处理。

然后我们利用盒图对数据集中的数值型变量进行异常值检测,盒图示意如下:

我们先提取出数据集中的数值型变量,然后对所有的数值型变量按不同的目标变量值画出其盒图,如下所示:

In [43]:
f,axarr = plt.subplots(2,4,figsize=(20,15))
sns.set_context("notebook", font_scale=1.8) # 放大横纵坐标标记,更容易看清
sns.stripplot(x='shot_made_flag',y='lat',data=kobe,jitter=True,ax=axarr[0,0])
sns.boxplot(x='shot_made_flag',y='lat',data=kobe,ax=axarr[0,0])
sns.stripplot(x='shot_made_flag',y='lon',data=kobe,jitter=True,ax=axarr[0,1])
sns.boxplot(x='shot_made_flag',y='lon',data=kobe,ax=axarr[0,1])
sns.stripplot(x='shot_made_flag',y='loc_x',data=kobe,jitter=True,ax=axarr[0,2])
sns.boxplot(x='shot_made_flag',y='loc_x',data=kobe,ax=axarr[0,2])
sns.stripplot(x='shot_made_flag',y='loc_y',data=kobe,jitter=True,ax=axarr[0,3])
sns.boxplot(x='shot_made_flag',y='loc_y',data=kobe,ax=axarr[0,3])
sns.stripplot(x='shot_made_flag',y='minutes_remaining',data=kobe,jitter=True,ax=axarr[1,0])
sns.boxplot(x='shot_made_flag',y='minutes_remaining',data=kobe,ax=axarr[1,0])
sns.stripplot(x='shot_made_flag',y='seconds_remaining',data=kobe,jitter=True,ax=axarr[1,1])
sns.boxplot(x='shot_made_flag',y='seconds_remaining',data=kobe,ax=axarr[1,1])
sns.stripplot(x='shot_made_flag',y='shot_distance',data=kobe,jitter=True,ax=axarr[1,2])
sns.boxplot(x='shot_made_flag',y='shot_distance',data=kobe,ax=axarr[1,2])
plt.tight_layout()

我们可以从盒图中看出,变量lat、loc_y以及shot_distance存在较多的异常值,但是这些异常值并不“异常”,比如对于loc_y和shot_distance的异常值,在很少的情形下(比如中场时间快到时,发球后直接在距离3分线较远的地方投篮),科比会选择直接投超远距离三分。 我们在之后的训练中会将这些异常值去除掉,并讨论去除异常值与不去除异常值对模型训练的影响。

我们画出点对图来分析变量之间的相关关系:

In [11]:
sns.pairplot(kobe,
             vars = ['lat','lon','loc_x','loc_y','shot_distance'],
             hue='shot_made_flag',
             markers=['o','s'],
             diag_kind='kde',
             diag_kws=dict(shade=True),
             size=4)
//anaconda/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
Out[11]:
<seaborn.axisgrid.PairGrid at 0x109887438>

从点对图,我们可以看出lat与loc_y、lon与loc_x存在相关关系,并且lat和loc_y都与shot_distance呈相关关系,所以在后面的分析中我们将去除掉相关变量中的一个变量用于模型的构建。

接下来我们对名义型变量进行分析:

In [12]:
f, axarr = plt.subplots(9, figsize=(20, 34))
plt.xticks(fontsize=0.1)
sns.countplot(x="combined_shot_type", hue="shot_made_flag", data=kobe, ax=axarr[0])
sns.countplot(x="season", hue="shot_made_flag", data=kobe, ax=axarr[1])
sns.countplot(x="period", hue="shot_made_flag", data=kobe, ax=axarr[2])
sns.countplot(x="playoffs", hue="shot_made_flag", data=kobe, ax=axarr[3])
sns.countplot(x="shot_type", hue="shot_made_flag", data=kobe, ax=axarr[4])
sns.countplot(x="shot_zone_area", hue="shot_made_flag", data=kobe, ax=axarr[5])
sns.countplot(x="shot_zone_basic", hue="shot_made_flag", data=kobe, ax=axarr[6])
sns.countplot(x="shot_zone_range", hue="shot_made_flag", data=kobe, ax=axarr[7])
plt.tight_layout()
plt.show()

从第一张“combined_shot_type”统计图可以看出,科比的投篮方式主要是跳投,干拔跳投毕竟是科比的绝技。第二张“season”统计图从侧面体现了科比完整的职业生涯:科比自1996年选秀进入NBA后,呈现迅猛的成长势头,并在1999年至2002年连续三年夺冠,建立了湖人王朝时期,但是辉煌之后的科比在2002年到2004年期间经历了短暂的低潮,在2008年到2010年两次夺冠之后,科比在2013年遭遇严重伤病,在最后三年的风平浪静之后选择了退役。

数据清洗

In [13]:
unknown_mask = kobe['shot_made_flag'].isnull()

我们假设各个投篮之间是相互独立的,现在我们要去掉其中的一些无关变量,如果存在相互之间强相关的自变量,我们需要删除其中的某一个变量。并将目标变量提取出来。

In [14]:
kobe_cl = kobe.copy()   #创建一份拷贝
target = kobe_cl['shot_made_flag'].copy()

#去掉一些列
kobe_cl.drop('team_id',axis=1,inplace=True) #只存在一个值
kobe_cl.drop('lat',axis=1,inplace=True) #与loc_x相关
kobe_cl.drop('lon',axis=1,inplace=True) #与loc_y相关
kobe_cl.drop('game_id',axis=1,inplace=True) #与投篮命中与否明显无关
kobe_cl.drop('game_event_id',axis=1,inplace=True) #与投篮命中与否明显无关
kobe_cl.drop('team_name',axis=1,inplace=True) #只存在一个值”Los Angeles Lakers“
kobe_cl.drop('shot_made_flag',axis=1,inplace=True) #已经存为target

下面我们将定义用于异常值检测的函数,用于模型训练前去除异常值:

In [15]:
#定义异常值检测函数
def detect_outliers(series,whis=1.5):
    q75,q25 = np.percentile(series,[75,25])
    iqr = q75 - q25
    return ~((series - series.median()).abs() <= (whis * iqr))

数据转换

为了模型训练方便,我们定义一些新的变量:

In [16]:
# Remaining time
kobe_cl['seconds_from_period_end'] = 60 * kobe_cl['minutes_remaining'] + kobe_cl['seconds_remaining']
kobe_cl['last_5_sec_in_period'] = kobe_cl['seconds_from_period_end'] < 5

kobe_cl.drop('minutes_remaining', axis=1, inplace=True)
kobe_cl.drop('seconds_remaining', axis=1, inplace=True)
kobe_cl.drop('seconds_from_period_end', axis=1, inplace=True)

## Matchup - (away/home)
kobe_cl['home_play'] = kobe_cl['matchup'].str.contains('vs').astype('int')
kobe_cl.drop('matchup', axis=1, inplace=True)

# Game date
kobe_cl['game_date'] = pd.to_datetime(kobe_cl['game_date'])
kobe_cl['game_year'] = kobe_cl['game_date'].dt.year
kobe_cl['game_month'] = kobe_cl['game_date'].dt.month
kobe_cl.drop('game_date', axis=1, inplace=True)

# Loc_x, and loc_y binning
kobe_cl['loc_x'] = pd.cut(kobe_cl['loc_x'], 25)
kobe_cl['loc_y'] = pd.cut(kobe_cl['loc_y'], 25)

# Replace 20 least common action types with value 'Other'
rare_action_types = kobe_cl['action_type'].value_counts().sort_values().index.values[:20]
kobe_cl.loc[kobe_cl['action_type'].isin(rare_action_types), 'action_type'] = 'Other'

由于变量中存在较多名义型变量,我们对变量进行编码:

In [17]:
categorial_cols = [
    'action_type', 'combined_shot_type', 'period', 'season', 'shot_type',
    'shot_zone_area', 'shot_zone_basic', 'shot_zone_range', 'game_year',
    'game_month', 'opponent','loc_x', 'loc_y']

for cc in categorial_cols:
    dummies = pd.get_dummies(kobe_cl[cc])
    dummies = dummies.add_prefix("{}#".format(cc))
    kobe_cl.drop(cc, axis=1, inplace=True)
    kobe_cl = kobe_cl.join(dummies)

3.特征选择

在这一部分,我们将从编码后的多个变量中进行特征选择过程。

在sklearn中存在feture_selection模块,其中有许多种特征选择相关方法,如VarianceThreshold, RFE, SelectKBest, chi2,也可以利用随机森林模型判别变量的重要程度。

首先,我们将存在shot_made_flag值的变量提取出来用于模型训练,shot_made_flag缺失的用验证模型:

In [18]:
# 划分测试集
kobe_submit = kobe_cl[unknown_mask]

# 划分训练集
X = kobe_cl[~unknown_mask]
Y = target[~unknown_mask]

方差阈值

我们可以找出方差大于的0.09的变量,作为我们建模的变量:

In [19]:
threshold = 0.90
vt = VarianceThreshold().fit(X)
# 列出特征名
feat_var_threshold = kobe_cl.columns[vt.variances_ > threshold * (1-threshold)]
list(feat_var_threshold)
Out[19]:
['playoffs',
 'shot_distance',
 'home_play',
 'action_type#Jump Shot',
 'combined_shot_type#Jump Shot',
 'combined_shot_type#Layup',
 'period#1',
 'period#2',
 'period#3',
 'period#4',
 'shot_type#2PT Field Goal',
 'shot_type#3PT Field Goal',
 'shot_zone_area#Center(C)',
 'shot_zone_area#Left Side Center(LC)',
 'shot_zone_area#Left Side(L)',
 'shot_zone_area#Right Side Center(RC)',
 'shot_zone_area#Right Side(R)',
 'shot_zone_basic#Above the Break 3',
 'shot_zone_basic#In The Paint (Non-RA)',
 'shot_zone_basic#Mid-Range',
 'shot_zone_basic#Restricted Area',
 'shot_zone_range#16-24 ft.',
 'shot_zone_range#24+ ft.',
 'shot_zone_range#8-16 ft.',
 'shot_zone_range#Less Than 8 ft.',
 'game_month#1',
 'game_month#2',
 'game_month#3',
 'game_month#4',
 'game_month#11',
 'game_month#12',
 'loc_x#(-10.96, 8.96]',
 'loc_y#(-10.6, 22.8]',
 'loc_y#(22.8, 56.2]',
 'loc_y#(123, 156.4]']

利用随机森林选出最重要的特征-Top 20

我们还可以通过随机森林找到最重要的20个变量:

In [20]:
model = RandomForestClassifier()
model.fit(X, Y)

feature_imp = pd.DataFrame(model.feature_importances_, index=X.columns, columns=["importance"])
feat_rf_20 = feature_imp.sort_values("importance", ascending=False).head(20).index
list(feat_rf_20)
Out[20]:
['shot_distance',
 'action_type#Jump Shot',
 'home_play',
 'period#2',
 'period#3',
 'period#1',
 'action_type#Layup Shot',
 'period#4',
 'game_month#1',
 'game_month#2',
 'game_month#3',
 'game_month#4',
 'game_month#12',
 'game_month#11',
 'playoffs',
 'opponent#SAS',
 'opponent#HOU',
 'opponent#POR',
 'shot_zone_basic#Restricted Area',
 'opponent#SAC']

可以看到,投篮命中与否主要是跟投篮距离有关,并且与是否是跳投也很有关系,看来科比的跳投干拔不是浪得虚名。

利用$\chi^2$检验进行单变量的特征选择-Top 20

In [21]:
#首先对变量进行0-1标准化
X_minmax = MinMaxScaler(feature_range=(0,1)).fit_transform(X)
X_scored = SelectKBest(score_func=chi2, k='all').fit(X_minmax, Y)
feature_scoring = pd.DataFrame({
        'feature': X.columns,
        'score': X_scored.scores_
    })
feat_chi2_20 = feature_scoring.sort_values('score', ascending=False).head(20)['feature'].values
list(feat_chi2_20)
Out[21]:
['combined_shot_type#Dunk',
 'action_type#Jump Shot',
 'shot_zone_basic#Restricted Area',
 'loc_x#(-10.96, 8.96]',
 'action_type#Driving Layup Shot',
 'shot_zone_range#Less Than 8 ft.',
 'loc_y#(-10.6, 22.8]',
 'action_type#Slam Dunk Shot',
 'shot_type#3PT Field Goal',
 'action_type#Driving Dunk Shot',
 'shot_zone_area#Center(C)',
 'action_type#Running Jump Shot',
 'shot_zone_range#24+ ft.',
 'shot_zone_basic#Above the Break 3',
 'combined_shot_type#Layup',
 'combined_shot_type#Jump Shot',
 'last_5_sec_in_period',
 'action_type#Jump Bank Shot',
 'action_type#Pullup Jump shot',
 'shot_zone_area#Left Side Center(LC)']

容易看出$\chi^2$检验得到的变量重要程度排序与随机森林得到的重要程度排序不同,下面我们再试试RFE方法。

利用RFE方法选择特征-Top 20

RFE叫做递归特征排除法,是Recursive Feature Elimination的简写,也是一种常用的特征选择方法。

In [22]:
rfe = RFE(LogisticRegression(), 20)
rfe.fit(X, Y)
feature_rfe_scoring = pd.DataFrame({
        'feature': X.columns,
        'score': rfe.ranking_
    })
feat_rfe_20 = feature_rfe_scoring[feature_rfe_scoring['score'] == 1]['feature'].values
list(feat_rfe_20)
Out[22]:
['action_type#Driving Dunk Shot',
 'action_type#Driving Finger Roll Layup Shot',
 'action_type#Driving Finger Roll Shot',
 'action_type#Driving Slam Dunk Shot',
 'action_type#Dunk Shot',
 'action_type#Fadeaway Bank shot',
 'action_type#Finger Roll Shot',
 'action_type#Hook Shot',
 'action_type#Jump Shot',
 'action_type#Layup Shot',
 'action_type#Running Bank shot',
 'action_type#Running Hook Shot',
 'action_type#Slam Dunk Shot',
 'combined_shot_type#Dunk',
 'combined_shot_type#Tip Shot',
 'shot_zone_area#Back Court(BC)',
 'shot_zone_range#Back Court Shot',
 'loc_y#(290, 323.4]',
 'loc_y#(356.8, 390.2]',
 'loc_y#(390.2, 423.6]']

我们将以上三种方式得到的Top 20小结一下:

In [23]:
feature_scoring = pd.DataFrame({
        'feat_rf_20': feat_rf_20,
        'feat_chi2_20': feat_chi2_20,
        'feat_rfe_20': feat_rfe_20
    })
feature_scoring
Out[23]:
feat_chi2_20 feat_rf_20 feat_rfe_20
0 combined_shot_type#Dunk shot_distance action_type#Driving Dunk Shot
1 action_type#Jump Shot action_type#Jump Shot action_type#Driving Finger Roll Layup Shot
2 shot_zone_basic#Restricted Area home_play action_type#Driving Finger Roll Shot
3 loc_x#(-10.96, 8.96] period#2 action_type#Driving Slam Dunk Shot
4 action_type#Driving Layup Shot period#3 action_type#Dunk Shot
5 shot_zone_range#Less Than 8 ft. period#1 action_type#Fadeaway Bank shot
6 loc_y#(-10.6, 22.8] action_type#Layup Shot action_type#Finger Roll Shot
7 action_type#Slam Dunk Shot period#4 action_type#Hook Shot
8 shot_type#3PT Field Goal game_month#1 action_type#Jump Shot
9 action_type#Driving Dunk Shot game_month#2 action_type#Layup Shot
10 shot_zone_area#Center(C) game_month#3 action_type#Running Bank shot
11 action_type#Running Jump Shot game_month#4 action_type#Running Hook Shot
12 shot_zone_range#24+ ft. game_month#12 action_type#Slam Dunk Shot
13 shot_zone_basic#Above the Break 3 game_month#11 combined_shot_type#Dunk
14 combined_shot_type#Layup playoffs combined_shot_type#Tip Shot
15 combined_shot_type#Jump Shot opponent#SAS shot_zone_area#Back Court(BC)
16 last_5_sec_in_period opponent#HOU shot_zone_range#Back Court Shot
17 action_type#Jump Bank Shot opponent#POR loc_y#(290, 323.4]
18 action_type#Pullup Jump shot shot_zone_basic#Restricted Area loc_y#(356.8, 390.2]
19 shot_zone_area#Left Side Center(LC) opponent#SAC loc_y#(390.2, 423.6]

我们将以上四种方式联合起来进行特征选择:

In [24]:
features = np.hstack([
        feat_var_threshold, 
        feat_rf_20,
        feat_chi2_20,
        feat_rfe_20
    ])
features = np.unique(features)
print('Final features set:\n')
for f in features:
    print("-{}".format(f))
Final features set:

-action_type#Driving Dunk Shot
-action_type#Driving Finger Roll Layup Shot
-action_type#Driving Finger Roll Shot
-action_type#Driving Layup Shot
-action_type#Driving Slam Dunk Shot
-action_type#Dunk Shot
-action_type#Fadeaway Bank shot
-action_type#Finger Roll Shot
-action_type#Hook Shot
-action_type#Jump Bank Shot
-action_type#Jump Shot
-action_type#Layup Shot
-action_type#Pullup Jump shot
-action_type#Running Bank shot
-action_type#Running Hook Shot
-action_type#Running Jump Shot
-action_type#Slam Dunk Shot
-combined_shot_type#Dunk
-combined_shot_type#Jump Shot
-combined_shot_type#Layup
-combined_shot_type#Tip Shot
-game_month#1
-game_month#11
-game_month#12
-game_month#2
-game_month#3
-game_month#4
-home_play
-last_5_sec_in_period
-loc_x#(-10.96, 8.96]
-loc_y#(-10.6, 22.8]
-loc_y#(123, 156.4]
-loc_y#(22.8, 56.2]
-loc_y#(290, 323.4]
-loc_y#(356.8, 390.2]
-loc_y#(390.2, 423.6]
-opponent#HOU
-opponent#POR
-opponent#SAC
-opponent#SAS
-period#1
-period#2
-period#3
-period#4
-playoffs
-shot_distance
-shot_type#2PT Field Goal
-shot_type#3PT Field Goal
-shot_zone_area#Back Court(BC)
-shot_zone_area#Center(C)
-shot_zone_area#Left Side Center(LC)
-shot_zone_area#Left Side(L)
-shot_zone_area#Right Side Center(RC)
-shot_zone_area#Right Side(R)
-shot_zone_basic#Above the Break 3
-shot_zone_basic#In The Paint (Non-RA)
-shot_zone_basic#Mid-Range
-shot_zone_basic#Restricted Area
-shot_zone_range#16-24 ft.
-shot_zone_range#24+ ft.
-shot_zone_range#8-16 ft.
-shot_zone_range#Back Court Shot
-shot_zone_range#Less Than 8 ft.

4.PCA降维

根据第三部分特征选择得到的变量重新整理数据框训练集kobe_cl和测试集kobe_submit:

In [25]:
kobe_cl = kobe_cl.ix[:, features]
kobe_submit = kobe_submit.ix[:, features]
X = X.ix[:, features]

print('完整数据集shape: {}'.format(kobe_cl.shape))
print('测试集shape: {}'.format(kobe_submit.shape))
print('训练集shape: {}'.format(X.shape))
print('目标标签shape: {}'.format(Y.shape))
完整数据集shape: (30697, 63)
测试集shape: (5000, 63)
训练集shape: (25697, 63)
目标标签shape: (25697,)

62个变量过多,我们需要在模型训练前进行降维,使得维度减为8:

In [26]:
components = 8
pca = PCA(n_components = components).fit(X)

我们来看一下每个变量对目标变量方差的解释程度:

In [27]:
pca_variance_explained_df = pd.DataFrame({
    "component": np.arange(1, components+1),
    "variance_explained": pca.explained_variance_ratio_            
    })
ax = sns.barplot(x='component', 
                 y='variance_explained', 
                 data=pca_variance_explained_df,
                 palette="Set1", )
ax.set_title("PCA - Variance explained")
Out[27]:
<matplotlib.text.Text at 0x11ff0acf8>

我们只取前两个主成分,命名为x1和x2,并在二维特征空间中画出散点图:

In [28]:
X_pca = pd.DataFrame(pca.transform(X)[:,:2])
X_pca['target'] = Y.values
X_pca.columns = ["x1", "x2", "target"]
sns.lmplot('x1','x2', 
           data=X_pca, 
           hue="target", 
           fit_reg=False, 
           markers=["o", "x"], 
           palette="Set1", 
           size=6,
          )
Out[28]:
<seaborn.axisgrid.FacetGrid at 0x1228e6550>

可以看出主成分x1对target的影响,在$[-70,-20]$范围内几乎全为投失的情况,在$[-20,20]$部分为投中,部分为投失,而x2对投球的决定作用要比x1小非常多。

5.AdaBoost模型训练

模型训练初探

初始化我们之后训练模型需要用到的参数:

In [29]:
seed = 7
processors=1
num_folds=3           #三折交叉验证
num_instances=len(X)
scoring='roc_auc'     #交叉验证的分数为AUC
kfold = KFold(n=num_instances, n_folds=num_folds, random_state=seed)

首先使用n_estimators=100,random_state=7来看看模型的训练结果:

In [30]:
model = AdaBoostClassifier(n_estimators=100, random_state=seed)

results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
print("({0:.3f}) +/- ({1:.3f})".format(results.mean(), results.std()))
(0.698) +/- (0.003)

为了显示出AdaBoost的优势,我们使用其他常用的分类模型与AdaBoost做一下比较:

In [31]:
# 模型包括逻辑回归、线性判别分析、K近邻、决策树和朴素贝叶斯
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('K-NN', KNeighborsClassifier(n_neighbors=5)))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
# 对每个模型计算其cross_cal_score
results = []
names = []
for name, model in models:
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring='roc_auc', n_jobs=processors)
    results.append(cv_results)
    names.append(name)
    print("{0}: ({1:.3f}) +/- ({2:.3f})".format(name, cv_results.mean(), cv_results.std()))
LR: (0.695) +/- (0.004)
//anaconda/lib/python3.5/site-packages/sklearn/discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
//anaconda/lib/python3.5/site-packages/sklearn/discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
//anaconda/lib/python3.5/site-packages/sklearn/discriminant_analysis.py:387: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
LDA: (0.695) +/- (0.004)
K-NN: (0.618) +/- (0.013)
CART: (0.592) +/- (0.005)
NB: (0.660) +/- (0.006)

从结果中可以看出,逻辑回归和线性判别分析方法在训练集中有更好的表现,比AdaBoost略微逊色。

模型调参:单个参数

下面,我们来看看不同的参数会对模型的训练产生什么影响:

(1) 参数“n_estimators”

In [32]:
n_estimators, scores = list(range(1,100,10)), []
for i in n_estimators:
    model = AdaBoostClassifier(n_estimators = i, learning_rate = 1, random_state = seed)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
    scores.append(cv_results)
cv_results = [i for i in n_estimators for j in range(num_folds)]

绘制盒图,横坐标为n_estimators,纵坐标为AUC:

In [33]:
ax = plt.axes()
sns.boxplot(x = cv_results, 
            y = np.array(scores).flatten(),
            palette='Set1',
            ax=ax)
ax.set_xlabel('n_estimators')
ax.set_ylabel('AUC')
Out[33]:
<matplotlib.text.Text at 0x11d52f128>

当n_estimators小于41时,随着n_estimators增加AUC也逐渐增加,但是到了31之后,AUC变化不显著,因此n_estimators默认的50是较好的数值。

(2) 参数“learning_rate”

在上面一部分,我们得出最佳的n_estimators为50,在此基础上,我们再对learning_rate进行调节:

In [34]:
learning_rate, scores = list(range(1,10,1)), []
for i in learning_rate:
    model = AdaBoostClassifier(n_estimators = 50, learning_rate = i, random_state = seed)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
    scores.append(cv_results)
cv_results = [i for i in learning_rate for j in range(num_folds)]
In [35]:
ax = plt.axes()
sns.boxplot(x = cv_results, 
            y = np.array(scores).flatten(),
            palette='Set1',
            ax=ax)
ax.set_xlabel('learning_rate')
ax.set_ylabel('AUC')
Out[35]:
<matplotlib.text.Text at 0x109896fd0>

从图可以得出结论:learning_rate在取1时,模型效果更好。

(3) 参数“random_state”

同样的道理,现在再对random_state做一个分析:

In [36]:
random_state, scores = list(range(1,100,10)), []
for i in random_state:
    model = AdaBoostClassifier(n_estimators = 50, learning_rate = 1, random_state = i)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring, n_jobs=processors)
    scores.append(cv_results)
cv_results = [i for i in random_state for j in range(num_folds)]
In [37]:
ax = plt.axes()
sns.boxplot(x = cv_results, 
            y = np.array(scores).flatten(),
            palette='Set1',
            ax=ax)
ax.set_xlabel('random_state')
ax.set_ylabel('AUC')
Out[37]:
<matplotlib.text.Text at 0x11bf7db38>

似乎randome_state对模型没什么影响。

模型调参:多个参数同时调节

sklearn中有GridSearchCV方法,可以对多个参数同时进行调节,如下:

In [38]:
ada_grid = GridSearchCV(
    estimator = AdaBoostClassifier(random_state=seed),
    param_grid = {
        'algorithm': ['SAMME', 'SAMME.R'],
        'n_estimators': list(range(1,100,10)),
        'learning_rate': [1e-3, 1e-2, 1e-1, 1]
    }, 
    cv = kfold, 
    scoring = scoring, 
    n_jobs = processors)
ada_grid.fit(X, Y)
print(ada_grid.best_score_)
print(ada_grid.best_params_)
0.697556767125
{'learning_rate': 1, 'n_estimators': 91, 'algorithm': 'SAMME.R'}

GridSearchCV方法与单个参数调节得到的参数大致一样。

6.模型预测

我们最后使用AdaBoost算法得到的模型,对数据集中的测试集进行预测。在Kaggle竞赛中,该测试集并没有包含shot_made_flag,即没有目标变量值。因此,我们并不能判断此模型的泛化性能。不过,我们还是把预测的结果展示一部分在这里:

In [39]:
model.fit(X,Y)
probs = model.predict_proba(kobe_submit)
preds = model.predict(kobe_submit)
submission = pd.DataFrame({
        'shot_id':kobe_submit.index,
        'shot_made_flag':preds,
        'shot_made_flag_probability':probs[:,0]
    })
submission.head(5)
Out[39]:
shot_id shot_made_flag shot_made_flag_probability
0 1 0.0 0.503067
1 8 0.0 0.504665
2 17 1.0 0.497304
3 20 1.0 0.494204
4 33 0.0 0.503563