本案例是基于Lending Club贷款数据集，介绍了数据可视化的相关知识，重点介绍Matplotlib与Seaborn这两种Python常用的数据可视化工具，包括如何绘制散点图、饼图、柱状图等基本图形以及如何绘制多面板等高级图形。

目录¶

数据载入与预处理
 1.1 数据导入
 1.2 数据清洗
Matplotlib绘图
     2.1 柱状图
     2.2 叠加柱状图
     2.3 饼图
     2.4 折线图
     2.5 多面板绘图
Seaborn绘图
     3.1 散点图
     3.2 点图
     3.3 箱线图
     3.4 多面板绘图
     3.5 小提琴图
     3.6 直方图
     3.7 计数图
     3.8 热力图

1. 数据载入与预处理¶

本案例主要背景为贷款情况审查。银行可以通过个人贷款状况对个人信用进行分类，从而更好地避免金融诈骗的发生。本案例所选的数据集是来自LendingClub中统计的2018年第四季度的借贷数据，随机删除了约40%的贷款记录。

数据集共有70000行，128列。由于列过多，此处简单列举几个重要列的含义作为参考。

对应的所有特征的具体含义可以查看源数据网页中的DATA DICTIONARY。

列名	含义说明
loan_status	贷款的当前状态
grade	信用证指定贷款等级
emp_title	借款人的职业
annual_inc	借款人自行申报的年收入
addr_state	借款人所处的国家或地区
int_rate	贷款利率
installment	如果贷款发放，借款人每月所需要还款的数额
sub_grade	信用证指定贷款基础
emp_length	就业年限（年）。可能的值介于0和10之间，其中0表示一年以下，10表示十年或十年以上
home_ownership	借款人在登记期间提供的或从信贷报告中获得的房屋所有权状况。其值为：租金、自有、抵押、其他
hardship_payoff_balance_amount	困难计划开始日期的收支差额
hardship_last_payment_amount	截至困难计划开始日期的最后一笔付款金额
disbursement_method	借款人获得贷款的方式。可能的值是：现金，直接支付
avg_cur_bal	所有账户的当前平均余额
installment	如果贷款发放，借款人所欠的每月付款
loan_amnt	借款人申请贷款的金额

1.1 数据导入¶

## 导入基础类库
import pandas as pd
import numpy as np
data = pd.read_csv("./input/lendingclub.csv")
data.head(5)

data.shape

(70000, 128)

从上图可看出，共有128列数据，其中有很多数据列存在空值等情况，在对数据进行可视化处理之前，先对数据进行简单的清洗。

1.2 数据清洗¶

查看数据集中缺失值占所在列中的比例。显示排名前5的列以及对应的比例。

check_null = data.isnull().sum().sort_values(ascending=False)/float(len(data)) 
print(check_null[:5])

settlement_term            1.0
payment_plan_start_date    1.0
member_id                  1.0
url                        1.0
desc                       1.0
dtype: float64

删除掉缺失值比例大于50%的列。

对剩下仍有缺失值的列进行后值向前填补。

thresh_count = len(data)*0.5 
data.dropna(thresh=thresh_count, axis=1, inplace=True) 
data.fillna(method="bfill",inplace=True)
data.isnull().sum().sum()

0

填补完毕后查看是否存在缺失值，此时整个数据集没有缺失值。

2.Matplotlib绘图¶

Matplotlib是一个Python绘图库，它可以生成各式各样的图形以供查看。

#导入Matplotlib库
import matplotlib.pyplot as plt

2.1 柱状图¶

针对离散型特征，柱状图可以清晰显示特征的每种取值的样本数量。

利用value_counts函数可以知道loan_status（贷款状态）特征的取值以及每种取值的样本数，但是并不直观。

data['loan_status'].value_counts()

Current               67944
Fully Paid             1583
In Grace Period         241
Late (31-120 days)      159
Late (16-30 days)        63
Charged Off              10
Name: loan_status, dtype: int64

使用bar函数可绘制出loan_status特征的柱状图。其中为柱状图X轴代表loan_status_label数值，Y轴代表loan_status_count数值。同时figure函数设置图形大小，figsize参数设置图形长宽大小，dpi参数设置图形分辨率。

loan_status_label = []
loan_status_count = []
for name,group in data.groupby(['loan_status']):
    loan_status_label.append(name)
    loan_status_count.append(group['loan_status'].count())
plt.figure(figsize=(10,8), dpi= 80)
plt.bar(loan_status_label,loan_status_count)
plt.show()

可见上图中，横坐标轴参数有所重叠，对其进行旋转使得横坐标更加方便查看。

此处旋转坐标轴，只需要调用plt包中的xticks函数，对其中的rotation参数设置一个角度即可，此处设置15度来查看效果。

plt.figure(figsize=(10,8), dpi= 80)
plt.bar(loan_status_label,loan_status_count)
plt.xticks(rotation=15)
plt.show()

对该图添加标题，横纵坐标添加标识等信息

标题使用title函数，横纵坐标使用xlabel与ylabel函数来对其进行标识，其中的fontsize代表字体大小。

## 绘图
plt.figure(figsize=(10,8), dpi= 80)
plt.bar(loan_status_label,loan_status_count)
## 绘图美化与添加信息
plt.xticks(rotation=15)
plt.title("Numbers of Loan Status" ,fontsize=20)
plt.xlabel("Loan Status" ,fontsize = 18)
plt.ylabel("Number" ,fontsize=18)
plt.show()

上图中，可以看出贷款状态为Current占绝大多数，其次是Fully Paid，说明绝大多数的用户的贷款状态为正常状态。

2.2 叠加柱状图¶

在基本柱状图上可以更进一步的显示离散型特征在另外特征上的样本数量情况。例如绘制不同贷款状态（loan_status）下信用证贷款等级（grade）的叠加柱状图。

针对loan_status的6种取值，将Current和Fully Paid设置为正常贷款状态True，其余设置为非正常贷款状态False。

data['loan_status_1'] = [(x=='Current' or x =='Fully Paid') for x in data['loan_status']]
print (data['loan_status_1'].value_counts())

True     69527
False      473
Name: loan_status_1, dtype: int64

利用groupby函数得到不同贷款状态下grade的样本取值，分别存入grade_cat_1、grade_cat_2中。

grade_labels=[]
grade_cat_1 = []
grade_cat_2 = []
for name,group in data.groupby(['grade']):
    grade_labels.append(name)
    grade_cat_1.append(group[group['loan_status_1'] == True]['grade'].count())
    grade_cat_2.append(group[group['loan_status_1'] == False]['grade'].count())

为了更好的了解样本比例，利于百分比的形式显示柱状图的取值，且以正负值来区分不同状态。

grade_cat_1 = [+x / sum(grade_cat_1) for x in grade_cat_1]
grade_cat_2 = [-x / sum(grade_cat_2) for x in grade_cat_2]

绘制不同贷款状态(loan_status)下信用证贷款等级（grade）的叠加柱状图。利用bar函数绘制，其中坐标轴X轴代表grade_labels数值，Y轴代表grade_cat_1、grade_cat_2数值，facecolor参数为柱面颜色，edgecolor参数为边框颜色。同时利用text函数显示图中每个柱子所代表的具体Y轴数值，ylim函数显示Y坐标轴范围，legend函数显示图例，其中第一个参数为图例内容，loc参数为图例所处位置。

plt.figure(figsize=(10,8), dpi= 80)
plt.bar(grade_labels, grade_cat_1, facecolor='#9999ff', edgecolor='white')
plt.bar(grade_labels, grade_cat_2, facecolor='#ff9999', edgecolor='white')
#在柱状图上显示比例文本数值，更加清晰
for x,y in zip(range(len(grade_labels)),grade_cat_1):
    plt.text(x, y+0.05, '%.2f' % y, ha='center', va= 'bottom')
for x,y in zip(range(len(grade_labels)),grade_cat_2):
    plt.text(x, y-0.05, '%.2f' % y, ha='center', va= 'bottom')

plt.ylim(-0.5,+0.5)
plt.legend(['Normal', 'Unomoral'], loc='lower right', scatterpoints=1)
plt.title("Loan Status at different grades",fontsize = 16)
plt.xlabel("Grades")
plt.ylabel("Loan Status")
plt.show()

可以从上图中看出，随着信用证贷款等级（grade）的递减，贷款状态（loan_status）的非正常状态所占比例逐渐增大，即允许发放贷款的比例越来越低。

2.3 饼图¶

饼图可以直观的反映不同取值所占比例大小。例如grade特征，可以通过pie函数绘制不同信用证贷款等级（grade）下的样本数量比例。其中grade_count列表存储具体贷款等级分数（grade）的样本数量，explode参数设置饼图中每一块离开中心的距离，labels参数为(每一块)饼图外侧显示的说明文字即为等级，autopct参数控制饼图内百分比数值格式的设置，例如%1.1f%%表示仅显示小数点后一位，shadow参数设置饼图是否存在阴影。

grade_count = []
for name,group in data.groupby(['grade']):
    grade_count.append(group['grade'].count())
plt.figure(figsize=(10,8), dpi= 80)
plt.pie(grade_count,explode=None,labels=grade_labels,autopct='%1.1f%%',shadow=True,startangle=50)
plt.axis('equal')
plt.title("Grade")
plt.show()

上图中可以看出，大部分借款人的信用等级都比较高，处于A、B等级的借款人比例超过50%。

2.4 折线图¶

折线图可以很好的反映特征之间的变化趋势，例如不同就业年限（emp_length）的借款人（如职位emp_title为director）的平均年收入（annual_inc）趋势。

首先，由于emp_title特征值大小写不分明，为了更加准确汇总emp_title的种类，统一为小写字母。

data['emp_title'] = [str(x).lower() for x in data['emp_title']]

然后，获取职位为director的不同就业年限对应的平均年收入。

emp_len_list = ['< 1 year','1 year','2 years','3 years','4 years','5 years','6 years','7 years','8 years','9 years','10+ years']
avg_inc_list = []
data_director = data[data['emp_title'] == 'director']
for emp_len in emp_len_list:
    avg_inc_list.append(data_director[data_director['emp_length'] == emp_len]['annual_inc'].mean())

绘制折线图。利用plot函数绘制，其中折线图X、Y轴分别表示emp_len_list和avg_inc_list，linewidth参数设置线宽。

plt.figure(figsize=(15,8), dpi= 80)
plt.plot(emp_len_list,avg_inc_list,linewidth=1) 
plt.xticks(rotation = 20)
plt.title("Average annual income for different years of employment",fontsize = 20)
plt.xlabel("Emp_length",fontsize = 18)
plt.ylabel("Annual_income",fontsize = 16)
plt.show()

从上图中，可以看出同种职位下平均年收入并不绝对随就业年限的增加而增加。但考虑到公司以及个人自身的因素，所以可能存在误差。

2.5 多面板绘图¶

对于想要将一组图放在一起进行比较，subplot函数可以很好解决这个问题，它将面板分为几个部分，绘制出不同子图。subplot中有numRows、numCols和plotNum参数，其中图表的整个绘图区域被分成 numRows 行和numCols 列，然后按照从左到右，从上到下的顺序对每个子区域进行编号，左上的子区域的编号为1。plotNum 参数指定创建的子图对象所在的区域。例如，绘制平均年收入（annual_inc）最高的10个职位（emp_title）对应的的柱状图以及平均年收入（annual_inc）最低的10个职位（emp_title）对应的的柱状图。

得到按照平均年收入降序对应的职位的列表max_emp_list和max_avg_inc。

emp_list  = []
avg_inc = []
for name,group in data.groupby(['emp_title']):
    emp_list.append(name)
    avg_inc.append(group['annual_inc'].mean())

dict_list = dict(zip(emp_list,avg_inc))
dict_list_1 = sorted(dict_list.items(), key=lambda x: x[1],reverse=True)
max_emp_list = []
max_avg_inc = []
for key,value in dict_list_1:
    max_emp_list.append(key)
    max_avg_inc.append(value)

绘制平均年收入最高的10个职位的的柱状子图以及平均年收入最低的10个职位的柱状子图。

plt.figure(figsize=(15,12), dpi= 80)
plt.subplot(2,1,1)
plt.xticks(np.arange(len(max_emp_list[:10])), max_emp_list[:10])
plt.bar(np.arange(len(max_emp_list[:10])),max_avg_inc[:10])
plt.title("Top 10",fontsize = 18)
plt.xticks(rotation = 15)
plt.xlabel("Emp_title", fontsize = 16)
plt.ylabel("Annual_income",fontsize = 16)
plt.subplot(2,1,2)
plt.xticks(np.arange(len(max_emp_list[-10:])), max_emp_list[-10:])
plt.bar(np.arange(len(max_emp_list[-10:])),max_avg_inc[-10:])
plt.title("Minimum 10",fontsize = 18)
plt.xticks(rotation = 15)
plt.xlabel("Emp_title", fontsize = 16)
plt.ylabel("Annual_income",fontsize = 16)
plt.show()

可见上图绘制的较为凌乱，上下子图坐标轴重合，不好查看，使用tight_layout函数将其自动相互适应，调整间距。

plt.figure(figsize=(15,12), dpi= 80)
plt.subplot(2,1,1)
plt.xticks(np.arange(len(max_emp_list[:10])), max_emp_list[:10])
plt.bar(np.arange(len(max_emp_list[:10])),max_avg_inc[:10])
plt.title("Top 10",fontsize = 18)
plt.xticks(rotation = 15)
plt.xlabel("Emp_title", fontsize = 16)
plt.ylabel("Annual_income",fontsize = 16)
plt.subplot(2,1,2)
plt.xticks(np.arange(len(max_emp_list[-10:])), max_emp_list[-10:])
plt.bar(np.arange(len(max_emp_list[-10:])),max_avg_inc[-10:])
plt.title("Minimum 10",fontsize = 18)
plt.xticks(rotation = 15)
plt.xlabel("Emp_title", fontsize = 16)
plt.ylabel("Annual_income",fontsize = 16)
plt.tight_layout()
plt.show()

目前可以直观的看出申请贷款的用户的年收入差距很大，即使是年收入排名前十也有很大差距。

3.Seaborn绘图¶

Seaborn是Python中基于Matplotlib的数据可视化工具。相较于Matplotlib，使用更加方便，图形更加好看。

#导入Seaborn库
import seaborn as sns

3.1 散点图¶

散点图可以将两个特征直观的显示在二维坐标中，同时反映特征之间的关系。

Seaborn中的stripplot函数按照x特征所对应的类别分别展示y特征的值，适用于分类数据。例如绘制不同地区（addr_state）借款人与当前平均余额（avg_cur_bal）之间的散点图。其中x参数设置分组统计字段，y参数设置分布统计字段，jitter参数是当数据点重合较多时，可用该参数做一些调整，hue参数进行内部数据的分类，例如按照贷款状态(loan_status)特征分类。

plt.figure(figsize=(15,8), dpi= 80)
sns.stripplot(data=data, x='addr_state', y='avg_cur_bal', jitter=True,hue='loan_status')
plt.show()

从散点图分布看出，无论是哪个地区的申请人的目前余额都集中在100000以下，相对而言所处于CA（加拿大）和NY（纽约）地区的申请人平均余额高一点。

3.2 点图¶

点图代表散点图位置的数值变量的中心趋势估计，并使用误差线提供关于该估计的不确定性的一些指示。例如利用pointplot函数可以绘制出不同贷款状态（loan_status_1）的当前平均余额（avg_cur_bal）的点图。

plt.figure(figsize=(10,8), dpi= 80)
sns.pointplot(x="loan_status_1",y="avg_cur_bal",data=data)
plt.show()

/explorer/pyenv/jupyter-py36/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

从上图中看出，处于正常贷款状态下的借款人的平均余额高于非正常状态下的，并且正常贷款状态下的借款人的余额分布更为集中平均。

3.3 箱线图¶

箱线图是一种用来显示一组数据分散情况资料的统计图。可以显示出数据的上下边界、上下四分位数，中位数以及异常值。例如利用boxplot函数可以绘制出借款人的当前平均余额（avg_cur_bal）的箱线图。

plt.figure(figsize=(10,8), dpi= 80)
sns.boxplot(y='avg_cur_bal',data=data)
plt.show()

plt.figure(figsize=(10,8), dpi= 80)
sns.boxplot(x='avg_cur_bal',data=data)
plt.show()

由于借款人贷款状态不同，也可绘制不同贷款状态下（loan_status）的借款人的当前平均余额（avg_cur_bal）的分组箱线图。其中分组因子是loan_status，在X轴不同位置绘制。

plt.figure(figsize=(10,8), dpi= 80)
sns.boxplot(y="avg_cur_bal", x="loan_status", data=data)
plt.xticks(rotation = 15)
plt.show()

可以看出借款人平均余额都不高，并且异常值集中在较大值一侧，分布呈右偏态。

3.4 多面板绘图¶

对于上图，由于有多个种类，图形显得拥挤，可以利用catplot函数进行多面板绘图，而kind参数可以绘制不同的图形。例如kind='box'使上面的箱线图分成不同面板上的箱线图。

plt.figure(figsize=(10,8), dpi= 80)
sns.catplot(y="avg_cur_bal",col = 'loan_status',data=data,kind='box', aspect=.5,legend=False)
plt.show()

<Figure size 800x640 with 0 Axes>

多面板箱线图更加直观的比较出贷款状态正常（即为Current和Fully Paid）的异常值更多，更偏向于较大值一侧，说明正常贷款状态的借款人的当前余额更多。

3.5 小提琴图¶

小提琴图用于显示数据分布及其概率密度。这种图表结合了箱形图和密度图的特征，主要用来显示数据的分布形状。例如使用violinplot函数绘制借款人的平均当前余额（avg_cur_bal）的小提琴图。

plt.figure(figsize=(10,8), dpi= 80)
sns.violinplot(y="avg_cur_bal",data=data)
plt.show()

/explorer/pyenv/jupyter-py36/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

绘制不同贷款状态下（loan_status）的借款人的当前平均余额（avg_cur_bal）的分组小提琴图，分组因子为loan_status。

plt.figure(figsize=(10,8), dpi= 80)
sns.violinplot(x="loan_status",y="avg_cur_bal",data=data)
plt.xticks(rotation = 15)
plt.show()

/explorer/pyenv/jupyter-py36/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

查看不同信用贷款等级（grade）的人处在不同贷款状态（loan_status_1）下的当前平均余额（avg_cur_bal）的分布情况，只需要多加入参数hue即可。

plt.figure(figsize=(20,10), dpi= 80)
sns.violinplot(x="grade",y="avg_cur_bal",hue="loan_status_1",data=data)
plt.show()

/explorer/pyenv/jupyter-py36/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

将分组小提琴图组合，使得对比更加明显。

plt.figure(figsize=(20,10), dpi= 80)
sns.violinplot(x="grade",y="avg_cur_bal",hue="loan_status_1",split=True,data=data)
plt.show()

/explorer/pyenv/jupyter-py36/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

3.6 直方图¶

直方图观察特征取值的分布情况。例如distplot函数绘制借款人所需的每月付款（installment）的直方图。

plt.figure(figsize=(10,8), dpi= 80)
sns.distplot(data['installment'])
plt.show()

/explorer/pyenv/jupyter-py36/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

上图的installment分类过多，可以设置bins参数来确定分类数。

plt.figure(figsize=(10,8), dpi= 80)
sns.distplot(data['installment'],bins=10)
plt.show()

/explorer/pyenv/jupyter-py36/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

利用jointplot函数绘制出二维直方图。

plt.figure(figsize=(10,8), dpi= 80)
sns.jointplot(x="installment", y="avg_cur_bal", data=data, kind="hex")
plt.show()

/explorer/pyenv/jupyter-py36/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<Figure size 800x640 with 0 Axes>

由上图可见，对于其当前平均余额（avg_cur_bal）数值较大，使得图像不明显，删除其中部分数据，再绘制图像进行查看。

plt.figure(figsize=(10,8), dpi= 80)
data_copy = data.copy()
data_copy.drop(list(data['avg_cur_bal'][data['avg_cur_bal']>80000].index),axis=0,inplace=True)
sns.jointplot(x="installment", y="avg_cur_bal", data=data_copy, kind="hex",color="r")
plt.show()

/explorer/pyenv/jupyter-py36/lib/python3.6/site-packages/scipy/stats/stats.py:1706: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

<Figure size 800x640 with 0 Axes>

二维直方图反映了一个数据集中两个特征之间的分布情况。在上图中，上方与右方分别显示借款人每月还款金额（installment）和当前平均金额（avg_cur_bal）的直方图分布，中间显示两个变量的六边形图，颜色越深代表有更多的数据在此处。

3.7 计数图¶

计数图可以被认为是一个分类直方图，就是对输入的数据分类，显示各个分类的数量。例如利用countplot函数绘制借款人当前房屋状态（home_ownership）的计数图。

plt.figure(figsize=(10,8), dpi= 80)
sns.countplot(x="home_ownership",data=data) 
plt.show()

计数图不同于柱状图，不能同时输入x和y。输入x表示竖向，输入y表示横向。

plt.figure(figsize=(10,8), dpi= 80)
sns.countplot(y="home_ownership",data=data) 
plt.show()

从上图可以看出，借款人的房屋大多是处于抵押或租赁状态。

3.8 热力图¶

利用热力图可以看数据集里多个特征两两间的相似度。例如：corr函数获得数据的相关性矩阵，利用heatmap函数绘制热力图。

plt.figure(figsize=(14,14),dpi=80)
sns.heatmap(data.corr())
plt.show()

cmap参数设置热力图色系，例如设置为Blues。

plt.figure(figsize=(14,14),dpi=80)
sns.heatmap(data.corr(),cmap="Blues")
plt.show()

无论热力图色系如何，可以看出大部分特征之间相关性都不是很高。为了更加直观的显示相关性的大小，可以设置annot参数为True，直接在方格中显示两两特征的相关性大小，fmt参数设置相关性数值的格式，如fmt='.2f'表示仅显示小数点后两位。由于特征过多，故仅显示loan_amnt、installment、avg_cur_bal、annual_inc特征的相关性热力图。

plt.figure(figsize=(10,8), dpi= 80)
data_1 = data[['loan_amnt', 'installment','avg_cur_bal','annual_inc']]
sns.heatmap(data_1.corr(),fmt='.2f',annot=True)
plt.show()

从上图看出，借款人每月还款金额（installment）与借款数目（loan_amnt）成强正相关性，其余特征也都或多或少呈一定程度的正相关性。

	id	member_id	loan_amnt	funded_amnt	funded_amnt_inv	term	int_rate	installment	grade	sub_grade	...	orig_projected_additional_accrued_interest	hardship_payoff_balance_amount	hardship_last_payment_amount	debt_settlement_flag	debt_settlement_flag_date	settlement_status	settlement_date	settlement_amount	settlement_percentage	settlement_term
0	NaN	NaN	2500	2500	2500	36 months	13.56%	84.92	C	C1	...	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	30000	30000	30000	60 months	18.94%	777.23	D	D2	...	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	5000	5000	5000	36 months	17.97%	180.69	D	D1	...	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	4000	4000	4000	36 months	18.94%	146.51	D	D2	...	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	30000	30000	30000	60 months	16.14%	731.78	C	C4	...	NaN	NaN	NaN	N	NaN	NaN	NaN	NaN	NaN	NaN