0
9
1254
False
False
None
False
0
<style type="text/css">
</style>
<!--style type="text/css">
/* Overrides of notebook CSS for static HTML export */
body {
#overflow: visible;
#padding: 8px;
}
div#notebook {
overflow: visible;
border-top: none;
}@media print {
div.cell {
display: block;
page-break-inside: avoid;
}
div.output_wrapper {
display: block;
page-break-inside: avoid;
}
div.output {
display: block;
page-break-inside: avoid;
}
}
</style-->
<!-- Custom stylesheet, it must be in the same directory as the html file -->
<!--link href="/static/codemirror/codemirror.css" rel="stylesheet"-->
<!--link rel="stylesheet" href="/static/css/custom.css"-->
<!-- Loading mathjax macro -->
<section>
<div tabindex="-1" id="notebook" class="border-box-sizing">
<div id="notebook-container">
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Pandas中核心的数据结构是Series和DataFrame。我们先来看看DataFrame对象<code>loan_data</code>中的前三行数据:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [6]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span class="n">loan_data</span><span class="p">[:</span><span class="mi">3</span><span class="p">]</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="prompt output_prompt">Out[6]:</div>
<div class="output_html rendered_html output_subarea output_execute_result">
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>loan_amnt</th>
<th>grade</th>
<th>sub_grade</th>
<th>emp_length</th>
<th>home_ownership</th>
<th>annual_inc</th>
<th>issue_d</th>
<th>loan_status</th>
<th>open_acc</th>
<th>total_pymnt</th>
<th>total_rec_int</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>5000.0</td>
<td>B</td>
<td>B2</td>
<td>10+ years</td>
<td>RENT</td>
<td>24000.0</td>
<td>2017/12/11</td>
<td>Fully Paid</td>
<td>3</td>
<td>5863.155187</td>
<td>863.16</td>
</tr>
<tr>
<th>1</th>
<td>2500.0</td>
<td>C</td>
<td>C4</td>
<td>< 1 year</td>
<td>RENT</td>
<td>30000.0</td>
<td>2017/12/11</td>
<td>Charged Off</td>
<td>3</td>
<td>1014.530000</td>
<td>435.17</td>
</tr>
<tr>
<th>2</th>
<td>12500.0</td>
<td>D</td>
<td>D4</td>
<td>10+ years</td>
<td>RENT</td>
<td>74400.0</td>
<td>2017/11/11</td>
<td>Fully Paid</td>
<td>8</td>
<td>14722.411910</td>
<td>2222.41</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>DataFrame数据结构是一种二维的结构,你可以将它想象成一个Excel工作表。整个DataFrame包括三个部分:</p>
<ul>
<li>最左边由0,1,2数字组成的一列被称为行索引,通常所说的索引即是行索引,通过索引,我们可以快速获取到数据中的任意一个或者多个样本</li>
<li>最上面由特征名或字段名<code>loan_amnt</code>,<code>grade</code>,$\ldots$,<code>total_rec_int</code>被称为列索引,即列名序列或表头,通过表头,我们可以快速获取到任意一个或多个特征所有的取值</li>
<li>除去行索引和列索引,中间的每一个值代表了某个样本某个特征下的取值</li>
</ul>
<p>DataFrame由按一定顺序排列的多列数据组成,各列的数据类型可以有所不同(数值型、字符串或布尔型等)。</p>
<p>与DataFrame不同,Series是一维的结构,它是带索引的一维数组,并且其中的数据的类型是一致的。DataFrame的任意一行或者一列就是一个Series对象,比如我们通过<code>.loc[0]</code>获取到第一个样本:</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [8]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span class="n">sample</span> <span class="o">=</span> <span class="n">loan_data</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="nb">print</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="prompt"></div>
<div class="output_subarea output_stream output_stdout output_text">
<pre>loan_amnt 5000
grade B
sub_grade B2
emp_length 10+ years
home_ownership RENT
annual_inc 24000
issue_d 2017/12/11
loan_status Fully Paid
open_acc 3
total_pymnt 5863.16
total_rec_int 863.16
Name: 0, dtype: object
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Series对象表现形式为:索引在左,取值在右,这种一左一右的数据对齐时刻保持着,除非你显式的改变数据对齐的形式。</p>
<p>从DataFrame中得到的Series会自动分配两个属性:<code>name</code>和<code>dtype</code>,<code>name</code>属性与Pandas其他的关键功能关系非常密切。<code>Name:0</code>表示该样本为第一个样本,<code>dtype:object</code>表示所有数据为<code>object</code>类型(对象型数据表示数据为字符串或包含混合数据类型)。</p>
<p>Series内部使用两个相关联的数组,且只能用来表示一维数据。Series可以看作一个定长的有序字典,因为它是索引值到数据值的映射,Series对象包含两个主要的属性:<code>index</code>(索引)和<code>values</code>(数据值)。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [9]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span class="n">sample</span><span class="o">.</span><span class="n">index</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="prompt output_prompt">Out[9]:</div>
<div class="output_text output_subarea output_execute_result">
<pre>Index(['loan_amnt', 'grade', 'sub_grade', 'emp_length', 'home_ownership',
'annual_inc', 'issue_d', 'loan_status', 'open_acc', 'total_pymnt',
'total_rec_int'],
dtype='object')</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell rendered">
<div class="input">
<div class="prompt input_prompt">In [10]:</div>
<div class="inner_cell">
<div class="input_area">
<div class=" highlight hl-ipython3"><pre><span class="n">sample</span><span class="o">.</span><span class="n">values</span>
</pre></div>
</div>
</div>
</div>
<div class="output_wrapper">
<div class="output">
<div class="output_area">
<div class="prompt output_prompt">Out[10]:</div>
<div class="output_text output_subarea output_execute_result">
<pre>array([5000.0, 'B', 'B2', '10+ years', 'RENT', 24000.0, '2017/12/11',
'Fully Paid', 3, 5863.155186999999, 863.16], dtype=object)</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>Series对象的索引与Numpy中的数组不同的是:Series对象中的索引值可以是非整型的,比如字符串等类型,而Numpy数组中的索引只能为整型。Series之间进行算术运算(<code>+,-,*,/,**</code>)时,即使两个对象的数据长度不同,它也可自动对齐相同索引的数据然后进行运算。</p>
</div>
</div>
</div>
<div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt">
</div>
<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p><br></p>
</div>
</div>
</div>
</div>
</div>
</section>
实战演练(2)
返回 >
1
2