课程目录
Series和DataFrame结构
完成学习
<style type="text/css"> </style> <!--style type="text/css"> /* Overrides of notebook CSS for static HTML export */ body { #overflow: visible; #padding: 8px; } div#notebook { overflow: visible; border-top: none; }@media print { div.cell { display: block; page-break-inside: avoid; } div.output_wrapper { display: block; page-break-inside: avoid; } div.output { display: block; page-break-inside: avoid; } } </style--> <!-- Custom stylesheet, it must be in the same directory as the html file --> <!--link href="/static/codemirror/codemirror.css" rel="stylesheet"--> <!--link rel="stylesheet" href="/static/css/custom.css"--> <!-- Loading mathjax macro --> <section> <div tabindex="-1" id="notebook" class="border-box-sizing"> <div id="notebook-container"> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div> <div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <p>Pandas中核心的数据结构是Series和DataFrame。我们先来看看DataFrame对象<code>loan_data</code>中的前三行数据:</p> </div> </div> </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> <div class="prompt input_prompt">In&nbsp;[6]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span class="n">loan_data</span><span class="p">[:</span><span class="mi">3</span><span class="p">]</span> </pre></div> </div> </div> </div> <div class="output_wrapper"> <div class="output"> <div class="output_area"> <div class="prompt output_prompt">Out[6]:</div> <div class="output_html rendered_html output_subarea output_execute_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>loan_amnt</th> <th>grade</th> <th>sub_grade</th> <th>emp_length</th> <th>home_ownership</th> <th>annual_inc</th> <th>issue_d</th> <th>loan_status</th> <th>open_acc</th> <th>total_pymnt</th> <th>total_rec_int</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>5000.0</td> <td>B</td> <td>B2</td> <td>10+ years</td> <td>RENT</td> <td>24000.0</td> <td>2017/12/11</td> <td>Fully Paid</td> <td>3</td> <td>5863.155187</td> <td>863.16</td> </tr> <tr> <th>1</th> <td>2500.0</td> <td>C</td> <td>C4</td> <td>&lt; 1 year</td> <td>RENT</td> <td>30000.0</td> <td>2017/12/11</td> <td>Charged Off</td> <td>3</td> <td>1014.530000</td> <td>435.17</td> </tr> <tr> <th>2</th> <td>12500.0</td> <td>D</td> <td>D4</td> <td>10+ years</td> <td>RENT</td> <td>74400.0</td> <td>2017/11/11</td> <td>Fully Paid</td> <td>8</td> <td>14722.411910</td> <td>2222.41</td> </tr> </tbody> </table> </div> </div> </div> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div> <div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <p>DataFrame数据结构是一种二维的结构,你可以将它想象成一个Excel工作表。整个DataFrame包括三个部分:</p> <ul> <li>最左边由0,1,2数字组成的一列被称为行索引,通常所说的索引即是行索引,通过索引,我们可以快速获取到数据中的任意一个或者多个样本</li> <li>最上面由特征名或字段名<code>loan_amnt</code>,<code>grade</code>,$\ldots$,<code>total_rec_int</code>被称为列索引,即列名序列或表头,通过表头,我们可以快速获取到任意一个或多个特征所有的取值</li> <li>除去行索引和列索引,中间的每一个值代表了某个样本某个特征下的取值</li> </ul> <p>DataFrame由按一定顺序排列的多列数据组成,各列的数据类型可以有所不同(数值型、字符串或布尔型等)。</p> <p>与DataFrame不同,Series是一维的结构,它是带索引的一维数组,并且其中的数据的类型是一致的。DataFrame的任意一行或者一列就是一个Series对象,比如我们通过<code>.loc[0]</code>获取到第一个样本:</p> </div> </div> </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> <div class="prompt input_prompt">In&nbsp;[8]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span class="n">sample</span> <span class="o">=</span> <span class="n">loan_data</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="nb">print</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span> </pre></div> </div> </div> </div> <div class="output_wrapper"> <div class="output"> <div class="output_area"> <div class="prompt"></div> <div class="output_subarea output_stream output_stdout output_text"> <pre>loan_amnt 5000 grade B sub_grade B2 emp_length 10+ years home_ownership RENT annual_inc 24000 issue_d 2017/12/11 loan_status Fully Paid open_acc 3 total_pymnt 5863.16 total_rec_int 863.16 Name: 0, dtype: object </pre> </div> </div> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div> <div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <p>Series对象表现形式为:索引在左,取值在右,这种一左一右的数据对齐时刻保持着,除非你显式的改变数据对齐的形式。</p> <p>从DataFrame中得到的Series会自动分配两个属性:<code>name</code>和<code>dtype</code>,<code>name</code>属性与Pandas其他的关键功能关系非常密切。<code>Name:0</code>表示该样本为第一个样本,<code>dtype:object</code>表示所有数据为<code>object</code>类型(对象型数据表示数据为字符串或包含混合数据类型)。</p> <p>Series内部使用两个相关联的数组,且只能用来表示一维数据。Series可以看作一个定长的有序字典,因为它是索引值到数据值的映射,Series对象包含两个主要的属性:<code>index</code>(索引)和<code>values</code>(数据值)。</p> </div> </div> </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> <div class="prompt input_prompt">In&nbsp;[9]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span class="n">sample</span><span class="o">.</span><span class="n">index</span> </pre></div> </div> </div> </div> <div class="output_wrapper"> <div class="output"> <div class="output_area"> <div class="prompt output_prompt">Out[9]:</div> <div class="output_text output_subarea output_execute_result"> <pre>Index([&#39;loan_amnt&#39;, &#39;grade&#39;, &#39;sub_grade&#39;, &#39;emp_length&#39;, &#39;home_ownership&#39;, &#39;annual_inc&#39;, &#39;issue_d&#39;, &#39;loan_status&#39;, &#39;open_acc&#39;, &#39;total_pymnt&#39;, &#39;total_rec_int&#39;], dtype=&#39;object&#39;)</pre> </div> </div> </div> </div> </div> <div class="cell border-box-sizing code_cell rendered"> <div class="input"> <div class="prompt input_prompt">In&nbsp;[10]:</div> <div class="inner_cell"> <div class="input_area"> <div class=" highlight hl-ipython3"><pre><span class="n">sample</span><span class="o">.</span><span class="n">values</span> </pre></div> </div> </div> </div> <div class="output_wrapper"> <div class="output"> <div class="output_area"> <div class="prompt output_prompt">Out[10]:</div> <div class="output_text output_subarea output_execute_result"> <pre>array([5000.0, &#39;B&#39;, &#39;B2&#39;, &#39;10+ years&#39;, &#39;RENT&#39;, 24000.0, &#39;2017/12/11&#39;, &#39;Fully Paid&#39;, 3, 5863.155186999999, 863.16], dtype=object)</pre> </div> </div> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div> <div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <p>Series对象的索引与Numpy中的数组不同的是:Series对象中的索引值可以是非整型的,比如字符串等类型,而Numpy数组中的索引只能为整型。Series之间进行算术运算(<code>+,-,*,/,**</code>)时,即使两个对象的数据长度不同,它也可自动对齐相同索引的数据然后进行运算。</p> </div> </div> </div> <div class="cell border-box-sizing text_cell rendered"><div class="prompt input_prompt"> </div> <div class="inner_cell"> <div class="text_cell_render border-box-sizing rendered_html"> <p><br></p> </div> </div> </div> </div> </div> </section>
实战演练(2)
返回 >
1
2