本案例是Pandas数据分析课程【第六章】时间序列处理的配套案例。案例使用的是德国能源生产及消耗数据，我们将使用Pandas中针对时间序列数据的各种方法进行分析。

目录¶

1. 数据集
 2. 数据分析
     2.1 数据导入
     2.2 基于时间索引筛选数据
     2.3 时间数据基本操作
     2.4 周期性分析
     2.5 滚动窗口

1 数据集¶

本案例使用德国2006年至2017年的能源生产及消耗数据集，数据来源于“Open Power System Data”（https://open-power-system-data.org/ ）。该数据集包括德国全国范围内的电力消耗，风能发电和太阳能发电总量等，单位为GWh。我们将结合Pandas中的各种方法对这一时间序列数据进行分析，为了对数据有更直观的认识，我们使用plot对数据进行简单的可视化处理。数据的字段及其说明如下：

变量名称	含义说明
Date	日期
Consumption	电力消耗
Wind	风能发电量
Solar	太阳能发电量

2 数据分析¶

2.1 数据导入¶

import pandas as pd
opsd = pd.read_csv('./input/opsd_germany_daily.csv')

查看数据前五行。

opsd.head()

查看数据后五行。

opsd.tail()

数据记载了从2006年1月1日至2017年12月31日德国全国每日的电力消耗数据及风能和太阳能发电量。可以看到，在2006年时，德国的风能发电与太阳能发电数据缺失，这是因为那时还没有推广清洁能源发电。而到2017年底，风能和太阳能日发电总量已占日用电量很大的比重，可以说是发展迅速。

使用dtypes查看数据类型

opsd.dtypes

Date            object
Consumption    float64
Wind           float64
Solar          float64
Wind+Solar     float64
dtype: object

Date变量为"object"类型，我们使用to_datetime将其转换为时间数据。

opsd['Date'] = pd.to_datetime(opsd['Date'])

opsd.dtypes

Date           datetime64[ns]
Consumption           float64
Wind                  float64
Solar                 float64
Wind+Solar            float64
dtype: object

再使用set_index将Date变量设定为索引。

opsd.set_index('Date',inplace=True)

opsd.head()

其实，我们也可以在数据导入时通过参数设置实现这些操作。我们设定index_col为0即以数据中第一列为索引，设定parse_dates为True，会把索引识别为时间数据类型。

opsd = pd.read_csv('./input/opsd_germany_daily.csv', index_col=0, parse_dates=True)

opsd.head()

查看此时的索引格式。

opsd.index

DatetimeIndex(['2006-01-01', '2006-01-02', '2006-01-03', '2006-01-04',
               '2006-01-05', '2006-01-06', '2006-01-07', '2006-01-08',
               '2006-01-09', '2006-01-10',
               ...
               '2017-12-22', '2017-12-23', '2017-12-24', '2017-12-25',
               '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29',
               '2017-12-30', '2017-12-31'],
              dtype='datetime64[ns]', name='Date', length=4383, freq=None)

可以看到，索引的类型为datetime，频率freq为None。这是因为我们在转换成时间数据时没有明确指定时间序列的任何频率。由于我们已知数据是以每日为记录的，我们可以使用asfreq进行指定。如果数据中缺失了某个时间，asfreq将自动为这些时间添加新行，并默认分配空值。

opsd = opsd.asfreq('D')

opsd.index

DatetimeIndex(['2006-01-01', '2006-01-02', '2006-01-03', '2006-01-04',
               '2006-01-05', '2006-01-06', '2006-01-07', '2006-01-08',
               '2006-01-09', '2006-01-10',
               ...
               '2017-12-22', '2017-12-23', '2017-12-24', '2017-12-25',
               '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29',
               '2017-12-30', '2017-12-31'],
              dtype='datetime64[ns]', name='Date', length=4383, freq='D')

时间频率被指定成了天。

2.2 基于时间索引筛选数据¶

对于时间数据索引，我们可以使用loc很方便的提取数据。例如，我们要查找2017年8月10日的数据。

opsd.loc['2017-08-10']

Consumption    1351.491
Wind            100.274
Solar            71.160
Wind+Solar      171.434
Name: 2017-08-10 00:00:00, dtype: float64

我们也可以选择一段时间，例如2014年1月20日至2014年1月22日的数据。与使用loc的常规索引一样，切片将包含两个端点。

opsd.loc['2014-01-20':'2014-01-22']

以上两个例子中，我们都将查找的时间具体到了日，我们也可以不具体到日，而仅仅指定对应的年和月，这将返回当月的所有数据。例如，我们查找2017年1月份的数据。

opsd.loc['2017-01']

要获取时间范围内的数据也可以使用truncate进行筛选。before将删去给定日期之前的数据，after将删去给定日期之后的数据。例如，我们同样想筛选2017年1月份的数据。

opsd.truncate(before='2017-01-01',after='2017-01-31')

2.3 时间数据基本操作¶

针对时间数据，我们可以使用year，month，weekday等多种方法获取对应时间的年份、月份和星期。

首先我们使用index提取数据的索引。

opsdtime = opsd.index
opsdtime

DatetimeIndex(['2006-01-01', '2006-01-02', '2006-01-03', '2006-01-04',
               '2006-01-05', '2006-01-06', '2006-01-07', '2006-01-08',
               '2006-01-09', '2006-01-10',
               ...
               '2017-12-22', '2017-12-23', '2017-12-24', '2017-12-25',
               '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29',
               '2017-12-30', '2017-12-31'],
              dtype='datetime64[ns]', name='Date', length=4383, freq='D')

使用year提取每个数据对应的年份。

opsdtime.year

Int64Index([2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006,
            ...
            2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017],
           dtype='int64', name='Date', length=4383)

使用month提取月份。

opsdtime.month

Int64Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
            ...
            12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
           dtype='int64', name='Date', length=4383)

month返回的是对应月份的数字，若想要获得月份的名字可以使用month_name。

opsdtime.month_name()

Index(['January', 'January', 'January', 'January', 'January', 'January',
       'January', 'January', 'January', 'January',
       ...
       'December', 'December', 'December', 'December', 'December', 'December',
       'December', 'December', 'December', 'December'],
      dtype='object', name='Date', length=4383)

同样，可以使用weekday和weekday_name查看日期是星期几。

opsdtime.weekday

Int64Index([6, 0, 1, 2, 3, 4, 5, 6, 0, 1,
            ...
            4, 5, 6, 0, 1, 2, 3, 4, 5, 6],
           dtype='int64', name='Date', length=4383)

opsdtime.weekday_name

Index(['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday',
       'Saturday', 'Sunday', 'Monday', 'Tuesday',
       ...
       'Friday', 'Saturday', 'Sunday', 'Monday', 'Tuesday', 'Wednesday',
       'Thursday', 'Friday', 'Saturday', 'Sunday'],
      dtype='object', name='Date', length=4383)

在weekday的返回中，0代表星期一，1代表星期二，以此类推，6代表星期日。

为了方便后文分析，我们对数据进行季节的划分，将3月、4月、5月定为春季，6月、7月、8月定为夏季，9月、10月、11月定为秋季，12月、1月、2月定为冬季。

首先，构建月份与对应季节间的映射字典。

seasons = [1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1]
month_to_season = dict(zip(range(1,13), seasons))
month_to_season

{1: 1, 2: 1, 3: 2, 4: 2, 5: 2, 6: 3, 7: 3, 8: 3, 9: 4, 10: 4, 11: 4, 12: 1}

使用map对month返回的月份结果进行转换。

opsdtime.month.map(month_to_season)

Int64Index([1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            ...
            1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
           dtype='int64', name='Date', length=4383)

以新变量season加入数据中。

opsd['season'] = opsdtime.month.map(month_to_season)

opsd['season'].head()

Date
2006-01-01    1
2006-01-02    1
2006-01-03    1
2006-01-04    1
2006-01-05    1
Freq: D, Name: season, dtype: int64

2.4 周期性分析¶

2.4.1重采样分析周期性

我们使用plot查看数据整体情况，电力消耗总量：

import matplotlib.pyplot as plt
%matplotlib inline

opsd['Consumption'].plot(figsize=(12,6))

<matplotlib.axes._subplots.AxesSubplot at 0x7fbc523cae48>

似乎年电力消耗量存在一定的规律，我们具体查看2007年的数据。

opsd.loc['2007','Consumption'].plot(figsize=(12,6))

<matplotlib.axes._subplots.AxesSubplot at 0x7fbc5007e668>

可以看到，年初和年末用电量最高。我们可以使用groupby按变量season分组，并计算每个季节的用电量均值。

opsd.groupby('season')['Consumption'].mean()

season
1    1419.220273
2    1313.851258
3    1259.919259
4    1363.514635
Name: Consumption, dtype: float64

opsd.groupby('season')['Consumption'].mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fbc4ff8acc0>

可以看到，冬季和春季的耗电量要高于夏季和秋季，可能是由于春冬季电加热和照明用电量增加，而夏秋用电量较低。

同时，我们使用groupby进行重采样，将数据按是星期几进行分组，并计算每组的用电量均值。这里我们使用lambda函数传入weekday进行分组，仔细观察可以看到每周的用电量均值也存在一个振荡，这可能是因为工作日与周末的用电量有所不同造成的。

opsd.groupby(lambda x:x.weekday)['Consumption'].mean()

0    1389.786334
1    1428.277624
2    1433.606541
3    1421.158254
4    1394.624076
5    1200.549839
6    1103.104493
Name: Consumption, dtype: float64

opsd.groupby(lambda x:x.weekday)['Consumption'].mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fbc4ff1a5f8>

可以看到，在周末（周六和周日）耗电量有明显地下降。

下面我们查看太阳能发电的情况。

opsd['Solar'].plot(figsize=(12,6))

<matplotlib.axes._subplots.AxesSubplot at 0x7fbc4ff579b0>

可以看到，太阳能发电量也存在明显的季节性，结合下面的图具体来说，太阳能产量在春季和夏季较高，因为春夏季阳光最丰富，而秋冬季较低。

opsd.groupby('season')['Solar'].mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fbc54cb9f98>

风能发电数据：

opsd['Wind'].plot(figsize=(12,6))

<matplotlib.axes._subplots.AxesSubplot at 0x7fbc4fe50400>

可以看到，风能发电总体水平有逐年增加的趋势。但由于数据过多，风能发电的周期性没有那么明显，我们使用resample对数据进行降采样。具体来说，我们按每个月重采样，并计算每月的均值。

wind = opsd['Wind'].resample('M').mean()
wind.plot(figsize=(12,6))

<matplotlib.axes._subplots.AxesSubplot at 0x7fbc4e11ae48>

观察2010年至2017年的数据可以很明显地看到冬季发电量较大，夏季发电量较低。这可能是因为冬季风能较强，风暴频繁。

2.4.2利用数据差分分析周期性

在分析周期性的过程中，很重要的一点就是要消除数据的趋势性，常见的消除数据趋势的方法就是差分：计算连续数据点间的差异（这里特指一阶差分）。例如，t时刻的差分值：$\Delta d_t=d_t - d_{t-1}$。我们可以使用diff方法实现差分操作。

例如，我们计算太阳能发电的差分序列并绘图：

opsd['Solar'].diff().plot(figsize=(12,6))

<matplotlib.axes._subplots.AxesSubplot at 0x7fbc4e0e3fd0>

可以看到，太阳能发电量的差分值在0处上下波动，且在夏季波动更大。这说明夏季有较多日期太阳能发电量较高。但这也说明太阳能发电并不稳定，依赖于天气情况，某一天发电量突然增高而后一天发电量一般就会出现图中差分值为负的情况。

我们也可以通过移动时间序列自行计算差分值。移动序列可以使用shift方法。shift方法可以沿着时间轴将数据前移或后移，保持索引不变。

原数据的后五行：

opsd['Solar'].tail()

Date
2017-12-27    16.530
2017-12-28    14.162
2017-12-29    29.854
2017-12-30     7.467
2017-12-31    19.980
Freq: D, Name: Solar, dtype: float64

为了计算差分值，我们将数据向后移一天。

opsd['Solar'].shift(1).tail()

Date
2017-12-27    30.923
2017-12-28    16.530
2017-12-29    14.162
2017-12-30    29.854
2017-12-31     7.467
Freq: D, Name: Solar, dtype: float64

两个序列相减即可得到原始数据的一阶差分序列。

dif = opsd['Solar']-opsd['Solar'].shift(1)
dif.plot(figsize=(12,6))

<matplotlib.axes._subplots.AxesSubplot at 0x7fbc4fe954a8>

也可以通过设定shift方法中的参数freq移动索引而数据保持不变，我们指定时间移动一天。

opsd['Solar'].shift(1,freq='d').tail()

Date
2017-12-28    16.530
2017-12-29    14.162
2017-12-30    29.854
2017-12-31     7.467
2018-01-01    19.980
Freq: D, Name: Solar, dtype: float64

2.5 滚动窗口¶

滚动窗口操作是时间序列数据的另一个重要转换。与降采样类似，滚动窗口将数据拆分为时间窗口，并且对每个窗口中的数据使用诸如mean，median等函数进行聚合。但是，与降采样不同，滚动窗口以与数据相同的频率重叠和“滚动”，因此变换的时间序列与原始时间序列的频率相同。

例如，我们设定窗口为7天，且以数据中心为基准点，则每一个数据对应的窗口将包含前面三天与后面三天。具体来看，2017-07-06对应的窗口就是2017-07-03到2017-07-09。

通过将每一个数据点用对应的窗口值来替代，我们可以消除一些波动对数据的影响。可以想象，当窗口越大，每个数据点将更能反映整体情况。我们计算窗口为7（周）,30（月）和365（年）时风能发电数据的滚动均值。

opsd['Wind'].rolling(7).mean().plot(figsize=(12,6))

<matplotlib.axes._subplots.AxesSubplot at 0x7fbc4fcae400>

opsd['Wind'].rolling(30).mean().plot(figsize=(12,6))

<matplotlib.axes._subplots.AxesSubplot at 0x7fbc4defff98>

当窗口范围中存在缺失值时，窗口将会返回为缺失值，我们可以设定min_periods为360，只需要对应窗口中有360个以上数据就可以，这样可以容忍一小部分的缺失数据。

opsd['Wind'].rolling(window=365,min_periods=360).mean().plot(figsize=(12,6))

<matplotlib.axes._subplots.AxesSubplot at 0x7fbc4ddfe080>

可以看到，随着窗口不断扩大，大部分随机波动甚至周期波动都被掩盖掉了，这让我们更明显地看到数据的趋势情况：风能发电量逐年增加。

	Date	Consumption	Wind	Solar	Wind+Solar
0	2006-01-01	1069.184	NaN	NaN	NaN
1	2006-01-02	1380.521	NaN	NaN	NaN
2	2006-01-03	1442.533	NaN	NaN	NaN
3	2006-01-04	1457.217	NaN	NaN	NaN
4	2006-01-05	1477.131	NaN	NaN	NaN

	Date	Consumption	Wind	Solar	Wind+Solar
4378	2017-12-27	1263.94091	394.507	16.530	411.037
4379	2017-12-28	1299.86398	506.424	14.162	520.586
4380	2017-12-29	1295.08753	584.277	29.854	614.131
4381	2017-12-30	1215.44897	721.247	7.467	728.714
4382	2017-12-31	1107.11488	721.176	19.980	741.156

	Consumption	Wind	Solar	Wind+Solar
Date
2006-01-01	1069.184	NaN	NaN	NaN
2006-01-02	1380.521	NaN	NaN	NaN
2006-01-03	1442.533	NaN	NaN	NaN
2006-01-04	1457.217	NaN	NaN	NaN
2006-01-05	1477.131	NaN	NaN	NaN

	Consumption	Wind	Solar	Wind+Solar
Date
2006-01-01	1069.184	NaN	NaN	NaN
2006-01-02	1380.521	NaN	NaN	NaN
2006-01-03	1442.533	NaN	NaN	NaN
2006-01-04	1457.217	NaN	NaN	NaN
2006-01-05	1477.131	NaN	NaN	NaN

	Consumption	Wind	Solar	Wind+Solar
Date
2014-01-20	1590.687	78.647	6.371	85.018
2014-01-21	1624.806	15.643	5.835	21.478
2014-01-22	1625.155	60.259	11.992	72.251

	Consumption	Wind	Solar	Wind+Solar
Date
2017-01-01	1130.413	307.125	35.291	342.416
2017-01-02	1441.052	295.099	12.479	307.578
2017-01-03	1529.990	666.173	9.351	675.524
2017-01-04	1553.083	686.578	12.814	699.392
2017-01-05	1547.238	261.758	20.797	282.555
2017-01-06	1501.795	115.723	33.341	149.064
2017-01-07	1405.145	252.307	8.387	260.694
2017-01-08	1301.011	41.261	4.991	46.252
2017-01-09	1604.348	190.983	7.070	198.053
2017-01-10	1639.046	280.373	13.045	293.418
2017-01-11	1654.809	637.259	7.379	644.638
2017-01-12	1620.597	584.792	17.865	602.657
2017-01-13	1608.895	518.618	14.311	532.929
2017-01-14	1392.736	487.189	16.767	503.956
2017-01-15	1289.904	229.770	16.105	245.875
2017-01-16	1605.465	69.209	17.600	86.809
2017-01-17	1649.104	79.363	22.909	102.272
2017-01-18	1669.395	148.915	22.709	171.624
2017-01-19	1667.477	121.272	38.191	159.463
2017-01-20	1641.737	109.383	39.633	149.016
2017-01-21	1423.020	78.893	45.477	124.370
2017-01-22	1340.341	50.774	47.386	98.160
2017-01-23	1663.492	39.710	30.939	70.649
2017-01-24	1682.002	31.375	10.300	41.675
2017-01-25	1674.171	70.772	17.720	88.492
2017-01-26	1659.527	235.128	56.487	291.615
2017-01-27	1629.164	254.270	68.625	322.895
2017-01-28	1394.033	208.827	65.964	274.791
2017-01-29	1296.170	304.952	53.854	358.806
2017-01-30	1605.356	338.292	18.577	356.869
2017-01-31	1620.860	124.784	12.064	136.848