本案例提供一份有关青少年爱好与生活习惯的数据集 主要收集年轻人的爱好和生活习惯 方面的信息. 问卷调查数据集的总体情况如下:

  • 该问卷调查的数据集共涉及1010份的150项问题的调查
  • columns.csv 文件中记录了这 150 项问题的详细描述
  • 数据集中包含缺失值,即被调查者未填写的项目
  • 数据集中既有数值型数据又有字符型数据
  • 数值型数据代表认同程度从 1 到 5 程度逐渐增加

对于150 项的调查内容, 可以分为若干类别: 音乐喜好、电影喜好、爱好和兴趣、厌恶、卫生习惯、个性特点, 人生观、消费习惯和个人基本信息。

本案例结合调查问卷数据集, 从零开始,实现关联规则领域的经典算法FPGrowth,找出数据集中隐藏的关联规则。

算法实现

FP-Growth简述

频繁模式算法主要有三部分构成:

  • 构造频繁模式树(FPTree)

  • 从FP树中获取条件模式基(Conditional Pattern Base)

  • 对于每个频繁项,根据条件模式基,构造条件频繁模式树(Conditional FPTree),并递归地挖掘条件FP树,直到树为空

FP-Growth的数据结构

我们构造两个在算法实现中需要用到的数据结构,NodeFPTree

  • Node

Node类表示的是树中的节点,保存节点的子节点(children),父节点(parent),以及节点代表的元素(value)出现的频数(count)等信息。

  • FPTree

FPTree表示的是FP树,它将数据集以树形结构进行存储,根节点为null,其余节点代表数据集中的一个频繁项和它的支持度信息。由于不同的频繁项会有重合的部分,在FP树结构中存储的数据就可以共享相同的路径,达到压缩数据的效果。

In [1]:
'''
Node类,包含以下变量:

name: 当前节点代表的频繁项
count:当前节点出现的频数
link:FP树的其他路径中代表相同频繁项的节点
parent:当前节点的父节点
children:当前节点的子节点

包含以下方法:

__init__(): 构造方法
increment():增加当前节点代表的频繁项的支持度
display():可视化节点,以及对应的频繁项支持度

'''


class Node:
    
    '''
    输入参数:name:节点代表的频繁项
            count:频繁项的支持度
            parent:节点的父节点
    '''
    
    def __init__(self, name, count, parent):
        self.name = name
        self.count = count
        self.link = None
        self.parent = parent
        self.children = {}
    '''
    输入参数: num:整数,代表频繁项支持度增加的数值
    '''
    def increment(self, num):
        self.count += num
    '''
    输入参数:lens:节点的初始支持度
    '''
    def display(self, lens = 1):
        print '   '*lens, self.name, ' ', self.count
        for child in self.children.values():
            child.display(lens + 1)

然后,我们实现FPTRee类,它也是频繁树增长算法的核心数据结构。

In [2]:
'''
FPTree类,包含以下变量:
flag:表示构建FPTree还是条件FPTree,取决于输入数据
root: FP树的根节点
data:FP树代表的数据集
data_type:数据集存储的原始类型(字典或者嵌套列表)
min_support:最小支持度(绝对)
frequent:频繁1项集集合
headerTable:FP树对应的频繁项表

包含以下方法:

__init__(): 构造方法
find_frequent_items():根据数据集生成频繁1项集集合
build_tree():将数据集中的每条数据记录插入FP树种,也就是建树过程
get_tree():返回树的根节点
get_frequent_items():返回频繁1项集集合
get_headertable():返回频繁项表
get_data():返回数据集
show():可视化FP树的结构

'''

class FPTree:
       
    '''
    输入参数:transactions:数据集
            min_support:最小支持度
            root_value:根节点代表的频繁项
            count:根节点代表的频繁项的支持度
    '''
    def __init__(self, transactions, min_support, root_value, count, flag):
        
        self.flag = flag
        self.root = Node(root_value, count, None)   
        self.data = transactions
        self.data_type = type(self.data) 
        self.min_support = min_support      
       
        self.frequent = self.find_frequent_items()
            
        self.headerTable = {v:[self.frequent[v], None] for v in self.frequent}
      
     
    '''
    返回参数:以字典格式存储的频繁1项集集合
    '''
    
    def find_frequent_items(self):
    
        from collections import defaultdict
        freq1 = defaultdict(int)

        #统计列表中各个元素的出现频次
        if self.data_type == list and self.flag == 'fptree':
            
            flatten_list = [element for item in self.data for element in item]
            for value in flatten_list:
                freq1[frozenset([value])] += 1
        # 条件FP树
        elif self.data_type == dict and self.flag == 'cfptree':
            for item in self.data:
                for element in item:
                    
                    if type(element) == frozenset:
                        freq1[element] += self.data[item]
                    else:
                        freq1[frozenset([element])] += self.data[item]
    
        # 过滤掉非频繁项
        return {v:freq1[v] for v in freq1 if freq1[v] >= self.min_support}
    
    def build_tree(self):
    
        root = Node('null', 1, None)
        sorted_headertable = [v[0] for v in sorted(self.frequent.items(), key=lambda kv: (-kv[1], list(kv[0])[0]))]
        # 将业务数据库中的每条记录插入树中,更新频繁项表
        for record in self.data:
            
            # 找到业务数据纪录中的频繁项
            
            sorted_items = [item for item in sorted_headertable if list(item)[0] in record]
            #print u'将排序后的频繁项集放入FP树中: '
            #print sorted_items
            #print ''
            node = self.root        
            while len(sorted_items) > 0:
            
                first_value = sorted_items[0]
                
                #print u' 开始插入节点 ', first_value 
                if  first_value in node.children:
                    #print u'    当前节点的子节点中包含节点 ', first_value
                    if self.data_type == list:             
                        node.children[first_value].increment(1)
                    elif self.data_type == dict:
                        node.children[first_value].increment(self.data[record])
    
                else:
                    # 创建新的子节点 
                    #print u'    创建新节点 ', first_value
                    if self.data_type == list:             
                        node.children[first_value] = Node(list(first_value)[0], 1, node)
                    elif self.data_type == dict:
                        node.children[first_value] = Node(first_value, self.data[record], node)
            
                    # 更新频繁项表
                    if self.headerTable[first_value][1] == None:
                        #print u'    在频繁项表中,添加对应链表的第一个节点', first_value
                        self.headerTable[first_value][1] = node.children[first_value]
                        
                    else:
                        #print u'    添加到频繁项表对应链表的末尾'
                        currentNode = self.headerTable[first_value][1]
               
                        # 遍历连表,确保在表的尾部插入新的节点指针
                        while currentNode.link != None:
                            
                            currentNode = currentNode.link
                        currentNode.link = node.children[first_value]
                #print '***********'
                sorted_items.pop(0)
                node = node.children[first_value]
            #print ''

    def get_tree(self):
        return self.root
    def get_frequent_items(self):
        return self.frequent
    def get_headertable(self):
        return self.headerTable
    def get_data(self):
        return self.data
    
    def show(self, depth = 1):
        print ' '*depth, self.root.name, ' ', self.root.count
        for node in self.root.children.values():
            node.display(depth + 1)
        

构建FP树

我们使用一个toy data数据集,来测试算法的建树过程。数据集的内容如下所示:

In [3]:
# 二维列表存储的数据集
data = [
    ['f', 'a', 'c', 'd', 'g', 'i', 'm', 'p'],
    ['a','b','c','f','l','m','o'],
    ['b','f','h','j','o'],
    ['b','c','k','s','p'],
    ['a','f','c','e','l','p','m','n']]
data
Out[3]:
[['f', 'a', 'c', 'd', 'g', 'i', 'm', 'p'],
 ['a', 'b', 'c', 'f', 'l', 'm', 'o'],
 ['b', 'f', 'h', 'j', 'o'],
 ['b', 'c', 'k', 's', 'p'],
 ['a', 'f', 'c', 'e', 'l', 'p', 'm', 'n']]

与数据集对应的频繁树为:

在测试例子中,我们设置最小支持度为3,然后使用data创建FP树的一个实例对象fptree

In [4]:
#最小支持度
min_support = 3

#创建实例
fptree = FPTree(data, min_support, 'null', 1, 'fptree')

通过调用get_frequent_items()方法,我们可以查看数据集的频繁1项集。从结果上来看,过滤掉不符合最小支持度条件的项d, e, g, i, k, l, h, n, o, s

In [5]:
# 频繁1项集集合
freq1 = fptree.get_frequent_items()
freq1
Out[5]:
{frozenset({'f'}): 4,
 frozenset({'b'}): 3,
 frozenset({'a'}): 3,
 frozenset({'p'}): 3,
 frozenset({'m'}): 3,
 frozenset({'c'}): 4}

然后,在建树过程中,我们可以通过build_tree()方法的打印语句来详细考察FP树的建立过程。

In [6]:
fptree.build_tree()

FP树建立完成后,可视化树的结构,可以和之前的图进行对比验证。

In [7]:
fptree.show(1)
  null   1
       f   1
          b   1
       c   4
          f   3
             a   3
                m   2
                   p   2
                b   1
                   m   1
          b   1
             p   1

挖掘FP树

频繁树增长算法将构建的FP树划分为一系列条件模式库。然后,基于每一个条件模式库,递归地构建条件频繁模式树。例如,我们构造以p为后缀的条件模式库,首先根据频繁项表中存储的p的链表结构,回溯所有的路径(cbp, cfamp);然后,再将路径中除去p的部分作为新的项集,继续构建FP树。

在构建的FP树种,回溯以某个节点为后缀的路径,我们使用函数find_prefix()来实现。

In [8]:
'''
输入参数:Node节点
返回参数:以输入节点为后缀的条件模式库
'''

def find_prefix(node):
  
    #条件模式库
    cpb = {}
    suffix_list = []

    # 判断条件,是否已经达到链表的尾部
    while node != None:
        suffix_list.append(node)
        node = node.link
   
    for item in suffix_list:

        path = []
        num = item.count
        # 判断条件,是否已经回溯到树的顶点
        while item.parent != None:
            path.append(item.name)
            item = item.parent
      
        # 除去当前节点的部分作为条件模式基,并以当前节点的支持度作为条件模式库的支持度
        cpb[frozenset(path[1:])] = num
        
    return cpb     

接下来,我们构造mine_tree方法,来实现FP树的挖掘过程。

In [9]:
'''
输入参数:frequent_items:数据集的频繁1项集集合
        headerTable:数据集的频繁项表
        frequent:每次迭代过程生成的频繁项
        item_list:存储所有迭代过程生成的的频繁项集
        
'''
def mine_tree(frequent_items, headerTable, min_support, frequent, item_list):
    
    # 频繁项表中的元素降序排列
    candidates = [v[0] for v in sorted(frequent_items.items(), key=lambda kv: (-kv[1], list(kv[0])[0]))]
    #print candidates
    for item in candidates[::-1]:
        
        #从以下的元素开始
        #print 'from the node: ', item
        # 针对每个条件模式库,生成频繁项集
        freq_set = frequent.copy() 
        freq_set.add(item)
        item_list.append(freq_set)
        
        # 生成条件模式库
        cpbs = find_prefix(headerTable[item][1])
       
        #print 'its cpbs: ', cpbs
        #创建条件FP树
        cTree = FPTree(cpbs, min_support, 'root', 1, 'cfptree')
        
        cTree.build_tree()
       
        #print '-----headerTable: ', cTree.get_headertable()
        #判断条件:频繁项表为空
        if len(cTree.get_headertable()) != 0:
            #print 'condtional tree for: ', freq_set
            #cTree.show(1)    
            mine_tree(cTree.get_frequent_items(), cTree.get_headertable(), min_support, freq_set, item_list)

将构建的FP树实例fptree代入函数mine_tree,返回所有的频繁项集。

In [10]:
#存储所有的频繁项
frequent_item = []
#存储一次迭代过程产生的频繁项
frequent = set()

#FP树的根节点
root_node = fptree.get_tree()
#数据集data的headertable
headertable = fptree.get_headertable()
#数据集data的频繁1项集集合
freq_items = fptree.get_frequent_items()

#挖掘FP树fptree
mine_tree(freq_items, headertable, min_support, frequent, frequent_item)

打印frequent_item的内容,展示所有的频繁项集。

In [11]:
frequent_item
Out[11]:
[{frozenset({'p'})},
 {frozenset({'c'}), frozenset({'p'})},
 {frozenset({'m'})},
 {frozenset({'f'}), frozenset({'m'})},
 {frozenset({'f'}), frozenset({'m'}), frozenset({'c'})},
 {frozenset({'f'}), frozenset({'m'}), frozenset({'a'})},
 {frozenset({'m'}), frozenset({'c'})},
 {frozenset({'m'}), frozenset({'c'}), frozenset({'a'})},
 {frozenset({'m'}), frozenset({'a'})},
 {frozenset({'b'})},
 {frozenset({'a'})},
 {frozenset({'f'}), frozenset({'a'})},
 {frozenset({'f'}), frozenset({'c'}), frozenset({'a'})},
 {frozenset({'c'}), frozenset({'a'})},
 {frozenset({'f'})},
 {frozenset({'f'}), frozenset({'c'})},
 {frozenset({'c'})}]

基于问卷调查数据集的规则挖掘

数据预处理

In [12]:
# 导入处理过程中需要用到的第三方库
from sklearn.preprocessing import Binarizer
import pandas as pd
import numpy as np

使用Pandas库的read_csv()函数读入数据。

In [13]:
real_data = pd.read_csv('./input/responses.csv')
real_data.head()
Out[13]:
Music Slow songs or fast songs Dance Folk Country Classical music Musical Pop Rock Metal or Hardrock ... Age Height Weight Number of siblings Gender Left - right handed Education Only child Village - town House - block of flats
0 5.0 3.0 2.0 1.0 2.0 2.0 1.0 5.0 5.0 1.0 ... 20.0 163.0 48.0 1.0 female right handed college/bachelor degree no village block of flats
1 4.0 4.0 2.0 1.0 1.0 1.0 2.0 3.0 5.0 4.0 ... 19.0 163.0 58.0 2.0 female right handed college/bachelor degree no city block of flats
2 5.0 5.0 2.0 2.0 3.0 4.0 5.0 3.0 5.0 3.0 ... 20.0 176.0 67.0 2.0 female right handed secondary school no city block of flats
3 5.0 3.0 2.0 1.0 1.0 1.0 1.0 2.0 2.0 1.0 ... 22.0 172.0 59.0 1.0 female right handed college/bachelor degree yes city house/bungalow
4 5.0 3.0 4.0 3.0 2.0 4.0 3.0 5.0 3.0 1.0 ... 20.0 170.0 59.0 1.0 female right handed secondary school no village house/bungalow

5 rows × 150 columns

In [14]:
# 数据集的大小
real_data.shape
Out[14]:
(1010, 150)

首先,我们需要了解数据集是否存在缺失值。在这里,我们使isnull()函数来获取boolean矩阵,并分别沿行,列求和,观察缺数据的缺失情况。在Python语言中,对boolean数组求和,如果元素为True,则记为1;反之,则记为0。

In [15]:
# 举例
boo_list = [True, True, False, True]
sum(boo_list)
Out[15]:
3

利用这样的特性,来获取数据集的缺失值分布信息。

In [16]:
# 数据记录(行)缺失值情况
row_missing = real_data.isnull().sum(axis = 1)

# non_missing_indicator
non_missing_indicator = row_missing == 0
missing_indicator = np.logical_not(non_missing_indicator)

non_missing_index = list(row_missing[non_missing_indicator].index)
In [17]:
print 'number of non missing record: ', len(non_missing_index)
number of non missing record:  674

查看含有缺失值的行的索引。

In [18]:
missing_index = list(row_missing[missing_indicator].index)
print missing_index
[3, 8, 15, 17, 22, 27, 37, 45, 46, 47, 51, 56, 58, 63, 69, 72, 76, 78, 83, 84, 87, 91, 93, 94, 102, 103, 107, 113, 115, 124, 130, 137, 138, 140, 142, 143, 144, 145, 146, 149, 153, 159, 162, 164, 167, 170, 174, 176, 177, 180, 181, 183, 186, 188, 191, 197, 199, 201, 202, 209, 210, 215, 219, 226, 231, 233, 238, 242, 243, 247, 249, 255, 259, 260, 262, 264, 270, 276, 279, 283, 288, 289, 294, 300, 302, 306, 308, 310, 313, 317, 318, 320, 321, 328, 330, 338, 343, 347, 349, 351, 355, 357, 360, 362, 365, 366, 368, 369, 373, 375, 376, 378, 382, 384, 396, 397, 398, 402, 405, 416, 422, 423, 424, 429, 437, 441, 449, 453, 458, 462, 466, 468, 469, 473, 474, 475, 476, 477, 478, 479, 481, 483, 484, 485, 487, 495, 497, 499, 502, 508, 509, 512, 516, 517, 523, 524, 525, 526, 527, 534, 540, 542, 543, 548, 549, 551, 552, 556, 558, 560, 562, 563, 565, 567, 570, 571, 572, 580, 581, 586, 592, 596, 602, 603, 606, 607, 608, 609, 615, 617, 621, 623, 624, 627, 629, 630, 631, 635, 636, 637, 638, 643, 644, 646, 647, 649, 651, 656, 657, 659, 660, 663, 664, 668, 669, 676, 677, 683, 685, 686, 687, 693, 696, 698, 703, 706, 708, 712, 718, 722, 723, 726, 729, 730, 733, 735, 736, 738, 744, 745, 746, 748, 755, 756, 759, 763, 766, 767, 776, 777, 778, 783, 785, 786, 788, 789, 790, 801, 803, 806, 813, 814, 815, 825, 828, 830, 831, 832, 837, 841, 842, 843, 845, 847, 849, 851, 853, 858, 861, 862, 868, 871, 873, 875, 882, 885, 887, 889, 891, 893, 895, 898, 899, 902, 908, 911, 912, 918, 920, 921, 925, 929, 934, 935, 939, 940, 941, 942, 943, 945, 946, 948, 949, 950, 951, 954, 956, 958, 960, 963, 964, 965, 968, 975, 980, 983, 984, 987, 988, 993, 994, 996, 997, 999, 1002, 1006]
In [19]:
print 'missing ratio: ', format(float(len(missing_index))/real_data.shape[0], '0.1%')
missing ratio:  33.3%
In [20]:
# 从列的角度来看
col_missing = real_data.isnull().sum()

non_missing_attribute = list(col_missing[col_missing == 0].index)
missing_attribute = list(col_missing[col_missing != 0].index)
print non_missing_attribute
['Snakes', 'Eating to survive', 'Dreams', 'Number of friends', 'Internet usage', 'Spending on gadgets']
In [21]:
print 'attribute missing ratio: ', format(float(real_data.shape[1] - len(non_missing_attribute))/real_data.shape[1], '0.1%')
attribute missing ratio:  96.0%

在掌握了数据的缺失情况之后,需要考察数据集变量的类型,以及相应的缺失值处理方法。

In [22]:
# 查看data各个变量的类型
real_data.dtypes.value_counts()
Out[22]:
float64    134
object      11
int64        5
dtype: int64
In [23]:
attribute_type = dict(real_data.dtypes)
attribute_type
Out[23]:
{'Achievements': dtype('float64'),
 'Action': dtype('float64'),
 'Active sport': dtype('float64'),
 'Adrenaline sports': dtype('float64'),
 'Age': dtype('float64'),
 'Ageing': dtype('float64'),
 'Alcohol': dtype('O'),
 'Alternative': dtype('float64'),
 'Animated': dtype('float64'),
 'Appearence and gestures': dtype('float64'),
 'Art exhibitions': dtype('float64'),
 'Assertiveness': dtype('float64'),
 'Biology': dtype('float64'),
 'Borrowed stuff': dtype('float64'),
 'Branded clothing': dtype('float64'),
 'Cars': dtype('float64'),
 'Celebrities': dtype('float64'),
 'Changing the past': dtype('float64'),
 'Charity': dtype('float64'),
 'Cheating in school': dtype('float64'),
 'Chemistry': dtype('float64'),
 'Children': dtype('float64'),
 'Classical music': dtype('float64'),
 'Comedy': dtype('float64'),
 'Compassion to animals': dtype('float64'),
 'Country': dtype('float64'),
 'Countryside, outdoors': dtype('float64'),
 'Criminal damage': dtype('float64'),
 'Daily events': dtype('float64'),
 'Dance': dtype('float64'),
 'Dancing': dtype('float64'),
 'Dangerous dogs': dtype('float64'),
 'Darkness': dtype('float64'),
 'Decision making': dtype('float64'),
 'Documentary': dtype('float64'),
 'Dreams': dtype('int64'),
 'Eating to survive': dtype('int64'),
 'Economy Management': dtype('float64'),
 'Education': dtype('O'),
 'Elections': dtype('float64'),
 'Empathy': dtype('float64'),
 'Energy levels': dtype('float64'),
 'Entertainment spending': dtype('float64'),
 'Fake': dtype('float64'),
 'Fantasy/Fairy tales': dtype('float64'),
 'Fear of public speaking': dtype('float64'),
 'Final judgement': dtype('float64'),
 'Finances': dtype('float64'),
 'Finding lost valuables': dtype('float64'),
 'Flying': dtype('float64'),
 'Folk': dtype('float64'),
 'Foreign languages': dtype('float64'),
 'Friends versus money': dtype('float64'),
 'Fun with friends': dtype('float64'),
 'Funniness': dtype('float64'),
 'Gardening': dtype('float64'),
 'Gender': dtype('O'),
 'Geography': dtype('float64'),
 'Getting angry': dtype('float64'),
 'Getting up': dtype('float64'),
 'Giving': dtype('float64'),
 'God': dtype('float64'),
 'Happiness in life': dtype('float64'),
 'Health': dtype('float64'),
 'Healthy eating': dtype('float64'),
 'Height': dtype('float64'),
 'Heights': dtype('float64'),
 'Hiphop, Rap': dtype('float64'),
 'History': dtype('float64'),
 'Horror': dtype('float64'),
 'House - block of flats': dtype('O'),
 'Hypochondria': dtype('float64'),
 'Interests or hobbies': dtype('float64'),
 'Internet': dtype('float64'),
 'Internet usage': dtype('O'),
 'Judgment calls': dtype('float64'),
 'Keeping promises': dtype('float64'),
 'Knowing the right people': dtype('float64'),
 'Latino': dtype('float64'),
 'Law': dtype('float64'),
 'Left - right handed': dtype('O'),
 'Life struggles': dtype('float64'),
 'Loneliness': dtype('float64'),
 'Loss of interest': dtype('float64'),
 'Lying': dtype('O'),
 'Mathematics': dtype('float64'),
 'Medicine': dtype('float64'),
 'Metal or Hardrock': dtype('float64'),
 'Mood swings': dtype('float64'),
 'Movies': dtype('float64'),
 'Music': dtype('float64'),
 'Musical': dtype('float64'),
 'Musical instruments': dtype('float64'),
 'New environment': dtype('float64'),
 'Number of friends': dtype('int64'),
 'Number of siblings': dtype('float64'),
 'Only child': dtype('O'),
 'Opera': dtype('float64'),
 'PC': dtype('float64'),
 "Parents' advice": dtype('float64'),
 'Passive sport': dtype('float64'),
 'Personality': dtype('float64'),
 'Pets': dtype('float64'),
 'Physics': dtype('float64'),
 'Politics': dtype('float64'),
 'Pop': dtype('float64'),
 'Prioritising workload': dtype('float64'),
 'Psychology': dtype('float64'),
 'Public speaking': dtype('float64'),
 'Punctuality': dtype('O'),
 'Punk': dtype('float64'),
 'Questionnaires or polls': dtype('float64'),
 'Rats': dtype('float64'),
 'Reading': dtype('float64'),
 'Reggae, Ska': dtype('float64'),
 'Reliability': dtype('float64'),
 'Religion': dtype('float64'),
 'Responding to a serious letter': dtype('float64'),
 'Rock': dtype('float64'),
 'Rock n roll': dtype('float64'),
 'Romantic': dtype('float64'),
 'Sci-fi': dtype('float64'),
 'Science and technology': dtype('float64'),
 'Self-criticism': dtype('float64'),
 'Shopping': dtype('float64'),
 'Shopping centres': dtype('float64'),
 'Slow songs or fast songs': dtype('float64'),
 'Small - big dogs': dtype('float64'),
 'Smoking': dtype('O'),
 'Snakes': dtype('int64'),
 'Socializing': dtype('float64'),
 'Spending on gadgets': dtype('int64'),
 'Spending on healthy eating': dtype('float64'),
 'Spending on looks': dtype('float64'),
 'Spiders': dtype('float64'),
 'Storm': dtype('float64'),
 'Swing, Jazz': dtype('float64'),
 'Techno, Trance': dtype('float64'),
 'Theatre': dtype('float64'),
 'Thinking ahead': dtype('float64'),
 'Thriller': dtype('float64'),
 'Unpopularity': dtype('float64'),
 'Village - town': dtype('O'),
 'Waiting': dtype('float64'),
 'War': dtype('float64'),
 'Weight': dtype('float64'),
 'Western': dtype('float64'),
 'Workaholism': dtype('float64'),
 'Writing': dtype('float64'),
 'Writing notes': dtype('float64')}

在本案例中,我们删除素有含有缺失值的行,保留所有的变量。

In [24]:
new_data = real_data.dropna(axis = 0)
new_data.shape
Out[24]:
(674, 150)

为了使用频繁模式算法发现潜在关联的变量,我们需要对变量的取值进行处理,包括离散化,二值化,便于后续的统计。下面,我们针对不同类型的变量,采用不同的处理方法。

对于浮点型变量,我们需要了解各个变量的取值范围。在这里,我们统计各个变量取值的最大值。

In [25]:
float_attribute = [v for v in attribute_type if attribute_type[v] == 'float64']
float_data = new_data[float_attribute].max()

float_data.value_counts()
Out[25]:
5.0      130
203.0      1
150.0      1
30.0       1
10.0       1
dtype: int64
In [26]:
float_data[float_data != 5.0]
Out[26]:
Number of siblings     10.0
Age                    30.0
Weight                150.0
Height                203.0
dtype: float64

对于一般的浮点型变量(最大值为5.0)来说,我们可以直接进行二值化,阈值为3。

In [27]:
#130个普通浮点型变量

personal_attribute = ['Height', 'Age', 'Weight', 'Number of siblings']

float_attribute = [v for v in attribute_type if attribute_type[v] == 'float64' and v not in personal_attribute]
print float_attribute
['Economy Management', 'Heights', 'Religion', 'Pets', 'Internet', 'Fear of public speaking', 'Comedy', 'Daily events', 'Darkness', 'Achievements', 'Western', 'Swing, Jazz', 'Interests or hobbies', 'Thriller', 'Unpopularity', 'Getting up', 'Passive sport', 'Punk', 'Final judgement', 'Entertainment spending', 'New environment', 'Assertiveness', 'Fantasy/Fairy tales', 'Borrowed stuff', 'Public speaking', 'Reliability', 'Health', 'Writing notes', 'Dancing', 'Prioritising workload', 'Friends versus money', 'Waiting', 'Reading', 'Biology', 'Shopping', 'Mood swings', 'Theatre', 'Horror', 'PC', 'Musical instruments', 'Country', 'Funniness', 'Socializing', 'Loneliness', 'Metal or Hardrock', 'Loss of interest', 'Knowing the right people', 'Criminal damage', 'Hiphop, Rap', 'Rock', 'Chemistry', 'Spending on healthy eating', 'Dance', 'Celebrities', 'Workaholism', 'Adrenaline sports', 'Cars', 'Alternative', 'Folk', 'Psychology', 'Giving', 'Rock n roll', 'Writing', 'Music', 'Storm', 'Dangerous dogs', 'Medicine', 'Latino', 'Techno, Trance', 'War', 'Personality', 'Life struggles', 'Documentary', 'Appearence and gestures', 'Changing the past', 'God', 'Reggae, Ska', 'Charity', 'Branded clothing', 'Finding lost valuables', 'Opera', 'Countryside, outdoors', 'Elections', "Parents' advice", 'Happiness in life', 'Action', 'Rats', 'History', 'Gardening', 'Keeping promises', 'Spending on looks', 'Fake', 'Children', 'Decision making', 'Getting angry', 'Small - big dogs', 'Cheating in school', 'Active sport', 'Art exhibitions', 'Flying', 'Thinking ahead', 'Responding to a serious letter', 'Mathematics', 'Finances', 'Sci-fi', 'Self-criticism', 'Foreign languages', 'Classical music', 'Hypochondria', 'Judgment calls', 'Fun with friends', 'Law', 'Empathy', 'Romantic', 'Ageing', 'Compassion to animals', 'Movies', 'Science and technology', 'Healthy eating', 'Animated', 'Physics', 'Musical', 'Energy levels', 'Slow songs or fast songs', 'Pop', 'Spiders', 'Questionnaires or polls', 'Politics', 'Shopping centres', 'Geography']
In [28]:
# 方法1
binary_float_attr = new_data[float_attribute].applymap(lambda x: 1 if x > 3.0 else 0)
binary_float_attr.head()
Out[28]:
Economy Management Heights Religion Pets Internet Fear of public speaking Comedy Daily events Darkness Achievements ... Physics Musical Energy levels Slow songs or fast songs Pop Spiders Questionnaires or polls Politics Shopping centres Geography
0 1 0 0 1 1 0 1 0 0 1 ... 0 0 1 0 1 0 0 0 1 0
1 1 0 0 1 1 1 1 0 0 0 ... 0 0 0 1 0 0 0 1 1 1
2 1 0 1 1 1 0 1 0 0 0 ... 0 1 1 1 0 0 0 0 1 0
4 0 0 1 0 0 0 1 0 0 0 ... 0 0 1 0 1 0 0 0 0 0
5 0 0 0 0 1 0 1 0 0 0 ... 0 0 1 0 0 0 1 1 0 0

5 rows × 130 columns

In [29]:
# 方法1

binary_float_attr = new_data[float_attribute].applymap(lambda x: 1 if x > 3.0 else 0)
binary_float_attr.head()
Out[29]:
Economy Management Heights Religion Pets Internet Fear of public speaking Comedy Daily events Darkness Achievements ... Physics Musical Energy levels Slow songs or fast songs Pop Spiders Questionnaires or polls Politics Shopping centres Geography
0 1 0 0 1 1 0 1 0 0 1 ... 0 0 1 0 1 0 0 0 1 0
1 1 0 0 1 1 1 1 0 0 0 ... 0 0 0 1 0 0 0 1 1 1
2 1 0 1 1 1 0 1 0 0 0 ... 0 1 1 1 0 0 0 0 1 0
4 0 0 1 0 0 0 1 0 0 0 ... 0 0 1 0 1 0 0 0 0 0
5 0 0 0 0 1 0 1 0 0 0 ... 0 0 1 0 0 0 1 1 0 0

5 rows × 130 columns

In [30]:
# 方法2
bianry_float_attr1 = Binarizer(threshold=3).transform(new_data[float_attribute])
bianry_float_attr1
Out[30]:
array([[ 1.,  0.,  0., ...,  0.,  1.,  0.],
       [ 1.,  0.,  0., ...,  1.,  1.,  1.],
       [ 1.,  0.,  1., ...,  0.,  1.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  1.,  0., ...,  0.,  0.,  1.],
       [ 0.,  1.,  0., ...,  0.,  1.,  1.]])

对于personal_attribute存储的浮点型变量来说,需要先进行变量离散化,然后进行特征编码。

In [31]:
new_data[['Age','Height', 'Weight', 'Number of siblings']].describe()
Out[31]:
Age Height Weight Number of siblings
count 674.000000 674.000000 674.000000 674.000000
mean 20.353116 173.419881 66.117211 1.299703
std 2.732763 9.475720 13.900289 0.992887
min 15.000000 152.000000 41.000000 0.000000
25% 19.000000 167.000000 55.000000 1.000000
50% 20.000000 172.000000 63.000000 1.000000
75% 21.000000 180.000000 75.000000 2.000000
max 30.000000 203.000000 150.000000 10.000000
In [32]:
# Age变量

#阈值
level = 20

binary_age_attr = new_data[['Age']].applymap(lambda x: 1 if x >=20 else 0)
binary_age_attr.columns = ['Age_older_than_20']
binary_age_attr.head()
Out[32]:
Age_older_than_20
0 1
1 0
2 1
4 1
5 1
In [33]:
# Height变量
#划定三个子区间
binary_height_attr = pd.get_dummies(pd.qcut(new_data['Height'], 3), prefix='Height_')
binary_height_attr.head()
Out[33]:
Height__[152, 168] Height__(168, 178] Height__(178, 203]
0 1.0 0.0 0.0
1 1.0 0.0 0.0
2 0.0 1.0 0.0
4 0.0 1.0 0.0
5 0.0 0.0 1.0
In [34]:
#Weight变量
#划定四个子区间
binary_weight_attr = pd.get_dummies(pd.qcut(new_data['Weight'], 4), prefix='Weight_')
binary_weight_attr.head()
Out[34]:
Weight__[41, 55] Weight__(55, 63] Weight__(63, 75] Weight__(75, 150]
0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0
4 0.0 1.0 0.0 0.0
5 0.0 0.0 0.0 1.0
In [35]:
#Number of Siblings变量
#划定2个子区间
binary_sibling_attr = new_data[['Number of siblings']].applymap(lambda x: 1 if x > 3.0 else 0)
binary_sibling_attr.columns = ['Number of siblings_larger_than_3']
binary_sibling_attr.head()
Out[35]:
Number of siblings_larger_than_3
0 0
1 0
2 0
4 0
5 0
In [36]:
# 字符型变量
str_attribute = [v for v in attribute_type if attribute_type[v] == 'object']
print str_attribute
['Education', 'Left - right handed', 'Only child', 'Lying', 'Internet usage', 'Village - town', 'Smoking', 'House - block of flats', 'Gender', 'Punctuality', 'Alcohol']

我们可以查看字符型变量的取值,然后使用Pandas库的get_dummies()函数进行特征编码。

In [37]:
# 查看各个字符型变量的取值
value_list = []
for item in str_attribute:
    index = list(new_data[item].value_counts(dropna=False).index)
    value_list.append(index)
    print index
    print ''
['secondary school', 'college/bachelor degree', 'primary school', 'masters degree', 'doctorate degree', 'currently a primary school pupil']

['right handed', 'left handed']

['no', 'yes']

['sometimes', 'only to avoid hurting someone', 'everytime it suits me', 'never']

['few hours a day', 'less than an hour a day', 'most of the day']

['city', 'village']

['tried smoking', 'never smoked', 'current smoker', 'former smoker']

['block of flats', 'house/bungalow']

['female', 'male']

['i am always on time', 'i am often early', 'i am often running late']

['social drinker', 'drink a lot', 'never']

In [38]:
# 特征编码
binary_str_attr = pd.get_dummies(new_data[str_attribute])
binary_str_attr.head()
Out[38]:
Education_college/bachelor degree Education_currently a primary school pupil Education_doctorate degree Education_masters degree Education_primary school Education_secondary school Left - right handed_left handed Left - right handed_right handed Only child_no Only child_yes ... House - block of flats_block of flats House - block of flats_house/bungalow Gender_female Gender_male Punctuality_i am always on time Punctuality_i am often early Punctuality_i am often running late Alcohol_drink a lot Alcohol_never Alcohol_social drinker
0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 ... 1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 ... 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 ... 1.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 ... 0.0 1.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
5 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 ... 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0

5 rows × 33 columns

In [39]:
# 整型变量

int_attribute = [v for v in attribute_type if attribute_type[v] == 'int64']
print int_attribute
['Eating to survive', 'Snakes', 'Number of friends', 'Dreams', 'Spending on gadgets']

对于整型变量来说,我们选取阈值为3,进行二值化。

In [40]:
binary_int_attr = new_data[int_attribute].applymap(lambda x: 1 if x > 3 else 0)
binary_int_attr.columns = ['Eating to survive_very_much', 
                           'Snakes_very_much',
                          'Number of friends_larger_than_3',
                          'Dreams_very_much',
                          'Spending on gadgets_very_much']
binary_int_attr.head()
Out[40]:
Eating to survive_very_much Snakes_very_much Number of friends_larger_than_3 Dreams_very_much Spending on gadgets_very_much
0 0 1 0 1 0
1 0 0 0 0 1
2 1 0 0 0 1
4 0 0 0 0 0
5 0 0 0 0 1

使用concat函数把处理后的变量拼接在一起。

In [41]:
final_data = pd.concat([binary_float_attr.reset_index(drop=True), 
                        binary_str_attr.reset_index(drop=True), 
                        binary_int_attr.reset_index(drop=True),
                        binary_age_attr.reset_index(drop=True),
                        binary_height_attr.reset_index(drop=True),
                        binary_weight_attr.reset_index(drop=True),
                        binary_sibling_attr.reset_index(drop=True)], axis=1) 
final_data.head()
Out[41]:
Economy Management Heights Religion Pets Internet Fear of public speaking Comedy Daily events Darkness Achievements ... Spending on gadgets_very_much Age_older_than_20 Height__[152, 168] Height__(168, 178] Height__(178, 203] Weight__[41, 55] Weight__(55, 63] Weight__(63, 75] Weight__(75, 150] Number of siblings_larger_than_3
0 1 0 0 1 1 0 1 0 0 1 ... 0 1 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0
1 1 0 0 1 1 1 1 0 0 0 ... 1 0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0
2 1 0 1 1 1 0 1 0 0 0 ... 1 1 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0
3 0 0 1 0 0 0 1 0 0 0 ... 0 1 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0
4 0 0 0 0 1 0 1 0 0 0 ... 1 1 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0

5 rows × 177 columns

In [42]:
final_data.shape
Out[42]:
(674, 177)

原始的real_data经过离散化,二值化等预处理之后,生成新的数据final_data,变量的数量由150增加为177个,取值均为0或者1。

为了便于输入算法,我们把final_data转换为形如示例数据集的嵌套列表格式。

In [43]:
input_data= []

#变量名称
column_name = final_data.columns.values

# 数据集的行数
row = final_data.shape[0]

# 针对每一行数据记录
for index in xrange(row):
    name_list = []
    # 将对应位置值为1的变量名称加入到name_list中
    for item in column_name:
        if final_data.ix[index, item] == 1:
            name_list.append(item)
        else:
            continue
    # 将每行数据记录转换生成的name_list放入到最终的结果列表中
    input_data.append(name_list)
In [44]:
# 展示前2行数据
print input_data[:2]
[['Economy Management', 'Pets', 'Internet', 'Comedy', 'Achievements', 'Unpopularity', 'Final judgement', 'New environment', 'Fantasy/Fairy tales', 'Borrowed stuff', 'Public speaking', 'Reliability', 'Writing notes', 'Shopping', 'Horror', 'Funniness', 'Rock', 'Workaholism', 'Adrenaline sports', 'Psychology', 'Giving', 'Music', 'Personality', 'Appearence and gestures', 'Branded clothing', 'Countryside, outdoors', 'Elections', "Parents' advice", 'Happiness in life', 'Gardening', 'Keeping promises', 'Children', 'Active sport', 'Sci-fi', 'Foreign languages', 'Fun with friends', 'Romantic', 'Compassion to animals', 'Movies', 'Science and technology', 'Healthy eating', 'Animated', 'Energy levels', 'Pop', 'Shopping centres', 'Education_college/bachelor degree', 'Left - right handed_right handed', 'Only child_no', 'Lying_never', 'Internet usage_few hours a day', 'Village - town_village', 'Smoking_never smoked', 'House - block of flats_block of flats', 'Gender_female', 'Punctuality_i am always on time', 'Alcohol_drink a lot', 'Snakes_very_much', 'Dreams_very_much', 'Age_older_than_20', 'Height__[152, 168]', 'Weight__[41, 55]'], ['Economy Management', 'Pets', 'Internet', 'Fear of public speaking', 'Comedy', 'Unpopularity', 'Getting up', 'Punk', 'Entertainment spending', 'New environment', 'Public speaking', 'Reliability', 'Health', 'Writing notes', 'Friends versus money', 'Reading', 'Mood swings', 'PC', 'Socializing', 'Metal or Hardrock', 'Knowing the right people', 'Rock', 'Workaholism', 'Alternative', 'Rock n roll', 'Music', 'Documentary', 'Appearence and gestures', 'Changing the past', 'Finding lost valuables', 'Elections', 'Happiness in life', 'Action', 'Keeping promises', 'Getting angry', 'Small - big dogs', 'Cheating in school', 'Thinking ahead', 'Responding to a serious letter', 'Mathematics', 'Sci-fi', 'Self-criticism', 'Foreign languages', 'Judgment calls', 'Fun with friends', 'Compassion to animals', 'Movies', 'Animated', 'Slow songs or fast songs', 'Politics', 'Shopping centres', 'Geography', 'Education_college/bachelor degree', 'Left - right handed_right handed', 'Only child_no', 'Lying_sometimes', 'Internet usage_few hours a day', 'Village - town_city', 'Smoking_never smoked', 'House - block of flats_block of flats', 'Gender_female', 'Punctuality_i am often early', 'Alcohol_drink a lot', 'Spending on gadgets_very_much', 'Height__[152, 168]', 'Weight__(55, 63]']]
In [45]:
# 最小支持度
min_support = 400

fptree = FPTree(input_data, min_support, 'null', 1, 'fptree')
In [46]:
# 频繁1项集
frequent1 = fptree.get_frequent_items()

print u'频繁项集集合的数量为:', len(frequent1)
print ''
print frequent1
频繁项集集合的数量为: 26

{frozenset(['Compassion to animals']): 465, frozenset(['Rock']): 431, frozenset(['Village - town_city']): 486, frozenset(['Movies']): 622, frozenset(['Fantasy/Fairy tales']): 411, frozenset(['Judgment calls']): 478, frozenset(['Alcohol_social drinker']): 448, frozenset(['Fun with friends']): 604, frozenset(['Gender_female']): 402, frozenset(['Foreign languages']): 433, frozenset(['Animated']): 428, frozenset(['Music']): 643, frozenset(['Left - right handed_right handed']): 611, frozenset(['House - block of flats_block of flats']): 413, frozenset(['Internet']): 519, frozenset(['Happiness in life']): 437, frozenset(['Only child_no']): 519, frozenset(['Countryside, outdoors']): 400, frozenset(['Internet usage_few hours a day']): 502, frozenset(['Empathy']): 456, frozenset(['Friends versus money']): 406, frozenset(['Reliability']): 453, frozenset(['Borrowed stuff']): 507, frozenset(['Education_secondary school']): 424, frozenset(['Comedy']): 606, frozenset(['Keeping promises']): 506}

由于数据集的变量较多,我们暂时把build_tree()mine_tree()中的打印语句注释掉,方便程序运行和结果展示。

In [47]:
fptree.build_tree()

建树完成后,我们开始挖掘FPtree。从长度为1的频繁模式开始,生成条件模式库。然后以条件模式库为数据集,构造条件FPtree,并生成频繁模式集合。

In [48]:
frequent_item = []
frequent = set()

root_node = fptree.get_tree()
headertable = fptree.get_headertable()
freq_items = fptree.get_frequent_items()
mine_tree(freq_items, headertable, min_support, frequent, frequent_item)

打印输出frequent_item的前10个元素的内容。

In [51]:
for index in range(10):
    print frequent_item[index]
set([frozenset(['Countryside, outdoors'])])
set([frozenset(['Gender_female'])])
set([frozenset(['Friends versus money'])])
set([frozenset(['Fantasy/Fairy tales'])])
set([frozenset(['House - block of flats_block of flats'])])
set([frozenset(['Education_secondary school'])])
set([frozenset(['Music']), frozenset(['Education_secondary school'])])
set([frozenset(['Animated'])])
set([frozenset(['Animated']), frozenset(['Movies'])])
set([frozenset(['Music']), frozenset(['Animated'])])

从频繁项集中,我们可以了解青少年的生活状态。比如,和朋友在一块儿很开心,喜欢音乐,喜欢喜剧,电影等这些青少年身上的特质经常在一起出现。当然,还有很多其他有趣的信息可以从频繁项集中找到。

参考文献: