在电商领域,好的推荐算法能带来更高的销售业绩. 例如,淘宝或京东的网店 使用产品推荐技术向潜在用户推荐产品. 在各种在线视频网站中,推荐算法也具 有很强大的应用空间,可以根据用户对电影的评分数据来对其进行电影推荐.

本案例使用一份电影评分数据集,包括 movies 和 ratings 两部分:

  • movies部分记录电影的基本信息
  • ratings 部分记录用户对电影的评论情况

本案例首先实现了关联规则领域的经典算法Aprior,然后使用简单数据集进行算法验证。在电影评分数据集上,运行实现的Aprior算法,找出有价值的关联规则,分析不同类型电影之间存在的关系。

在本案例中,我们根据教材中介绍的关联规则算法,寻找潜在的关联规则。APRIORI算法是一种发现频繁项集的基本算法,接下来,我们将使用Python语言详细实现APRIORI的过程,并使用电影数据集,找出感兴趣的关联规则。

在具体编写算法的过程中,我们使用一个简单的toy data数据集Data,来验证算法的正确性。

1 APRIORI算法实现

根据APRIORI算法的步骤,创建候选$C_{1}$集,并由此创建频繁$L_{1}$项集。

In [1]:
Data = {'user1':['I1','I2','I5'],
        'user2':['I2','I4'],
        'user3':['I2','I3'],
        'user4':['I1','I2','I4'],
        'user5':['I1','I3'],
        'user6':['I2','I3'],
        'user7':['I1','I3'],
        'user8':['I1','I2','I3','I5'],
        'user9':['I1','I2','I3']
        }

我们首先对data进行变换,使用frozenset类型存储数据记录。

Python中,集合类型set是可变的,不存在哈希值。而frozenset是冻结的集合类型,不可变,存在哈希值。因此,使用frozenset类型存储的频繁项,可以作为字典的key保存下来,有利于后续创建频繁项集集合(key为频繁项,value为支持度)。更多的数据类型如OrderedDictnamedtuple等,可以参考Pythoncollection模块

In [2]:
data = {v: frozenset(Data[v]) for v in Data}
data
Out[2]:
{'user1': frozenset({'I1', 'I2', 'I5'}),
 'user2': frozenset({'I2', 'I4'}),
 'user3': frozenset({'I2', 'I3'}),
 'user4': frozenset({'I1', 'I2', 'I4'}),
 'user5': frozenset({'I1', 'I3'}),
 'user6': frozenset({'I2', 'I3'}),
 'user7': frozenset({'I1', 'I3'}),
 'user8': frozenset({'I1', 'I2', 'I3', 'I5'}),
 'user9': frozenset({'I1', 'I2', 'I3'})}

1.1 生成频繁1项集

由于频繁项的长度为1,我们可以通过统计data中各个元素出现的频次,直接生成频繁1项集。项集使用Python中的字典类型进行存储,key表示频繁项,value表示对应的支持度。最小支持度为2。

In [3]:
freq_1 = {}
for item in data:
    for record in data[item]:
        if frozenset([record]) in freq_1:
            freq_1[frozenset([record])] += 1
        else:
            freq_1[frozenset([record])]  = 1

freq_1 = {v:freq_1[v] for v in freq_1 if freq_1[v] >= 2}
In [4]:
freq_1
Out[4]:
{frozenset({'I1'}): 6,
 frozenset({'I5'}): 2,
 frozenset({'I2'}): 7,
 frozenset({'I4'}): 2,
 frozenset({'I3'}): 6}

1.2 生成频繁k项集

从$k >= 2$开始,根据得到的频繁$(k - 1)$项集,生成频繁$k$项集。这个过程由函数getFrequentItemSetWithSupport()负责,返回k频繁项集。具体来说,分为三步:

  • 连接步生成候选集$C_{k}$

  • 剪枝步生成频繁$k$项集

  • 选择符合设定条件(满足最小支持度)的频繁项,生成最终结果

在连接步中,我们将频繁$(k - 1)$项集与自身连结,生成$k$项候选集candidate_items

在剪枝步中,我们使用先验性质对候选项集进行过滤,减少运算量。这个性质就是:

频繁项集的所有非空子集必然是频繁项集

In [5]:
'''
getFrequentItemSetWithSupport函数用来寻找频繁k项集
输入参数:
frequent_k: 频繁k-1项集
min_support:最小支持度
item_list: 业务数据记录

返回参数:
频繁k项集:{频繁项:支持度,频繁项:支持度}
'''

# 当k >= 2时
def getFrequentItemSetWithSupport(frequent_k, min_support, k, item_list):
    
    items = frequent_k.keys()
    
    #k项候选集
    candidate_items = []
    
    current_k = {}

    ### 连接步,生成候选集
    candidate_items = [m.union(n) for m in items for n in items if m != n and len(m.union(n)) == k]
 
    ### 剪枝步,剔除含有非频繁项子集的项集
    final_candidate = set()
    for candidate in candidate_items:
        
        sub_items = getSubset(candidate, (k - 1))
    
        if set(items) > set(sub_items):  
            
            final_candidate.add(candidate)
            
        else:
            continue
              
    # 遍历数据集data,对final_candidate中的元素进行统计,保留支持度大于最小阈值的频繁项
    for record in item_list.items():
        for item in final_candidate:
            
            #数据记录record中含有频繁项item
            if item.issubset(record[1]):     
                
                if item in current_k:
                    current_k[item] += 1                
                else:
                    current_k[item] = 1
    
    return {v: current_k[v] for v in current_k if current_k[v] >= min_support}  

在函数getFrequentItemSetWithSupport中,我们自定义两个函数getSubset,和getAllSubets,分别返回项集的k项子集,以及所有非空子集。

比如, 项集frozenset({'I1', 'I2', 'I5'}), 它的2项子集应为:

frozenset({'I1', 'I2'})
frozenset({'I1', 'I5'})
frozenset({'I2', 'I5'})

那么,我们使用以下代码,实现上述的功能。

In [6]:
def getSubset(item, k):
    
    import itertools as its
    return [frozenset(item) for item in its.combinations(item, k)]
In [7]:
example1 = frozenset({'I1', 'I2', 'I5'})
getSubset(example1, 2)
Out[7]:
[frozenset({'I1', 'I5'}), frozenset({'I1', 'I2'}), frozenset({'I2', 'I5'})]

当然,我们通过getAllSubsets()函数,找到它的所有非空真子集:

frozenset({'I1', 'I2'})
frozenset({'I1', 'I5'})
frozenset({'I2', 'I5'})
frozenset({'I1'})
frozenset({'I2'})
frozenset({'I5'})
In [8]:
def getAllSubsets(item):
    
    subsets = []
    
    for i in range(len(item) - 1):
    
        subsets.extend(getSubset(item, i + 1))
                       
    return subsets
In [9]:
getAllSubsets(example1)
Out[9]:
[frozenset({'I1'}),
 frozenset({'I5'}),
 frozenset({'I2'}),
 frozenset({'I1', 'I5'}),
 frozenset({'I1', 'I2'}),
 frozenset({'I2', 'I5'})]

项集的子集可以通过排列组合来得到,我们可以调用itertools中的combinations函数来具体实现返回子集的功能。

至此,反复调用getFrequentItemSetWithSuppot(), 得到完整的频繁k项集集合,直到达到停止条件,集频繁k项集为空。

当数据记录的长度很长时,算法需要逐个检查频繁k项集是否为空;另外,我们在具体任务中寻找的频繁项长度也不会过长,因此

我们选择通过控制k的取值来获取指定的频繁k项集集合。

In [10]:
k = 2
final_itemsets= []
final_itemsets.append(freq_1)
min_support = 2
min_conf = 0.6

frequent_k_minus_1 = freq_1
# 
#while frequent_k != {}:
while k <= 3:
    print(frequent_k_minus_1)
    print(" ")
    frequent_k = getFrequentItemSetWithSupport(frequent_k_minus_1, min_support, k, data)
    final_itemsets.append(frequent_k)
    frequent_k_minus_1 = frequent_k
    k += 1
{frozenset({'I1'}): 6, frozenset({'I5'}): 2, frozenset({'I2'}): 7, frozenset({'I4'}): 2, frozenset({'I3'}): 6}
 
{frozenset({'I1', 'I2'}): 4, frozenset({'I5', 'I2'}): 2, frozenset({'I1', 'I5'}): 2, frozenset({'I4', 'I2'}): 2, frozenset({'I2', 'I3'}): 4, frozenset({'I1', 'I3'}): 4}
 
In [11]:
final_itemsets
Out[11]:
[{frozenset({'I1'}): 6,
  frozenset({'I5'}): 2,
  frozenset({'I2'}): 7,
  frozenset({'I4'}): 2,
  frozenset({'I3'}): 6},
 {frozenset({'I1', 'I2'}): 4,
  frozenset({'I2', 'I5'}): 2,
  frozenset({'I1', 'I5'}): 2,
  frozenset({'I2', 'I4'}): 2,
  frozenset({'I2', 'I3'}): 4,
  frozenset({'I1', 'I3'}): 4},
 {frozenset({'I1', 'I2', 'I5'}): 2, frozenset({'I1', 'I2', 'I3'}): 2}]

1.3 生成关联规则

生成频繁项集之后,我们可以直接得到同时满足最小支持度和最小置信度的强关联规则。例如,对于频繁项集{'A','B'}来说,关联规则 $A \rightarrow B$的置信度如下所示: \begin{equation} \text{conf}(A \rightarrow B) = \dfrac{\text{support}(\{A,B\})}{\text{support}(\{A\})} \end{equation}

生成关联规则主要有以下几步:

  • 对于频繁k项集中的每个元素value,调用getAllSubsets()函数得到value的所有非空子集
  • 对于非空子集中的每个元素condition,找到除去condition的所有剩余元素conclusion_items
  • 根据置信度公式,计算confidence
  • 将所有满足条件(大于min_conf)的潜在关联规则[[condition,conclusoin_items], confidence]放入association_rules
In [12]:
association_rules = []

for item_set in final_itemsets[1:]:
    
    for value in item_set:
        print('frequent item set: ', value)
        #形如 if condition, then conclusion
        for condition in getAllSubsets(value):
            print('    condition: ', condition)
            conclusion_items = frozenset(x for x in value if x not in condition)
            print('    conclusion: ', conclusion_items)
            
            if len(conclusion_items) > 0:
                    
                confidence = float(final_itemsets[len(value) - 1][value])/final_itemsets[len(condition) - 1][condition]
                print(confidence)
                if confidence > min_conf:
                        
                    association_rules.append([[condition, conclusion_items],
                                           confidence])
                print('')
frequent item set:  frozenset({'I1', 'I2'})
    condition:  frozenset({'I1'})
    conclusion:  frozenset({'I2'})
0.6666666666666666

    condition:  frozenset({'I2'})
    conclusion:  frozenset({'I1'})
0.5714285714285714

frequent item set:  frozenset({'I5', 'I2'})
    condition:  frozenset({'I5'})
    conclusion:  frozenset({'I2'})
1.0

    condition:  frozenset({'I2'})
    conclusion:  frozenset({'I5'})
0.2857142857142857

frequent item set:  frozenset({'I1', 'I5'})
    condition:  frozenset({'I1'})
    conclusion:  frozenset({'I5'})
0.3333333333333333

    condition:  frozenset({'I5'})
    conclusion:  frozenset({'I1'})
1.0

frequent item set:  frozenset({'I4', 'I2'})
    condition:  frozenset({'I4'})
    conclusion:  frozenset({'I2'})
1.0

    condition:  frozenset({'I2'})
    conclusion:  frozenset({'I4'})
0.2857142857142857

frequent item set:  frozenset({'I2', 'I3'})
    condition:  frozenset({'I2'})
    conclusion:  frozenset({'I3'})
0.5714285714285714

    condition:  frozenset({'I3'})
    conclusion:  frozenset({'I2'})
0.6666666666666666

frequent item set:  frozenset({'I1', 'I3'})
    condition:  frozenset({'I1'})
    conclusion:  frozenset({'I3'})
0.6666666666666666

    condition:  frozenset({'I3'})
    conclusion:  frozenset({'I1'})
0.6666666666666666

frequent item set:  frozenset({'I1', 'I5', 'I2'})
    condition:  frozenset({'I1'})
    conclusion:  frozenset({'I5', 'I2'})
0.3333333333333333

    condition:  frozenset({'I5'})
    conclusion:  frozenset({'I1', 'I2'})
1.0

    condition:  frozenset({'I2'})
    conclusion:  frozenset({'I1', 'I5'})
0.2857142857142857

    condition:  frozenset({'I1', 'I5'})
    conclusion:  frozenset({'I2'})
1.0

    condition:  frozenset({'I1', 'I2'})
    conclusion:  frozenset({'I5'})
0.5

    condition:  frozenset({'I5', 'I2'})
    conclusion:  frozenset({'I1'})
1.0

frequent item set:  frozenset({'I1', 'I2', 'I3'})
    condition:  frozenset({'I1'})
    conclusion:  frozenset({'I2', 'I3'})
0.3333333333333333

    condition:  frozenset({'I2'})
    conclusion:  frozenset({'I1', 'I3'})
0.2857142857142857

    condition:  frozenset({'I3'})
    conclusion:  frozenset({'I1', 'I2'})
0.3333333333333333

    condition:  frozenset({'I1', 'I2'})
    conclusion:  frozenset({'I3'})
0.5

    condition:  frozenset({'I1', 'I3'})
    conclusion:  frozenset({'I2'})
0.5

    condition:  frozenset({'I2', 'I3'})
    conclusion:  frozenset({'I1'})
0.5

In [13]:
association_rules
Out[13]:
[[[frozenset({'I1'}), frozenset({'I2'})], 0.6666666666666666],
 [[frozenset({'I5'}), frozenset({'I2'})], 1.0],
 [[frozenset({'I5'}), frozenset({'I1'})], 1.0],
 [[frozenset({'I4'}), frozenset({'I2'})], 1.0],
 [[frozenset({'I3'}), frozenset({'I2'})], 0.6666666666666666],
 [[frozenset({'I1'}), frozenset({'I3'})], 0.6666666666666666],
 [[frozenset({'I3'}), frozenset({'I1'})], 0.6666666666666666],
 [[frozenset({'I5'}), frozenset({'I1', 'I2'})], 1.0],
 [[frozenset({'I1', 'I5'}), frozenset({'I2'})], 1.0],
 [[frozenset({'I2', 'I5'}), frozenset({'I1'})], 1.0]]

2. 电影数据集的关联规则挖掘

在案例的第一部分,我们实现了APRIOR算法,并使用简单的数据集进行验证,接下来,我们在实际的数据集上进行关联规则挖掘。

2.1 数据集预处理

使用Pandas库的read_csv函数导入数据集ratings.csv

In [14]:
import pandas as pd
In [15]:
ratings = pd.read_csv("./input/ratings.csv")
ratings.head()
Out[15]:
userId movieId rating timestamp
0 1 31 2.5 1260759144
1 1 1029 3.0 1260759179
2 1 1061 3.0 1260759182
3 1 1129 2.0 1260759185
4 1 1172 4.0 1260759205

为了便于进行推荐电影,我们仅关注评分大于3的电影信息,需要汇总用户的电影评分信息。

In [16]:
ratings_by_user = {}

# 保留电影评分大于3的数据记录
pos_ratings = ratings[ratings['rating'] > 3]
In [17]:
pos_ratings.head()
Out[17]:
userId movieId rating timestamp
4 1 1172 4.0 1260759205
8 1 1339 3.5 1260759125
12 1 1953 4.0 1260759191
13 1 2105 4.0 1260759139
20 2 10 4.0 835355493

使用groupby函数,可以从pos_ratings找到每个用户进行评分的电影ID信息。ratings_by_user的存储格式如下:

$$\{userId1:\{movieId1, movieID2\},userId2:\{movieId1, movieId2\}...\}$$
In [18]:
for index, item in pos_ratings.groupby('userId')['movieId']:
        
    if index in ratings_by_user:
        ratings_by_user[index].add(frozenset(item.values))
    else:
        ratings_by_user[index] = frozenset(item.values)

从电影本身的角度出发,我们可以统计每部电影被用户评论的次数。在这里,我们设置最小支持度$min\_support = 50$。

In [19]:
# 统计评论次数
movie_stats = pos_ratings.groupby('movieId')['userId'].count()
movie_freq_dict = dict(movie_stats[movie_stats > 50])
In [20]:
## 输出前10个元素
movie_freq_list = list(movie_freq_dict.items())
for index in range(10):
    print(movie_freq_list[index])
(1, 182)
(2, 51)
(6, 73)
(10, 65)
(11, 53)
(16, 72)
(17, 62)
(21, 55)
(25, 69)
(32, 147)

2.2 生成频繁1项集

根据movie_freq_dict生成频繁1项集$frequent1$。

In [21]:
frequent_1 = {frozenset([item]): movie_freq_dict[item] for item in movie_freq_dict} 

2.2 生成频繁k项集

与第一部分相同,我们设依然定$k <= 3$

In [22]:
k = 2
final_itemsets= []
final_itemsets.append(frequent_1)
min_support = 50
min_conf = 0.6

frequent_k_minus_1 = frequent_1
 
#while frequent_k != {}:
while k <= 3:
    print('begin to search ' + str(k) + ' frequent item set')
    print("")
    frequent_k = getFrequentItemSetWithSupport(frequent_k_minus_1, min_support, k, ratings_by_user)
    final_itemsets.append(frequent_k)
    frequent_k_minus_1 = frequent_k
    k += 1
begin to search 2 frequent item set

begin to search 3 frequent item set

final_itemsets存储了符合设定条件的所有频繁项集, 以下代码查看频繁3项集集合。

In [23]:
# 查看频繁3项集集合的前10个元素
final_items_3_list = list(final_itemsets[2].items())
for index in range(10):
    print(final_items_3_list[index])
(frozenset({50, 527, 47}), 61)
(frozenset({296, 50, 589}), 63)
(frozenset({480, 589, 47}), 58)
(frozenset({296, 50, 527}), 84)
(frozenset({480, 589, 527}), 54)
(frozenset({296, 110, 527}), 65)
(frozenset({527, 110, 47}), 55)
(frozenset({296, 150, 110}), 52)
(frozenset({296, 589, 110}), 65)
(frozenset({296, 47, 527}), 74)

2.3 生成关联规则

我们从频繁项集中抽取关联规则,也就是形如"如果用户看过电影A,那么也会去观看电影B"的结论。接下来,针对每个频繁项集,生成关联规则(包括条件和结论),并计算相应的置信度。

In [24]:
association_rules = []
min_conf = 0.5

for item_set in final_itemsets[1:]:
    
    for value in item_set:
        #print 'frequent item set: ', value
        for condition in getAllSubsets(value):
            #print ' condition: ', condition
            conclusion_items = frozenset(x for x in value if x not in condition)
            #print ' conclusion_items: ', conclusion_items
            
            if len(conclusion_items) > 0:
                    
                confidence = float(final_itemsets[len(value) - 1][value])/final_itemsets[len(condition) - 1][condition]
                #print confidence
                if confidence > min_conf:
                        
                    association_rules.append([[condition, conclusion_items],
                                           confidence])

由于生成的关联规则较多,为了便于展示,我们只输出前200个关联规则。

In [25]:
association_rules[:10]
Out[25]:
[[[frozenset({590}), frozenset({589})], 0.504],
 [[frozenset({592}), frozenset({296})], 0.7021276595744681],
 [[frozenset({527}), frozenset({296})], 0.602803738317757],
 [[frozenset({47}), frozenset({296})], 0.8098159509202454],
 [[frozenset({500}), frozenset({296})], 0.6206896551724138],
 [[frozenset({590}), frozenset({110})], 0.576],
 [[frozenset({50}), frozenset({47})], 0.5163043478260869],
 [[frozenset({47}), frozenset({50})], 0.5828220858895705],
 [[frozenset({480}), frozenset({589})], 0.5977011494252874],
 [[frozenset({589}), frozenset({480})], 0.5621621621621622]]

2.4 关联规则转换

movies.csv存储了电影信息,通过movieId,我们可以获取电影的具体信息,将关联规则中的电影ID,替换为电影名称。

In [26]:
movie_names  = pd.read_csv('./input/movies.csv')
movie_names.head()
Out[26]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
In [27]:
movie_names.dtypes
Out[27]:
movieId     int64
title      object
genres     object
dtype: object
In [28]:
movie_names = pd.read_csv('./input/movies.csv')
movie_rules = []

for rule in association_rules:
    
    #  确定电影ID信息
    conditions = rule[0][0]
    conclusions = rule[0][1]
    
    # 转换
    condition_names = [movie_names[movie_names['movieId'] == int(item)]['title'].values[0] for item in conditions]
    conclusion_names = [movie_names[movie_names['movieId'] == int(item)]['title'].values[0] for item in conclusions]
    
    movie_rules.append([condition_names, conclusion_names, rule[1]])
In [29]:
## 输出前10条规则
movie_rules[:10]
Out[29]:
[[['Dances with Wolves (1990)'], ['Terminator 2: Judgment Day (1991)'], 0.504],
 [['Batman (1989)'], ['Pulp Fiction (1994)'], 0.7021276595744681],
 [["Schindler's List (1993)"], ['Pulp Fiction (1994)'], 0.602803738317757],
 [['Seven (a.k.a. Se7en) (1995)'],
  ['Pulp Fiction (1994)'],
  0.8098159509202454],
 [['Mrs. Doubtfire (1993)'], ['Pulp Fiction (1994)'], 0.6206896551724138],
 [['Dances with Wolves (1990)'], ['Braveheart (1995)'], 0.576],
 [['Usual Suspects, The (1995)'],
  ['Seven (a.k.a. Se7en) (1995)'],
  0.5163043478260869],
 [['Seven (a.k.a. Se7en) (1995)'],
  ['Usual Suspects, The (1995)'],
  0.5828220858895705],
 [['Jurassic Park (1993)'],
  ['Terminator 2: Judgment Day (1991)'],
  0.5977011494252874],
 [['Terminator 2: Judgment Day (1991)'],
  ['Jurassic Park (1993)'],
  0.5621621621621622]]