🌑

💡

Spaghetti.ink

Home Tags About Me Email

Appreciation, Modesty, Persistence

【毕设专栏-5】PLWAP算法及Python实现

频繁序列挖掘算法

序列模式由关联规则发展而来，都属于频繁模式挖掘算法，其核心思想为：“频繁序列的非空子序列一定是频繁的”。

频繁序列挖掘也称频繁序列模式，是频繁模式挖掘中针对序列挖掘的分支。频繁序列模式挖掘分为候选集生成算法与无候选集生成算法两种。典型的有候选集生成算法有Apriori算法与GSP算法。无候选集产生算法主要有PrefixSpan算法与WAP算法。

有候选集生成的频繁序列挖掘算法也称为Apriori-like算法，此类算法以Apriori为基础，主要有连接和剪枝两个操作。但是Apriori-like算法存在两个严重缺陷:

连接操作会生成大量候选集，特别是第一连接操作产生的2阶候选集。若存在104个1阶繁项，那么第一次连接操作将会产生约 $10^4 * (10^4-1) / 2 \simeq 10^7$ 个2阶候选集。
剪枝操作需要频繁扫描数据库，检验候选集是否满足最小支持度。虽然GSP（Generalized Sequential Pattern, 广义序列模式）优化了连接操作，让部分无用候选集在连接阶段就不会生成，进而减少候选集数量。但是依旧无法避免大量候选集的产生，所以Apriori-like效率较低，不具有实用性。

WAP算法介绍

WAP（Web Access Pattern，Web访问模式算法）由Jian Pei与Jiawei Han等人提出。WAP算法首先扫描WASD数据库中所有的1阶频繁项，再在1阶频繁项的基础上，再次扫描WASD数据库并构建WAP树，最后使用条件搜索递归地挖掘频繁序列。其中关键操作为构建WAP树和使用WAP树进行数据挖掘：

构建WAP树
- WAP树中每个节点包含两部分信息：标签与支持度计数。标签用于标记当前节点代表的项，在访问路径挖掘中即当前记录的URL；支持度计数用于统计符合前缀路径的当前节点总数。对于根节点来说比较特殊，他的标签为空且支持度计数为0。
- WAP树将通过遍历除去非频繁序列的序列数据库，依次插入建立节点。如果当前序列路径不存在则创建节点，节点支持度计数为1,否则无需新增节点，在路径经过的节点处支持度计数+1即可。
  针对每一个项建立一个链表，从该项出现直到该项不再出现。每一个链表叫作一个e_i-queue，所有链表的表头集合叫作header table（链头表）。
WAP树频繁模式挖掘
- 如果WAP树只有一条分支，那么返回所有节点的唯一序列组合。
- 初始频繁序列F为空，WAP树中每一个1阶序列都加入频繁序列F。
- 对于WAP树中的每一个项e_i
  - 创建基于e_i的条件树，记录e_i所有前缀序列的支持度计数，如果前缀序列中存在S与S^’，满足S^’⊂S,其中S的支持度计数为c，S^’的支持度计数为c^’，那么S^’的支持度计数为c^’-c。
  - 如果条件频繁项不为空，则递归创建、挖掘条件树。
  - 将条件树中得到的频繁序列，拼接上e_i后加入频繁序列F。

尽管WAP算法只需要扫描两次数据库，且避免了大量候选序列的产生，比Apriori-like算法更加高效。但是它在递归过程中的会生成大量的条件WAP树，而构建这些条件WAP树所需的额外的开销会导致算法性能下降。

PLWAP算法

针对WAP需要递归构建条件WAP树的问题，Ezeife和Lu提出了PLWAP算法。不同于WAP算法，PLWAP（Pre-Order Linked WAP-tree）算法采用先序遍历对每个节点进行二进制位置编码以用于快速确定节点间的父子关系，因此不需要递归地创建条件WAP树。

二进制位置编码（binary position code）
给定一个PLWAP树，每个节点的位置编码（position code）遵循以下规则：
- 根节点的位置编码为null。
- 根节点最左侧节点的位置编码为1。其余所有节点的最左侧子节点的位置编码为在其父节点位置编码的基础上拼接1。非最左侧节点的位置编码则根据其位置排序在其父节点位置编码的基础上拼接1和0。总的来说，对于第n个左侧节点，其位置编码为父节点位置编码拼接2^(n-1)。
  通过二进制位置编码，可以知道节点之间的关系。如果节点α是节点β的父节点，那么β的位置编码以α的位置编码拼接1开头。如，α位置编码为10、β位置编码为101、γ的位置编码为100，那么α为β的父节点且α与γ为兄弟节点。
PLWAP算法步骤
- 扫描访问序列数据库，挑选出大于最小支持度的频繁1阶序列。所有节点的数据结构由标签、支持度计数与编码位置三部分组成，其中根节点的标签与编码位置为空，支持度计数为0。
- 与WAP算法相同，删除序列数据库中的非频繁项，同时生成PLWAP树，其构建过程与WAP相似。不同的是，PLWAP算法的链头表（head table）由先序遍历产生，它确保了链表中当前节点后侧的节点总是子孙节点或兄弟分支节点，这一点对于PLWAP序列挖掘来说至关重要。
- 使用PLWAP挖掘算法以前缀序列的方式递归的挖掘所有频繁序列，并最终输出所有的频繁序列。

PLWAP树构造伪代码

PLWAP构造过程
输入：访问日志数据库(WASD)，频繁1阶序列L_1
输出：链头表 Head Table与PLWAP-tree
    1）设当前节点为 R(label=Null,position code=Null,count=0)
    For each 序列 S∈WASD
        提取序列S中所有频繁子序列FE
    For each event e_i∈FE
        If 当前节点有子节点
            If e_i∈ 子节点
                子节点计数+1
        Else
            将e_i作为新节点插入支持度计数为1
            位置编码为最左侧子节点编码+“0”
        Else
            将e_i作为新节点插入支持度计数为1
                位置编码为最左侧子节点编码+“1”
    2）从根节点开始，对PLWAP树作先序遍历，获得Head table
    3）返回Head table与PLWAP树

PLWAP挖掘过程伪代码

PLWAP挖掘过程
输入：PLWAP-tree T，Head linkage table L
最小支持度，后缀树节点列表（初始值包括根节点）R，频繁序列集合F
输出：频繁序列集合F^'
其他变量：S存储当前节点，C存储e_i 的支持度计数
    1) If R为空，return
    2) For each e_i∈L.找到找到e_i 的后缀树
        令S为e_i-queue的第一个节点
        遍历e_i-queue
            If e_i 为R中任意项的子孙节点且不为S的子孙节点
                令e_i 加入后缀树R得到R^'
                C=C+e_i.count
                令S=e_i
            If C≥最小支持度
                将e_i 拼接上F得到F^'
                递归调用PLWAP-Mine算法,入参为R^' 和F^'

案例

事务（会话）ID	会话访问序列
1	P1→P2→P1→P3
2	P1→P2→P3→P1→P3
3	P2→P1→P2→P1→P3
4	P1→P2→P1→P3→P3

PLWAP-Tree 构建过程

WAP_build

PLWAP-Tree 挖掘过程

WAP_build

代码【转载使用请引用】

#! /usr/bin/env python3
# Author: Fengfan Hu
# Date: 2021.06.26

class Node:
    '''
    Node 类型
    @param url 节点Url - .i.e event in sequential mining.
    @param code 节点编码
    @param count 节点频数
    @param children 子节点列表
    '''
    def __init__(self, url: str, children, code, count: int = 1):
        self.url = url;
        self.count = count;
        self.code = code;
        self.children = children;
        
    def isDescendantOf(self, x):
        return self.code[1][:x.code[0]+1] == x.code[1]+'1';
    
    def isDescendantOfR(self, R):
        for event in R:
            code = event.code;
            if (self.code[1][:code[0]+1] == code[1]+'1' or self.code[1][:code[0]] == code[1]): return True;
        return False;

class PLWAP:
    def __init__(self, sequences):
        self.root = Node('root', {}, None);
        self.linked_head = {};
        self.sequences = sequences;
        self.F = [];
        # Init linked header
        self.initLinkedHeader();
        self.createWAPTree();
        self.getPreOrderedLink();

    # Init linked header
    def initLinkedHeader(self):
        uniqueUrl = set();
        for session in self.sequences:
            uniqueUrl = uniqueUrl.union(set(session));
        for url in uniqueUrl:
            self.linked_head[url] = [];
    
    # WAP-Tree
    def createWAPTree(self):
        for sequence in self.sequences:
            currentNode = self.root;
            for event in sequence:
                if (event not in currentNode.children):
                    # root case
                    if (currentNode.code == None):
                        if (not currentNode.children):
                            # first non-root node
                            code = (1, '1');
                        else:
                            keys = currentNode.children.keys();
                            code_val = currentNode.children[list(keys)[0]].code[1] + len(keys)*'0';
                            code = (len(code_val), code_val);
                    else:
                        if (not currentNode.children):
                            code_val = currentNode.code[1] + '1';
                            code = (len(code_val), code_val);
                        else:
                            keys = currentNode.children.keys();
                            code_val = currentNode.children[list(keys)[0]].code[1] + len(keys)*'0';
                            code = (len(code_val), code_val);
                    newNode = Node(event, {}, code);
                    currentNode.children[event] = newNode;
                else:
                    currentNode.children[event].count += 1;
                    newNode = currentNode.children[event];
                currentNode = newNode;

    # Get Pre-ordered header linkage
    def getPreOrderedLink(self):
        root = self.root;
        linked_head = self.linked_head;
        if root is None:
            return []
        stack = [root, ]
        while stack:
            root = stack.pop()
            if (root.url != 'root'): # root Node need not add in
                linked_head[root.url].append(root);
            stack.extend(list(root.children.values())[::-1])

    def PLWAP_Mine(self, minimum_support, header_linkage = None, frequent_sequence = None, suffix_tree_roots = None):
        # Initialization
        if (not header_linkage):
            header_linkage = self.linked_head;
        if (not frequent_sequence):
            frequent_sequence = [];
        if (not suffix_tree_roots):
            suffix_tree_roots = list(self.root.children.values());

        # 1. If R is empty, return
        if (not suffix_tree_roots): return;
        # 2. For each event, e_i in header_linkage, find the suffix tree of e_i in wap_tree, do
        for event in list(header_linkage.keys()):
            e_i_queue = header_linkage[event];
            # 2a. Save first event in e_i-queue to S 
            S = e_i_queue[0];
            # 2b. Following the e_i-queue
            R_bar = [];
            C = 0;
            for idx, e_i in enumerate(e_i_queue):
                # If event e_i is the descendant of any event in R, and is not descendant of S
                if (e_i.isDescendantOfR(suffix_tree_roots)):
                    if (not e_i.isDescendantOf(S)):
                        # Insert it into suffix-tree-header set R'
                        R_bar.extend(list(e_i.children.values()));
                        # Add count of e_i to C
                        C += e_i.count;
                        # Replace the S with e_i
                        S = e_i;
                else:
                    if (idx+1 < len(e_i_queue)):
                        S = e_i_queue[idx+1];

            # 2c. If C is greater than minimum_support
            if (C >= minimum_support):
                # Append e_i after F to F' and output F'
                frequent_sequence_bar = frequent_sequence[:];
                frequent_sequence_bar.append(event);
                self.F.append({'+'.join(frequent_sequence_bar): C});
                print({'+'.join(frequent_sequence_bar): C});

                # Call Algorithm PLWAP-MINE by passing R_bar and F_bar
                self.PLWAP_Mine(minimum_support, None, frequent_sequence_bar, R_bar);

if __name__ == "__main__":
    testSessions = [];
    testSessions.append(['a', 'b', 'a', 'c']);
    testSessions.append(['a', 'b', 'c', 'a', 'c']);
    testSessions.append(['b', 'a', 'b', 'a', 'c']);
    testSessions.append(['a', 'b', 'a', 'c', 'c']);

    plwap = PLWAP(testSessions);
    plwap.PLWAP_Mine(3);

    print(plwap.F)

本文由 Frank采用署名 4.0 国际 (CC BY 4.0)许可

数据分析, 毕业设计 — 2021年6月26日

本文总阅读量次

数据分析

Next posts Cloud Service Model

Previous posts 【毕设专栏-4】会话识别

毕业设计

Previous posts 【毕设专栏-4】会话识别

View / Make Comments

Made with ❤ and at Hangzhou.

本站总访问量次