Longitudinal Analytics
of Web Archive Data

 

Mind the Gap: Large-Scale Frequent Sequence Mining

The paper "Mind the Gap: Large-Scale Frequent Sequence Mining" by Iris Miliaraki, Klaus Berberich, Rainer Gemulla and Spyros Zoupanos has been accepted for presentation at SIGMOD 2013.

Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are suffciently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose PFSM, a scalable algorithm for frequent sequence mining on MapReduce. PFSM can handle so-called “gap constraints’‘, which can be used to limit the output to a controlled set of frequent sequences. At its heart, PFSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of $w$-equivalency, which is a generalization of the notion of a “projected database’’ used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our extensive experimental study in the context of text mining suggests that PFSM is significantly more efficient and scalable than alternative approaches.

SIGMOD 2013 homepage