Web archives are valuable resources. However, they are characterized by a high degree of redundancy. Not only does this redundancy waste computing resources, but it also deteriorates users’ experience, since they have to sift through and weed out redundant content. Existing methods focus on identifying near-duplicate documents, assuming a universal notion of redundancy, and can thus not adapt to user-specific requirements such as a preference for more recent or diversely opinionated content.
In this work, we propose an approach that equips users with fine-grained control over what they consider redundant. Users thus specify a binary coverage relation between documents that can factor in documents’ contents as well as their meta data. Our approach then determines a minimum-cardinality cover set of non-redundant documents. We describe how this can be done at scale using MapReduce as a platform for distributed data processing. Our prototype implementation has been deployed on a real-world web archive and we report experiences from this case study.