Longitudinal Analytics
of Web Archive Data

 

The classification power of Web features

Miklos Erdelyi, Andras A. Benczur, Balint Daroczy, Andras Garzo, Tamas Kiss and David Siklosi have published a technical report on "The classification power of Web features".

In this paper we give a comprehensive overview of features devised for Web spam detection and investigate how much various classes, some requiring very high computational effort, add to the classification accuracy. We collect and handle a large number of features based on recent advances in Web spam filtering, including temporal ones, in particular we analyze the strength and sensitivity of linkage change. We show that machine learning techniques including ensemble selection, LogitBoost and Random Forest significantly improve accuracy.

Our result is a summary of the Web spam filtering best practice with a listing of various configurations depending on collection size, computational resources and quality needs. To foster research in the area, we make several feature sets and source codes public (https://datamining.sztaki.hu/en/download/web-spam-resources), including the temporal features of eight .uk crawl snapshots that include WEBSPAM-UK2007 as well as the Web Spam Challenge features for the labeled part of ClueWeb09.

Technical Report