Miklos Erdelyi, Andras A. Benczur, Balint Daroczy, Andras Garzo, Tamas Kiss and David Siklosi have published a technical report on "The classification power of Web features".
In this paper we give a comprehensive overview of features devised for Web spam detection and investigate how much various classes, some requiring very high computational effort, add to the classification accuracy. We collect and handle a large number of features based on recent advances in Web spam filtering, including temporal ones, in particular we analyze the strength and sensitivity of linkage change. We show that machine learning techniques including ensemble selection, LogitBoost and Random Forest significantly improve accuracy.
Our result is a summary of the Web spam filtering best practice with a listing of various configurations depending on collection size, computational resources and quality needs. To foster research in the area, we make several feature sets and source codes public (https://datamining.sztaki.hu/en/download/web-spam-resources), including the temporal features of eight .uk crawl snapshots that include WEBSPAM-UK2007 as well as the Web Spam Challenge features for the labeled part of ClueWeb09.
Lev Faerman has published a technical report on "Skewed Key Spaces in Map Reduce".
This paper discusses the effects of non-uniform key spaces (such as ones created by processing English text) on load balancing in Hadoop. It demonstrates that a potential problem exists by observing the characteristics of the English language, their effect on reducer loading and then discusses a simple improvement of Hadoop partitioners to improve load balancing.
Aviad Pines has published a technical report on "Analyzing Virtualized Datacenter Hadoop Deployments".
This paper discusses the performance of Hadoop deployments on virtualized Data Centers such as Amazon EC2 and Elastichosts, both when the Hadoop cluster is located in a single data center, and when it is spread in a cross-datacenter deployment. We analyze the impact of bandwidth between nodes on cluster performance.
The paper "Crowdsourced Entity Markup" by Lili Jiang, Yafang Wang, Johannes Hoffart and Gerhard Weikum has been accepted for the Workshop on Crowdsourcing the Semantic Web (CrowdSem 2013) in conjunction with ISWC 2013.
Entities, such as people, places, products, etc., exist in knowledge bases and linked data, on one hand, and in web pages, news articles, and social media, on the other hand. Entity markup, like Named Entities Recognition and Disambiguation (NERD), is the essential means for adding semantic value to unstructured web contents and this way enabling the linkage between unstructured and structured data and knowledge collections. A major challenge in this endeavor lies in the dynamics of the digital contents about the world, with new entities emerging all the time. In this paper, we propose a crowdsourced framework for NERD, specifically addressing the challenge of emerging entities in social media. Our approach combines NERD techniques with the detection of entity alias names and with co-reference resolution in texts. We propose a linking-game based crowdsourcing system for this combined task, and we report on experimental insights with this approach and on lessons learned.
CrowdSem 2013 homepage
The paper "On the SPOT: Question Answering over Temporally Enhanced Structured Data" by Mohamed Yahya, Klaus Berberich, Maya Ramanath and Gerhard Weikum has been accepted for the Workshop on Time-aware Information Access (TAIA2013) in conjunction with SIGIR 2013.
Natural-language question answering is a convenient way for humans to discover relevant information in structured Web data such as knowledge bases or Linked Open Data sources. This paper focuses on data with a temporal dimension, and discusses the problem of mapping natural-language questions into extended SPARQL queries over RDF-structured data. We specifically address the issue of disambiguating temporal phrases in the question into temporal entities like dates and named events, and temporal predicates. For the situation where the data has only partial coverage of the time dimension but is augmented with textual descriptions of entities and facts, we also discuss how to generate queries that combine structured search with keyword conditions.
TAIA 2013 homepage
The paper "Temporal Diversification of Search Results" by Klaus Berberich and Srikanta Bedathur has been accepted for the Workshop on Time-aware Information Access (TAIA2013) in conjunction with SIGIR 2013.
We investigate the notion of temporal diversity, bringing together two recently active threads of research, namely temporal ranking and diversification of search results. A novel method is developed to determine search results consisting of documents that are relevant to the query and were published at diverse times of interest to the query. Preliminary experiments on twenty years’ worth of newspaper articles from The New York Times demonstrate characteristics of our method and compare it against two baselines.
TAIA 2013 homepage
The paper "On Temporal Wikipedia search by edits and linkage" by Julianna Göbölös-Szabó and András Benczúr has been accepted for the Workshop on Time-aware Information Access (TAIA2013) in conjunction with SIGIR 2013.
We exploit the connectivity structure of edits in Wikipedia to identify recent events that happened at a given time via identifying bursty changes in linked articles around a specified date. Our key results include algorithms for node relevance ranking in temporal subgraph and neighborhood selection based on measurements for structural changes in time over the Wikipedia link graph. We measure our algorithms over manually annotated queries with relevant events in September and October 2011; we make the assessment publicly available (https://dms.sztaki.hu/en/download/wimmut-searching-and-navigating-wikipedia). While our methods were tested over clean Wikipedia metadata, we believe the methods are applicable to general temporal Web collections as well.
TAIA 2013 homepage
The paper "HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text" by Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol and Gerhard Weikum has been accepted for the ACL 2013 demo track.
Recent research has shown progress in achieving high-quality, very fine-grained type classification in hierarchical taxonomies. Within such a multi-level type hierarchy with several hundreds of types at different levels, many entities naturally belong to multiple types. In order to achieve high-precision in type classification, current approaches are either limited to certain domains or require time consuming multistage computations. As a consequence, existing systems are incapable of performing ad-hoc type classification on arbitrary input texts. In this demo, we present a novel Web-based tool that is able to perform domain independent entity type classification under real time conditions. Due to its efficient implementation and compacted feature representation, the system is able to process text inputs on-the-fly by achieving equally high precision as leading state-of-the-art implementations. Our system offers an online interface where natural-language text can be inserted, which returns lexical type labels for entity mentions. Further the user interface allows users to explore the types assigned to text mentions by visualizing and navigating along the type-hierarchy.
ACL 2013 homepage
A summary about the 3rd Temporal Web Analytics Workshop (TempWeb 2013) has been published as part of the workshop proceedings.
Time is a key dimension to understand the Web. It is fair to say that it has not received yet all the attention it deserves and TempWeb is an attempt to help remedy this situation by putting time as the center of its reflection.
Studying time in this context actually covers a large spectrum, from dating methodology to extraction of temporal information and knowledge, from diachronic studies to the design of infrastructural and experimental settings enabling a proper observation of this dimension.
For its third edition, TempWeb includes 9 papers out of a total of 18 papers submitted. The quality of papers has constantly improved, so that we have been “forced” to accept every second paper submitted to the third edition. We like to interpret paper quality and slightly increased submission figures as a clear sign of a positive dynamic in the study of time in the scope of the Web and an indication of the relevance of this effort. The workshop proceedings are published by ACM DL as part of the WWW 2013 Companion Publication.
We hope you will find in these papers, the keynote, and the discussion and exchanges of this edition of TempWeb some motivations to look more into this important aspect of the Web.
TempWeb 2013 was jointly organized by Internet Memory Foundation (Paris, France), the Max-Planck-Institut für Informatik (Saarbrücken, Germany) and Yahoo! Research Barcelona (Barcelona, Spain), and supported by the 7th Framework IST programme of the European Union through the focused research project (STREP) on Longitudinal Analytics of Web Archive data (LAWA) under contract no. 258105.
The Proceedings of the 3rd International Temporal Web Analytics Workshop (TempWeb 2013) are online now.
The Proceedings of the 3rd International Temporal Web Analytics Workshop (TempWeb 2013) held in conjunction with the 22nd International World Wide Web Conference (www2013) in Rio de Janeiro, Brazil on May 13, 2013 are online as: WWW Companion Volume. The workshop was co-organized by the LAWA project and chaired by R. Baeza-Yates (Yahoo! Research Barcelona), J. Masanès (Internet Memory Foundation) and M. Spaniol (Max-Planck-Institut für Informatik).
The paper "Knowledge Linking for Online Statistics" by Marc Spaniol, Natalia Prytkova and Gerhard Weikum will be presented at the 59th World Statistics Congress (WSC) in the Special Topic Session (STS) on "The potential of Internet, big data and organic data for official statistics".
The LAWA project investigates large-scale Web (archive) data along the temporal dimension. As a use case, we are studying Knowledge Linking for Online Statistics.
Statistic portals such as eurostat’s “Statistics Explained” (http://epp.eurostat.ec.europa.eu/statistics_explained/index.php/Main_Page) provide a wealth of articles constituting an encyclopedia of European statistics. Together with its statistical glossary, the huge amount of numerical data comes with a well-defined thesaurus. However, this data is not directly at hands, when browsing Web data covering the topic. For instance, when reading news articles about the debate on renewable energy across Europe after the earthquake in Japan and the Fukushima accident, one would ideally be able to understand these discussions based on statistical evidence.
We believe that Internet contents, captured in Web archives and reflected and aggregated in the Wikipedia history, can be better understood when linked with online statistics. To this end, we aim at semantically enriching and analyzing Web (archive) data to narrow and ultimately bridge the gap between numerical statistics and textual media like news or online forums. The missing link and key to this goal is the discovery and analysis of entities and events in Web (archive) contents. This way, we can enrich Web pages, e.g. by a browser plug-in, with links to relevant statistics (e.g. eurostat pages). Raising data analytics to the entity-level also enables understanding the impact of societal events and their perception in different cultures and economies.
WSC 2013 homepage
The paper "Mind the Gap: Large-Scale Frequent Sequence Mining" by Iris Miliaraki, Klaus Berberich, Rainer Gemulla and Spyros Zoupanos has been accepted for presentation at SIGMOD 2013.
Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are suffciently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose PFSM, a scalable algorithm for frequent sequence mining on MapReduce. PFSM can handle so-called “gap constraints’‘, which can be used to limit the output to a controlled set of frequent sequences. At its heart, PFSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of $w$-equivalency, which is a generalization of the notion of a “projected database’’ used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our extensive experimental study in the context of text mining suggests that PFSM is significantly more efficient and scalable than alternative approaches.
SIGMOD 2013 homepage
The paper "Cross-lingual web spam classification" by András Garzó, Bálint Daróczy, Tamás Kiss, Dávid Siklósi and András Benczúr has been accepted for the 3rd Joint WICOW/AIRWeb Workshop on Web Quality (WebQuality 2013) in conjunction with WWW 2013.
While English language training data exists for several Web classification tasks, most notably for Web spam, we face an expensive human labeling procedure if we want to classify a Web domain in a language different from English.We overview how existing content and link based classification techniques work, how models can be ``translated’’ from English into another language, and how language-dependent and independent methods combine. Our experiments are conducted on the ClueWeb09 corpus as the training English collection and a large Portuguese crawl of the Portuguese Web Archive. To foster further research, we provide labels and precomputed values of term frequencies, content and link based features for both ClueWeb09 and the Portuguese data.
WICOW/AIRWeb 2013 Homepage
The paper "Predicting Search Engine Switching in WSCD 2013 Challenge" by Qiang Yan, Xingxing Wang, Qiang Xu, Dongying Kong, Danny Bickson, Quan Yuan, and Qing Yang has been accepted for presentation at the Workshop on Web Search Click Data 2013 (WSCD2013) in conjunction with WSDM 2013.
How to accurately predict search engine switch behavior is a very important but challenging problem. This paper describes the solution of GraphLab team that achieves the 4th place for WSCD 2013 Search Engine Switch Detect contest sponsored by Yandex. There are three core steps in our solution: Feature extraction, Prediction, and Model ensemble. First, we extract features related to the quality of result, user preference and search behavior sequence pattern from user actions, query logs, and sequence patterns of click-streams. Second, models like Online Bayesian Probit Regression (BPR), Online Bayesian Matrix Factorization (BMF), Support Vector Regression (SVR), Logistic Regression (LR) and Factorization Machine Model (FM) are exploited based on the previous features. Finally, we propose a two-step ensemble method to blend our individual models in order to fully exploit the dataset and get more accurate result based on the local and public test dataset. Our final solution achieves 0.8439 AUC on the public leaderboard and 0.8432 AUC on the private test set.
The paper "User-Defined Redundancy in Web Archives" by Bibek Paudel, Avishek Anand, and Klaus Berberich has been accepted for the Workshop on Large-Scale and Distributed Systems for Information Retrieval (LSDS-IR) in conjunction with WSDM 2013.
Web archives are valuable resources. However, they are characterized by a high degree of redundancy. Not only does this redundancy waste computing resources, but it also deteriorates users’ experience, since they have to sift through and weed out redundant content. Existing methods focus on identifying near-duplicate documents, assuming a universal notion of redundancy, and can thus not adapt to user-specific requirements such as a preference for more recent or diversely opinionated content.
In this work, we propose an approach that equips users with fine-grained control over what they consider redundant. Users thus specify a binary coverage relation between documents that can factor in documents’ contents as well as their meta data. Our approach then determines a minimum-cardinality cover set of non-redundant documents. We describe how this can be done at scale using MapReduce as a platform for distributed data processing. Our prototype implementation has been deployed on a real-world web archive and we report experiences from this case study.
The paper "Computing n-Gram Statistics in MapReduce" by Klaus Berberich and Srikanta Bedathur has been accepted for EDBT 2013.
Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-sigma that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete MapReduce implementation, we provide insights on an efficient implementation of the methods. Extensive experiments on The New York Times Annotated Corpus and ClueWeb09 expose the relative benefits and trade-offs of the methods.
EDBT 2013 homepage
The paper "Interval Indexing and Querying on Key-Value Cloud Stores" by George Sfakianakis, Ioannis Patlakas, Nikos Ntarmos and Peter Triantafillou has been accepted for ICDE 2013.
Cloud key-value stores are becoming increasingly more important. Challenging applications, requiring efficient and scalable access to massive data, arise every day. We focus on supporting interval queries (which are prevalent in several data intensive applications, such as temporal querying for temporal analytics), an efficient solution for which is lacking. We contribute a compound interval index structure, comprised of two tiers: (i) the MRSegmentTree (MRST), a key-value representation of the Segment Tree, and (ii) the Endpoints Index (EPI), a column family index that stores information for interval endpoints. In addition to the above, our contributions include: (i) algorithms for efficiently constructing and populating our indices using MapReduce jobs, (ii) techniques for efficient and scalable index maintenance, and (iii) algorithms for processing interval queries. We have implemented all algorithms using HBase and Hadoop, and conducted a detailed performance evaluation. We quantify the costs associated with the construction of the indices, and evaluate our query processing algorithms using queries on real data sets. We compare the performance of our approach to two alternatives: the native support for interval queries provided in HBase, and the execution of such queries using the Hive query execution tool. Our results show a significant speedup, far outperforming the state of the art.
ICDE 2013 homepage
The paper "HYENA: Hierarchical Type Classification for Entity Names" by Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart,
Marc Spaniol and Gerhard Weikum has been accepted for COLING 2012.
Inferring lexical type labels for entity mentions in texts is an important asset for NLP tasks like semantic role labeling and named entity disambiguation (NED). Prior work has focused on flat and relatively small type systems where most entities belong to exactly one type. This paper addresses very fine-grained types organized in a hierarchical taxonomy, with several hundreds of types at different levels. We present HYENA for multi-label hierarchical classification. HYENA exploits gazetteer features and accounts for the joint evidence for types at different levels. Experiments and an extrinsic study on NED demonstrate the practical viability of HYENA.
COLING 2012 homepage
The paper "PRAVDA-live: Interactive Knowledge Harvesting" by Yafang Wang, Maximilian Dylla, Zhaouchun Ren, Marc Spaniol and Gerhard Weikum has been accepted for the CIKM 2012 demo session.
Acquiring high-quality (temporal) facts for knowledge bases is a labor-intensive process. Although there has been recent progress in the area of semi-supervised fact extraction, these approaches still have limitations, including a restricted corpus, a fixed set of relations to be extracted or a lack of assessment capabilities. In this paper we introduce PRAVDA-live, a framework that overcomes these limitations and supports the entire pipeline of interactive knowledge harvesting. To this end, our demo exhibits temporal fact extraction from ad-hoc corpus creation, via relation specification, labeling and assessment all the way to ready-to-use RDF exports.
CIKM 2012 homepage
The paper "LINDA: Distributed Web-of-Data-Scale Entity Matching" by Christoph Böhm, Gerard de Melo, Felix Naumann and Gerhard Weikum has been accepted for CIKM 2012.
Linked Data has emerged as a powerful way of interconnecting structured data on the Web. However, the crosslinkage between Linked Data sources is not as extensive as one would hope for. In this paper, we formalize the task of automatically creating “sameAs” links across data sources in a globally consistent manner. Our algorithm, presented in a multi-core as well as a distributed version, achieves this link generation by accounting for joint evidence of a match. Experiments conrm that our system scales beyond 100 million entities and delivers highly accurate results despite the vast heterogeneity and daunting scale.
CIKM 2012 homepage
The paper "KORE: Keyphrase Overlap Relatedness for Entity Disambiguation" by Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald and Gerhard Weikum has been accepted for CIKM 2012.
Measuring the semantic relatedness between two entities is the basis for numerous tasks in IR, NLP, and Web-based knowledge extraction. This paper focuses on disambiguating names in a Web or text document by jointly mapping all names onto semantically related entities registered in a knowledge base. To this end, we have developed a novel notion of semantic relatedness between two entities represented as sets of weighted (multi-word) keyphrases, with consideration of partially overlapping phrases. This measure improves the quality of prior link-based models, and also eliminates the need for (usually Wikipedia-centric) explicit interlinkage between entities. Thus, our method is more versatile and can cope with long-tail and newly emerging entities that have few or no links associated with them. For efficiency, we have developed approximation techniques based on min-hash sketches and locality-sensitive hashing. Our experiments on semantic relatedness and on named entity disambiguation demonstrate the superiority of our method compared to state-of-the-art baselines.
CIKM 2012 homepage
The paper "Cross-Lingual Data Quality for Knowledge Base Acceleration across Wikipedia Editions" by Julianna Göbölös-Szabó, Natalia Prytkova, Marc Spaniol and Gerhard Weikum has been accepted for QDB 2012.
Knowledge-sharing communities like Wikipedia and knowledge bases like Freebase are expected to capture the latest facts about the real world. However, neither of these can keep pace with the rate at which events happen and new knowledge is reported in news and social media. To narrow this gap, we propose an approach to accelerate the online maintenance of knowledge bases.
Our method, coined LAIKA, is based on link prediction. Wikipedia editions in different languages, Wikinews, and other news media come with extensive but noisy interlinkage at the entity level. We utilize this input for recommending, for a given Wikipedia article or knowledge-base entry, new categories, related entities, and cross-lingual interwiki links. LAIKA constructs a large graph from the available input and uses link-overlap measures and random-walk techniques to generate missing links and rank them for recommendations. Experiments with a very large graph from multilingual Wikipedia editions demonstrate the accuracy of our link predictions.
QDB 2012 homepage
The paper "Index Maintenance for Time-Travel Text Search" by A. Anand, S. Bedathur, K. Berberich and R. Schenkel has been accepted for SIGIR 2012.
Time-travel text search enriches standard text search by temporal predicates, so that users of web archives can easily retrieve document versions that are considered relevant to a given keyword query and existed during a given time interval. Different index structures have been proposed to efficiently support time-travel text search. None of them, however, can easily be updated as the Web evolves and new document versions are added to the web archive.
In this work, we describe a novel index structure that efficiently supports time-travel text search and can be maintained incrementally as new document versions are added to the web archive. Our solution uses a sharded index organization, bounds the number of spuriously read index entries per shard, and can be maintained using small in-memory buffers and append-only operations. We present experiments on two large-scale real-world datasets demonstrating that maintaining our novel index structure is an order of magnitude more efficient than periodically rebuilding one of the existing index structures, while query-processing performance is not adversely affected.
SIGIR 2012 homepage
The paper "Natural Language Questions for the Web of Data" by Mohamed Yahya, Klaus Berberich, Shady Elbassuoni
Maya Ramanath, Volker Tresp and Gerhard Weikum has been accepted for EMNLP 2012.
The Linked Data initiative comprises structured databases in the Semantic-Web data model RDF. Exploring this heterogeneous data by structured query languages is tedious and error-prone even for skilled users. To ease the task, this paper presents a methodology for translating natural language questions into structured SPARQL queries over linked-data sources.
Our method is based on an integer linear program to solve several disambiguation tasks jointly: the segmentation of questions into phrases; the mapping of phrases to semantic entities, classes, and relations; and the construction of SPARQL triple patterns. Our solution harnesses the rich type system provided by knowledge bases in the web of linked data, to constrain our semantic-coherence objective function. We present experiments on both the question translation and the resulting query answering.
EMNLP 2012 homepage
The paper "Coupling Label Propagation and Constraints for Temporal Fact Extraction" by Y. Wang, M. Dylla, M. Spaniol and G. Weikum has been accepted for ACL 2012.
The Web and digitized text sources contain a wealth of information about named entities such as politicians, actors, companies, or cultural landmarks. Extracting this information has enabled the automated construction of large knowledge bases, containing hundred millions of binary relationships or attribute values about these named entities. However, in reality most knowledge is transient, i.e. changes over time, requiring a temporal dimension in fact extraction. In this paper we develop a methodology that interlinks label propagation with constraints for temporal fact extraction. Due to the coupling we gain maximum benefit from both “worlds”. Label propagation “aggressively” gathers fact candidates, while an Integer Linear Program does the “clean-up”. Our method is able to improve on recall while keeping up with precision, which we demonstrate by experiments with biography-style Wikipedia pages and a large corpus of news articles.
ACL 2012 homepage
A summary about the 2nd Temporal Web Analytics Workshop (TempWeb 2012) has been published as part of the workshop proceedings.
Time is a key dimension to understand the web. It is fair to say that it has not received yet all the attention it deserves and TempWeb is an attempt to help remedy this situation by putting time as the center of its reflexion. Studying time in this context actually covers a large spectrum, from dating methodology to extraction of temporal information and knowledge, from diachronic studies to the design of infrastructural and experimental settings enabling a proper observation of this dimension.
For its second edition, TempWeb includes 6 papers out of a total of 17 papers submitted which put its acceptance rate at 35%. The number of papers submitted has almost doubled compared to the first edition, which we like to interpret as a clear sign of positive dynamic and an indication of the relevance of this effort. The workshop proceedings are published in ACM DL (ISBN 978-1-4503-1188-5).
We hope you will find in these papers, the keynotes and the discussion and exchanges of this edition of TempWeb some motivations to look more into this important aspect of Web studies. TempWeb 2012 was jointly organized by Internet Memory Foundation (Paris, France), the Max-Planck-Institut für Informatik (Saarbrücken, Germany) and Yahoo! Research Barcelona (Barcelona, Spain), and supported by the 7th Framework IST programme of the European Union through the focused research project (STREP) on Longitudinal Analytics of Web Archive data (LAWA) under contract no. 258105.
The Proceedings of the 2nd International Temporal Web Analytics Workshop (TempWeb 2012) are online now.
The Proceedings of the 2nd International Temporal Web Analytics Workshop (TempWeb 2012) held in conjunction with the 21st International World Wide Web Conference (www2012) in Lyon, France on April 17, 2012 are online at: ACM DL. The workshop was co-organized by the LAWA project and chaired by R. Baeza-Yates (Yahoo! Research Barcelona), J. Masanès (Internet Memory Foundation) and M. Spaniol (Max-Planck-Institut für Informatik).
The paper "Predicting the Evolution of Taxonomy Restructuring in Collective Web Catalogues" by N. Prytkova, M. Spaniol and G. Weikum has been accepted for WebDB 2012.
Collectively maintained Web catalogues organize links to interesting Web sites into topic hierarchies, based on community input and editorial decisions. These taxonomic systems reflect the interests and diversity of ongoing societal discourses. Catalogues evolve by adding new topics, splitting topics, and other restructuring, in order to capture newly emerging concepts of long-lasting interest. In this paper, we investigate these changes in taxonomies and develop models for predicting such structural changes. Our approach identifies newly emerging latent concepts by analyzing news articles (or social media), by means of a temporal term relatedness graph. We predict the addition of new topics to the catalogue based on statistical measures associated with the identified latent concepts. Experiments with a large news archive corpus demonstrate the high precision of our method, and its suitability for Web-scale application.
WebDB 2012 homepage
The paper "Extraction of Temporal Facts and Events from Wikipedia" by Erdal Kuzey and Gerhard Weikum has been accepted for the second Temporal Web Analytics Workshop (TempWeb 2012) in conjunction with the WWW 2012 conference.
Recently, large-scale knowledge bases have been constructed by automatically extracting relational facts from text. Unfortunately, most of the current knowledge bases focus on static facts and ignore the temporal dimension. However, the vast majority of facts are evolving with time or are valid only during a particular time period. Thus, time is a significant dimension that should be included in knowledge bases.
In this paper, we introduce a complete information extraction framework that harvests temporal facts and events from semi-structured data and free text of Wikipedia articles to create a temporal ontology. First, we extend a temporal data representation model by making it aware of events. Second, we develop an information extraction method which harvests temporal facts and events from Wikipedia infoboxes, categories, lists, and article titles in order to build a temporal knowledge base. Third, we show how the system can use its extracted knowledge for further growing the knowledge base.
We demonstrate the effectiveness of our proposed methods through several experiments. We extracted more than one million temporal facts with precision over 90% for extraction from semi-structured data and almost 70% for extraction from text.
TempWeb 2012 homepage
The paper "Content-Based Trust and Bias Classification via Biclustering" by Dávid Siklósi, Bálint Daróczy and András A. Benczúr has been accepted for the 2nd Joint WICOW/AIRWeb Workshop on Web Quality in conjunction with the WWW 2012 conference.
In this paper we improve trust, bias and factuality classification over Web data on the domain level. Unlike the majority of literature in this area that aims at extracting opinion and handling short text on the micro level, we aim to aid a researcher or an archivist in obtaining a large collection that, on the high level, originates from unbiased and trustworthy sources. Our method generates features as Jensen-Shannon distances from centers in a host-term biclustering. On top of the distance features, we apply kernel methods and also combine with baseline text classifiers. We test our method on the ECML/PKDD Discovery Challenge data set DC2010. Our method improves over the best achieved text classification NDCG results by over 3–10% for neutrality, bias and trustworthiness. The fact that the ECML/PKDD Discovery Challenge 2010 participants reached an AUC only slightly above 0.5 indicates the hardness of the task.
The paper "Big Web Analytics: Toward a Virtual Web Observatory" by Marc Spaniol, András Benczúr, Zsolt Viharos and Gerhard Weikum has been accepted for the ERCIM News special theme No. 89 on "Big Data".
For decades, compute power and storage have become steadily cheaper, while network speeds, although increasing, have not kept up. The result is that data is becoming increasingly local and thus distributed in nature. It has become necessary to move the software and hardware to where the data resides, and not the reverse. The goal of LAWA is to create a Virtual Web Observatory based on the rich centralized Web repository of the European Archive. The observatory will enable Web-scale analysis of data, will facilitate large-scale studies of Internet content and will add a new dimension to the roadmap of Future Internet Research – it’s about time!
ERCIM News (pdf)
The paper "Tracking Entities in Web Archives: The LAWA Project" by Marc Spaniol and Gerhard Weikum has been accepted for the European projects track at the WWW 2012 conference.
Web-preservation organization like the Internet Archive not only capture the history of born-digital content but also reflect the zeitgeist of different time periods over more than a decade. This longitudinal data is a potential gold mine for researchers like sociologists, politologists, media and market analysts, or experts on intellectual property. The LAWA project (Longitudinal Analytics of Web Archive data) is developing an Internet-based experimental testbed for large-scale data analytics on Web archive collections. Its emphasis is on scalable methods for this specific kind of big-data analytics, and software tools for aggregating, querying, mining, and analyzing Web contents over long epochs. In this paper, we highlight our research on entity-level analytics in Web archive data, which lifts Web analytics from plain text to the entity-level by detecting named entities, resolving ambiguous names, extracting temporal facts and visualizing entities over time periods. Our results provide key assets for tracking named entities in the evolving Web, news, and social media.
WWW 2012 homepage
The paper "SZTAKI @ ImageCLEF 2011" by B. Daróczy, R. Pethes, and A. A. Benczúr has been accepted for publication in the Working Notes of the ImageCLEF 2011 Workshop at CLEF 2011 Conference, Amsterdam, The Netherlands, 2011.
We participated in the ImageCLEF 2011 Photo Annotation and Wikipedia Image Retrieval Tasks. Our approach to the ImageCLEF 2011 Photo Annotation is based on a kernel weighting procedure using visual Fisher kernels and a Flickr-tag based Jensen-Shannon divergence based kernel. We trained a Gaussian Mixture Model (GMM) to define a generative model over the feature vectors extracted from the image patches. To represent each image with high-level descriptors we calculated Fisher vectors from different visual features of the images. These features were sampled at various scales and partitions such as Harris-Laplace detected patches, scale and spatial pyramids. We calculated distance matrices from the descriptors of train images to combine different high-level descriptors and the tag based similarity matrix. With this uniform representation we had the possibility to learn the natural weights for each category over the different type of descriptors. This re-weightning resulted 0.01838 MAP increase over the average kernel results. We used the weighted kernels for learning linear SVM models for each of the 99 concepts independently. For the Wikipedia Image Retrieval Task we used the search engine of the Hungarian Academy of Sciences as our information retrieval system that is based on Okapi BM25 ranking. We calculated light Fisher vectors to represent the content of the images and performed nearest-neighbour search on them.
The paper "Web spam classification: a few features worth more" by M. Erdélyi, A. Garzó, and A. A. Benczúr has been published in the proceedings of the joint WICOW/AIRWeb Workshop on Web Quality (WebQuality 2011), Hyderabad, India, March 28, 2011, ACM Press 2011.
In this paper we investigate how much various classes of Web spam features, some requiring very high computational effort, add to the classification accuracy. We realize that advances in machine learning, an area that has received less attention in the adversarial IR community, yields more improvement than new features and result in low cost yet accurate spam filters. Our original contributions are as follows:
• We collect and handle a large number of features based on recent advances in Web spam filtering.
• We show that machine learning techniques including ensemble selection, LogitBoost and Random Forest significantly improve accuracy.
• We conclude that, with appropriate learning techniques, a small and computationally inexpensive feature subset outperforms all previous results published so far on our data set and can only slightly be further improved by computationally expensive features.
• We test our method on two major publicly available data sets, the Web Spam Challenge 2008 data set WEBSPAM-UK2007 and the ECML/PKDD Discovery Challenge data set DC2010.
Our classifier ensemble reaches an improvement of 5% in AUC over the Web Spam Challenge 2008 best result; more importantly our improvement is 3.5% based solely on less than 100 inexpensive content features and 5% if a small vocabulary bag of words representation is included. For DC2010 we improve over the best achieved NDCG for spam by 7.5% and over 5% by using inexpensive content features and a small bag of words representation.
The paper "Temporal Analysis for Web Spam Detection: An Overview " by M. Erdélyi, and A. A. Benczúr has been published in the proceedings of the 1st Intl. Temporal Web Analytics Workshop (TWAW 2011), Hyderabad, India, March 28, 2011, pp.17-24.
In this paper we give a comprehensive overview of temporal features devised for Web spam detection providing measurements for different feature sets.
• We make a temporal feature research data set publicly available (cf. http://datamining.ilab.sztaki.hu/?q=en/downloads). The features are based on eight UbiCrawler crawl snapshots of the .uk domain between October 2006 and May 2007 and use the WEBSPAM-UK2007 labels.
• We explore the performance of previously published temporal spam features and in particular the strength and sensitivity of linkage change.
• We propose new temporal link similarity based features and show how to compute them efficiently on large graphs.
Our experiments are conducted over the collection of eight .uk crawl snapshots that include WEBSPAM-UK2007.
The paper "Infrastructures and bounds for distributed entity resolution" by Csaba István Sidló, András Garzó, András Molnár, and András A. Benczúr has been published in the proceedings of the 9th Intl. Workshop on Quality in Databases (QDB 2011) in conjunction with VLDB 2011, August 29, Seattle, WA, USA.
Entity resolution (ER), deduplication or record linkage is a computationally hard problem with distributed implementations typically relying on shared memory architectures. We show simple reductions to communication complexity and data streaming lower bounds to illustrate the difficulties with a distributed implementation: If the data records are split among servers, then basically all data must be transferred.
As a key result, we demonstrate that ER can be solved using algorithms with three different distributed computing paradigms:
• Distributed key-value stores;
• Bulk Synchronous Parallel.
We measure our algorithms in the real-world scenario of an insurance customer master data integration procedure. We show how the algorithms can be modified for non-Boolean fuzzy merge functions and similarity indexes.
The paper "AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables" by M. A. Yosef, J. Hoffart, I. Bordino, M. Spaniol and G. Weikum has been accepted for the VLDB 2011 conference.
We present AIDA, a framework and online tool for entity detection and disambiguation. Given a natural-language text or a Web table, we map mentions of ambiguous names onto canonical entities like people or places, registered in a knowledge base like DBpedia, Freebase, or YAGO. AIDA is a robust framework centred around collective disambiguation exploiting the prominence of entities, similarity between the context of the mention and its candidates, and the coherence among candidate entities for all mentions. We have developed a Web-based online interface for AIDA where different formats of inputs can be processed on the fly, returning proper entities and showing intermediate steps of the disambiguation process.
VLDB 2011 homepage
The paper "Harvesting Facts from Textual Web Sources by Constrained Label Propagation" by Y. Wang, B. Yang, L. Qu, M. Spaniol and G. Weikum has been accepted for the CIKM 2011 conference.
There have been major advances on automatically constructing large knowledge bases by extracting relational facts from Web and text sources. However, the world is dynamic: periodic events such as sports competitions need to be interpreted with their respective timepoints, and facts such as coaching a sports team, holding political or business positions, and even marriages do not hold forever and should be augmented by their respective timespans. This paper addresses the problem of automatically harvesting temporal facts with such extended time-awareness. We employ pattern-based gathering techniques for fact candidates and construct a weighted pattern-candidate graph. Our key contribution is a new kind of label propagation algorithm with a judiciously designed loss function, which iteratively processes the graph to label good temporal facts for a given set of target relations. Our experiments with online news and Wikipedia articles demonstrate the accuracy of this method.
CIKM 2011 homepage
The paper "Temporal Index Sharding for Space-Time Efficiency in Archive Search" by A. Anand, S. Bedathur, K. Berberich and R. Schenkel has been accepted for the ACM SIGIR Conference 2011.
Time-travel queries that couple temporal constraints with keyword queries are useful in searching large-scale archives of time-evolving content such as the web archives or wikis. Typical approaches for efficient evaluation of these queries involve slicing either the entire collection or individual index lists along the time-axis. Both these methods are not satisfactory since they sacrifice compactness of index for
processing efficiency making them either too big or, otherwise, too slow.
We present a novel index organization scheme that shards each index list with almost zero increase in index size but still minimizes the cost of reading index entries during query processing. Based on the optimal sharding thus obtained, we develop a practically efficient sharding that takes into account the different costs of random and sequential accesses. Our algorithm merges shards from the optimal solution to allow for a few extra sequential accesses while gaining significantly by reducing the number of random accesses. We empirically establish the effectiveness of our sharding scheme with experiments over the revision history of the English Wikipedia between 2001-2005 (~ 700 GB) and an archive of U.K. governmental web sites (~ 400 GB). Our results demonstrate the feasibility of faster time-travel query processing with no space overhead.
A summary about the 1st Temporal Web Analytics Workshop (TWAW) has been published as part of the www2011 companion proceedings.
The paper “The 1st Temporal Web Analytics Workshop (TWAW)” by R. Baeza-Yates (Yahoo! Research Barcelona), J. Masanès (Internet Memory Foundation) and M. Spaniol (Max-Planck-Institut für Informatik) has been published as part of the www2011 companion proceedings. The workshop was co-organized by LAWA.
The objective of the 1st Temporal Web Analytics Workshop (TWAW) is to provide a venue for researchers of all domains (IE/IR, Web mining etc.) where the temporal dimension opens up an entirely new range of challenges and possibilities. The workshop’s ambition is to help shaping a community of interest on the research challenges and possibilities resulting from the introduction of the time dimension in Web analysis. The maturity of the Web, the emergence of large scale repositories of Web material, makes this very timely and a growing sets of research and services (recorded future, truthy launched just in the last months) are emerging that have this focus in common. Having a dedicated workshop will help, we believe, to take a rich and crossdomain approach to this new research challenge with a strong focus on the temporal dimension.
The Proceedings of the 1st International Temporal Web Analytics Workshop (TWAW 2011) are online now.
The Proceedings of the 1st International Temporal Web Analytics Workshop (TWAW 2011) held in conjunction with the 20th International World Wide Web Conference (www2011) in Hyderabad, India on March 28, 2011 are online at: CEUR Workshop Proceedings Vol. 707. The workshop was co-organized by the LAWA project and chaired by R. Baeza-Yates (Yahoo! Research Barcelona), J. Masanès (Internet Memory Foundation) and M. Spaniol (Max-Planck-Institut für Informatik).
The paper "Scalable Spatio-temporal Knowledge Harvesting" by Yafang Wang, Bin Yang, Spyros Zoupanos, Marc Spaniol and Gerhard Weikum has been accepted for the WWW 2011 poster track.
Knowledge harvesting enables the automated construction of large knowledge bases. In this work, we made a first attempt to harvest the spatio-temporal knowledge from news archives to construct the trajectories of individual entities for spatio-temporal entity tracking.
Our approach consists of an entity extraction and disambiguation module and a fact generation module which produce pertinent trajectory records from textual sources. The evaluation on the 20 years’ New York Times corpus showed that our methods are effective and scalable.
The joint vision paper “Longitudinal Analytics on Web Archive Data: It’s About Time!” by the LAWA consortium has been accepted at CIDR 2011
Organizations like the Internet Archive have been capturing Web contents over decades, building up huge repositories of time-versioned pages. The timestamp annotations and the sheer volume of multi-modal content constitutes a gold mine for analysts of all sorts, across different application areas, from political analysts and marketing agencies to academic researchers and product developers. In contrast to traditional data analytics on click logs, the focus is on longitudinal stud- ies over very long horizons. This longitudinal aspect affects and concerns all data and metadata, from the content itself, to the indices and the statistical metadata maintained for it. Moreover, advanced analysts prefer to deal with semantically rich entities like people, places, organizations, and ideally relationships such as company acquisitions, instead of, say, Web pages containing such references. For example, tracking and analyzing a politician’s public appearances over a decade is much harder than mining frequently used query words or frequently clicked URLs for the last month. The huge size of Web archives adds to the complexity of this daunting task. This paper discusses key challenges, that we intend to take up, which are posed by this kind of longitudinal analytics: time-travel indexing and querying, entity detection and track- ing along the time axis, algorithms for advanced analyses and knowledge discovery, and scalability and platform issues.