Longitudinal Analytics
of Web Archive Data

 

LAWA 6th Newsletter

LAWA Partners are glad to present the sixth LAWA Newsletter

This edition focuses on Temporal Web Analytics in Action. In addition, we present the latest publications of the LAWA project.

Enjoy reading!

Release of Web Analytics Technology V2

LAWA has released its Web Analytics Technology V2.

Implementations have been driven by the overall aim of developing methods that support typical tasks in temporal Web analytics, such as:

  • fine-grained hierarchical type classification,
  • knowledge linking onto online statistics,
  • n-gram statistics,
  • entity exploration and browsing,
  • trend analysis.

This software complements the initially released Web Analytics Technology V1. In combination, the software constitutes the analytics part of LAWA’s Virtual Web Observatory. Backed by requirements monitoring within LAWA’s target user community, we believe that the developed software will make temporal Web analytics better understandable and explainable. To this end, all modules incorporate state-of-the-art information extraction technologies for Web content analytics. The software is available for download in our Software section.

The classification power of Web features

Miklos Erdelyi, Andras A. Benczur, Balint Daroczy, Andras Garzo, Tamas Kiss and David Siklosi have published a technical report on "The classification power of Web features".

In this paper we give a comprehensive overview of features devised for Web spam detection and investigate how much various classes, some requiring very high computational effort, add to the classification accuracy. We collect and handle a large number of features based on recent advances in Web spam filtering, including temporal ones, in particular we analyze the strength and sensitivity of linkage change. We show that machine learning techniques including ensemble selection, LogitBoost and Random Forest significantly improve accuracy.

Our result is a summary of the Web spam filtering best practice with a listing of various configurations depending on collection size, computational resources and quality needs. To foster research in the area, we make several feature sets and source codes public (https://datamining.sztaki.hu/en/download/web-spam-resources), including the temporal features of eight .uk crawl snapshots that include WEBSPAM-UK2007 as well as the Web Spam Challenge features for the labeled part of ClueWeb09.

Technical Report

Skewed Key Spaces in Map Reduce

Lev Faerman has published a technical report on "Skewed Key Spaces in Map Reduce".

This paper discusses the effects of non-uniform key spaces (such as ones created by processing English text) on load balancing in Hadoop. It demonstrates that a potential problem exists by observing the characteristics of the English language, their effect on reducer loading and then discusses a simple improvement of Hadoop partitioners to improve load balancing.

Technical Report

Analyzing Virtualized Datacenter Hadoop Deployments

Aviad Pines has published a technical report on "Analyzing Virtualized Datacenter Hadoop Deployments".

This paper discusses the performance of Hadoop deployments on virtualized Data Centers such as Amazon EC2 and Elastichosts, both when the Hadoop cluster is located in a single data center, and when it is spread in a cross-datacenter deployment. We analyze the impact of bandwidth between nodes on cluster performance.

Technical Report

Crowdsourced Entity Markup

The paper "Crowdsourced Entity Markup" by Lili Jiang, Yafang Wang, Johannes Hoffart and Gerhard Weikum has been accepted for the Workshop on Crowdsourcing the Semantic Web (CrowdSem 2013) in conjunction with ISWC 2013.

Entities, such as people, places, products, etc., exist in knowledge bases and linked data, on one hand, and in web pages, news articles, and social media, on the other hand. Entity markup, like Named Entities Recognition and Disambiguation (NERD), is the essential means for adding semantic value to unstructured web contents and this way enabling the linkage between unstructured and structured data and knowledge collections. A major challenge in this endeavor lies in the dynamics of the digital contents about the world, with new entities emerging all the time. In this paper, we propose a crowdsourced framework for NERD, specifically addressing the challenge of emerging entities in social media. Our approach combines NERD techniques with the detection of entity alias names and with co-reference resolution in texts. We propose a linking-game based crowdsourcing system for this combined task, and we report on experimental insights with this approach and on lessons learned.

CrowdSem 2013 homepage

On the SPOT: Question Answering over Temporally Enhanced Structured Data

The paper "On the SPOT: Question Answering over Temporally Enhanced Structured Data" by Mohamed Yahya, Klaus Berberich, Maya Ramanath and Gerhard Weikum has been accepted for the Workshop on Time-aware Information Access (TAIA2013) in conjunction with SIGIR 2013.

Natural-language question answering is a convenient way for humans to discover relevant information in structured Web data such as knowledge bases or Linked Open Data sources. This paper focuses on data with a temporal dimension, and discusses the problem of mapping natural-language questions into extended SPARQL queries over RDF-structured data. We specifically address the issue of disambiguating temporal phrases in the question into temporal entities like dates and named events, and temporal predicates. For the situation where the data has only partial coverage of the time dimension but is augmented with textual descriptions of entities and facts, we also discuss how to generate queries that combine structured search with keyword conditions.

TAIA 2013 homepage

Temporal Diversification of Search Results

The paper "Temporal Diversification of Search Results" by Klaus Berberich and Srikanta Bedathur has been accepted for the Workshop on Time-aware Information Access (TAIA2013) in conjunction with SIGIR 2013.

We investigate the notion of temporal diversity, bringing together two recently active threads of research, namely temporal ranking and diversification of search results. A novel method is developed to determine search results consisting of documents that are relevant to the query and were published at diverse times of interest to the query. Preliminary experiments on twenty years’ worth of newspaper articles from The New York Times demonstrate characteristics of our method and compare it against two baselines.

TAIA 2013 homepage

On Temporal Wikipedia search by edits and linkage

The paper "On Temporal Wikipedia search by edits and linkage" by Julianna Göbölös-Szabó and András Benczúr has been accepted for the Workshop on Time-aware Information Access (TAIA2013) in conjunction with SIGIR 2013.

We exploit the connectivity structure of edits in Wikipedia to identify recent events that happened at a given time via identifying bursty changes in linked articles around a specified date. Our key results include algorithms for node relevance ranking in temporal subgraph and neighborhood selection based on measurements for structural changes in time over the Wikipedia link graph. We measure our algorithms over manually annotated queries with relevant events in September and October 2011; we make the assessment publicly available (https://dms.sztaki.hu/en/download/wimmut-searching-and-navigating-wikipedia). While our methods were tested over clean Wikipedia metadata, we believe the methods are applicable to general temporal Web collections as well.

TAIA 2013 homepage

TempWeb 2013 Roundup

The 3rd Temporal Web Analytics Workshop (TempWeb 2013) was successfully staged in Rio de Janeiro, Brazil on May 13, 2013.

On May 13, 2013 the LAWA consortium successfully staged the 3rd Temporal Web Analytics Workshop (TempWeb 2013) in Rio de Janeiro, Brazil. Again, the workshop was organized in conjunction with the international World Wide Web conference. The workshop attracted around 40 participants throughout the entire day.

After a short introduction, the workshop began with an exciting keynote by Omar Alonso (Microsoft Bing, USA) on “Stuff happens continuously: exploring Web contents with temporal information”. The talk covered the entire spectrum of
temporal Web analytics including time in document collections, social data and exploring the Web using time. The keynote showed again the relevance of the topic and the perfect alignment with the World Wide Web conference.

TempWeb 2013 Keynote

The scientific presentation then were separaetd into three sessions (Papers are available from the WWW companion edition published by ACM):

Web Archiving
Miguel Costa, Daniel Gomes and Mário J. Silva: “A Survey of Web Archive Search Architectures”
Ahmed Alsum, Michael L. Nelson, Robert Sanderson and Herbert Van de Sompel: “Archival HTTP Redirection Retrieval Policies”
Daniel Gomes, Miguel Costa, David Cruz, João Miranda and Simão Fontes: “Creating a Billion-Scale Searchable Web Archive”

Identifying and leveraging time information
Julia Kiseleva, Hoang Thanh Lam, Mykola Pechenizkiy and Toon Calders: “Predicting temporal hidden contexts in web sessions”
Hany Salaheldeen and Michael Nelson: “Carbon Dating The Web: Estimating the Age of Web Resources”
Omar Alonso and Kyle Shiells: “Timelines as Summaries of Popular Scheduled Events”

TempWeb 2013 Audience

Lucas Miranda, Rodrygo Santos and Alberto Laender: “Characterizing Video Access Patterns in Mainstream Media Portals”
Laura Elisa Celis, Koustuv Dasgupta and Vaibhav Rajan: “Adaptive Crowdsourcing for Temporal Crowds”
Hideo Joho, Adam Jatowt and Roi Blanco: “A Survey of Temporal Web Search Experience”

All talks were of high quality and the discussions were lively. From the workshop’s third edition it became clear that LAWA covers a hot topic, which is worth being investigated in conjunction with the World Wide Web conference series. As a next step, we are planning to work on a special issue to be published in a journal. So, stay tuned!

HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text

The paper "HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text" by Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol and Gerhard Weikum has been accepted for the ACL 2013 demo track.

Recent research has shown progress in achieving high-quality, very fine-grained type classification in hierarchical taxonomies. Within such a multi-level type hierarchy with several hundreds of types at different levels, many entities naturally belong to multiple types. In order to achieve high-precision in type classification, current approaches are either limited to certain domains or require time consuming multistage computations. As a consequence, existing systems are incapable of performing ad-hoc type classification on arbitrary input texts. In this demo, we present a novel Web-based tool that is able to perform domain independent entity type classification under real time conditions. Due to its efficient implementation and compacted feature representation, the system is able to process text inputs on-the-fly by achieving equally high precision as leading state-of-the-art implementations. Our system offers an online interface where natural-language text can be inserted, which returns lexical type labels for entity mentions. Further the user interface allows users to explore the types assigned to text mentions by visualizing and navigating along the type-hierarchy.

ACL 2013 homepage

The 3rd Temporal Web Analytics Workshop (TempWeb 2013)

A summary about the 3rd Temporal Web Analytics Workshop (TempWeb 2013) has been published as part of the workshop proceedings.

Preface
Time is a key dimension to understand the Web. It is fair to say that it has not received yet all the attention it deserves and TempWeb is an attempt to help remedy this situation by putting time as the center of its reflection.
Studying time in this context actually covers a large spectrum, from dating methodology to extraction of temporal information and knowledge, from diachronic studies to the design of infrastructural and experimental settings enabling a proper observation of this dimension.
For its third edition, TempWeb includes 9 papers out of a total of 18 papers submitted. The quality of papers has constantly improved, so that we have been “forced” to accept every second paper submitted to the third edition. We like to interpret paper quality and slightly increased submission figures as a clear sign of a positive dynamic in the study of time in the scope of the Web and an indication of the relevance of this effort. The workshop proceedings are published by ACM DL as part of the WWW 2013 Companion Publication.
We hope you will find in these papers, the keynote, and the discussion and exchanges of this edition of TempWeb some motivations to look more into this important aspect of the Web.
TempWeb 2013 was jointly organized by Internet Memory Foundation (Paris, France), the Max-Planck-Institut für Informatik (Saarbrücken, Germany) and Yahoo! Research Barcelona (Barcelona, Spain), and supported by the 7th Framework IST programme of the European Union through the focused research project (STREP) on Longitudinal Analytics of Web Archive data (LAWA) under contract no. 258105.

pdf-file

TempWeb 2013 Proceedings

The Proceedings of the 3rd International Temporal Web Analytics Workshop (TempWeb 2013) are online now.

The Proceedings of the 3rd International Temporal Web Analytics Workshop (TempWeb 2013) held in conjunction with the 22nd International World Wide Web Conference (www2013) in Rio de Janeiro, Brazil on May 13, 2013 are online as: WWW Companion Volume. The workshop was co-organized by the LAWA project and chaired by R. Baeza-Yates (Yahoo! Research Barcelona), J. Masanès (Internet Memory Foundation) and M. Spaniol (Max-Planck-Institut für Informatik).

Knowledge Linking for Online Statistics

The paper "Knowledge Linking for Online Statistics" by Marc Spaniol, Natalia Prytkova and Gerhard Weikum will be presented at the 59th World Statistics Congress (WSC) in the Special Topic Session (STS) on "The potential of Internet, big data and organic data for official statistics".

The LAWA project investigates large-scale Web (archive) data along the temporal dimension. As a use case, we are studying Knowledge Linking for Online Statistics.

Statistic portals such as eurostat’s “Statistics Explained” (http://epp.eurostat.ec.europa.eu/statistics_explained/index.php/Main_Page) provide a wealth of articles constituting an encyclopedia of European statistics. Together with its statistical glossary, the huge amount of numerical data comes with a well-defined thesaurus. However, this data is not directly at hands, when browsing Web data covering the topic. For instance, when reading news articles about the debate on renewable energy across Europe after the earthquake in Japan and the Fukushima accident, one would ideally be able to understand these discussions based on statistical evidence.

We believe that Internet contents, captured in Web archives and reflected and aggregated in the Wikipedia history, can be better understood when linked with online statistics. To this end, we aim at semantically enriching and analyzing Web (archive) data to narrow and ultimately bridge the gap between numerical statistics and textual media like news or online forums. The missing link and key to this goal is the discovery and analysis of entities and events in Web (archive) contents. This way, we can enrich Web pages, e.g. by a browser plug-in, with links to relevant statistics (e.g. eurostat pages). Raising data analytics to the entity-level also enables understanding the impact of societal events and their perception in different cultures and economies.

WSC 2013 homepage

Mind the Gap: Large-Scale Frequent Sequence Mining

The paper "Mind the Gap: Large-Scale Frequent Sequence Mining" by Iris Miliaraki, Klaus Berberich, Rainer Gemulla and Spyros Zoupanos has been accepted for presentation at SIGMOD 2013.

Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are suffciently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose PFSM, a scalable algorithm for frequent sequence mining on MapReduce. PFSM can handle so-called “gap constraints’‘, which can be used to limit the output to a controlled set of frequent sequences. At its heart, PFSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of $w$-equivalency, which is a generalization of the notion of a “projected database’’ used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our extensive experimental study in the context of text mining suggests that PFSM is significantly more efficient and scalable than alternative approaches.

SIGMOD 2013 homepage

Cross-lingual web spam classification

The paper "Cross-lingual web spam classification" by András Garzó, Bálint Daróczy, Tamás Kiss, Dávid Siklósi and András Benczúr has been accepted for the 3rd Joint WICOW/AIRWeb Workshop on Web Quality (WebQuality 2013) in conjunction with WWW 2013.

While English language training data exists for several Web classification tasks, most notably for Web spam, we face an expensive human labeling procedure if we want to classify a Web domain in a language different from English.We overview how existing content and link based classification techniques work, how models can be ``translated’’ from English into another language, and how language-dependent and independent methods combine. Our experiments are conducted on the ClueWeb09 corpus as the training English collection and a large Portuguese crawl of the Portuguese Web Archive. To foster further research, we provide labels and precomputed values of term frequencies, content and link based features for both ClueWeb09 and the Portuguese data.

WICOW/AIRWeb 2013 Homepage

LAWA 5th Newsletter

LAWA Partners are glad to present the fifth LAWA Newsletter

This edition focuses on indexing Big Data. In addition, we present the latest publications of the LAWA project.

Enjoy reading!

Predicting Search Engine Switching in WSCD 2013 Challenge

The paper "Predicting Search Engine Switching in WSCD 2013 Challenge" by Qiang Yan, Xingxing Wang, Qiang Xu, Dongying Kong, Danny Bickson, Quan Yuan, and Qing Yang has been accepted for presentation at the Workshop on Web Search Click Data 2013 (WSCD2013) in conjunction with WSDM 2013.

How to accurately predict search engine switch behavior is a very important but challenging problem. This paper describes the solution of GraphLab team that achieves the 4th place for WSCD 2013 Search Engine Switch Detect contest sponsored by Yandex. There are three core steps in our solution: Feature extraction, Prediction, and Model ensemble. First, we extract features related to the quality of result, user preference and search behavior sequence pattern from user actions, query logs, and sequence patterns of click-streams. Second, models like Online Bayesian Probit Regression (BPR), Online Bayesian Matrix Factorization (BMF), Support Vector Regression (SVR), Logistic Regression (LR) and Factorization Machine Model (FM) are exploited based on the previous features. Finally, we propose a two-step ensemble method to blend our individual models in order to fully exploit the dataset and get more accurate result based on the local and public test dataset. Our final solution achieves 0.8439 AUC on the public leaderboard and 0.8432 AUC on the private test set.

WSCD2013 homepage

User-Defined Redundancy in Web Archives

The paper "User-Defined Redundancy in Web Archives" by Bibek Paudel, Avishek Anand, and Klaus Berberich has been accepted for the Workshop on Large-Scale and Distributed Systems for Information Retrieval (LSDS-IR) in conjunction with WSDM 2013.

Web archives are valuable resources. However, they are characterized by a high degree of redundancy. Not only does this redundancy waste computing resources, but it also deteriorates users’ experience, since they have to sift through and weed out redundant content. Existing methods focus on identifying near-duplicate documents, assuming a universal notion of redundancy, and can thus not adapt to user-specific requirements such as a preference for more recent or diversely opinionated content.

In this work, we propose an approach that equips users with fine-grained control over what they consider redundant. Users thus specify a binary coverage relation between documents that can factor in documents’ contents as well as their meta data. Our approach then determines a minimum-cardinality cover set of non-redundant documents. We describe how this can be done at scale using MapReduce as a platform for distributed data processing. Our prototype implementation has been deployed on a real-world web archive and we report experiences from this case study.

LSDS-IR homepage

Computing n-Gram Statistics in MapReduce

The paper "Computing n-Gram Statistics in MapReduce" by Klaus Berberich and Srikanta Bedathur has been accepted for EDBT 2013.

Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-sigma that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete MapReduce implementation, we provide insights on an efficient implementation of the methods. Extensive experiments on The New York Times Annotated Corpus and ClueWeb09 expose the relative benefits and trade-offs of the methods.

EDBT 2013 homepage

Temporal Web Analytics Workshop at WWW2013

LAWA helps staging TempWeb 2013 in conjunction with the World Wide Web conference in Rio de Janeiro, Brazil on May 13, 2013.

***************************************************************
          CALL FOR PAPERS
***************************************************************

3rd Temporal Web Analytics Workshop (TempWeb 2013)
in conjunction with WWW 2013
May 13, 2013, Rio de Janeiro, Brazil
http://www.temporalweb.net/

***************************************************************
      Proceedings published by ACM
***************************************************************

***************************************************************
Keynote by Omar Alonso (Microsoft Bing, USA)
“Stuff happens continuously: exploring Web contents
with temporal information”
***************************************************************

Objectives:
The objective of this workshop is to provide a venue for researchers of all domains (IE/IR, Web mining etc.) where the temporal dimension opens up an entirely new range of challenges and possibilities. The workshop’s ambition is to help shaping a community of interest on the research challenges and possibilities resulting from the introduction of the time dimension in Web analysis.

TempWeb focuses on temporal data analysis along the time dimension for Web data that has been collected over extended time periods. A major challenge in this regard is the sheer size of the data it exposes and the ability to make sense of it in a useful and meaningful manner for its users. Web scale data analytics therefore needs to develop infrastructures and extended analytical tools to make sense of these. TempWeb will take place May 13, 2013 in conjunction with the International World Wide Web Conference in Rio de Janeiro, Brazil.

Workshop topics of TempWeb therefore include, but are not limited to following:
- Web scale data analytics
- Temporal Web analytics
- Distributed data analytics
- Web science
- Web dynamics
- Data quality metrics
- Web spam evolution
- Content evolution on the Web
- Systematic exploitation of Web archives
- Large scale data storage
- Large scale data processing
- Time aware Web archiving
- Data aggregation
- Web trends
- Topic mining
- Terminology evolution
- Community detection and evolution

Important Dates:
- Paper submission deadline: February 22, 2013
- Notification of acceptance: March 11, 2013
- Camera-ready copy deadline: April 3, 2013
- Workshop: May 13, 2013

Please post your submission (up to 8 pages) using the ACM template:
http://www.acm.org/sigs/publications/proceedings-templates
at:
https://www.easychair.org/account/signin.cgi?conf=tempweb2013

Workshop Officials

PC-Chairs and Organizers:
Julien Masanès (Internet Memory Foundation, France and Netherlands)
Marc Spaniol (Max Planck Institute for Informatics, Germany)
Ricardo Baeza-­Yates (Yahoo! Research, Spain)

Program Committee:
Eytan Adar (University of Michigan, USA)
Omar Alonso (Microsoft Bing, USA)
Ralitsa Angelova (Google, Switzerland)
Srikanta Bedathur (IIIT-Delhi, India)
Andras A. Benczur (Hungarian Academy of Science)
Klaus Berberich (Max-Planck-Institut für Informatik, Germany)
Roi Blanco (Yahoo! Research, Spain)
Philipp Cimiano (University of Bielefeld, Germany)
Renata Galante (Universidade Federal do Rio Grande do Sul, Brazil)
Adam Jatowt (Kyoto University, Japan)
Scott Kirkpatrick (Hebrew University Jerusalem, Israel)
Frank McCown (Harding University, USA)
Michael Nelson (Old Dominion University, USA)
Kjetil Norvag (Norwegian University of Science and Technology, Norway)
Nikos Ntarmos (University of Patras, Greece)
Philippe Rigaux (Mignify, France)
Thomas Risse (L3S Research Center, Germany)
Rodrygo Luis Teodoro Santos (University of Glasgow, UK)
Torsten Suel (NYU Polytechnic, USA)
Masashi Toyoda (Tokyo University, Japan)
Gerhard Weikum (Max-Planck-Institut für Informatik, Germany)

Feedback of the 3rd User Workshop - Paris, November 13, 2012

3rd LAWA User Workshop: Big-Data Analytics for the Temporal Web was held on November 13, at Conservatoire National des Arts et Métiers, CNAM (Paris, France)

The workshop has been organized as a one-day workshop. About 50 researchers attended the event. Presentations were given by the LAWA project team and the participating guest researchers. Topics included methods, tools, and platforms for big-data analytics, including requirements on and experiences with such technologies.

Specific keynotes were presented by:

- Ricardo Baeza-Yates (Yahoo! Research, Barcelona): “Time in Web IR”
- Wolfgang Nejdl (L3S Research Center, Hanover): “Web Science, Web Analytics and Web Archives - Humans in the Loop”

Ricardo Baeza-Yates  Wolfgang Nejdl

Guest presentations covered a wide spectrum on big-data analytics, such as:

- Frédéric Plissonneau (Technicolor): “Incremental collection of data from specific Web sites: A Cinema dedicated use-case”
- Robert Fischer (SWR): “The SWR/ARD Webarchive and the goals of ARCOMEM”
- Linnet Taylor (Oxford Internet Institute): “Accessing and Using Big Data to Advance Social Science Knowledge”
- Gaël Dias (University of Caen Basse-Normandie): “Temporal Disambiguation of Timely Implicit Queries”
- Marie Guégan (Technicolor): “Like You Said, “This Movie Rocks!” Extracting Post Quotes for Social Network Analysis”
- Hugo C. Huurdeman (University of Amsterdam): “Introducing the WebART project: Web Archive Retrieval Tools”
- Zeynep Pehlivan (University Pierre and Marie Curie Paris): “Temporal Static Index Pruning”
- Gérard Dupont (CASSIDIAN [an EADS company]): “An overview of the OSINT challenges”

The workshop was highly interactive. Communications with the participating guests showed the great potential of the topics presented. Apart from the explicit Q&A sessions after each presentation, there were many lively discussions continuing during breaks and the social event. Even more, the consortium was able to present the first building blocks for temporal analytics in the Virtual Web Observatory. From the discussions throughout the workshop entity-driven analytics again emerged to be the focal point. It turned out that the next generation of analytics tools should go beyond plain text, helping users in tracing entities over time.

Interval Indexing and Querying on Key-Value Cloud Stores

The paper "Interval Indexing and Querying on Key-Value Cloud Stores" by George Sfakianakis, Ioannis Patlakas, Nikos Ntarmos and Peter Triantafillou has been accepted for ICDE 2013.

Cloud key-value stores are becoming increasingly more important. Challenging applications, requiring efficient and scalable access to massive data, arise every day. We focus on supporting interval queries (which are prevalent in several data intensive applications, such as temporal querying for temporal analytics), an efficient solution for which is lacking. We contribute a compound interval index structure, comprised of two tiers: (i) the MRSegmentTree (MRST), a key-value representation of the Segment Tree, and (ii) the Endpoints Index (EPI), a column family index that stores information for interval endpoints. In addition to the above, our contributions include: (i) algorithms for efficiently constructing and populating our indices using MapReduce jobs, (ii) techniques for efficient and scalable index maintenance, and (iii) algorithms for processing interval queries. We have implemented all algorithms using HBase and Hadoop, and conducted a detailed performance evaluation. We quantify the costs associated with the construction of the indices, and evaluate our query processing algorithms using queries on real data sets. We compare the performance of our approach to two alternatives: the native support for interval queries provided in HBase, and the execution of such queries using the Hive query execution tool. Our results show a significant speedup, far outperforming the state of the art.

ICDE 2013 homepage

HYENA: Hierarchical Type Classification for Entity Names

The paper "HYENA: Hierarchical Type Classification for Entity Names" by Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol and Gerhard Weikum has been accepted for COLING 2012.

Inferring lexical type labels for entity mentions in texts is an important asset for NLP tasks like semantic role labeling and named entity disambiguation (NED). Prior work has focused on flat and relatively small type systems where most entities belong to exactly one type. This paper addresses very fine-grained types organized in a hierarchical taxonomy, with several hundreds of types at different levels. We present HYENA for multi-label hierarchical classification. HYENA exploits gazetteer features and accounts for the joint evidence for types at different levels. Experiments and an extrinsic study on NED demonstrate the practical viability of HYENA.

COLING 2012 homepage

Big-Data Analytics for the Temporal Web (Paris, November 13, 2012)

International Workshop on Big-Data Analytics for the Temporal Web, Paris, November 13, 2012. Keynotes by: Ricardo Baeza-Yates (Yahoo! Research, Barcelona) and Wolfgang Nejdl (L3S Research Center, Hanover).

The LAWA project organizes an one-day workshop with researchers using (or planning to use) the Web as a corpus for their studies. The focus is on methods, tools, and platforms for big-data analytics, including requirements on and experiences with such technologies. Topics of interest include but are not limited to: Web dynamics, history, and archives; text mining and contents classification, temporal/longitudinal studies, scalable methods (e.g., cloud-based map-reduce), large scale data storage, community detection and evolution.

The workshop will have presentations by participating researchers and big-data users, including the LAWA project team. Emphasis will be on experience-sharing and discussing mutual interests in big-data analytics for the temporal Web. The workshop is free of charge and open to public, but registration is compulsory by sending an email to:
.(JavaScript must be enabled to view this email address)

Agenda (updated)

Venue
Conservatoire National des Arts et Métiers (CNAM)
2, rue Conté, Paris, 3rd arrondissement.
Room: 37.1.50

Directions: enter the courtyard, find access 37; take the staircase to first floor.

CNAM

LAWA 4th Newsletter

LAWA Partners are glad to present the fourth LAWA Newsletter

This edition focuses on the Virtual Web Observatory (VWO). In addition, selected applications are available for testing.

Enjoy reading!

Click here for a live demo!

PRAVDA-live: Interactive Knowledge Harvesting

The paper "PRAVDA-live: Interactive Knowledge Harvesting" by Yafang Wang, Maximilian Dylla, Zhaouchun Ren, Marc Spaniol and Gerhard Weikum has been accepted for the CIKM 2012 demo session.

Acquiring high-quality (temporal) facts for knowledge bases is a labor-intensive process. Although there has been recent progress in the area of semi-supervised fact extraction, these approaches still have limitations, including a restricted corpus, a fixed set of relations to be extracted or a lack of assessment capabilities. In this paper we introduce PRAVDA-live, a framework that overcomes these limitations and supports the entire pipeline of interactive knowledge harvesting. To this end, our demo exhibits temporal fact extraction from ad-hoc corpus creation, via relation specification, labeling and assessment all the way to ready-to-use RDF exports.

CIKM 2012 homepage

RADAR service updated

A new version of RADAR has been released.

We have updated our webservice to address a bug and to make the API more convenient. Regarding the API changes, please check the included Example Java class and the README in the zip-file provided in the software section. The web service will still work with the old API, but we encourage you to use the new one as it is cleaner. The major new feature is that you can get a ranked list of entities with the PriorOnlyDisambiguationSettings and the LocalDisambiguationSettings.

We also removed a bug that kept the graph algorithm from being executed properly, resulting in the same results as a disambiguation with only prior+local similarity. If you ran any experiments to judge the quality of RADAR, please run the experiments again.

LINDA: Distributed Web-of-Data-Scale Entity Matching

The paper "LINDA: Distributed Web-of-Data-Scale Entity Matching" by Christoph Böhm, Gerard de Melo, Felix Naumann and Gerhard Weikum has been accepted for CIKM 2012.

Linked Data has emerged as a powerful way of interconnecting structured data on the Web. However, the crosslinkage between Linked Data sources is not as extensive as one would hope for. In this paper, we formalize the task of automatically creating “sameAs” links across data sources in a globally consistent manner. Our algorithm, presented in a multi-core as well as a distributed version, achieves this link generation by accounting for joint evidence of a match. Experiments con rm that our system scales beyond 100 million entities and delivers highly accurate results despite the vast heterogeneity and daunting scale.

CIKM 2012 homepage

KORE: Keyphrase Overlap Relatedness for Entity Disambiguation

The paper "KORE: Keyphrase Overlap Relatedness for Entity Disambiguation" by Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald and Gerhard Weikum has been accepted for CIKM 2012.

Measuring the semantic relatedness between two entities is the basis for numerous tasks in IR, NLP, and Web-based knowledge extraction. This paper focuses on disambiguating names in a Web or text document by jointly mapping all names onto semantically related entities registered in a knowledge base. To this end, we have developed a novel notion of semantic relatedness between two entities represented as sets of weighted (multi-word) keyphrases, with consideration of partially overlapping phrases. This measure improves the quality of prior link-based models, and also eliminates the need for (usually Wikipedia-centric) explicit interlinkage between entities. Thus, our method is more versatile and can cope with long-tail and newly emerging entities that have few or no links associated with them. For efficiency, we have developed approximation techniques based on min-hash sketches and locality-sensitive hashing. Our experiments on semantic relatedness and on named entity disambiguation demonstrate the superiority of our method compared to state-of-the-art baselines.

CIKM 2012 homepage

Cross-Lingual Data Quality for Knowledge Base Acceleration across Wikipedia Editions

The paper "Cross-Lingual Data Quality for Knowledge Base Acceleration across Wikipedia Editions" by Julianna Göbölös-Szabó, Natalia Prytkova, Marc Spaniol and Gerhard Weikum has been accepted for QDB 2012.

Knowledge-sharing communities like Wikipedia and knowledge bases like Freebase are expected to capture the latest facts about the real world. However, neither of these can keep pace with the rate at which events happen and new knowledge is reported in news and social media. To narrow this gap, we propose an approach to accelerate the online maintenance of knowledge bases.
Our method, coined LAIKA, is based on link prediction. Wikipedia editions in diff erent languages, Wikinews, and other news media come with extensive but noisy interlinkage at the entity level. We utilize this input for recommending, for a given Wikipedia article or knowledge-base entry, new categories, related entities, and cross-lingual interwiki links. LAIKA constructs a large graph from the available input and uses link-overlap measures and random-walk techniques to generate missing links and rank them for recommendations. Experiments with a very large graph from multilingual Wikipedia editions demonstrate the accuracy of our link predictions.

QDB 2012 homepage

The Virtual Web Observatory is open now!

The Virtual Web Observatory showcases a selection of applications developed in LAWA that support experimentally driven analytics on Internet Data. Each application presented here unifies the work of several partners, which will be integrated into a unified application at the end of the project (August 2013).

Key components of the VWO include:
DAIM: an analytic platform operating at Web scale
Wimmut: a temporal Web linkage search-and-browse application
Radar: entity resolution and disambiguation in text documents

Come in and have a look!

Index Maintenance for Time-Travel Text Search

The paper "Index Maintenance for Time-Travel Text Search" by A. Anand, S. Bedathur, K. Berberich and R. Schenkel has been accepted for SIGIR 2012.

Time-travel text search enriches standard text search by temporal predicates, so that users of web archives can easily retrieve document versions that are considered relevant to a given keyword query and existed during a given time interval. Different index structures have been proposed to efficiently support time-travel text search. None of them, however, can easily be updated as the Web evolves and new document versions are added to the web archive.
In this work, we describe a novel index structure that efficiently supports time-travel text search and can be maintained incrementally as new document versions are added to the web archive. Our solution uses a sharded index organization, bounds the number of spuriously read index entries per shard, and can be maintained using small in-memory buffers and append-only operations. We present experiments on two large-scale real-world datasets demonstrating that maintaining our novel index structure is an order of magnitude more efficient than periodically rebuilding one of the existing index structures, while query-processing performance is not adversely affected.

SIGIR 2012 homepage

Natural Language Questions for the Web of Data

The paper "Natural Language Questions for the Web of Data" by Mohamed Yahya, Klaus Berberich, Shady Elbassuoni Maya Ramanath, Volker Tresp and Gerhard Weikum has been accepted for EMNLP 2012.

The Linked Data initiative comprises structured databases in the Semantic-Web data model RDF. Exploring this heterogeneous data by structured query languages is tedious and error-prone even for skilled users. To ease the task, this paper presents a methodology for translating natural language questions into structured SPARQL queries over linked-data sources.

Our method is based on an integer linear program to solve several disambiguation tasks jointly: the segmentation of questions into phrases; the mapping of phrases to semantic entities, classes, and relations; and the construction of SPARQL triple patterns. Our solution harnesses the rich type system provided by knowledge bases in the web of linked data, to constrain our semantic-coherence objective function. We present experiments on both the question translation and the resulting query answering.

EMNLP 2012 homepage

WAC Summer Workshop

Ongoing research in LAWA will be presented at the WAC (Web Archive Cooperative) Summer Workshop June 29 - July 1, 2012 at Stanford University in Palo Alto, CA, USA.

The Web Archive Cooperative (WAC) organizes a Summer Workshop on “Challenges in Providing Access to the World’s Web Archives” from June 29 - July 1, 2012 at Stanford University in Palo Alto, CA, USA. The LAWA project is proud to contribute to this outstanding event by giving a presentation on its ongoing research.

WAC 2012 homepage

International Internet Preservation Consortium 2012: Leveraging Web archives Research

In the context of LAWA project, IMF will make a presentation on "Leveraging Web archives Research" at International Internet Preservation Consortium 2012 General Assembly: April 30 - May 4, 2012 - The Library of Congress - Washington

IIPC- General Assembly
Leveraging Web archives Research
Tuesday May 1, 2012 General Assembly

Internet Research requires the ability to store and analyse large portions of the Web as a foundational block for most content-centric studies.
For this, a combination of Web archives together with a distributed infrastructure supporting extended analytical tools is a necessary tool. With such an infrastructure, large-scale measurements, topological information and trends at Internet scale can be brought to researchers and information professional’s scrutiny.
Internet Memory developed a new infrastructure with the ambition to reach “Web-scale” in terms of Web documents acquisition (billions of resources crawled per week) and computable data storage (Petabytes of data). This platform, partly supported by several EU projects among which LAWA (Longitudinal Analytics of Web Archive data) includes:

- A new crawler, entirely implemented in Erlang to support the retrieval of billions of pages in days. Thanks to its innovative frontier and seen-URL structure, it sustains throughput for weeks while enabling Web-scale exploration.

- A new Web Archive repository for content and metadata based on HBase. It offers a perfect storage layer for Web archives as it is functionally isomorphic to WARC, but abstracts lots of the underlying data management (replication, index creation etc) while exposing analytical friendly APIs

- Filters and extractors to distil relevant information and create processing chain in a distributed execution environment.

This presentation will offer an overview of this platform and discuss the next steps of its development.

HBASE CON2012 : Mignify, A Big Data Refinery Built on HBase

In the framework of LAWA project, IMF will present at HBasecon 2012 progress of the design and development of a Big Data Platform: May 22, 2012 in San Francisco

HBasecon 2012
Tuesday, May 22, 2012, 2:20pm – 3:00pm, InterContinental San Francisco Hotel
Presented by Stanislav Barton

Mignify is a platform for collecting, storing and analyzing Big Data harvested from the web.  It aims at providing an easy access to focused and structured information extracted from Web data flows.

It consists of a distributed crawler, a resource-oriented storage based on HDFS and HBase, and an extraction framework that produces filtered, enriched, and aggregated data from large document collections, including the temporal aspect.
The whole system is deployed in an innovative hardware architecture comprising of a high number of small (low-consumption) nodes. This talk will tackle the decisions made along the design and development of the platform, both under a technical and functional perspective. It will introduce the cloud infrastructure, the LTE-like ingestion of the crawler output into HBase/HDFS, and the triggering mechanism of analytics based on a declarative filter/extraction specification. The design choices will be illustrated with a pilot application targeting Daily Web Monitoring in the context of a national domain.

This platform is partly supported by several EU projects among which LAWA (Longitudinal Analytics of Web Archive data).

Coupling Label Propagation and Constraints for Temporal Fact Extraction

The paper "Coupling Label Propagation and Constraints for Temporal Fact Extraction" by Y. Wang, M. Dylla, M. Spaniol and G. Weikum has been accepted for ACL 2012.

The Web and digitized text sources contain a wealth of information about named entities such as politicians, actors, companies, or cultural landmarks. Extracting this information has enabled the automated construction of large knowledge bases, containing hundred millions of binary relationships or attribute values about these named entities. However, in reality most knowledge is transient, i.e. changes over time, requiring a temporal dimension in fact extraction. In this paper we develop a methodology that interlinks label propagation with constraints for temporal fact extraction. Due to the coupling we gain maximum benefit from both “worlds”. Label propagation “aggressively” gathers fact candidates, while an Integer Linear Program does the “clean-up”. Our method is able to improve on recall while keeping up with precision, which we demonstrate by experiments with biography-style Wikipedia pages and a large corpus of news articles.

ACL 2012 homepage

TempWeb 2012 Roundup

The 2nd Temporal Web Analytics Workshop (TempWeb 2012) was successfully staged in Lyon, France on April 17, 2012.

On April 17, 2012 members of the LAWA consortium successfully staged the 2nd Temporal Web Analytics Workshop (TempWeb 2012) in Lyon, France. Again, the workshop was organized in conjunction with the international World Wide Web conference. The workshop attracted even more participants than its previous edition, with a peak of about 50 guests.

After a short introduction, the workshop began with an exciting keynote by Staffan Truvé, CTO on “Recorded Future: unlocking the predictive power of the web.” The talk covered almost all aspects of ongoing research aspects that were discussed in detail throughout the entire workshop. Even more, it became clear that the alignment of the workshop really focuses on a hot topic.

The scientific presentation then were separated into two sessions (Papers are available from the workshop proceedings published by ACM):

Web Dynamics
Geerajit Rattanaritnont, Masashi Toyoda and Masaru Kitsuregawa: “Analyzing Patterns of Information Cascades based on Users’ Influence and Posting Behaviors”
Masahiro Inoue and Keishi Tajima: “Noise Robust Detection of the Emergence and Spread of Topics on the Web”
Margarita Karkali, Vassilis Plachouras, Costas Stefanatos and Michalis Vazirgiannis: “Keeping Keywords Fresh: A BM25 Variation for Personalized Keyword Extraction”

Speaker at TempWeb 2012

Identifying and leveraging time information
Erdal Kuzey and Gerhard Weikum “Extraction of Temporal Facts and Events from Wikipedia”
Jannik Strötgen, Omar Alonso and Michael Gertz “Identification of Top Relevant Temporal Expressions in Documents”
Ricardo Campos, Gaël Dias, Alípio Jorge and Célia Nunes: “Enriching Temporal Query Understanding through Date Identification: How to Tag Implicit Temporal Queries?”

Participants of TempWeb 2012

Again, all talks were of high quality and the discussions were lively. From the panel at the end of the workshop it became clear that the audience is keen on reference data sets provided by LAWA and wants to see a third edition of the workshop to be organized in conjunction with the next World Wide Web conference. So, stay tuned!

The 2nd Temporal Web Analytics Workshop (TempWeb 2012)

A summary about the 2nd Temporal Web Analytics Workshop (TempWeb 2012) has been published as part of the workshop proceedings.

Preface
Time is a key dimension to understand the web. It is fair to say that it has not received yet all the attention it deserves and TempWeb is an attempt to help remedy this situation by putting time as the center of its reflexion. Studying time in this context actually covers a large spectrum, from dating methodology to extraction of temporal information and knowledge, from diachronic studies to the design of infrastructural and experimental settings enabling a proper observation of this dimension.
For its second edition, TempWeb includes 6 papers out of a total of 17 papers submitted which put its acceptance rate at 35%. The number of papers submitted has almost doubled compared to the first edition, which we like to interpret as a clear sign of positive dynamic and an indication of the relevance of this effort. The workshop proceedings are published in ACM DL (ISBN 978-1-4503-1188-5).
We hope you will find in these papers, the keynotes and the discussion and exchanges of this edition of TempWeb some motivations to look more into this important aspect of Web studies. TempWeb 2012 was jointly organized by Internet Memory Foundation (Paris, France), the Max-Planck-Institut für Informatik (Saarbrücken, Germany) and Yahoo! Research Barcelona (Barcelona, Spain), and supported by the 7th Framework IST programme of the European Union through the focused research project (STREP) on Longitudinal Analytics of Web Archive data (LAWA) under contract no. 258105.

pdf-file

TempWeb 2012 Proceedings

The Proceedings of the 2nd International Temporal Web Analytics Workshop (TempWeb 2012) are online now.

The Proceedings of the 2nd International Temporal Web Analytics Workshop (TempWeb 2012) held in conjunction with the 21st International World Wide Web Conference (www2012) in Lyon, France on April 17, 2012 are online at: ACM DL. The workshop was co-organized by the LAWA project and chaired by R. Baeza-Yates (Yahoo! Research Barcelona), J. Masanès (Internet Memory Foundation) and M. Spaniol (Max-Planck-Institut für Informatik).

Predicting the Evolution of Taxonomy Restructuring in Collective Web Catalogues

The paper "Predicting the Evolution of Taxonomy Restructuring in Collective Web Catalogues" by N. Prytkova, M. Spaniol and G. Weikum has been accepted for WebDB 2012.

Collectively maintained Web catalogues organize links to interesting Web sites into topic hierarchies, based on community input and editorial decisions. These taxonomic systems reflect the interests and diversity of ongoing societal discourses. Catalogues evolve by adding new topics, splitting topics, and other restructuring, in order to capture newly emerging concepts of long-lasting interest. In this paper, we investigate these changes in taxonomies and develop models for predicting such structural changes. Our approach identifi es newly emerging latent concepts by analyzing news articles (or social media), by means of a temporal term relatedness graph. We predict the addition of new topics to the catalogue based on statistical measures associated with the identifi ed latent concepts. Experiments with a large news archive corpus demonstrate the high precision of our method, and its suitability for Web-scale application.

WebDB 2012 homepage

Release of Web Analytics Technology V1

LAWA has released its Web Analytics Technology V1.

Implementations of LAWA’s Web Analytics Technology V1 have been driven by the overall aim of developing methods that support typical tasks in temporal Web analytics, such as:
• processing of large scale data sets for integration into the reference collection,
• large scale entity disambiguation,
• creation of temporal indices,
• Web classification.
This software will, throughout the course of the project, become building blocks of LAWA’s Virtual Web Observatory. Backed by requirements monitoring within LAWA’s target user community, we believe that the developed software will make temporal Web analytics better understandable and explainable. To this end, all modules incorporate state-of-the-art information extraction technologies for Web content analytics. The software is available for download in our Software section.

Extraction of Temporal Facts and Events from Wikipedia

The paper "Extraction of Temporal Facts and Events from Wikipedia" by Erdal Kuzey and Gerhard Weikum has been accepted for the second Temporal Web Analytics Workshop (TempWeb 2012) in conjunction with the WWW 2012 conference.

Recently, large-scale knowledge bases have been constructed by automatically extracting relational facts from text. Unfortunately, most of the current knowledge bases focus on static facts and ignore the temporal dimension. However, the vast majority of facts are evolving with time or are valid only during a particular time period. Thus, time is a signi ficant dimension that should be included in knowledge bases.
In this paper, we introduce a complete information extraction framework that harvests temporal facts and events from semi-structured data and free text of Wikipedia articles to create a temporal ontology. First, we extend a temporal data representation model by making it aware of events. Second, we develop an information extraction method which harvests temporal facts and events from Wikipedia infoboxes, categories, lists, and article titles in order to build a temporal knowledge base. Third, we show how the system can use its extracted knowledge for further growing the knowledge base.
We demonstrate the e ffectiveness of our proposed methods through several experiments. We extracted more than one million temporal facts with precision over 90% for extraction from semi-structured data and almost 70% for extraction from text.

TempWeb 2012 homepage

Content-Based Trust and Bias Classification via Biclustering

The paper "Content-Based Trust and Bias Classification via Biclustering" by Dávid Siklósi, Bálint Daróczy and András A. Benczúr has been accepted for the 2nd Joint WICOW/AIRWeb Workshop on Web Quality in conjunction with the WWW 2012 conference.

In this paper we improve trust, bias and factuality classification over Web data on the domain level. Unlike the majority of literature in this area that aims at extracting opinion and handling short text on the micro level, we aim to aid a researcher or an archivist in obtaining a large collection that, on the high level, originates from unbiased and trustworthy sources. Our method generates features as Jensen-Shannon distances from centers in a host-term biclustering. On top of the distance features, we apply kernel methods and also combine with baseline text classifiers. We test our method on the ECML/PKDD Discovery Challenge data set DC2010. Our method improves over the best achieved text classification NDCG results by over 3–10% for neutrality, bias and trustworthiness. The fact that the ECML/PKDD Discovery Challenge 2010 participants reached an AUC only slightly above 0.5 indicates the hardness of the task.

WICOW/AIRWeb homepage

Big Web Analytics: Toward a Virtual Web Observatory

The paper "Big Web Analytics: Toward a Virtual Web Observatory" by Marc Spaniol, András Benczúr, Zsolt Viharos and Gerhard Weikum has been accepted for the ERCIM News special theme No. 89 on "Big Data".

For decades, compute power and storage have become steadily cheaper, while network speeds, although increasing, have not kept up. The result is that data is becoming increasingly local and thus distributed in nature. It has become necessary to move the software and hardware to where the data resides, and not the reverse.  The goal of LAWA is to create a Virtual Web Observatory based on the rich centralized Web repository of the European Archive. The observatory will enable Web-scale analysis of data, will facilitate large-scale studies of Internet content and will add a new dimension to the roadmap of Future Internet Research – it’s about time!

ERCIM News
ERCIM News (pdf)

LAWA 3rd Newsletter

LAWA Partners are glad to present the third LAWA Newsletter

This edition focuses on the data subset citation method and the integration of the Wikipedia history into the LAWA reference collection. In addition, research areas present updates on their ongoing research.

Enjoy reading!

LAWA Presentation at the Future Internet (FI) week

The LAWA project will be presented in the FIRE session at the Future Internet (FI) week in Aalborg, Denmark, May 9, 2012

A FIRE thematic workshop will be held on May 9, 2012 in Aalborg, Denmark. This event is part of the Future Internet Week 2012. LAWA is proud to give a presentation in the FIRE Workshop on Measurement and Measurement Tools.

Presentation Abstract
The LAWA project develops methods and tools for temporal Web analytics. The focus of developments are semantic and structural analytics of time-versioned textual Web contents. In particular, we are developing methods that enable entity detection and tracking along the time axis as well as temporal studies of large (Web) graphs. To this end, we also prepare a reference data set and will provide analytics services.

Tracking Entities in Web Archives: The LAWA Project

The paper "Tracking Entities in Web Archives: The LAWA Project" by Marc Spaniol and Gerhard Weikum has been accepted for the European projects track at the WWW 2012 conference.

Web-preservation organization like the Internet Archive not only capture the history of born-digital content but also reflect the zeitgeist of di fferent time periods over more than a decade. This longitudinal data is a potential gold mine for researchers like sociologists, politologists, media and market analysts, or experts on intellectual property. The LAWA project (Longitudinal Analytics of Web Archive data) is developing an Internet-based experimental testbed for large-scale data analytics on Web archive collections. Its emphasis is on scalable methods for this specifi c kind of big-data analytics, and software tools for aggregating, querying, mining, and analyzing Web contents over long epochs. In this paper, we highlight our research on entity-level analytics in Web archive data, which lifts Web analytics from plain text to the entity-level by detecting named entities, resolving ambiguous names, extracting temporal facts and visualizing entities over time periods. Our results provide key assets for tracking named entities in the evolving Web, news, and social media.

WWW 2012 homepage

Temporal Web Analytics Workshop at WWW2012

LAWA helps staging TempWeb 2012 in conjunction with the World Wide Web conference in Lyon, France on April 17, 2012.

***************************************************************
                CALL FOR PAPERS
***************************************************************

***************************************************************
          Proceedings published by ACM
***************************************************************

2nd Temporal Web Analytics Workshop (TempWeb 2012)
in conjunction with WWW 2012
April 17, 2012, Lyon, France
http://www.temporalweb.net/

Objectives:
The objective of this workshop is to provide a venue for researchers of all domains (IE/IR, Web mining etc.) where the temporal dimension opens up an entirely new range of challenges and possibilities. The workshop’s ambition is to help shaping a community of interest on the research challenges and possibilities resulting from the introduction of the time dimension in Web analysis.

TempWeb focuses on temporal data analysis along the time dimension for Web data that has been collected over extended time periods. A major challenge in this regard is the sheer size of the data it exposes and the ability to make sense of it in a useful and meaningful manner for its users. Web scale data analytics therefore needs to develop infrastructures and extended analytical tools to make sense of these. TempWeb will take place April 17, 2012 in conjunction with the International World Wide Web Conference in Lyon, France.

Workshop topics of TempWeb therefore include, but are not limited to following:
• Web scale data analytics
• Temporal Web analytics
• Distributed data analytics
• Web science
• Web dynamics
• Data quality metrics
• Web spam evolution
• Content evolution on the Web
• Systematic exploitation of Web archives
• Large scale data storage
• Large scale data processing
• Time aware Web archiving
• Data aggregation
• Web trends
• Topic mining
• Terminology evolution
• Community detection and evolution

Important Dates:
• Paper submission deadline: February 17, 2012
• Notification of acceptance: March 5, 2012
• Camera ready copy deadline: March 16, 2012
• Workshop: April 17, 2012

Please post your submission (up to 8 pages) using the ACM template:
http://www.acm.org/sigs/publications/proceedings-templates
at:
https://www.easychair.org/account/signin.cgi?conf=tempweb2012

Workshop Officials:
PC-Chairs and Organizers:
Ricardo Baeza-Yates (Yahoo! Research, Spain)
Julien Masanès (Internet Memory Foundation, France and Netherlands)
Marc Spaniol (Max Planck Institute for Informatics, Germany)

Program Committee:
Eytan Adar (University of Michigan, USA)
Omar Alonso (Microsoft Bing, USA)
Srikanta Bedathur (IIIT-Delhi, India)
Andras Benczur (Hungarian Academy of Science)
Klaus Berberich (Max Planck Institute for Informatics, Germany)
Roi Blanco (Yahoo! Research, Spain)
Adam Jatowt (Kyoto University, Japan)
Scott Kirkpatrick (Hebrew University Jerusalem, Israel)
Ravi Kumar (Yahoo! Research, USA)
Christian König (Microsoft Research, USA)
Frank McCown (Harding University, USA)
Michael Nelson (Old Dominion University, USA)
Nikos Ntarmos (University of Patras, Greece)
Kjetil Norvag (Norwegian University of Science and Technology, Norway)
Philippe Rigaux (Internet Memory Foundation, France and Netherlands)
Thomas Risse (L3S Research Center, Germany)
Pierre Senellart (Télécom ParisTech, France)
Torsten Suel (NYU Polytechnic, USA)
Masashi Toyoda (Tokyo University, Japan)
Peter Triantafillou (University of Patras, Greece)
Michalis Vazirgiannis (Athens University of Economics and Business & École Polytechnique)
Gerhard Weikum (Max Planck Institute for Informatics, Germany)

Feedback of the 2nd User Workshop - Paris, November 15, 2011

2nd LAWA User Workshop: Big-Data Analytics for the Temporal Web was held on November 15, at Telecom Paris tech (Paris, France)

The workshop has been organized as a one-day workshop. More than 30 researchers attended the event. Presentations were given by the LAWA project team and the participating guest researchers. Topics included methods, tools, and platforms for big-data analytics, including requirements on and experiences with such technologies.

Specific keynotes were presented by:

- Pierre Senellart (Télécom ParisTech, Paris): PARIS: Probabilistic Alignement of Relationships, Instances, and Schema.
- Roi Blanco (Yahoo! Research, Barcelona): Searching over the past, present and future.



Guest presentations covered a wide spectrum on big-data analytics, such as:

- Erica Yang (Science and Technology Facilities Council, U.K.): “Poking Through Research Lifecycles: Towards Leveraging Web Archives for Substaining Digital Research Assets”
- Pei Li (DISCo - University of Milano-Bicocca): “Linking Temporal Record”
- Helen Hockx-Yu (British Library, London): “Analytical Access to the UK Web Archive”
- Zeynep Pehlivan (University Pierre and Marie Curie, LIP6, Paris): “A survey: ranking Web data with Temporal Dimension”
- Michalis Vazirgiannis (Athens University of Economics and Business & École Polytechnique, Athens): “Evaluation of communities with degeneracy”
- Vicenc Torra (IIIA-CSIC,  Bellaterra, Catalonia (Spain)): “Privacy-enhancing technologies in the web:Semantic-based tools for search logs”
- Klara Stokes (U. Rovira i Virgili): “P2P UPIR: Privacy protection for users of web-based search engines”

The presentations were followed by an interactive Q&A session. During the discussion the underlying ontology has been introduced, and methods for accessing the its (meta-)data have been discussed. The attendees expressed their interest to use analytics and data provided by the LAWA project. As a outcome of the workshop, LAWA is now working a publicly available selected data set of its reference collection.

LAWA 2nd Newsletter

LAWA Partners are glad to present the second LAWA Newsletter

This edition focuses on the LAWA reference collection and the testbed. In addition, each research area presents updates on their ongoing research.

Enjoy reading!

Big-Data Analytics for the Temporal Web (Paris, November 15, 2011)

International Workshop on Big-Data Analytics for the Temporal Web, Paris, November 15, 2011. Keynotes by: Roi Blanco (Yahoo! Research) "Searching over the past, present and future" and Pierre Senellart (Télécom ParisTech and Webdam Project) "PARIS: Probabilistic Alignment of Relations, Instances, and Schema"

The LAWA project organizes an one-day workshop with researchers using (or planning to use) the Web as a corpus for their studies. The focus is on methods, tools, and platforms for big-data analytics, including requirements on and experiences with such technologies. Topics of interest include but are not limited to: Web dynamics, history, and archives; text mining and contents classification, temporal/longitudinal studies, scalable methods (e.g., cloud-based map-reduce), large scale data storage, community detection and evolution.

The workshop will have presentations by participating researchers and big-data users, including the LAWA project team. Emphasis will be on experience-sharing and discussing mutual interests in big-data analytics for the temporal Web. The workshop is free of charge and open to public, but registration is compulsory by sending an email to:
.(JavaScript must be enabled to view this email address)

SZTAKI @ ImageCLEF 2011

The paper "SZTAKI @ ImageCLEF 2011" by B. Daróczy, R. Pethes, and A. A. Benczúr has been accepted for publication in the Working Notes of the ImageCLEF 2011 Workshop at CLEF 2011 Conference, Amsterdam, The Netherlands, 2011.

We participated in the ImageCLEF 2011 Photo Annotation and Wikipedia Image Retrieval Tasks. Our approach to the ImageCLEF 2011 Photo Annotation is based on a kernel weighting procedure using visual Fisher kernels and a Flickr-tag based Jensen-Shannon divergence based kernel. We trained a Gaussian Mixture Model (GMM) to define a generative model over the feature vectors extracted from the image patches. To represent each image with high-level descriptors we calculated Fisher vectors from different visual features of the images. These features were sampled at various scales and partitions such as Harris-Laplace detected patches, scale and spatial pyramids. We calculated distance matrices from the descriptors of train images to combine different high-level descriptors and the tag based similarity matrix. With this uniform representation we had the possibility to learn the natural weights for each category over the different type of descriptors. This re-weightning resulted 0.01838 MAP increase over the average kernel results. We used the weighted kernels for learning linear SVM models for each of the 99 concepts independently. For the Wikipedia Image Retrieval Task we used the search engine of the Hungarian Academy of Sciences as our information retrieval system that is based on Okapi BM25 ranking. We calculated light Fisher vectors to represent the content of the images and performed nearest-neighbour search on them.

AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables

The paper "AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables" by M. A. Yosef, J. Hoffart, I. Bordino, M. Spaniol and G. Weikum has been accepted for the VLDB 2011 conference.

We present AIDA, a framework and online tool for entity detection and disambiguation. Given a natural-language text or a Web table, we map mentions of ambiguous names onto canonical entities like people or places, registered in a knowledge base like DBpedia, Freebase, or YAGO. AIDA is a robust framework centred around collective disambiguation exploiting the prominence of entities, similarity between the context of the mention and its candidates, and the coherence among candidate entities for all mentions. We have developed a Web-based online interface for AIDA where different formats of inputs can be processed on the fly, returning proper entities and showing intermediate steps of the disambiguation process.

VLDB 2011 homepage

Harvesting Facts from Textual Web Sources by Constrained Label Propagation

The paper "Harvesting Facts from Textual Web Sources by Constrained Label Propagation" by Y. Wang, B. Yang, L. Qu, M. Spaniol and G. Weikum has been accepted for the CIKM 2011 conference.

There have been major advances on automatically constructing large knowledge bases by extracting relational facts from Web and text sources. However, the world is dynamic: periodic events such as sports competitions need to be interpreted with their respective timepoints, and facts such as coaching a sports team, holding political or business positions, and even marriages do not hold forever and should be augmented by their respective timespans. This paper addresses the problem of automatically harvesting temporal facts with such extended time-awareness. We employ pattern-based gathering techniques for fact candidates and construct a weighted pattern-candidate graph. Our key contribution is a new kind of label propagation algorithm with a judiciously designed loss function, which iteratively processes the graph to label good temporal facts for a given set of target relations. Our experiments with online news and Wikipedia articles demonstrate the accuracy of this method.

CIKM 2011 homepage

Temporal Index Sharding for Space-Time Efficiency in Archive Search

The paper "Temporal Index Sharding for Space-Time Efficiency in Archive Search" by A. Anand, S. Bedathur, K. Berberich and R. Schenkel has been accepted for the ACM SIGIR Conference 2011.

Time-travel queries that couple temporal constraints with keyword queries are useful in searching large-scale archives of time-evolving content such as the web archives or wikis. Typical approaches for efficient evaluation of these queries involve slicing either the entire collection or individual index lists along the time-axis. Both these methods are not satisfactory since they sacri fice compactness of index for
processing efficiency making them either too big or, otherwise, too slow.
We present a novel index organization scheme that shards each index list with almost zero increase in index size but still minimizes the cost of reading index entries during query processing. Based on the optimal sharding thus obtained, we develop a practically efficient sharding that takes into account the different costs of random and sequential accesses. Our algorithm merges shards from the optimal solution to allow for a few extra sequential accesses while gaining significantly by reducing the number of random accesses. We empirically establish the eff ectiveness of our sharding scheme with experiments over the revision history of the English Wikipedia between 2001-2005 (~ 700 GB) and an archive of U.K. governmental web sites (~ 400 GB). Our results demonstrate the feasibility of faster time-travel query processing with no space overhead.

Full Paper

LAWA Presentation at the FIRE Research Workshop

The LAWA project will be presented at the FIRE Research Workshop in Budapest (Hungary) on May 16 May 2011

The FIRE Research Workshop will be held on May 16 May 2011 in Budapest (Hungary). This event is part of the Future Internet Week 2011. LAWA is proud to give a presentation in the Future Internet, Living Labs and Web analytics session.

Presentation Abstract
Organizations like the Internet Archive have been capturing Web contents over decades. This time-versioned content is a gold mine for analysts, focusing on longitudinal studies. An application example is tracking and analyzing a politician’s public appearances over a decade. The LAWA project develops methods and tools for time-travel indexing and querying, entity detection and tracking along the time axis, and advanced analyses and knowledge discovery. For scalability, we pursue Hadoop-based distributed computations. We also prepare reference data and will provide analytics services. We will offer a user workshop in late 2011 to disseminate these opportunities and explore interesting use cases.

LAWA 1st Newsletter

LAWA Partners are glad to present the first LAWA Newsletter

This edition introduces the LAWA research areas by presenting their goals and first research steps undertaken.
Enjoy reading!

Scalable Spatio-temporal Knowledge Harvesting

The paper "Scalable Spatio-temporal Knowledge Harvesting" by Yafang Wang, Bin Yang, Spyros Zoupanos, Marc Spaniol and Gerhard Weikum has been accepted for the WWW 2011 poster track.

Knowledge harvesting enables the automated construction of large knowledge bases. In this work, we made a first attempt to harvest the spatio-temporal knowledge from news archives to construct the trajectories of individual entities for spatio-temporal entity tracking.
Our approach consists of an entity extraction and disambiguation module and a fact generation module which produce pertinent trajectory records from textual sources. The evaluation on the 20 years’ New York Times corpus showed that our methods are effective and scalable.

Temporal Web Analytics Workshop at WWW2011

LAWA help put together the first workshop on this emerging topic at WWW2011

We are very pleased to see that the temporal dimension of Web analysis is getting momentum. An example? The first Temporal Web Analytics Workshop (twitter #temporalweb) will be held in Conjunction with the World Wide Web Conference (WWW2011) in Hyderabad, India.  Time to prepare great papers (deadline is 31st of January 2011).

1st LAWA user workshop

The 1st LAWA User Workshop will be held on November 30, 2010 at CNAM (Conservatoire des Arts et Métier) Paris, France.

The LAWA project on Longitudinal Analytics of Web Archive data will build an Internet-based experimental testbed for large-scale data analytics. Its focus is on developing a sustainable infra-structure, scalable methods, and easily usable software tools for aggregating, querying, and analyzing heterogeneous data at Internet scale. Particular emphasis will be given to longitudinal data analysis along the time dimension for Web data that has been crawled over extended time periods.

By means of this workshop we want to present LAWA to a scientific audience and discuss mutual interests in Web archive analytics. In addition, we want to establish links to related projects. The preliminary agenda is available at: (http://www.lawa-project.eu/events/user-workshop-2010.pdf).

Registration for the event is free of charge but compulsory and has to be made online by sending an email to:
.(JavaScript must be enabled to view this email address)
Due to the limited capacity of the venue, it is recommended to register as soon as possible. Applications will be handled on a first come, first served basis.

LAWA-workshop1LAWA-workshop1



Longitudinal Analytics on Web Archive Data: It’s About Time!

The joint vision paper “Longitudinal Analytics on Web Archive Data: It’s About Time!” by the LAWA consortium has been accepted at CIDR 2011

ABSTRACT

Organizations like the Internet Archive have been capturing Web contents over decades, building up huge repositories of time-versioned pages. The timestamp annotations and the sheer volume of multi-modal content constitutes a gold mine for analysts of all sorts, across different application areas, from political analysts and marketing agencies to academic researchers and product developers. In contrast to traditional data analytics on click logs, the focus is on longitudinal stud- ies over very long horizons. This longitudinal aspect affects and concerns all data and metadata, from the content itself, to the indices and the statistical metadata maintained for it. Moreover, advanced analysts prefer to deal with semantically rich entities like people, places, organizations, and ideally relationships such as company acquisitions, instead of, say, Web pages containing such references. For example, tracking and analyzing a politician’s public appearances over a decade is much harder than mining frequently used query words or frequently clicked URLs for the last month. The huge size of Web archives adds to the complexity of this daunting task. This paper discusses key challenges, that we intend to take up, which are posed by this kind of longitudinal analytics: time-travel indexing and querying, entity detection and track- ing along the time axis, algorithms for advanced analyses and knowledge discovery, and scalability and platform issues.

LAWA Kick-off Meeting

The kick-off meeting took place on the 6th – 7th of September, 2010 at Max Planck Institute für Informatik in Saarbrücken, Germany.

The kickoff meeting took place at the Max-Planck-Institute for Computer Science in Saarbrücken, Germany on September 6-7, 2010. The meeting was highly interactive and organized in presentation and discussion sessions. On the first day, organizational issues were handled and all partners presented their ideas of research within LAWA. From the presentations, collaboration ideas among the following three lines emerged:
• Data
• Infrastructure/Architecture
• Analytics/Interfaces
On the second day, the previously identified aspects were transformed into technical tasks and next steps were identified. Again, along the three identified lines of research mutual collaboration aspects were identified, such as efficient indexing and data distribution as well as methods for aggregated querying and graph computations. As a result, the partners have started working on a joint white paper to be submitted to one of the upcoming relevant conferences.

Kick-off Meeting Participants
Figure: Participants of the LAWA kickoff meeting in Saarbrücken, Sept. 6-7, 2010