Longitudinal Analytics
of Web Archive Data

 

LAWA 6th Newsletter

LAWA Partners are glad to present the sixth LAWA Newsletter

This edition focuses on Temporal Web Analytics in Action. In addition, we present the latest publications of the LAWA project.

Enjoy reading!

Crowdsourced Entity Markup

The paper "Crowdsourced Entity Markup" by Lili Jiang, Yafang Wang, Johannes Hoffart and Gerhard Weikum has been accepted for the Workshop on Crowdsourcing the Semantic Web (CrowdSem 2013) in conjunction with ISWC 2013.

Entities, such as people, places, products, etc., exist in knowledge bases and linked data, on one hand, and in web pages, news articles, and social media, on the other hand. Entity markup, like Named Entities Recognition and Disambiguation (NERD), is the essential means for adding semantic value to unstructured web contents and this way enabling the linkage between unstructured and structured data and knowledge collections. A major challenge in this endeavor lies in the dynamics of the digital contents about the world, with new entities emerging all the time. In this paper, we propose a crowdsourced framework for NERD, specifically addressing the challenge of emerging entities in social media. Our approach combines NERD techniques with the detection of entity alias names and with co-reference resolution in texts. We propose a linking-game based crowdsourcing system for this combined task, and we report on experimental insights with this approach and on lessons learned.

CrowdSem 2013 homepage

On the SPOT: Question Answering over Temporally Enhanced Structured Data

The paper "On the SPOT: Question Answering over Temporally Enhanced Structured Data" by Mohamed Yahya, Klaus Berberich, Maya Ramanath and Gerhard Weikum has been accepted for the Workshop on Time-aware Information Access (TAIA2013) in conjunction with SIGIR 2013.

Natural-language question answering is a convenient way for humans to discover relevant information in structured Web data such as knowledge bases or Linked Open Data sources. This paper focuses on data with a temporal dimension, and discusses the problem of mapping natural-language questions into extended SPARQL queries over RDF-structured data. We specifically address the issue of disambiguating temporal phrases in the question into temporal entities like dates and named events, and temporal predicates. For the situation where the data has only partial coverage of the time dimension but is augmented with textual descriptions of entities and facts, we also discuss how to generate queries that combine structured search with keyword conditions.

TAIA 2013 homepage

Temporal Diversification of Search Results

The paper "Temporal Diversification of Search Results" by Klaus Berberich and Srikanta Bedathur has been accepted for the Workshop on Time-aware Information Access (TAIA2013) in conjunction with SIGIR 2013.

We investigate the notion of temporal diversity, bringing together two recently active threads of research, namely temporal ranking and diversification of search results. A novel method is developed to determine search results consisting of documents that are relevant to the query and were published at diverse times of interest to the query. Preliminary experiments on twenty years’ worth of newspaper articles from The New York Times demonstrate characteristics of our method and compare it against two baselines.

TAIA 2013 homepage

On Temporal Wikipedia search by edits and linkage

The paper "On Temporal Wikipedia search by edits and linkage" by Julianna Göbölös-Szabó and András Benczúr has been accepted for the Workshop on Time-aware Information Access (TAIA2013) in conjunction with SIGIR 2013.

We exploit the connectivity structure of edits in Wikipedia to identify recent events that happened at a given time via identifying bursty changes in linked articles around a specified date. Our key results include algorithms for node relevance ranking in temporal subgraph and neighborhood selection based on measurements for structural changes in time over the Wikipedia link graph. We measure our algorithms over manually annotated queries with relevant events in September and October 2011; we make the assessment publicly available (https://dms.sztaki.hu/en/download/wimmut-searching-and-navigating-wikipedia). While our methods were tested over clean Wikipedia metadata, we believe the methods are applicable to general temporal Web collections as well.

TAIA 2013 homepage

TempWeb 2013 Roundup

The 3rd Temporal Web Analytics Workshop (TempWeb 2013) was successfully staged in Rio de Janeiro, Brazil on May 13, 2013.

On May 13, 2013 the LAWA consortium successfully staged the 3rd Temporal Web Analytics Workshop (TempWeb 2013) in Rio de Janeiro, Brazil. Again, the workshop was organized in conjunction with the international World Wide Web conference. The workshop attracted around 40 participants throughout the entire day.

After a short introduction, the workshop began with an exciting keynote by Omar Alonso (Microsoft Bing, USA) on “Stuff happens continuously: exploring Web contents with temporal information”. The talk covered the entire spectrum of
temporal Web analytics including time in document collections, social data and exploring the Web using time. The keynote showed again the relevance of the topic and the perfect alignment with the World Wide Web conference.

TempWeb 2013 Keynote

The scientific presentation then were separaetd into three sessions (Papers are available from the WWW companion edition published by ACM):

Web Archiving
Miguel Costa, Daniel Gomes and Mário J. Silva: “A Survey of Web Archive Search Architectures”
Ahmed Alsum, Michael L. Nelson, Robert Sanderson and Herbert Van de Sompel: “Archival HTTP Redirection Retrieval Policies”
Daniel Gomes, Miguel Costa, David Cruz, João Miranda and Simão Fontes: “Creating a Billion-Scale Searchable Web Archive”

Identifying and leveraging time information
Julia Kiseleva, Hoang Thanh Lam, Mykola Pechenizkiy and Toon Calders: “Predicting temporal hidden contexts in web sessions”
Hany Salaheldeen and Michael Nelson: “Carbon Dating The Web: Estimating the Age of Web Resources”
Omar Alonso and Kyle Shiells: “Timelines as Summaries of Popular Scheduled Events”

TempWeb 2013 Audience

Lucas Miranda, Rodrygo Santos and Alberto Laender: “Characterizing Video Access Patterns in Mainstream Media Portals”
Laura Elisa Celis, Koustuv Dasgupta and Vaibhav Rajan: “Adaptive Crowdsourcing for Temporal Crowds”
Hideo Joho, Adam Jatowt and Roi Blanco: “A Survey of Temporal Web Search Experience”

All talks were of high quality and the discussions were lively. From the workshop’s third edition it became clear that LAWA covers a hot topic, which is worth being investigated in conjunction with the World Wide Web conference series. As a next step, we are planning to work on a special issue to be published in a journal. So, stay tuned!

HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text

The paper "HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text" by Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol and Gerhard Weikum has been accepted for the ACL 2013 demo track.

Recent research has shown progress in achieving high-quality, very fine-grained type classification in hierarchical taxonomies. Within such a multi-level type hierarchy with several hundreds of types at different levels, many entities naturally belong to multiple types. In order to achieve high-precision in type classification, current approaches are either limited to certain domains or require time consuming multistage computations. As a consequence, existing systems are incapable of performing ad-hoc type classification on arbitrary input texts. In this demo, we present a novel Web-based tool that is able to perform domain independent entity type classification under real time conditions. Due to its efficient implementation and compacted feature representation, the system is able to process text inputs on-the-fly by achieving equally high precision as leading state-of-the-art implementations. Our system offers an online interface where natural-language text can be inserted, which returns lexical type labels for entity mentions. Further the user interface allows users to explore the types assigned to text mentions by visualizing and navigating along the type-hierarchy.

ACL 2013 homepage

Knowledge Linking for Online Statistics

The paper "Knowledge Linking for Online Statistics" by Marc Spaniol, Natalia Prytkova and Gerhard Weikum will be presented at the 59th World Statistics Congress (WSC) in the Special Topic Session (STS) on "The potential of Internet, big data and organic data for official statistics".

The LAWA project investigates large-scale Web (archive) data along the temporal dimension. As a use case, we are studying Knowledge Linking for Online Statistics.

Statistic portals such as eurostat’s “Statistics Explained” (http://epp.eurostat.ec.europa.eu/statistics_explained/index.php/Main_Page) provide a wealth of articles constituting an encyclopedia of European statistics. Together with its statistical glossary, the huge amount of numerical data comes with a well-defined thesaurus. However, this data is not directly at hands, when browsing Web data covering the topic. For instance, when reading news articles about the debate on renewable energy across Europe after the earthquake in Japan and the Fukushima accident, one would ideally be able to understand these discussions based on statistical evidence.

We believe that Internet contents, captured in Web archives and reflected and aggregated in the Wikipedia history, can be better understood when linked with online statistics. To this end, we aim at semantically enriching and analyzing Web (archive) data to narrow and ultimately bridge the gap between numerical statistics and textual media like news or online forums. The missing link and key to this goal is the discovery and analysis of entities and events in Web (archive) contents. This way, we can enrich Web pages, e.g. by a browser plug-in, with links to relevant statistics (e.g. eurostat pages). Raising data analytics to the entity-level also enables understanding the impact of societal events and their perception in different cultures and economies.

WSC 2013 homepage

Mind the Gap: Large-Scale Frequent Sequence Mining

The paper "Mind the Gap: Large-Scale Frequent Sequence Mining" by Iris Miliaraki, Klaus Berberich, Rainer Gemulla and Spyros Zoupanos has been accepted for presentation at SIGMOD 2013.

Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are suffciently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose PFSM, a scalable algorithm for frequent sequence mining on MapReduce. PFSM can handle so-called “gap constraints’‘, which can be used to limit the output to a controlled set of frequent sequences. At its heart, PFSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of $w$-equivalency, which is a generalization of the notion of a “projected database’’ used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our extensive experimental study in the context of text mining suggests that PFSM is significantly more efficient and scalable than alternative approaches.

SIGMOD 2013 homepage

LAWA 5th Newsletter

LAWA Partners are glad to present the fifth LAWA Newsletter

This edition focuses on indexing Big Data. In addition, we present the latest publications of the LAWA project.

Enjoy reading!

User-Defined Redundancy in Web Archives

The paper "User-Defined Redundancy in Web Archives" by Bibek Paudel, Avishek Anand, and Klaus Berberich has been accepted for the Workshop on Large-Scale and Distributed Systems for Information Retrieval (LSDS-IR) in conjunction with WSDM 2013.

Web archives are valuable resources. However, they are characterized by a high degree of redundancy. Not only does this redundancy waste computing resources, but it also deteriorates users’ experience, since they have to sift through and weed out redundant content. Existing methods focus on identifying near-duplicate documents, assuming a universal notion of redundancy, and can thus not adapt to user-specific requirements such as a preference for more recent or diversely opinionated content.

In this work, we propose an approach that equips users with fine-grained control over what they consider redundant. Users thus specify a binary coverage relation between documents that can factor in documents’ contents as well as their meta data. Our approach then determines a minimum-cardinality cover set of non-redundant documents. We describe how this can be done at scale using MapReduce as a platform for distributed data processing. Our prototype implementation has been deployed on a real-world web archive and we report experiences from this case study.

LSDS-IR homepage

Computing n-Gram Statistics in MapReduce

The paper "Computing n-Gram Statistics in MapReduce" by Klaus Berberich and Srikanta Bedathur has been accepted for EDBT 2013.

Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-sigma that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete MapReduce implementation, we provide insights on an efficient implementation of the methods. Extensive experiments on The New York Times Annotated Corpus and ClueWeb09 expose the relative benefits and trade-offs of the methods.

EDBT 2013 homepage

Temporal Web Analytics Workshop at WWW2013

LAWA helps staging TempWeb 2013 in conjunction with the World Wide Web conference in Rio de Janeiro, Brazil on May 13, 2013.

***************************************************************
          CALL FOR PAPERS
***************************************************************

3rd Temporal Web Analytics Workshop (TempWeb 2013)
in conjunction with WWW 2013
May 13, 2013, Rio de Janeiro, Brazil
http://www.temporalweb.net/

***************************************************************
      Proceedings published by ACM
***************************************************************

***************************************************************
Keynote by Omar Alonso (Microsoft Bing, USA)
“Stuff happens continuously: exploring Web contents
with temporal information”
***************************************************************

Objectives:
The objective of this workshop is to provide a venue for researchers of all domains (IE/IR, Web mining etc.) where the temporal dimension opens up an entirely new range of challenges and possibilities. The workshop’s ambition is to help shaping a community of interest on the research challenges and possibilities resulting from the introduction of the time dimension in Web analysis.

TempWeb focuses on temporal data analysis along the time dimension for Web data that has been collected over extended time periods. A major challenge in this regard is the sheer size of the data it exposes and the ability to make sense of it in a useful and meaningful manner for its users. Web scale data analytics therefore needs to develop infrastructures and extended analytical tools to make sense of these. TempWeb will take place May 13, 2013 in conjunction with the International World Wide Web Conference in Rio de Janeiro, Brazil.

Workshop topics of TempWeb therefore include, but are not limited to following:
- Web scale data analytics
- Temporal Web analytics
- Distributed data analytics
- Web science
- Web dynamics
- Data quality metrics
- Web spam evolution
- Content evolution on the Web
- Systematic exploitation of Web archives
- Large scale data storage
- Large scale data processing
- Time aware Web archiving
- Data aggregation
- Web trends
- Topic mining
- Terminology evolution
- Community detection and evolution

Important Dates:
- Paper submission deadline: February 22, 2013
- Notification of acceptance: March 11, 2013
- Camera-ready copy deadline: April 3, 2013
- Workshop: May 13, 2013

Please post your submission (up to 8 pages) using the ACM template:
http://www.acm.org/sigs/publications/proceedings-templates
at:
https://www.easychair.org/account/signin.cgi?conf=tempweb2013

Workshop Officials

PC-Chairs and Organizers:
Julien Masanès (Internet Memory Foundation, France and Netherlands)
Marc Spaniol (Max Planck Institute for Informatics, Germany)
Ricardo Baeza-­Yates (Yahoo! Research, Spain)

Program Committee:
Eytan Adar (University of Michigan, USA)
Omar Alonso (Microsoft Bing, USA)
Ralitsa Angelova (Google, Switzerland)
Srikanta Bedathur (IIIT-Delhi, India)
Andras A. Benczur (Hungarian Academy of Science)
Klaus Berberich (Max-Planck-Institut für Informatik, Germany)
Roi Blanco (Yahoo! Research, Spain)
Philipp Cimiano (University of Bielefeld, Germany)
Renata Galante (Universidade Federal do Rio Grande do Sul, Brazil)
Adam Jatowt (Kyoto University, Japan)
Scott Kirkpatrick (Hebrew University Jerusalem, Israel)
Frank McCown (Harding University, USA)
Michael Nelson (Old Dominion University, USA)
Kjetil Norvag (Norwegian University of Science and Technology, Norway)
Nikos Ntarmos (University of Patras, Greece)
Philippe Rigaux (Mignify, France)
Thomas Risse (L3S Research Center, Germany)
Rodrygo Luis Teodoro Santos (University of Glasgow, UK)
Torsten Suel (NYU Polytechnic, USA)
Masashi Toyoda (Tokyo University, Japan)
Gerhard Weikum (Max-Planck-Institut für Informatik, Germany)

Feedback of the 3rd User Workshop - Paris, November 13, 2012

3rd LAWA User Workshop: Big-Data Analytics for the Temporal Web was held on November 13, at Conservatoire National des Arts et Métiers, CNAM (Paris, France)

The workshop has been organized as a one-day workshop. About 50 researchers attended the event. Presentations were given by the LAWA project team and the participating guest researchers. Topics included methods, tools, and platforms for big-data analytics, including requirements on and experiences with such technologies.

Specific keynotes were presented by:

- Ricardo Baeza-Yates (Yahoo! Research, Barcelona): “Time in Web IR”
- Wolfgang Nejdl (L3S Research Center, Hanover): “Web Science, Web Analytics and Web Archives - Humans in the Loop”

Ricardo Baeza-Yates  Wolfgang Nejdl

Guest presentations covered a wide spectrum on big-data analytics, such as:

- Frédéric Plissonneau (Technicolor): “Incremental collection of data from specific Web sites: A Cinema dedicated use-case”
- Robert Fischer (SWR): “The SWR/ARD Webarchive and the goals of ARCOMEM”
- Linnet Taylor (Oxford Internet Institute): “Accessing and Using Big Data to Advance Social Science Knowledge”
- Gaël Dias (University of Caen Basse-Normandie): “Temporal Disambiguation of Timely Implicit Queries”
- Marie Guégan (Technicolor): “Like You Said, “This Movie Rocks!” Extracting Post Quotes for Social Network Analysis”
- Hugo C. Huurdeman (University of Amsterdam): “Introducing the WebART project: Web Archive Retrieval Tools”
- Zeynep Pehlivan (University Pierre and Marie Curie Paris): “Temporal Static Index Pruning”
- Gérard Dupont (CASSIDIAN [an EADS company]): “An overview of the OSINT challenges”

The workshop was highly interactive. Communications with the participating guests showed the great potential of the topics presented. Apart from the explicit Q&A sessions after each presentation, there were many lively discussions continuing during breaks and the social event. Even more, the consortium was able to present the first building blocks for temporal analytics in the Virtual Web Observatory. From the discussions throughout the workshop entity-driven analytics again emerged to be the focal point. It turned out that the next generation of analytics tools should go beyond plain text, helping users in tracing entities over time.

HYENA: Hierarchical Type Classification for Entity Names

The paper "HYENA: Hierarchical Type Classification for Entity Names" by Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol and Gerhard Weikum has been accepted for COLING 2012.

Inferring lexical type labels for entity mentions in texts is an important asset for NLP tasks like semantic role labeling and named entity disambiguation (NED). Prior work has focused on flat and relatively small type systems where most entities belong to exactly one type. This paper addresses very fine-grained types organized in a hierarchical taxonomy, with several hundreds of types at different levels. We present HYENA for multi-label hierarchical classification. HYENA exploits gazetteer features and accounts for the joint evidence for types at different levels. Experiments and an extrinsic study on NED demonstrate the practical viability of HYENA.

COLING 2012 homepage

Big-Data Analytics for the Temporal Web (Paris, November 13, 2012)

International Workshop on Big-Data Analytics for the Temporal Web, Paris, November 13, 2012. Keynotes by: Ricardo Baeza-Yates (Yahoo! Research, Barcelona) and Wolfgang Nejdl (L3S Research Center, Hanover).

The LAWA project organizes an one-day workshop with researchers using (or planning to use) the Web as a corpus for their studies. The focus is on methods, tools, and platforms for big-data analytics, including requirements on and experiences with such technologies. Topics of interest include but are not limited to: Web dynamics, history, and archives; text mining and contents classification, temporal/longitudinal studies, scalable methods (e.g., cloud-based map-reduce), large scale data storage, community detection and evolution.

The workshop will have presentations by participating researchers and big-data users, including the LAWA project team. Emphasis will be on experience-sharing and discussing mutual interests in big-data analytics for the temporal Web. The workshop is free of charge and open to public, but registration is compulsory by sending an email to:
.(JavaScript must be enabled to view this email address)

Agenda (updated)

Venue
Conservatoire National des Arts et Métiers (CNAM)
2, rue Conté, Paris, 3rd arrondissement.
Room: 37.1.50

Directions: enter the courtyard, find access 37; take the staircase to first floor.

CNAM

LAWA 4th Newsletter

LAWA Partners are glad to present the fourth LAWA Newsletter

This edition focuses on the Virtual Web Observatory (VWO). In addition, selected applications are available for testing.

Enjoy reading!

Click here for a live demo!

PRAVDA-live: Interactive Knowledge Harvesting

The paper "PRAVDA-live: Interactive Knowledge Harvesting" by Yafang Wang, Maximilian Dylla, Zhaouchun Ren, Marc Spaniol and Gerhard Weikum has been accepted for the CIKM 2012 demo session.

Acquiring high-quality (temporal) facts for knowledge bases is a labor-intensive process. Although there has been recent progress in the area of semi-supervised fact extraction, these approaches still have limitations, including a restricted corpus, a fixed set of relations to be extracted or a lack of assessment capabilities. In this paper we introduce PRAVDA-live, a framework that overcomes these limitations and supports the entire pipeline of interactive knowledge harvesting. To this end, our demo exhibits temporal fact extraction from ad-hoc corpus creation, via relation specification, labeling and assessment all the way to ready-to-use RDF exports.

CIKM 2012 homepage

LINDA: Distributed Web-of-Data-Scale Entity Matching

The paper "LINDA: Distributed Web-of-Data-Scale Entity Matching" by Christoph Böhm, Gerard de Melo, Felix Naumann and Gerhard Weikum has been accepted for CIKM 2012.

Linked Data has emerged as a powerful way of interconnecting structured data on the Web. However, the crosslinkage between Linked Data sources is not as extensive as one would hope for. In this paper, we formalize the task of automatically creating “sameAs” links across data sources in a globally consistent manner. Our algorithm, presented in a multi-core as well as a distributed version, achieves this link generation by accounting for joint evidence of a match. Experiments con rm that our system scales beyond 100 million entities and delivers highly accurate results despite the vast heterogeneity and daunting scale.

CIKM 2012 homepage

KORE: Keyphrase Overlap Relatedness for Entity Disambiguation

The paper "KORE: Keyphrase Overlap Relatedness for Entity Disambiguation" by Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald and Gerhard Weikum has been accepted for CIKM 2012.

Measuring the semantic relatedness between two entities is the basis for numerous tasks in IR, NLP, and Web-based knowledge extraction. This paper focuses on disambiguating names in a Web or text document by jointly mapping all names onto semantically related entities registered in a knowledge base. To this end, we have developed a novel notion of semantic relatedness between two entities represented as sets of weighted (multi-word) keyphrases, with consideration of partially overlapping phrases. This measure improves the quality of prior link-based models, and also eliminates the need for (usually Wikipedia-centric) explicit interlinkage between entities. Thus, our method is more versatile and can cope with long-tail and newly emerging entities that have few or no links associated with them. For efficiency, we have developed approximation techniques based on min-hash sketches and locality-sensitive hashing. Our experiments on semantic relatedness and on named entity disambiguation demonstrate the superiority of our method compared to state-of-the-art baselines.

CIKM 2012 homepage

Natural Language Questions for the Web of Data

The paper "Natural Language Questions for the Web of Data" by Mohamed Yahya, Klaus Berberich, Shady Elbassuoni Maya Ramanath, Volker Tresp and Gerhard Weikum has been accepted for EMNLP 2012.

The Linked Data initiative comprises structured databases in the Semantic-Web data model RDF. Exploring this heterogeneous data by structured query languages is tedious and error-prone even for skilled users. To ease the task, this paper presents a methodology for translating natural language questions into structured SPARQL queries over linked-data sources.

Our method is based on an integer linear program to solve several disambiguation tasks jointly: the segmentation of questions into phrases; the mapping of phrases to semantic entities, classes, and relations; and the construction of SPARQL triple patterns. Our solution harnesses the rich type system provided by knowledge bases in the web of linked data, to constrain our semantic-coherence objective function. We present experiments on both the question translation and the resulting query answering.

EMNLP 2012 homepage

WAC Summer Workshop

Ongoing research in LAWA will be presented at the WAC (Web Archive Cooperative) Summer Workshop June 29 - July 1, 2012 at Stanford University in Palo Alto, CA, USA.

The Web Archive Cooperative (WAC) organizes a Summer Workshop on “Challenges in Providing Access to the World’s Web Archives” from June 29 - July 1, 2012 at Stanford University in Palo Alto, CA, USA. The LAWA project is proud to contribute to this outstanding event by giving a presentation on its ongoing research.

WAC 2012 homepage

Coupling Label Propagation and Constraints for Temporal Fact Extraction

The paper "Coupling Label Propagation and Constraints for Temporal Fact Extraction" by Y. Wang, M. Dylla, M. Spaniol and G. Weikum has been accepted for ACL 2012.

The Web and digitized text sources contain a wealth of information about named entities such as politicians, actors, companies, or cultural landmarks. Extracting this information has enabled the automated construction of large knowledge bases, containing hundred millions of binary relationships or attribute values about these named entities. However, in reality most knowledge is transient, i.e. changes over time, requiring a temporal dimension in fact extraction. In this paper we develop a methodology that interlinks label propagation with constraints for temporal fact extraction. Due to the coupling we gain maximum benefit from both “worlds”. Label propagation “aggressively” gathers fact candidates, while an Integer Linear Program does the “clean-up”. Our method is able to improve on recall while keeping up with precision, which we demonstrate by experiments with biography-style Wikipedia pages and a large corpus of news articles.

ACL 2012 homepage

TempWeb 2012 Roundup

The 2nd Temporal Web Analytics Workshop (TempWeb 2012) was successfully staged in Lyon, France on April 17, 2012.

On April 17, 2012 members of the LAWA consortium successfully staged the 2nd Temporal Web Analytics Workshop (TempWeb 2012) in Lyon, France. Again, the workshop was organized in conjunction with the international World Wide Web conference. The workshop attracted even more participants than its previous edition, with a peak of about 50 guests.

After a short introduction, the workshop began with an exciting keynote by Staffan Truvé, CTO on “Recorded Future: unlocking the predictive power of the web.” The talk covered almost all aspects of ongoing research aspects that were discussed in detail throughout the entire workshop. Even more, it became clear that the alignment of the workshop really focuses on a hot topic.

The scientific presentation then were separated into two sessions (Papers are available from the workshop proceedings published by ACM):

Web Dynamics
Geerajit Rattanaritnont, Masashi Toyoda and Masaru Kitsuregawa: “Analyzing Patterns of Information Cascades based on Users’ Influence and Posting Behaviors”
Masahiro Inoue and Keishi Tajima: “Noise Robust Detection of the Emergence and Spread of Topics on the Web”
Margarita Karkali, Vassilis Plachouras, Costas Stefanatos and Michalis Vazirgiannis: “Keeping Keywords Fresh: A BM25 Variation for Personalized Keyword Extraction”

Speaker at TempWeb 2012

Identifying and leveraging time information
Erdal Kuzey and Gerhard Weikum “Extraction of Temporal Facts and Events from Wikipedia”
Jannik Strötgen, Omar Alonso and Michael Gertz “Identification of Top Relevant Temporal Expressions in Documents”
Ricardo Campos, Gaël Dias, Alípio Jorge and Célia Nunes: “Enriching Temporal Query Understanding through Date Identification: How to Tag Implicit Temporal Queries?”

Participants of TempWeb 2012

Again, all talks were of high quality and the discussions were lively. From the panel at the end of the workshop it became clear that the audience is keen on reference data sets provided by LAWA and wants to see a third edition of the workshop to be organized in conjunction with the next World Wide Web conference. So, stay tuned!

Predicting the Evolution of Taxonomy Restructuring in Collective Web Catalogues

The paper "Predicting the Evolution of Taxonomy Restructuring in Collective Web Catalogues" by N. Prytkova, M. Spaniol and G. Weikum has been accepted for WebDB 2012.

Collectively maintained Web catalogues organize links to interesting Web sites into topic hierarchies, based on community input and editorial decisions. These taxonomic systems reflect the interests and diversity of ongoing societal discourses. Catalogues evolve by adding new topics, splitting topics, and other restructuring, in order to capture newly emerging concepts of long-lasting interest. In this paper, we investigate these changes in taxonomies and develop models for predicting such structural changes. Our approach identifi es newly emerging latent concepts by analyzing news articles (or social media), by means of a temporal term relatedness graph. We predict the addition of new topics to the catalogue based on statistical measures associated with the identifi ed latent concepts. Experiments with a large news archive corpus demonstrate the high precision of our method, and its suitability for Web-scale application.

WebDB 2012 homepage

Release of Web Analytics Technology V1

LAWA has released its Web Analytics Technology V1.

Implementations of LAWA’s Web Analytics Technology V1 have been driven by the overall aim of developing methods that support typical tasks in temporal Web analytics, such as:
• processing of large scale data sets for integration into the reference collection,
• large scale entity disambiguation,
• creation of temporal indices,
• Web classification.
This software will, throughout the course of the project, become building blocks of LAWA’s Virtual Web Observatory. Backed by requirements monitoring within LAWA’s target user community, we believe that the developed software will make temporal Web analytics better understandable and explainable. To this end, all modules incorporate state-of-the-art information extraction technologies for Web content analytics. The software is available for download in our Software section.

Extraction of Temporal Facts and Events from Wikipedia

The paper "Extraction of Temporal Facts and Events from Wikipedia" by Erdal Kuzey and Gerhard Weikum has been accepted for the second Temporal Web Analytics Workshop (TempWeb 2012) in conjunction with the WWW 2012 conference.

Recently, large-scale knowledge bases have been constructed by automatically extracting relational facts from text. Unfortunately, most of the current knowledge bases focus on static facts and ignore the temporal dimension. However, the vast majority of facts are evolving with time or are valid only during a particular time period. Thus, time is a signi ficant dimension that should be included in knowledge bases.
In this paper, we introduce a complete information extraction framework that harvests temporal facts and events from semi-structured data and free text of Wikipedia articles to create a temporal ontology. First, we extend a temporal data representation model by making it aware of events. Second, we develop an information extraction method which harvests temporal facts and events from Wikipedia infoboxes, categories, lists, and article titles in order to build a temporal knowledge base. Third, we show how the system can use its extracted knowledge for further growing the knowledge base.
We demonstrate the e ffectiveness of our proposed methods through several experiments. We extracted more than one million temporal facts with precision over 90% for extraction from semi-structured data and almost 70% for extraction from text.

TempWeb 2012 homepage

Big Web Analytics: Toward a Virtual Web Observatory

The paper "Big Web Analytics: Toward a Virtual Web Observatory" by Marc Spaniol, András Benczúr, Zsolt Viharos and Gerhard Weikum has been accepted for the ERCIM News special theme No. 89 on "Big Data".

For decades, compute power and storage have become steadily cheaper, while network speeds, although increasing, have not kept up. The result is that data is becoming increasingly local and thus distributed in nature. It has become necessary to move the software and hardware to where the data resides, and not the reverse.  The goal of LAWA is to create a Virtual Web Observatory based on the rich centralized Web repository of the European Archive. The observatory will enable Web-scale analysis of data, will facilitate large-scale studies of Internet content and will add a new dimension to the roadmap of Future Internet Research – it’s about time!

ERCIM News
ERCIM News (pdf)

LAWA 3rd Newsletter

LAWA Partners are glad to present the third LAWA Newsletter

This edition focuses on the data subset citation method and the integration of the Wikipedia history into the LAWA reference collection. In addition, research areas present updates on their ongoing research.

Enjoy reading!

LAWA Presentation at the Future Internet (FI) week

The LAWA project will be presented in the FIRE session at the Future Internet (FI) week in Aalborg, Denmark, May 9, 2012

A FIRE thematic workshop will be held on May 9, 2012 in Aalborg, Denmark. This event is part of the Future Internet Week 2012. LAWA is proud to give a presentation in the FIRE Workshop on Measurement and Measurement Tools.

Presentation Abstract
The LAWA project develops methods and tools for temporal Web analytics. The focus of developments are semantic and structural analytics of time-versioned textual Web contents. In particular, we are developing methods that enable entity detection and tracking along the time axis as well as temporal studies of large (Web) graphs. To this end, we also prepare a reference data set and will provide analytics services.

Tracking Entities in Web Archives: The LAWA Project

The paper "Tracking Entities in Web Archives: The LAWA Project" by Marc Spaniol and Gerhard Weikum has been accepted for the European projects track at the WWW 2012 conference.

Web-preservation organization like the Internet Archive not only capture the history of born-digital content but also reflect the zeitgeist of di fferent time periods over more than a decade. This longitudinal data is a potential gold mine for researchers like sociologists, politologists, media and market analysts, or experts on intellectual property. The LAWA project (Longitudinal Analytics of Web Archive data) is developing an Internet-based experimental testbed for large-scale data analytics on Web archive collections. Its emphasis is on scalable methods for this specifi c kind of big-data analytics, and software tools for aggregating, querying, mining, and analyzing Web contents over long epochs. In this paper, we highlight our research on entity-level analytics in Web archive data, which lifts Web analytics from plain text to the entity-level by detecting named entities, resolving ambiguous names, extracting temporal facts and visualizing entities over time periods. Our results provide key assets for tracking named entities in the evolving Web, news, and social media.

WWW 2012 homepage

Big-Data Analytics for the Temporal Web (Paris, November 15, 2011)

International Workshop on Big-Data Analytics for the Temporal Web, Paris, November 15, 2011. Keynotes by: Roi Blanco (Yahoo! Research) "Searching over the past, present and future" and Pierre Senellart (Télécom ParisTech and Webdam Project) "PARIS: Probabilistic Alignment of Relations, Instances, and Schema"

The LAWA project organizes an one-day workshop with researchers using (or planning to use) the Web as a corpus for their studies. The focus is on methods, tools, and platforms for big-data analytics, including requirements on and experiences with such technologies. Topics of interest include but are not limited to: Web dynamics, history, and archives; text mining and contents classification, temporal/longitudinal studies, scalable methods (e.g., cloud-based map-reduce), large scale data storage, community detection and evolution.

The workshop will have presentations by participating researchers and big-data users, including the LAWA project team. Emphasis will be on experience-sharing and discussing mutual interests in big-data analytics for the temporal Web. The workshop is free of charge and open to public, but registration is compulsory by sending an email to:
.(JavaScript must be enabled to view this email address)

AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables

The paper "AIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables" by M. A. Yosef, J. Hoffart, I. Bordino, M. Spaniol and G. Weikum has been accepted for the VLDB 2011 conference.

We present AIDA, a framework and online tool for entity detection and disambiguation. Given a natural-language text or a Web table, we map mentions of ambiguous names onto canonical entities like people or places, registered in a knowledge base like DBpedia, Freebase, or YAGO. AIDA is a robust framework centred around collective disambiguation exploiting the prominence of entities, similarity between the context of the mention and its candidates, and the coherence among candidate entities for all mentions. We have developed a Web-based online interface for AIDA where different formats of inputs can be processed on the fly, returning proper entities and showing intermediate steps of the disambiguation process.

VLDB 2011 homepage

Harvesting Facts from Textual Web Sources by Constrained Label Propagation

The paper "Harvesting Facts from Textual Web Sources by Constrained Label Propagation" by Y. Wang, B. Yang, L. Qu, M. Spaniol and G. Weikum has been accepted for the CIKM 2011 conference.

There have been major advances on automatically constructing large knowledge bases by extracting relational facts from Web and text sources. However, the world is dynamic: periodic events such as sports competitions need to be interpreted with their respective timepoints, and facts such as coaching a sports team, holding political or business positions, and even marriages do not hold forever and should be augmented by their respective timespans. This paper addresses the problem of automatically harvesting temporal facts with such extended time-awareness. We employ pattern-based gathering techniques for fact candidates and construct a weighted pattern-candidate graph. Our key contribution is a new kind of label propagation algorithm with a judiciously designed loss function, which iteratively processes the graph to label good temporal facts for a given set of target relations. Our experiments with online news and Wikipedia articles demonstrate the accuracy of this method.

CIKM 2011 homepage

LAWA 1st Newsletter

LAWA Partners are glad to present the first LAWA Newsletter

This edition introduces the LAWA research areas by presenting their goals and first research steps undertaken.
Enjoy reading!

Scalable Spatio-temporal Knowledge Harvesting

The paper "Scalable Spatio-temporal Knowledge Harvesting" by Yafang Wang, Bin Yang, Spyros Zoupanos, Marc Spaniol and Gerhard Weikum has been accepted for the WWW 2011 poster track.

Knowledge harvesting enables the automated construction of large knowledge bases. In this work, we made a first attempt to harvest the spatio-temporal knowledge from news archives to construct the trajectories of individual entities for spatio-temporal entity tracking.
Our approach consists of an entity extraction and disambiguation module and a fact generation module which produce pertinent trajectory records from textual sources. The evaluation on the 20 years’ New York Times corpus showed that our methods are effective and scalable.

1st LAWA user workshop

The 1st LAWA User Workshop will be held on November 30, 2010 at CNAM (Conservatoire des Arts et Métier) Paris, France.

The LAWA project on Longitudinal Analytics of Web Archive data will build an Internet-based experimental testbed for large-scale data analytics. Its focus is on developing a sustainable infra-structure, scalable methods, and easily usable software tools for aggregating, querying, and analyzing heterogeneous data at Internet scale. Particular emphasis will be given to longitudinal data analysis along the time dimension for Web data that has been crawled over extended time periods.

By means of this workshop we want to present LAWA to a scientific audience and discuss mutual interests in Web archive analytics. In addition, we want to establish links to related projects. The preliminary agenda is available at: (http://www.lawa-project.eu/events/user-workshop-2010.pdf).

Registration for the event is free of charge but compulsory and has to be made online by sending an email to:
.(JavaScript must be enabled to view this email address)
Due to the limited capacity of the venue, it is recommended to register as soon as possible. Applications will be handled on a first come, first served basis.

LAWA-workshop1LAWA-workshop1



Longitudinal Analytics on Web Archive Data: It’s About Time!

The joint vision paper “Longitudinal Analytics on Web Archive Data: It’s About Time!” by the LAWA consortium has been accepted at CIDR 2011

ABSTRACT

Organizations like the Internet Archive have been capturing Web contents over decades, building up huge repositories of time-versioned pages. The timestamp annotations and the sheer volume of multi-modal content constitutes a gold mine for analysts of all sorts, across different application areas, from political analysts and marketing agencies to academic researchers and product developers. In contrast to traditional data analytics on click logs, the focus is on longitudinal stud- ies over very long horizons. This longitudinal aspect affects and concerns all data and metadata, from the content itself, to the indices and the statistical metadata maintained for it. Moreover, advanced analysts prefer to deal with semantically rich entities like people, places, organizations, and ideally relationships such as company acquisitions, instead of, say, Web pages containing such references. For example, tracking and analyzing a politician’s public appearances over a decade is much harder than mining frequently used query words or frequently clicked URLs for the last month. The huge size of Web archives adds to the complexity of this daunting task. This paper discusses key challenges, that we intend to take up, which are posed by this kind of longitudinal analytics: time-travel indexing and querying, entity detection and track- ing along the time axis, algorithms for advanced analyses and knowledge discovery, and scalability and platform issues.