Entities, such as people, places, products, etc., exist in knowledge bases and linked data, on one hand, and in web pages, news articles, and social media, on the other hand. Entity markup, like Named Entities Recognition and Disambiguation (NERD), is the essential means for adding semantic value to unstructured web contents and this way enabling the linkage between unstructured and structured data and knowledge collections. A major challenge in this endeavor lies in the dynamics of the digital contents about the world, with new entities emerging all the time. In this paper, we propose a crowdsourced framework for NERD, specifically addressing the challenge of emerging entities in social media. Our approach combines NERD techniques with the detection of entity alias names and with co-reference resolution in texts. We propose a linking-game based crowdsourcing system for this combined task, and we report on experimental insights with this approach and on lessons learned.
Natural-language question answering is a convenient way for humans to discover relevant information in structured Web data such as knowledge bases or Linked Open Data sources. This paper focuses on data with a temporal dimension, and discusses the problem of mapping natural-language questions into extended SPARQL queries over RDF-structured data. We specifically address the issue of disambiguating temporal phrases in the question into temporal entities like dates and named events, and temporal predicates. For the situation where the data has only partial coverage of the time dimension but is augmented with textual descriptions of entities and facts, we also discuss how to generate queries that combine structured search with keyword conditions.
We investigate the notion of temporal diversity, bringing together two recently active threads of research, namely temporal ranking and diversification of search results. A novel method is developed to determine search results consisting of documents that are relevant to the query and were published at diverse times of interest to the query. Preliminary experiments on twenty years’ worth of newspaper articles from The New York Times demonstrate characteristics of our method and compare it against two baselines.
We exploit the connectivity structure of edits in Wikipedia to identify recent events that happened at a given time via identifying bursty changes in linked articles around a specified date. Our key results include algorithms for node relevance ranking in temporal subgraph and neighborhood selection based on measurements for structural changes in time over the Wikipedia link graph. We measure our algorithms over manually annotated queries with relevant events in September and October 2011; we make the assessment publicly available (https://dms.sztaki.hu/en/download/wimmut-searching-and-navigating-wikipedia). While our methods were tested over clean Wikipedia metadata, we believe the methods are applicable to general temporal Web collections as well.
Recent research has shown progress in achieving high-quality, very fine-grained type classification in hierarchical taxonomies. Within such a multi-level type hierarchy with several hundreds of types at different levels, many entities naturally belong to multiple types. In order to achieve high-precision in type classification, current approaches are either limited to certain domains or require time consuming multistage computations. As a consequence, existing systems are incapable of performing ad-hoc type classification on arbitrary input texts. In this demo, we present a novel Web-based tool that is able to perform domain independent entity type classification under real time conditions. Due to its efficient implementation and compacted feature representation, the system is able to process text inputs on-the-fly by achieving equally high precision as leading state-of-the-art implementations. Our system offers an online interface where natural-language text can be inserted, which returns lexical type labels for entity mentions. Further the user interface allows users to explore the types assigned to text mentions by visualizing and navigating along the type-hierarchy.
The LAWA project investigates large-scale Web (archive) data along the temporal dimension. As a use case, we are studying Knowledge Linking for Online Statistics.
Statistic portals such as eurostat’s “Statistics Explained” (http://epp.eurostat.ec.europa.eu/statistics_explained/index.php/Main_Page) provide a wealth of articles constituting an encyclopedia of European statistics. Together with its statistical glossary, the huge amount of numerical data comes with a well-defined thesaurus. However, this data is not directly at hands, when browsing Web data covering the topic. For instance, when reading news articles about the debate on renewable energy across Europe after the earthquake in Japan and the Fukushima accident, one would ideally be able to understand these discussions based on statistical evidence.
We believe that Internet contents, captured in Web archives and reflected and aggregated in the Wikipedia history, can be better understood when linked with online statistics. To this end, we aim at semantically enriching and analyzing Web (archive) data to narrow and ultimately bridge the gap between numerical statistics and textual media like news or online forums. The missing link and key to this goal is the discovery and analysis of entities and events in Web (archive) contents. This way, we can enrich Web pages, e.g. by a browser plug-in, with links to relevant statistics (e.g. eurostat pages). Raising data analytics to the entity-level also enables understanding the impact of societal events and their perception in different cultures and economies.
Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are suffciently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose PFSM, a scalable algorithm for frequent sequence mining on MapReduce. PFSM can handle so-called “gap constraints’‘, which can be used to limit the output to a controlled set of frequent sequences. At its heart, PFSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of $w$-equivalency, which is a generalization of the notion of a “projected database’’ used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our extensive experimental study in the context of text mining suggests that PFSM is significantly more efficient and scalable than alternative approaches.
Web archives are valuable resources. However, they are characterized by a high degree of redundancy. Not only does this redundancy waste computing resources, but it also deteriorates users’ experience, since they have to sift through and weed out redundant content. Existing methods focus on identifying near-duplicate documents, assuming a universal notion of redundancy, and can thus not adapt to user-specific requirements such as a preference for more recent or diversely opinionated content.
In this work, we propose an approach that equips users with fine-grained control over what they consider redundant. Users thus specify a binary coverage relation between documents that can factor in documents’ contents as well as their meta data. Our approach then determines a minimum-cardinality cover set of non-redundant documents. We describe how this can be done at scale using MapReduce as a platform for distributed data processing. Our prototype implementation has been deployed on a real-world web archive and we report experiences from this case study.
Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. In this work, we study how n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, can be computed efficiently harnessing MapReduce for distributed data processing. We describe different algorithms, ranging from an extension of word counting, via methods based on the Apriori principle, to a novel method Suffix-sigma that relies on sorting and aggregating suffixes. We examine possible extensions of our method to support the notions of maximality/closedness and to perform aggregations beyond occurrence counting. Assuming Hadoop as a concrete MapReduce implementation, we provide insights on an efficient implementation of the methods. Extensive experiments on The New York Times Annotated Corpus and ClueWeb09 expose the relative benefits and trade-offs of the methods.
Inferring lexical type labels for entity mentions in texts is an important asset for NLP tasks like semantic role labeling and named entity disambiguation (NED). Prior work has focused on flat and relatively small type systems where most entities belong to exactly one type. This paper addresses very fine-grained types organized in a hierarchical taxonomy, with several hundreds of types at different levels. We present HYENA for multi-label hierarchical classification. HYENA exploits gazetteer features and accounts for the joint evidence for types at different levels. Experiments and an extrinsic study on NED demonstrate the practical viability of HYENA.
Acquiring high-quality (temporal) facts for knowledge bases is a labor-intensive process. Although there has been recent progress in the area of semi-supervised fact extraction, these approaches still have limitations, including a restricted corpus, a fixed set of relations to be extracted or a lack of assessment capabilities. In this paper we introduce PRAVDA-live, a framework that overcomes these limitations and supports the entire pipeline of interactive knowledge harvesting. To this end, our demo exhibits temporal fact extraction from ad-hoc corpus creation, via relation specification, labeling and assessment all the way to ready-to-use RDF exports.
Linked Data has emerged as a powerful way of interconnecting structured data on the Web. However, the crosslinkage between Linked Data sources is not as extensive as one would hope for. In this paper, we formalize the task of automatically creating “sameAs” links across data sources in a globally consistent manner. Our algorithm, presented in a multi-core as well as a distributed version, achieves this link generation by accounting for joint evidence of a match. Experiments conrm that our system scales beyond 100 million entities and delivers highly accurate results despite the vast heterogeneity and daunting scale.
Measuring the semantic relatedness between two entities is the basis for numerous tasks in IR, NLP, and Web-based knowledge extraction. This paper focuses on disambiguating names in a Web or text document by jointly mapping all names onto semantically related entities registered in a knowledge base. To this end, we have developed a novel notion of semantic relatedness between two entities represented as sets of weighted (multi-word) keyphrases, with consideration of partially overlapping phrases. This measure improves the quality of prior link-based models, and also eliminates the need for (usually Wikipedia-centric) explicit interlinkage between entities. Thus, our method is more versatile and can cope with long-tail and newly emerging entities that have few or no links associated with them. For efficiency, we have developed approximation techniques based on min-hash sketches and locality-sensitive hashing. Our experiments on semantic relatedness and on named entity disambiguation demonstrate the superiority of our method compared to state-of-the-art baselines.
Knowledge-sharing communities like Wikipedia and knowledge bases like Freebase are expected to capture the latest facts about the real world. However, neither of these can keep pace with the rate at which events happen and new knowledge is reported in news and social media. To narrow this gap, we propose an approach to accelerate the online maintenance of knowledge bases.
Our method, coined LAIKA, is based on link prediction. Wikipedia editions in different languages, Wikinews, and other news media come with extensive but noisy interlinkage at the entity level. We utilize this input for recommending, for a given Wikipedia article or knowledge-base entry, new categories, related entities, and cross-lingual interwiki links. LAIKA constructs a large graph from the available input and uses link-overlap measures and random-walk techniques to generate missing links and rank them for recommendations. Experiments with a very large graph from multilingual Wikipedia editions demonstrate the accuracy of our link predictions.
The Linked Data initiative comprises structured databases in the Semantic-Web data model RDF. Exploring this heterogeneous data by structured query languages is tedious and error-prone even for skilled users. To ease the task, this paper presents a methodology for translating natural language questions into structured SPARQL queries over linked-data sources.
Our method is based on an integer linear program to solve several disambiguation tasks jointly: the segmentation of questions into phrases; the mapping of phrases to semantic entities, classes, and relations; and the construction of SPARQL triple patterns. Our solution harnesses the rich type system provided by knowledge bases in the web of linked data, to constrain our semantic-coherence objective function. We present experiments on both the question translation and the resulting query answering.
The Web and digitized text sources contain a wealth of information about named entities such as politicians, actors, companies, or cultural landmarks. Extracting this information has enabled the automated construction of large knowledge bases, containing hundred millions of binary relationships or attribute values about these named entities. However, in reality most knowledge is transient, i.e. changes over time, requiring a temporal dimension in fact extraction. In this paper we develop a methodology that interlinks label propagation with constraints for temporal fact extraction. Due to the coupling we gain maximum benefit from both “worlds”. Label propagation “aggressively” gathers fact candidates, while an Integer Linear Program does the “clean-up”. Our method is able to improve on recall while keeping up with precision, which we demonstrate by experiments with biography-style Wikipedia pages and a large corpus of news articles.
Collectively maintained Web catalogues organize links to interesting Web sites into topic hierarchies, based on community input and editorial decisions. These taxonomic systems reflect the interests and diversity of ongoing societal discourses. Catalogues evolve by adding new topics, splitting topics, and other restructuring, in order to capture newly emerging concepts of long-lasting interest. In this paper, we investigate these changes in taxonomies and develop models for predicting such structural changes. Our approach identifies newly emerging latent concepts by analyzing news articles (or social media), by means of a temporal term relatedness graph. We predict the addition of new topics to the catalogue based on statistical measures associated with the identified latent concepts. Experiments with a large news archive corpus demonstrate the high precision of our method, and its suitability for Web-scale application.
Recently, large-scale knowledge bases have been constructed by automatically extracting relational facts from text. Unfortunately, most of the current knowledge bases focus on static facts and ignore the temporal dimension. However, the vast majority of facts are evolving with time or are valid only during a particular time period. Thus, time is a significant dimension that should be included in knowledge bases.
In this paper, we introduce a complete information extraction framework that harvests temporal facts and events from semi-structured data and free text of Wikipedia articles to create a temporal ontology. First, we extend a temporal data representation model by making it aware of events. Second, we develop an information extraction method which harvests temporal facts and events from Wikipedia infoboxes, categories, lists, and article titles in order to build a temporal knowledge base. Third, we show how the system can use its extracted knowledge for further growing the knowledge base.
We demonstrate the effectiveness of our proposed methods through several experiments. We extracted more than one million temporal facts with precision over 90% for extraction from semi-structured data and almost 70% for extraction from text.
For decades, compute power and storage have become steadily cheaper, while network speeds, although increasing, have not kept up. The result is that data is becoming increasingly local and thus distributed in nature. It has become necessary to move the software and hardware to where the data resides, and not the reverse. The goal of LAWA is to create a Virtual Web Observatory based on the rich centralized Web repository of the European Archive. The observatory will enable Web-scale analysis of data, will facilitate large-scale studies of Internet content and will add a new dimension to the roadmap of Future Internet Research – it’s about time!
Web-preservation organization like the Internet Archive not only capture the history of born-digital content but also reflect the zeitgeist of different time periods over more than a decade. This longitudinal data is a potential gold mine for researchers like sociologists, politologists, media and market analysts, or experts on intellectual property. The LAWA project (Longitudinal Analytics of Web Archive data) is developing an Internet-based experimental testbed for large-scale data analytics on Web archive collections. Its emphasis is on scalable methods for this specific kind of big-data analytics, and software tools for aggregating, querying, mining, and analyzing Web contents over long epochs. In this paper, we highlight our research on entity-level analytics in Web archive data, which lifts Web analytics from plain text to the entity-level by detecting named entities, resolving ambiguous names, extracting temporal facts and visualizing entities over time periods. Our results provide key assets for tracking named entities in the evolving Web, news, and social media.
Entity resolution (ER), deduplication or record linkage is a computationally hard problem with distributed implementations typically relying on shared memory architectures. We show simple reductions to communication complexity and data streaming lower bounds to illustrate the difficulties with a distributed implementation: If the data records are split among servers, then basically all data must be transferred.
As a key result, we demonstrate that ER can be solved using algorithms with three different distributed computing paradigms:
• Distributed key-value stores;
• Bulk Synchronous Parallel.
We measure our algorithms in the real-world scenario of an insurance customer master data integration procedure. We show how the algorithms can be modified for non-Boolean fuzzy merge functions and similarity indexes.
We present AIDA, a framework and online tool for entity detection and disambiguation. Given a natural-language text or a Web table, we map mentions of ambiguous names onto canonical entities like people or places, registered in a knowledge base like DBpedia, Freebase, or YAGO. AIDA is a robust framework centred around collective disambiguation exploiting the prominence of entities, similarity between the context of the mention and its candidates, and the coherence among candidate entities for all mentions. We have developed a Web-based online interface for AIDA where different formats of inputs can be processed on the fly, returning proper entities and showing intermediate steps of the disambiguation process.
There have been major advances on automatically constructing large knowledge bases by extracting relational facts from Web and text sources. However, the world is dynamic: periodic events such as sports competitions need to be interpreted with their respective timepoints, and facts such as coaching a sports team, holding political or business positions, and even marriages do not hold forever and should be augmented by their respective timespans. This paper addresses the problem of automatically harvesting temporal facts with such extended time-awareness. We employ pattern-based gathering techniques for fact candidates and construct a weighted pattern-candidate graph. Our key contribution is a new kind of label propagation algorithm with a judiciously designed loss function, which iteratively processes the graph to label good temporal facts for a given set of target relations. Our experiments with online news and Wikipedia articles demonstrate the accuracy of this method.
Knowledge harvesting enables the automated construction of large knowledge bases. In this work, we made a first attempt to harvest the spatio-temporal knowledge from news archives to construct the trajectories of individual entities for spatio-temporal entity tracking.
Our approach consists of an entity extraction and disambiguation module and a fact generation module which produce pertinent trajectory records from textual sources. The evaluation on the 20 years’ New York Times corpus showed that our methods are effective and scalable.
Organizations like the Internet Archive have been capturing Web contents over decades, building up huge repositories of time-versioned pages. The timestamp annotations and the sheer volume of multi-modal content constitutes a gold mine for analysts of all sorts, across different application areas, from political analysts and marketing agencies to academic researchers and product developers. In contrast to traditional data analytics on click logs, the focus is on longitudinal stud- ies over very long horizons. This longitudinal aspect affects and concerns all data and metadata, from the content itself, to the indices and the statistical metadata maintained for it. Moreover, advanced analysts prefer to deal with semantically rich entities like people, places, organizations, and ideally relationships such as company acquisitions, instead of, say, Web pages containing such references. For example, tracking and analyzing a politician’s public appearances over a decade is much harder than mining frequently used query words or frequently clicked URLs for the last month. The huge size of Web archives adds to the complexity of this daunting task. This paper discusses key challenges, that we intend to take up, which are posed by this kind of longitudinal analytics: time-travel indexing and querying, entity detection and track- ing along the time axis, algorithms for advanced analyses and knowledge discovery, and scalability and platform issues.