Longitudinal Analytics
of Web Archive Data

 

Wide-Area Hadoop (WAH) Technology V2

Rank (top-k) Join Indexing and Querying for HBase

We contribute a set of indices and accompanying query processing algorithms, tackling the issue of efficient processing of rank (i.e., top-k) join queries over HBase. We assume that each indexed tuple has a score attribute (or predefined score function) and, as is the case with relevant research, that the join aggregate scoring function is monotonic. We provide both MapReduce and coordinator-based solutions, leading to orders of magnitude better performance compared to Pig, Hive, and Impala, in terms of both query processing time and network bandwidth consumption.

Readme

Jar files

Web Analytics Technology V2

AIDA (Accurate onlIne DisAmbiguation of entities)

AIDA is a framework and online tool for entity detection and disambiguation. Given a natural-language text, it maps mentions of ambiguous names onto canonical entities (e.g., individual people or places) registered in the YAGO2 knowledge base. This knowledge is useful for multiple tasks, for example:

  • Build an entity index. This allows one kind of semantic search, retrieve all documents where a given entity was mentioned.
  • Extract knowledge about the entities, for example relations between entities mention in the text.

YAGO2 entities have a one-to-one correspondence to Wikipedia pages, thus each disambiguated entity also denotes a Wikipedia URL.

Note that AIDA does not annotate common words (like song, musician, idea, ... ). Also, AIDA does not identify mentions that have no entity in the repository. Once a name is in the dictionary containing all candidates for surface strings, AIDA will map to the best possible candidate, even if the correct one is not in the entity repository.

Download AIDA

n-Gram Statistics
Statistics about n-grams (i.e., sequences of contiguous words or other tokens in text documents or other string data) are an important building block in information retrieval and natural language processing. We provide four methods (NGSuffixSigma, NGNaive, NGAprioriScan, NGAprioriIndex) to compute n-gram statistics, optionally restricted by a maximum n-gram length and minimum collection frequency, by efficiently harnessing MapReduce for distributed data processing.

We address the problem of efficiently computing n-gram statistics on MapReduce platforms. We allow for a restriction of the n-gram statistics to be computed by a maximum length sigma and a minimum collection frequency tau. Only n-grams consisting of up to sigma words and occurring at least tau times in the document collection are thus considered. While this can be seen as a special case of frequent sequence mining, our experiments show that MapReduce adaptations of Apriori-based methods do not perform well – in particular when long and/or less frequent n-grams are of interest. In this light, we develop our novel method Suffix-sigma that is based on ideas from string processing. Our method makes thoughtful use of MapReduce’s grouping and sorting functionality. It keeps the number of records that have to be sorted by MapReduce low and exploits their order to achieve a compact main-memory footprint, when determining collection frequencies of all n-grams considered.

Download n-Gram Statistics

LILIANA (LIve LInking for online statistic ANAlytics)
Statistic portals such as eurostat’s “Statistics Explained” provide a wealth of articles constituting an encyclopedia of European statistics. Together with its statistical glossary, the huge amount of numerical data comes with a well-defined thesaurus. However, this data is not directly at hands, when browsing Web data covering the topic. For instance, when reading news articles about the debate on renewable energy across Europe after the earthquake in Japan and the Fukushima accident, one would ideally be able to understand these discussions based on statistical evidence. To this end, we aim at semantically enriching and analyzing Web (archive) data to narrow and ultimately bridge the gap between numerical statistics and textual media like news or online forums. The missing link and key to this goal is the discovery and analysis of entities and events in Web (archive) contents. This way, we can enrich Web pages, e.g. by a browser plug-in, with links to relevant statistics (e.g. eurostat pages). Raising data analytics to the entity-level also enables understanding the impact of societal events and their perception in different cultures and economies.

LILIANA has been released as a browser plug-in. The plug-in allows users to select text in Web contents, which directs the user to our disambiguation and link recommendation server. The outcome of live linking is the highest ranked statistical article.

Download the LILIANA Browser Plug-in

HYENA (Hierarchical tYpe classification for Entity NAmes)
Entity types provide contextual information about the semantics of Web (archive) contents. For instance, one might want to understand whether or not politicians and sportspersons co-occur more frequently in the media during election campaigns. In order to reveal mutual dependencies among entities like the aforementioned, efficient and accurate entity typing is required. To this end, HYENA allows end-users to insert natural-language text for semantic type labeling of entity mentions. In addition, the user interface allows interactive exploration of the assigned types by visualizing and navigating along the type-hierarchy.

HYENA has been released as a browser plug-in. The plug-in allows users to select text in Web contents and directs the user to the visualization and navigation interface.

Download the HYENA Browser Plug-in

Yammut
The Yago entity browser (Yammut) connects full text search and entity visualization and exploration. In the LAWA project we have enriched Web documents by recognizing and resolving entities appearing in their content. Given the automatically recognized entities, the user may want to learn more about them: you may want to search for other documents mentioning the same entity, see the relation among all entities mentioned in a document, or learn about closely related entities by using this application.

Download the Yammut.war

Temporal Trend Analyzer
The Temporal Trend Analyzer tool provides real time term co-occurrence counts over the results of ad hoc queries. Given a topic characterized by a set of keywords and a time interval, we find an evolving set of words (a tag cloud) that have the highest increase of frequency within the topic.

Download the Temporal Trend Analyzer jar-file

 

Source release: HBase Indexes for Interval Queries

We have release the source code for HBase Indexes for Interval Queries

The MRSegmentTree and Endpoints Indices focus on providing support for efficient processing of analytics and time-traveling queries over Web archives. Each Web page has several incarnations at different time points, with any two subsequent such points defining an interval. The interval query types we support by the indexes created in this software include containment, stabbing and intersection queries.

IntervalIndices.zip

Source code

Research Driven Crawling and Storage Technology V2

Research Driven Crawling and Storage Technology V2

A new service is open for Lawa user to execute crawls and get the result as Warc files containing the collected Web resources.
Access to this service is restricted, the service is restricted to registered LAWA user and is a subject to limitations regarding the size and scope of crawls.

Storage technology V2

We extended the storage model of LAWA to multi data centers, and implemented a dedicated replication protocol which coordinates the asynchronous copy of the collections between remote DCs. The protocol is described in Deliverable 2.4
and we provide the code as an example of Hadoop implementation.

Download the code zipped (8 kb)

RADAR service updated

A new version of RADAR has been released.

We have updated our webservice to address a bug and to make the API more convenient. Regarding the API changes, please check the included Example Java class and the README in the zip-file provided in the software section. The web service will still work with the old API, but we encourage you to use the new one as it is cleaner. The major new feature is that you can get a ranked list of entities with the PriorOnlyDisambiguationSettings and the LocalDisambiguationSettings.

We also removed a bug that kept the graph algorithm from being executed properly, resulting in the same results as a disambiguation with only prior+local similarity. If you ran any experiments to judge the quality of RADAR, please run the experiments again.

Wide-Area Hadoop (WAH) Technology V1

This release of Wide-Area Hadoop (WAH) Technology includes improvements, tools and tool integrations on top of the Cloudera CDH3 Hadoop distribution. Each separate feature has a separate document detailing it, written by the author(s) of the specific feature. There are four main features to this release:

Hadoop Network Monitoring Tool (Hadoop Kelvin)

This tool is used to monitor network flows between clustermachines. It is accessible via a network server which enables access to the collected data. It also logs all transmissions between the machines in the Hadoop cluster (it is integrated with Hadoop itself, and thus only tracks Hadoop data traffic) into a log file, which can be more convenient for basic use (if you wish to analyze a job’s network traffic in an offline, post-mortem manner) as it requires no coding. There is a basic visualizer application which can be dissected to understand how accessing the data via a programming interface works.

Hadoop_Kelvin_Overview.pdf

Hadoop Block Allocation Management
This tool enables the modification of the mechanism by which a Hadoop cluster allocates blocks of data to machines. This can be used to vary the distribution of data between the cluster machines in order to explore the effect of block placement on job performance. The mechanism is integrated with the core Hadoop code and is activated with a simple configuration file.

Custom_Block_Distribution_On_HDFS.pdf

Jift
A small tool which takes advantage of Cloudera’s Thrift plugins for Hadoop to access cluster statistics in a simple, remote, fashion (Java + Thrift = Jift). A client is implemented which enables the retrieval of statistics on jobs and the HDFS itself. A sample use of the client is also included: This is the program we’ve been using to determine the distribution of load between the various reducers on the cluster. This is also the program we plan to use to analyze the effects of different key-to-reducer allocation strategies.

Jift_Overview.pdf

Hadoop Munin integration
This is a set of instructions for the integration between Hadoop and Munin to enable simple monitoring of cluster health and performance. Munin itself is not
included in this release and must be installed separately. The release tarball includes a copy of our test cluster configuration to make configuration easier on the user end, as things can get a bit tricky. The important stuff, like the Munin-required configuration and the additional Thrift plugins are already configured. Aside of the additional configuration, the cluster itself needs to be configured in the same fashion as any CDH3 cluster.

Munin_Integration.pdf


hadoop-lawa-cdh3-dev0.zip

The release tarball includes a copy of our test cluster configuration to make configuration easier on the user end, as things can get a bit tricky. The important stuff, like the Munin-required configuration and the additional Thrift plugins are already configured. Aside of the additional configuration, the cluster itself needs to be configured in the same fashion as any CDH3 cluster.
For questions, help, or bug reports, please contact us via
.(JavaScript must be enabled to view this email address)

 

Research Driven Crawling and Storage Technology V1

Extraction Framework V1

The extraction framework provides a general mechanism to run extractors on Web collections. It consists of three libraries:
- Core Library (IMHBaseCore.jar): implements Collection and Resource classes, abstract classes to specify the behavior of Filters, Extractors and Aggregators, and utility classes to create and run extraction jobs.
- Extractor Library (IMExtraction.jar): extractors supplied by Internet Memory
- Utilities (IMHBaseSamples.jar): a set of program utilities to access collections, views, and apply the latter to the former.

All the jar files can be obtained from the following repository.

In addition, new extractors can be integrated to the framework. An extractor is a software component which transforms a resource information into a new set of features. The basic interface for anExtractor is defined by the Extractor class in the IMHBaseCore jar.

Web Analytics Technology V1

Wikipedia History Creator

The history of a Wikipedia page is represented as a set of page revisions. The goal of the Wikipedia History Creator Software is to convert data from very large xml history files and import them into a HBase table representation. Storing the Wikipedia history in form of HBase tables and processing them on the MapReduce framework serves two main purposes:
  • Efficient reconstruction of Wikipedia state at a specified time point in past (monthly granularity).
  • Efficient answering temporal queries.

WikipediaHistory.zip


RADAR (RApid Disambiguation of ARchives)

RADAR is a system for performing large scale named entity disambiguation on an entire corpus of documents. It has been specifically designed for heavy-duty usage in Web archives. The RADAR system supports mapping mentions of named entities (e.g. persons, locations, organization …) into canonical entities registered in a knowledge base (e.g. DBpedia, YAGO).

RADAR.zip


HBase Indexes for Interval Queries

The MRSegmentTree and Endpoints Indices focus on providing support for efficient processing of analytics and time-traveling queries over Web archives. Each Web page has several incarnations at different time points, with any two subsequent such points defining an interval. The interval query types we support by the indexes created in this software include containment, stabbing and intersection queries.

IntervalIndices.zip


Temporal Web Classification

Imagine a fictional Internet archive who may prefer certain type of content such as News-Editorial and Educational beyond Commercial sites. They may or may not want to completely exclude Web spam, a deliberate attempt at inflating search engine rank positions of target pages. The goal of the Temporal Web Classification software is to train a model on a collection of manually labelled Web hosts possibly also including their historical versions. The software, as part of the Web classification framework, can be tested as a web service as a WSDL the following URL:

http://monster.ilab.sztaki.hu:8891/webspamservice/webspamservice?wsdl

The service returns the spam prediction by calling the getPrediction operation with a host name as a parameter. The operation returns the prediction (a double value between 0 and 1), or -1 if the given host is not in the database. Currently the ClueWeb09 English language hosts are supported.

Wide Area Hadoop

Kelvin Example Configuration
hadoopstatsconfig.tar

Kelvin In Action
KIA.war

Version 0.21.dev0
hadoop-0.21.dev0.tar

Software section opened

This is the software section

Welcome to the new LAWA website section: software.