To support innovative Future Internet applications, we need a deep understanding of Internet content characteristics (size, distribution, form, structure, evolution, dynamic). The LAWA project on Longitudinal Analytics of Web Archive data will build an Internet-based experimental testbed for large-scale data analytics. Its focus is on developing a sustainable infra-structure, scalable methods, and easily usable software tools for aggregating, querying, and analyzing heterogeneous data at Internet scale. Particular emphasis will be given to longitudinal data analysis along the time dimension for Web data that has been crawled over extended time periods.
LAWA will federate distributed FIRE facilities with the rich Web repository of the European Archive, to create a Virtual Web Observatory and use Web data analytics as a use case study to validate our design. The outcome of our work will enable Internet-scale analysis of data, and bring the content aspect of the Internet on the roadmap of Future Internet Research. In four work packages we will extend the open-source Hadoop software by novel methods for wide-area data access, distributed storage and indexing, scalable data aggregation and data analysis along the time dimension, and automatic classification of Web contents.
Target Users and Benefits
LAWA adds value to the FIRE community by offering access to very large datasets, with advanced methods and open-source tools for intelligent analysis. This enables research on the Future Internet with regard to the challenge of content explosion. A Virtual Web Observatory will be created, to support data-intensive experimentation with Web content analytics. A demonstrator is planned which will allow citizens at large to interactively browse, search, and explore born-digital content along the time dimension.