The main objective is this area is to ensure that Web data is provisioned on the LAWA infrastructure at the right scale and in an optimal structure for further processing. This encompass, the improvement of the current EA’s infrastructure to reach «Web scale» limit (billions of resources crawled per week) and also the development of an on-demand crawling service by which research group will be able to develop focus collection in an iterative manner. The goal is also to address the issue of optimal data storage for stream processing and ensure that data subset can be referenced to or cited.
The main tasks will be :
- Web-scale Crawling
- Research-driven Crawling Service
- Data storage Optimization for Processing
- Data Substet Citation