Making data lakes better accessible and analizable

The project “Sustainable Data Lakes for Extreme-Scale Analytics” has been funded by the H2020 program. George Fletcher, Nikolay Yakovets en Odysseas Papapetrou of the Database group (M&CS) will supervise two PhD students.

Data lakes are raw data ecosystems, where large amounts of diverse data are retained and coexist. They facilitate selfservice analytics for flexible, fast, ad hoc decision making. SmartDataLake enables extreme-scale analytics over sustainable big data lakes. It provides an adaptive, scalable and elastic data lake management system that offers: (a) data virtualization for abstracting and optimizing access and queries over heterogeneous data, (b) data synopses for approximate query answering and analytics to enable interactive response times, and (c) automated placement of data in different storage tiers based on data characteristics and access patterns to reduce costs.

The data lake’s contents are modelled and organised as a heterogeneous information network, containing multiple types of entities and relations. In this project, efficient and scalable algorithms will be provided for (a) similarity search and exploration for discovering relevant information, (b) entity resolution and ranking for identifying and selecting important and representative entities across sources, (c) link prediction and clustering for unveiling hidden associations and patterns among entities, and (d) change detection and incremental update of analysis results to enable faster analysis of new data. Interactive and scalable visual analytics are provided to include and empower the data scientist in the knowledge extraction loop. The results of the project are evaluated in real-world use cases from the business intelligence domain, including scenarios for portfolio recommendation, production planning and pricing, and investment decision making.