Data management and pre-processing often consume the majority of time spent by data scientists. The data architecture and the configuration of data pipelines significantly influence the efficiency of this work. An emerging ’Lakehouse’ architecture combines the features of both a Data Lake and a Data Warehouse, eliminating the need to manage a two-tier system. This allows for the storage and processing of raw, structured, and semi-structured data on a unified platform, offering higher performance and decoupling computing from storage. The capabilities of this architecture are explored within Trase.earth, a leading initiative in commodity supply chain transparency that focuses on agricultural products driving deforestation. This thesis demonstrates that the Lakehouse architecture can simplify intricate data pipelines while enabling new functionalities. It also shows that this transition can be made backwards-compatible, rely on open standards, and reduce costs. The enhancements analyzed include data ingestion from heterogeneous sources, data discoverability, metadata management, data sharing, and pipeline management with the integration of data quality expectations. As an additional case study, graph data mining techniques are applied to the beef supply chain in the state of Pará, Brazil, using a dataset of sanitary records for animal transportation. Various methods for deriving and analyzing paths of indirect sourcing are employed, facilitating the identification and characterization of the most frequently traveled routes, trade communities, and node centrality. The code related to this thesis can be found in: https://github.com/nmartinbekier/ds_de_thesis
Keywords
Lakehouse architectureEfficiency improvementsData pipeline simplification
Institute(s)
Uppsala University
Year
2023
Abstract
Author(s)
Nicolás MartĂn