Current metadata and information about data lineage are crucial for understanding and interpreting data in a Hadoop data warehouse. At the same time, Hadoop data warehouse projects sink or swim with the ability to continuously add new data sources and views as business requirements evolve.
Conventional ETL workflow schedulers and metadata management approaches prove millstones round a project's neck, however:
The talk proposes integrated specification of data structure, data dependencies, and computation logic as a way to keep ETL productive and metadata current.
Based on such specifications, a scheduler can automatically detect changes and perform
Also, the very same specifications explicitly "program" rich metadata, avoiding
The talk illustrates this approach with Schedoscope, a scheduler developed at Otto Group based on integrated view specification, and Metascope, a collaborative metadata exploration tool built on top of Schedoscope. Schedoscope and Metascope drive Otto Group BI's data platform, which processes clickstream, product, and CRM data from 120 online shops with a yearly revenue north of 5bn Euros. Schedoscope has enabled Otto Group BI's small team of data engineers to continuously release new data sources and view for more than 2 years now; with Metascope, Otto Group's analysts and data scientists have access to always up-to-date metadata and documentation.
Schedoscope and Metascope are available as open-source at http://schedoscope.org