Name: Disdat: Bundle Data Management for Machine Learning Pipelines
Start: 2019-05-20T17:00:00-0700
End: 2019-05-20T17:20:00-0700

Back To Schedule

Disdat: Bundle Data Management for Machine Learning Pipelines

Modern machine learning pipelines can produce hundreds of data artifacts (such as features, models, and predictions) throughout their lifecycle. During that time, data scientists need to reproduce errors, update features, re-train on specific data, validate / inspect outputs, and share models and predictions. Doing so requires the ability to publish, discover, and version those artifacts.

This work introduces Disdat, a system to simplify ML pipelines by addressing these data management challenges. Disdat is built on two core data abstractions: bundles and contexts. A bundle is a versioned, typed, immutable collection of data. A context is a sharable set of bundles that can exist on local and cloud storage environments. Disdat provides a bundle management API that we use to extend an existing workflow system to produce and consume bundles. This bundle-based approach to data management has simplified both authoring and deployment of our ML pipelines.

Speakers

Paper Presentation

OpML '19: 2019 USENIX Conference on Operational Machine Learning

Sean Rowan

Jonathan Lunt

Theodore M. Wong

Ken Yocum

Attendees (7)