Back To Schedule
Monday, May 20 • 5:00pm - 5:20pm
Disdat: Bundle Data Management for Machine Learning Pipelines

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Modern machine learning pipelines can produce hundreds of data artifacts (such as features, models, and predictions) throughout their lifecycle. During that time, data scientists need to reproduce errors, update features, re-train on specific data, validate / inspect outputs, and share models and predictions. Doing so requires the ability to publish, discover, and version those artifacts.

This work introduces Disdat, a system to simplify ML pipelines by addressing these data management challenges. Disdat is built on two core data abstractions: bundles and contexts. A bundle is a versioned, typed, immutable collection of data. A context is a sharable set of bundles that can exist on local and cloud storage environments. Disdat provides a bundle management API that we use to extend an existing workflow system to produce and consume bundles. This bundle-based approach to data management has simplified both authoring and deployment of our ML pipelines.


Sean Rowan

Intuit, Inc.

Jonathan Lunt

Intuit, Inc.

Theodore M. Wong

23andMe, Inc.

Ken Yocum

Intuit, Inc.

Monday May 20, 2019 5:00pm - 5:20pm PDT
Lawrence/San Tomas/Lafayette Rooms

Attendees (7)