Monday, May 20 • 4:20pm - 4:40pm
tensorflow-tracing: A Performance Tuning Framework for Production

Sign up or log in to save this to your schedule and see who's attending!

The growing popularity of Deep Neural Networks (DNN) within the mainstream \cite{gartnerhype} has had a rapid transformative effect on clusters and data centers.

DNN training jobs are becoming one of the largest tenants within clusters, and often take hours to weeks to complete; and even a slight performance improvement can save substantial runtime costs. Despite this fact, the DNN specific performance tuning tools are yet to keep up with the needs of the new changes in production environments.

On one hand, the existing application-agnostic resource-level tools such as top, Nvidia Nsight (for GPU utilization), IPM (for MPI network monitoring) are too limited to predict or explain the behavior and performance of a job accurately. In DNN applications, there exists a complex relationship among resources. Even though measuring coarse metrics such as bandwidth, latency, and GPU/CPU utilization can draw an overall picture of cluster performance, these metrics are not easily translatable to application-level metrics and do not provide actionable insights on how to handle performance bottlenecks.

On the other hand, the short list of application-aware tools, such as MLModelScope \cite{dakkak2018mlmodelscope}, TensorBoard \cite{tensorboard}, and \texttt{tf.RunOptions} \cite{tensorflow-trace}, while able to provide actionable insights, are mainly designed for the need of application developers and are not intended for production use. Such tools require substantial modification to applications, and early planning as to what, when and how data should be collected.

In this article, we introduce \texttt{tensorflow-tracing}~to fill the gap between these two classes of performance tuning tools. To achieve this goal, \texttt{tensorflow-tracing}~addresses the following technical challenges:

\begin{itemize}[noitemsep,topsep=0pt,leftmargin=*] \item Collecting the application-level runtime metrics, such as the timing of each operation or the iteration time, needs explicitly expressed in the training job source code. To makes it possible to trace ML jobs without requiring any application modification, \texttt{tensorflow-tracing}~ \textit{monkeypatches} the \texttt{tensorflow} library at the system level. \item Collecting some metrics is expensive and have a significant overhead on the runtime. \texttt{tensorflow-tracing}~treats metrics differently; it collects low-overhead metrics automatically, while expensive ones are collected on demand through an admin interface. \item There is no easy way to exchange runtime metrics among users and admins --- \texttt{tensorflow-tracing}~facilities this through a portable file format and supporting tools to explore these metrics offline. \end{itemize}

The \texttt{tensorflow-tracing}~is publicly available under \texttt{Apache-2.0} license\footnote{\url{https://github.com/xldrx/tensorflow-tracer}}. It supports native TensorFlow \cite{tensorflow}, Horovod \cite{horovod}, and IBM PowerAI \cite{powerai} applications.


Sayed Hadi Hashemi

University of Illinois at Urbana-Champaign

Benjamin Rabe

University of Illinois at Urbana-Champaign

Kuan-Yen Chou

University of Illinois at Urbana-Champaign

Simeng Liu

University of Illinois at Urbana-Champaign

Volodymyr Kindratenko

University of Illinois at Urbana-Champaign

Roy H Campbell

University of Illinois at Urbana-Champaign

Monday May 20, 2019 4:20pm - 4:40pm
Lawrence/San Tomas/Lafayette Rooms

Attendees (4)