On-the-fly Learning in Large Distributed Simulation Workflows
Michael Tynes, University of Chicago
Approximate models used to accelerate scientific applications are traditionally treated as immutable; they are designed up front and then kept fixed while in use. While machine learning methods have given us the ability to make good approximations for expensive calculations, the benefits from the mutability of machine learning models are yet to be fully realized. Here, we explore two potential benefits of mutable approximations: (1) mitigate risk of approximation errors through adaptive auditing retraining and (2) decrease resources invested into the up-front training phase when creating the surrogate.
Our experiments in dynamic surrogates are built around a system that maintains a high-performance computing workload that alternates between propagating trajectories generated by members of a molecular dynamics simulation ensemble and auditing them against higher fidelity calculations. The system uses audit results to determine whether to restart trajectories from a checkpoint and to continually update the surrogate. This surrogate improves over its lifespan, reducing the need for computationally expensive audits. The intensity of audits is managed by a control system that minimizes computational cost while ensuring all trajectories fall within target accuracy bound.
Authors: Michael Tynes, Logan Ward, Kyle Chard, Ian Foster