Rapid growth in architectural diversity has increased the size and complexity of application configuration spaces and thereby demands scalable and accurate methods for modeling and tuning large-scale applications. Communication-avoiding algorithms in particular exhibit an increasing multiplicity of parameters tuned to optimize competing costs in synchronization, interprocessor communication, computational work, and memory footprint. We first introduce a novel class of tensor-based surrogate models for tuning and modeling high-dimensional execution time data.
We evaluate these models against state-of-the-art regression methods and demonstrate significant improvements in accuracy relative to optimization time and evaluation latency across a suite of numerical kernels. We next introduce a framework for approximate autotuning that achieves a desired confidence in each algorithm configuration's performance by constructing confidence intervals to describe the performance of individual kernels. Once a kernel's performance is deemed sufficiently predictable for a set of inputs, subsequent invocations are avoided and replaced with a predictive model of the execution time. We encapsulate this framework as part of a new profiling tool, Critter, that automates kernel execution decisions and propagates statistical profiles along critical paths of execution. We demonstrate execution speed-ups of up to 7x with 98% prediction accuracy using state-of-the-art distributed-memory implementations of Cholesky and QR factorization.