WorkQ: A Dynamic Producer/Consumer Processing Model Applied to PGAS Computations
David Ozog, University of Oregon
The ubiquity of distributed memory in massively parallel computer systems suggests the PGAS (Partitioned Global Address Space) model as an effective programming paradigm for enabling productive and efficient computational science. A common processing methodology seen in PGAS-based applications, such as the Tensor Contraction Engine (TCE) in NWChem, involves a one-process-per-core mapping in which each process in the system iterates through the following work-processing cycle: 1) determine a work-item dynamically; 2) get data via one-sided operations on remote blocks; 3) perform computation on the data locally; 4) put (or accumulate) resultant data into an appropriate remote location; and 5) repeat the cycle. In this paper we propose a new run-time processing model, called WorkQ, in which some number of on-node “producer” processes primarily do communication (steps 1, 2, 4, and 5) and the other “consumer” processes do computation (step 3), yet processes can switch roles dynamically for the sake of performance. Load balance, synchronization, and overlap of communication and computation is facilitated by a highly tunable node-wise FIFO message queue protocol. Our WorkQ library implementation also enables an MPI+X hybrid programming model where the X is comprised of SysV message queues and the user’s choice of SysV, POSIX, and MPI shared memory. We develop a simplified software mini-application which mimics the performance behavior of the TCE at arbitrary scale and show that the WorkQ engine outperforms the original model by about a factor of two. We also show performance improvement in the TCE Coupled Cluster module of NWChem across all possible tile sizes.
Abstract Author(s): David Ozog, Allen Malony, Jeff R. Hammond, Pavan Balaji