Ports by: Peter A. Dinda (Fx) and David R. O'Hallaron (Archimedes)
In general, we found PVM was an easy environment to port to, but at the cost of performance. PVM was considerably slower than the native communication system on each of the machines we looked at (DEC Alphas with Ethernet, FDDI, and HiPPI, Intel Paragon, Cray T3D). Much of this slowdown is probably due to the extra copying needed to provide PVM's programmer-friendly semantics, which, as compiler-writers, are unnecessary to us. Although PVM goes a long way to making parallel programs portable, we found it was necessary to make minor (Paragon) to major (T3D) modifications to run PVM programs on MPPs.
The details of running PVM programs are hard to hide from users. Although our toolchain hides the details of compiling and linking for PVM, once an executable is produced, the user is left to deal with hostfiles, daemons, and other details of execution - issues that are nonexistent under the operating systems of MPPs.
The Fx language is a variant of High Performance Fortran (HPF) which integrates task parallelism into the overall data parallel HPF framework. Data parallelism is expressed by Fortran 90 array assignment statements and parallel loops over distributed arrays. Task parallelism allows the programmer to instantiate several data parallel routines at a time and specify how data flows between them. For example, a two dimensional FFT could be decomposed into a parallel loop over the rows followed by a parallel loop over the columns. With task parallelism, both loops could operate at the same time, forming a pipeline. Fx has been used to build or parallelize a number of real applications, including Air Quality Modelling, Stereo Vision, Synthetic Aperture RADAR, Earthquake Ground Motion Modeling, Magnetic Resonance Imaging, and Narrowband Tracking RADAR.The Fx compiler translates an Fx program into a SPMD Fortran 77 program that calls on the Fx run-time system to perform communication. The F77 source is compiled using the native Fortran compiler and linked with the run-time system and PVM libraries. Porting to PVM involved minor changes to the compiler, mostly to support program startup and shutdown, and a writing a PVM-based run-time system.
On a workstation cluster, a parallel Fx program exhibits the behavior a user would expect from a sequential program. When executable is run, it spawns the necessary number of copies of itself using PVM. It also spawns a monitor program which gracefully shuts down the application should a problem arise in any task. All I/O (except for parallel file I/O) is performed by the process which was spawned by the user - thus the user can use the Fx program like any other Unix program. For example, the user could include it in a pipeline. On an MPP, program startup varies from machine to machine. For example, on the Paragon, the user must run a "wrapper" program which spawns all the copies of the Fx program.
PVM lets Fx target workstation clusters which, for some applications, prove significantly better than MPPs. For example, the chemical reaction component of the Air Quality Modeling application runs as fast on four DEC Alphas than on 32 nodes of the Intel Paragon:
Portability is another concern we have with PVM. Although PVM programs are highly portable among workstations, each MPP's implementation of PVM seems to be different, requiring considerable special casing in order to achieve portability. For us, the T3D's implementation required the most changes, while the Paragon's required the fewest.
We expose starting the PVM daemon to the user because we want a nondefault executable path for each host - something we can not configure using pvm_addhosts(). Further, in practice, starting daemons on different machines in a network environment as complex as CMU's can be quite painful due to differing security mechanisms and machines equipped with multiple network adaptors. Finally, since we use only task-to-task communication (RouteDirect), and don't need dynamic virtual machines, the daemon seems superfluous.
Because each PVM user establishes his own virtual machine and different virtual machines can contain the same computer, application performance can vary considerably with few clues as to why. Although it is possible to run PVM jobs under a queueing system such as DQS, it seems like job queueing on a single, shared virtual machine would be a natural extension to PVM and eliminate the need for users to have to deal with more than one tool. Such an extension would also make it easier to hide PVM from our users by centralizing PVM configuration under our direct control.