Design for MPI_Abort/Exit Status

Karen Baltzer
April 12, 2001

 
 


This document describes the methodology I am proposing to enable MPI jobs to return a exit status via
MPI_Abort and also to display a error message when one of the child processes was terminated by
receipt of a signal.


 

Why this feature is needed

Currently, if a MPI process is terminated by a signal such as SIGBUS or SIGSEGV
no reason for the failure is returned back to the user (see PV 785289).

Also MPI_Abort currently does not return the error code back to the run environment.
This enhancement has been requested by Walter Spector and Ken Taylor for FNMOC
See PV's 799785 & 802953.  FNMOC has some operational jobs that they want to
have query the exit status and then take appropriate action.

This feature will be implemented for MPI on both Irix and Linux.

Methodology for MPI returning an exit status

The MPI daemon will now extract the status information when a child process
dies, via waitpid (waitpid(mpi_sgi_pid(i),&mpi_sgi_exit_stat, WNOHANG).
This status will then be sent to the mpirun process:

        if (unexpdeath){
                MPI_SGI_printf("MPI: MPI_COMM_WORLD rank %d has terminated without calling MPI_Finalize()\n",
                        mpi_sgi_base_grank[mpi_sgi_my_hrank]+unexpdeath-1);

                label = MPI_CMD_EXITSTAT;
                MPI_SGI_ctrl_send(&label, sizeof(int));
                MPI_SGI_ctrl_send(&mpi_sgi_exit_stat, sizeof(int));

The MPI daemon will then terminate itself and all its child processes.

The mpirun process will get the exit status and execute the following function:

static int
get_exitstat(xmpi_arg_t *arg, void *dummy)
{
        int len;

        len = sizeof(int);
        xmpi_net_recv(arg->recv_fd, &MPI_exitstat, len);
        if (WIFEXITED(MPI_exitstat)) {
             int estat = WEXITSTATUS(MPI_exitstat);
             exit(estat);
        } else if (WIFSIGNALED(MPI_exitstat)) {
             int sig_num = WTERMSIG(MPI_exitstat);
             xmpi_all_error(0,"Received signal %1d\n", sig_num);
             exit(1);
        }

}

A message will be displayed if a child process was terminated by receipt of
a signal:
$ mpirun -np 2 ./mtest1
 groupsize= 2

lib-4051 : UNRECOVERABLE library error
  The file must not exist prior to OPEN if STATUS is 'NEW'.

Encountered during an OPEN of unit 1
Fortran unit 1 is not connected
IOT Trap
MPI: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
MPI: Received signal 6
 
 

If a child process terminated via MPI_Abort, the exit status will be returned to the calling
environment.  Any comm argument to MPI_Abort will be treated as if the comm were
MPI_COMM_WORLD.  This is standard compliant (see pg 197 of the MPI standard).

$ cat mpibug.f
        program mpibug
        use mpi
        implicit none

        integer:: ierr

        call mpi_init (ierr)
        call mpi_abort (MPI_COMM_WORLD, 42, ierr)
        end
$ mpirun -np 2 ./mpibug
MPI: MPI_COMM_WORLD rank 1 has terminated without calling MPI_Finalize()

$ echo $?
42