[MPI error help] mpirun on a linux server

Hi all,

I am writing to seek suggestions on solving the error message coming from a Linux server when I tried to run OpenMPI. Specifically, I had the following two command lines to the terminal:

make mpi 
mpirun -np 2 ./filename

Both commands run well on my own MacBook terminal. However, when I switch to a Linux server (provided by the university), the first one runs well, but the second one spins out the following error message, although the program seems to be running.

Any suggestion on understanding the message or fixing the issue would be highly appreciated!

SERVER:rank0.FILENAME: Failed to get bond0 (unit 0) cpu set
SERVER:rank0.FILENAME: Failed to get bond0 (unit 0) cpu set
SERVER:rank0: PSM3 can't open nic unit: 0 (err=23)
SERVER:rank0: PSM3 can't open nic unit: 0 (err=23)
SERVER:rank0.FILENAME: Failed to get bond0 (unit 0) cpu set
SERVER:rank0: PSM3 can't open nic unit: 0 (err=23)
SERVER:rank1.FILENAME: Failed to get bond0 (unit 0) cpu set
SERVER:rank1: PSM3 can't open nic unit: 0 (err=23)
SERVER:rank1.FILENAME: Failed to get bond0 (unit 0) cpu set
SERVER:rank1: PSM3 can't open nic unit: 0 (err=23)
SERVER:rank1.FILENAME: Failed to get bond0 (unit 0) cpu set
SERVER:rank1: PSM3 can't open nic unit: 0 (err=23)
SERVER:rank0.FILENAME: Failed to get bond0 (unit 0) cpu set
SERVER:rank0: PSM3 can't open nic unit: 0 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: SERVER
  Location: mtl_ofi_component.c:513
  Error: Invalid argument (22)
--------------------------------------------------------------------------
SERVER:rank1.FILENAME: Failed to get bond0 (unit 0) cpu set
SERVER:rank1: PSM3 can't open nic unit: 0 (err=23)
 Hello. I am processor            0  out of            2
 Hello. I am processor            1  out of            2
[SERVER:1022598] 1 more process has sent help message help-mtl-ofi.txt / OFI call fail
[SERVER:1022598] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

[Sorry, I don’t know how to block quote all this info and decide to use the code mode for the error message.]

Thanks, and I look forward to hearing from you!

Best,
Long

Some more information on what make mpi is supposed to do, how your program filename is compiled could be helpful. Is this an Intel system you are running on?

Btw, the output does contain something which looks like the executable ran nonetheless:

 Hello. I am processor            0  out of            2
 Hello. I am processor            1  out of            2

Some browsing brought up this issue which gives the same error message as you get. Maybe you could rerun your program with FI_LOG_LEVEL=debug mpirun -np 2 ./filename and check what’s the status of PSM3. In the issue it was resolved by adding PSM3_MULTI_EP=1. In general I don’t think general users should be expected to have this level of knowledge on the network adapters. Something earlier must have been configured wrong.

Maybe you can get support through some kind of ticket support system of your university?