Will openMP use all the cores when MPI is also enabled?

Dear all,

A naive question. Say there is code, and the PC has 10 cpu cores, so from rank 0 to rank 9.
For example, a code structure like below,

step 1: Use MPI. All the core are doing their jobs. Then rank 0 core collect the results from all the cores.
For example, each cores are solving some ODEs, then send their results to rank0.

step 2: Use openMP. Rank 0 then continue to process the collected results.
For example, rank 0 need to deal with a big loop based on the collected results, and this part may be more suitable to be done with openMP.

Now the question is,
if step 2 use openMP, will openMP still use all the 10 cores?
I mean since step 2 is done by the rank0 core from MPI, if I use openMP, can openMP see all the 10 cores from MPI?

Thanks much in advance!

You should read through Question about hyper-threading, as it may answer some of your questions.

You should always be specific about whether you’re talking about physical or logical cores when talking about parallelization models.

MPI and OpenMP don’t “see” cores, at least on the programmer level, they launch tasks and processes. You can easily launch 17 MPI tasks, each with 33 OpenMP threads, on a single (physical) core CPU if you want. The runtimes trust that you’re doing the right thing, and it’s up to you to set the thread affinity properly.

1 Like

I guess I can say I am familiar with MPI. I know the difference among CPU, threads and physical cores. LOL. In supercomputer cluster, they have nodes, each node means a motherboard. Each node can have like two CPUs. For Intel’s CPU, usually each CPU have N physical cores therefore 2N logical cores if hyper-threading is enabled.

To make no confusion and make things easy. Here by saying 10 cores, I mean just for 1 CPU, and hyper-threading is OFF. So it is just 1 CPU with 10 physical cores. From MPI’s point of view, it sees 10 ranks, again, from rank0 to rank9.

Now I guess my question is really, MPI and openMP are relatively independent, right?
I mean, from MPI’s view, there are 10 cores. From openMP’s view, there are also 10 cores.
In step 2, from MPI’s view, it is only rank0 core is doing the job.
However, from openMP’s view, this MPI’s rank0 core can still access all the 10 cores, yes?

1 Like

Maybe it helps to introduce the concept of a process. A process has its own address space, whereas the threads that run within a process share the same address space. MPI allows different processes to communicate with each other with messages. OpenMP allows the threads to work together through their shared address space. Some cores allow multiple threads, usually some small number like two or four.

So if you have two CPUs on a node, those two CPUs will be running separate processes and can communicate with each other with MPI. The two CPUs on a node might also be able to share some memory, or they may have a local disk or SSD that they share, but it would be in different address spaces, so it would look more like i/o or a network communication than like memory references. On one of the CPUs you can run separate processes (which communicate with MPI), and within each process the threads can communicate through shared memory with each other. Each CPU can have multiple cores, and sometimes each core can run multiple threads. When MPI wants to exchange information, it needs to look to see if the processes are on the same CPU or separate CPUs on the same node, or separate CPUs on different nodes. It will exchange messages in different ways in all of those cases.

I think thread schedulers can move threads from one core to another during execution. In the timesharing environment, processes can also be transferred between cores on a CPU or even between different CPUs if the hardware+software supports that. The process and thread schedulers try to keep all the hardware busy, the programmer usually doesn’t have direct control of that. If the programmer over allocates the number of processes and threads, then the scheduler wastes time swapping things in and out and among the cores, and the computational efficiency decreases. If the programmer under allocates processes and threads, then hardware sits idle and computational efficiency decreases. So the goal is to find that right mixture.


You can control the number of MPI ranks, and openmp threads.

For example take this program:

program test_mpi
    use mpi
    use omp_lib
    implicit none
    integer :: ierr, my_rank, num_ranks, i

    ! First part of the program uses MPI
    call mpi_init(ierr)
    call mpi_comm_rank(MPI_COMM_WORLD, my_rank, ierr)
    call mpi_comm_size(MPI_COMM_WORLD, num_ranks, ierr)

    print*, 'Hello from image ', my_rank , ' of ', num_ranks

    call mpi_finalize(ierr)

    ! End of MPI

    ! Second part of the program uses openmp

    if(my_rank == 0) then ! Without this, the loop runs for all MPI ranks --  despite mpi_finalize above

        do i = 1, 10
            print*, 'Hello from omp thread ', omp_get_thread_num(), ' of ', omp_get_num_threads()
        end do

    end if

    ! End of openmp

end program

We build and specify the number of ranks and threads like:

mpif90 -fopenmp test_mpi.f90 -o test_mpi

OMP_NUM_THREADS=4 mpiexec -np 2 ./test_mpi

And this is reflected in the output

 Hello from image            0  of            2
 Hello from image            1  of            2
 Hello from omp thread            1  of            4
 Hello from omp thread            3  of            4
 Hello from omp thread            0  of            4
 Hello from omp thread            0  of            4
 Hello from omp thread            0  of            4
 Hello from omp thread            1  of            4
 Hello from omp thread            1  of            4
 Hello from omp thread            2  of            4
 Hello from omp thread            2  of            4
 Hello from omp thread            3  of            4


Awesome @gareth , thank you so much! That completely, clearly and perfectly solves the problem!