Locally setting number of threads in OpenMP

I’m looking for a way to locally set the number of OpenMP threads for a routine. There are a couple of constraints which do not allow usage of the OMP_NUM_THREADS environment variable:

  • used via a foreign function interface
    • caller cannot know whether runtime will be threaded
    • OpenMP pragmas are not generally available to the caller
  • multiple concurrent instances must not interfere
    • set_omp_num_threads in global scope will not work
  • number of threads stored in an opaque handle
    • pointer to a non-bind(c) Fortran object
    • setter function is available
    • deferred usage of thread number when using handle in another procedure

One the C side the simplified setup might look like this:

typedef struct context* context;
extern context new_context(void);
extern void context_set_num_threads(context, int);
extern void actual_calculation(context);
extern void delete_context(context*);

int
main() {
  context ctx = new_context();
  context_set_num_threads(ctx, 2);

  actual_calculation(ctx);

  delete_context(&ctx);
  return 0;
}

One the Fortran side a minimal example could look like

module bind_c
  use, intrinsic :: iso_c_binding
  implicit none

  ! Wrapper to create an opaque handle for configuration
  type :: vp_context
    integer, allocatable :: nthreads
  end type vp_context

contains

  ! Allocate new context object
  function new_context() result(vctx) bind(c)
    type(vp_context), pointer :: ctx
    type(c_ptr) :: vctx

    allocate(ctx)
    vctx = c_loc(ctx)
  end function new_context

  ! Set the number of threads in a context object,
  ! non-positive values will clear the property from the object
  subroutine context_set_num_threads(vctx, nthreads) bind(c)
    type(c_ptr), value :: vctx
    type(vp_context), pointer :: ctx
    integer(c_int), value :: nthreads

    if (.not.c_associated(vctx)) return
    call c_f_pointer(vctx, ctx)

    if (allocated(ctx%nthreads)) deallocate(ctx)
    if (nthreads > 0) ctx%nthreads = nthreads
  end subroutine context_set_num_threads

  ! Entry point for performing calculation for a given configuration,
  ! usually takes several objects but simplified here
  subroutine actual_calculation(vctx) bind(c)
    type(c_ptr), value :: vctx
    type(vp_context), pointer :: ctx
    integer :: nthreads

    if (.not.c_associated(vctx)) return
    call c_f_pointer(vctx, ctx)

    nthreads = 0
    if (allocated(ctx%nthreads)) nthreads = ctx%nthreads

    !$omp teams num_teams(1) thread_limit(nthreads)
    ! call some_library_function(...)
    ! contains several !$omp parallel do regions
    !$omp end teams
  end subroutine actual_calculation

  ! Free allocation of a context handle,
  ! also resets the handle to not leave a dangling pointer
  subroutine delete_context(vctx) bind(c)
    type(c_ptr), intent(inout) :: vctx
    type(vp_context), pointer :: ctx

    if (.not.c_associated(vctx)) return
    call c_f_pointer(vctx, ctx)
    deallocate(ctx)
    vctx = c_null_ptr
  end subroutine delete_context
end module bind_c

Most promising so far is the OpenMP 4.0 teams construct and in some preliminary tests it seems to work as expected. However, for a team of threads the OpenMP specification seems to recommend the usage of distribute rather than parallel do, which would require some adaptation and testing in the library first (which is a feasible option).

Is there a better way to locally set the number of threads for a certain scope in OpenMP?

https://www.openmp.org/spec-html/5.0/openmpse50.html

It is a bit convoluted, but I generally find a combination of the three methods works. I like the idea you describe with teams, but personally I need a more concrete example of a use case; but I generally run in a homogeneous cluster; with very similiar tasks assigned to each node. Are you working with a problem where very different tasks are assigned to the threads?

I’m looking at a scenario where the consumer side of my library wants to run several instances in the same process with different configurations, like different numbers of threads depending for example on the problem size. In the worst case scenario I have no control over the host process, e.g. could be a service processing different requests delivered by a socket, adjusting environment variables is therefore not an option.


The project is on GitHub if seeing the actual code helps (the project is fpm compatible, but the routines in question are not used in the fpm version), the above example is a simplified version capturing the essential parts. Here are the key routines in the original