I have an “embarrassingly parallel” program which has been parallelized with OpenMP. The program spawns a bunch of threads to do independent computations, then writes the results to a file. An approximate reproduction is:
program
implicit none
integer :: i, n
integer :: outfile
real(kind=8) :: datavar
! ... set-up code ...
!$omp parallel do
do i = 1, n
call expensive_computation_with_index(i, datavar, n, ...)
!$omp critical
write(outfile, *) "i = ", i, " result = ", datavar
!$omp end critical
end do
!$omp end parallel do
! ... clean-up code ...
end program
After each thread finishes its work, I want the thread to wait until all previous indices have finished writing. That is, I want the output file to look like
i = 1 result = ...
i = 2 result = ...
i = 3 result = ...
where each thread is essentially processed in order of the loop. I tried to put in a !$omp barrier clause before the write(), but that doesn’t work with !$omp parallel do. What’s the course of action here, that each thread synchronizes the file writing?
However, this tends to be less efficient. A better method is to store the results in an array and then output the array sequentially after the parallel loop is complete.
If expensive_computation_with_index() does indeed a lot of work, then the solution with ORDERED is ok, and the performance hit has chances to be very minor.
I asked Google Gemini about the “low-level way” of doing this, and let’s just say it is complicated:
! :warning:
! For illustration purposes only, use ORDERED construct if possible instead
! :warning:
integer :: i, n, next_serial, tid, nthr
integer :: outfile
real(kind=8) :: datavar
! ... set-up code ...
! Initialize the "ticket" counter
next_serial = 1
! For Serial execution
me = 0
nthr = 1
!$omp parallel private(i, tid) shared(next_serial, n, nthr)
!$ me = omp_get_thread_num()
!$ nthr = omp_get_num_threads()
! Thread-strided loop
do i = me+1, n, nthr
! 1. Parallel: Do the heavy lifting here
call expensive_computation_with_index(i)
! 2. Wait: Spin until it is this index's turn
! We use flush to ensure we are reading the latest value from memory
do
!$omp flush(next_serial)
if (next_serial == i) exit
end do
! 3. Serial: Write the output
write(*, *) "i = ", i, " written by thread ", tid
! 4. Update: Hand off the "turn" to the next index
! We can simply increment because only one thread is allowed
! past the spinlock at a time for this specific 'i'
!$omp atomic update
next_serial = next_serial + 1
! 5. Broadcast: Ensure other threads see the new value immediately
!$omp flush(next_serial)
end do
!$omp end parallel
Why not save the results to real(kind=8), allocatable :: datavar_list(:)
This will eliminate any inefficient delays to expensive_computation_with_index
If the results are more extensive, data_list can be an allocated derived type, as a shared data structure.
At the end of the !$OMP region, this can be quickly written to file or further analysed.
In order with what other people have said, you might be better off doing the computation, storing the data and then dumping everything serially. If your data is gigantic you might want to look into doing parallel I/O using HDF5 or something similar.
I think the best thing to do here would be the ordered construct. datavar isn’t actually a scalar (or even just one variable), and it would be too much work to figure out a new data format (later parts of the analysis pipeline depend on the current format), and since each call to the expensive subroutine is expected to roughly take the same amount of time, the ordered write is probably a relatively small hit to be worried too much about. Also the number of threads is smaller than the number of loop runs, and we’re trying to get each thread to output as soon as it’s done, with the caveat of being in order. Thanks, everyone for all your help!