What rationale can be given for why call cpu_time measures approximately half as much time for OpenMP than for do concurrent?
I agree, but it can quickly get quite complicated for a compiler to decide which order is the best one. Admittedly, the case posted here looks simple, and the compiler should be able to figure out that the loop on i
should be the inner one, as i
is always used to index the first dimension of the arrays. But what I mean is that the developer should not expect a compiler to always make the right choice on its own.
Note that on GPU, how the loops are ordered does generally not matter, as thereās no cache memory.
For a compiler that already supports OpenMP, translating a do concurrent
loop into an OpenMP one looks quite simple. And with a single index I expect similar performances:
do concurrent (i=1:n)
...
end do
becomes
!$OMP PARALLEL DO
do i = 1, n
...
end do
This is not different with multiples indices, but as I said in my previous post, the loop ordering can be an issue
do concurrent (k=1:p,j=1:m,i=1:n)
...
end do
becomes
!$OMP PARALLEL DO COLLAPSE(3)
do k =1, p
do j = 1, m
do i = 1, n
...
end do
end do
end do
With this point of view, a do concurrent
inside an OpenMP region (or vice-versa) would just be nested OpenMP parallelism.
But clearly, OpenMP offers much more control than do concurrent
when it comes to fine tuning the performances.
A code that works on all platforms is a thing, and a code that optimally works on all platforms is another thing.
Since this is documented, it becomes a feature and not a bug.
This raises a few related questions. First, can the programmer expect all compilers to make this choice, or is this index order convention expected to be very specific (to the ifx compiler in this case). Do any compilers for which the index order matters make a different choice (e.g. the from innermost to outermost order). If the former is true, then the programmer can write portable code that performs well (hopefully optimally) for a variety of compilers. If the latter, then a fortran programmer might need to benchmark each compiler individually and rely, for example, on conditional compilation within a preprocessor to achieve good performance.
If a programmer writes multiple nested DO CONCURRENT loops, is this semantically equivalent within the fortran standard to writing a single DO CONCURRENT with multiple indices? I think the answer is yes, but Iām uncertain. If the answer is yes, then do current compilers do exactly the same optimizations and produce the same code in these two cases? Or, can the programmer use multiple nested loops in a portable way to hint/suggest the optimal loop orders to a compiler? To give an example, suppose the first index in an array reference is known to the programmer to have a short range. Can the programmer still use nested DO CONCURRENT to get the compiler to use another index as the inner loop in the same way that normal DO loops would be nested for this case?
I think of these as different levels of parallelism, perhaps with some overlaps of capablities, but still distinct in some sense. I think of OpenMP as shared-memory parallelism based on threads, or some other way to specify independent execution streams within the program (I guess hyperthreading would be included). I think of DO CONCURRENT as a way to tell the compiler that the loop indexes can be executed in any order (particularly when that might not be obvious to the compiler from just the statements), and thus the compiler can freely choose the optimal index order for stride calculations, cache optimization, efficient use of multiple functional units (fp adders, fp multipliers, fused multiply-add instructions, SSE instructions, GPU instructions, etc.). DO CONCURRENT also does a few other things, like documenting the program features to human readers, and preventing at compile time calls to impure functions or subroutines. If the hardware does not support shared-memory parallelism, then OpenMP would not be very useful, but DO CONCURRENT would still be quite useful. There is some overlap of functionality too; for example both constructs can be used to produce GPU aware code, and if the hardware does support shared-memory parallelism, then both constructs can produce hyperthreaded or multithreaded programs. It is this last case that was described above by @PierU.
Iām not sure about this interpretation. From what Iāve understood from the sparse documentation from different compilers, do concurrent
would be a kind of āyet-to-be language intrinsic replacementā for a subset of what standards such as OpenMP or OpenACC can offer.
So in principle it should indeed enable removing some (but not all) directives. For instance, with nvfortran, one can deactivate automatic memory management and then mix OpenACC data movement directives with do concurrent
in order to offload to the GPUs, having fine grain control on the data but delegate the processing logic arrangement to the compiler. I think one can also do the same with omp target
, but havenāt tested that.
Another problem that I see lurking is that each compiler is baking different things within. Where can the developer draw the line between which kind of shared parallelism will effectively be deployed at each loop level? Not clear for me. The directives have the benefit of being explicit about it.
Hyper-threading is an affair of the OS AFAIK. And it can actually be detrimental for HPC applications. Iām aware of the possibility of pinning threads or processes to physical cores, but with hyperthreading activated I think it can sabotage that control. We do deactivate it and recommend our users to do so as it does more harm than good.
You should understand that OpenMP uses POSIX threads and for this to be effective you have to give each thread enough work to compensate for the overhead of issuing a task to a pthread.
I really wish there was another hardware mechanism to issue threads, I have done VLSI design in the past and did just that for an embedded CPU.
Years ago Apple invented GCD which reduced the thread dispatch time and I used it effectively in the OpenGL framework to implement a separate thread to do the server side.
Maybe someday Intel will come up with some sort of wiz bang solution for this⦠but until that time you need to remember when using concurrency through threads you need to issue enough work for the thread to do to overcome this.
You mean like Golangās goroutines but for Fortran? (i.e., āforoutinesā )
Out of curiosity: given the fact that pthreads are used everywhere (and therefore presumably optimized already), is a high overhead really expected? Or is the overhead actually coming from OpenMP?
It isnāt pthreads or any application level issue..
The real overhead comes with assigning a lightweight thread to a process, a CPU has to assume the process context like any other process which involves a call to the OS kernel which is never a fast process. ChatGPT says 10-100 microseconds, which is an eternity for a CPU in a tight loop with little to do for each thread, with thousands of threads slamming the kernel scheduler it gets even worse. With todayās CPUās you have to assign thread resources through the kernel for reasons of security and there just is no other way to do it.
Ideally you would like a queue or available threads to be available to the client at the application level and do the assignment there. Thatās the advantage a GPU has over a CPU threads come from a thread pool and are assigned as needed by hardware to maximize the use of the GPU resources. We can talk about NVidia GPUās and WARPS but letās keep it simple.
@Walt ,
Thanks for these comments, as I have struggled to understand why initiating a !$OMP region (on Windows) using ifort or Gfortran takes soo long.
My experience is 5-20 microseconds, which is a prohibitive overhead for most DO CONCURRENT (or forall) loops I have written. Why isnāt there a better way ?
One aspect of this performance I have not been able to definately measure, is the overhead when subsequent !$OMP regions are used. Could it be feasible that after the extra threads are established, that multiple DO CONCURRENT loops become more effectivive ?
My impression is that DO CONCURRENT offers a simpler alternative for using GPU off-loading, as I have struggled to understand the complexity of how this would be enabled via OpenMP. A non-standardised interface between different GPU hardwares is another difficulty !
One the things I have worked on in the past is maintaining thread affinity, with regular work loads that keep a thread on a particular task you maintain cache coherency and keep thread affinity. Thread affinity is the ability of a thread to run on the same CPU each time it is enabled for a process.
So larger loads on fewer threads can perform better than more threads with smaller loads.
This keeps you away from the kernel scheduler and loosing your CPU time to the kernel / another process. Remember there are only so many threads available on any CPU system and over scheduling threads will just force your system to jump from thread to thread and loose thread affinity.
This is an actual positive thing for GPUās, when a thread is de-scheduled on a GPU another thread is started in hardware. Things like memory access which can take forever will de-schedule a GPU thread so having lots of threads will hide all the memory access overhead.
The other main issue on performance in large systems with multiple physical CPUās is memory locality. But thats another issue, see NUMA for more information on this.
Doesnāt GCD use pthreads under the hood?
At least in the past (although I wouldnāt be surprised if still true today), the Intel OpenMP implementation did better than GCC on barrier overhead. Details can be found here: Georg Hager's Blog | Intel vs. GCC for the OpenMP vector triad: Barrier shootout!. Iād guess a similar overhead applies to forking. One can still reach good performance with both implementations; it depends on how fine-grained your multi-threading workload is.
Recently I read an announcement about a minimalistic C++ fork-join style thread library: Fork Union: Beyond OpenMP in C++ and Rust? | Ash's Blog
A good place to learn more about what goes into an OpenMP runtime library is the book of Klemm & Cownie - High Performance Parallel Runtimes: Design and Implementation, 2021, De Gruyter.
From ChatGPTā¦
GCD is built on top of pthreads, but it manages threads for you in a thread pool model ā so you donāt create, destroy, or manage them directly.
How GCD Uses pthreads
- GCD maintains a global thread pool of worker threads (backed by pthreads).
- When you submit a block to a queue (e.g., dispatch_async), GCD:
- Places the block in a queue
- Wakes up an existing thread from its pool (if one is available), or
- Creates a new pthread if all existing ones are busy and concurrency limits allow
- Threads are reused to avoid the overhead of frequent pthread_create()/pthread_exit()
In my experience it just worked better than using pthreads and all the notification, locking mechanisms around it⦠personally I didnāt care how it worked!
I worked closely with the kernel team at Apple on a number of bugs related to this so I got deep into the scheduler and learned a lot more than the casual user would get with pthreads since Apple wanted to push GCD over pthreads.
I think apple has implemented several thread layers over the years. I think Apple originally used the BSD kernel on its A/UX operating system on Motorola CPUs in the 1980s. I think this included both the POSIX and the BSD thread libraries at that time. Then a decade later Apple adopted the Mach kernel for MacOSX, but they still supported the earlier thread libraries. This was in the late 1990s and early 2000s. The important feature was that thread-level parallelism was built into the Mach kernel. Apple used this at that time to support multi-CPU systems using single-core PowerPC cpus. In contrast, POSIX and BSD threads operate at the user level within an application, not at the kernel level. Then Apple switched to Intel CPUs (2005), and multicore CPUs, and more recently (2020) to multicore ARM CPUs. The GCD thread library came during this multicore time (2009 or so). I think the current MacOS is still based on Mach, not the linux, BSD, or System V kernel. I do not know the details of any of these libraries, or how their compatibility layers have evolved since the 1980s. Maybe others can add this information to this discussion.
Support for parallelizing do concurrent
with flang
hit the llvm-project main branch in early May so it missed making it into LLVM 20 by a few weeks. If one builds the main branch of llvm-project from source, however, then flang
can automatically parallelize do concurrent
on CPUs. Coming up on June 13 at the Computational Aspects of Deep Learning Workshop at the International Supercomputing conference, Iāll present a paper showing the results of automatically parallelizing inference calculations on deep neural networks using do concurrent
. The performance is roughly the same as OpenMP with the benefit offloading the same calculations to a GPU should be possible without any changes to the source code if the compiler supports automatic offload. LLVM flang
doesnāt offer automatic offload just yet, but Intel ifx
and the Cray compiler do.
In OpenMP you donāt create/destroy/manage the threads directly either. And AFAIK, smart OpenMP implementations also maintain a thread pool, without destroying the threads at the end of each parallel region. i.e. once created, a thread generally remains alive until the end of the process.
And if I remember correctly, the main advantage of GCD on macOS/iOS comes from its integration into the kernel (In contrast, the ports to other OS (libdispatch on Linux or BSD) are less efficient because they are not integrated into the kernels). For instance, the number of active threads for a given process can be dynamically adjusted depending on the load of the machine. Which makes sense for a task-based parallelism (which GCD is).
At the end Iām not sure that OpenMP and GCD can be directly compared, they are distinct approaches for distinct applications.
BSD is a layer on top of OS X, at this point OS X doesnāt resemble mach much⦠its been compiled into a monolithic kernel and a lot of the messaging mechanisms for which mach was built upon are integrated directly into the monolithic kernel. It was one of the first fully symmetric multiprocessing operating systems ages ago.
Its kind of a diversion from the topic so I will stop there.
macOS, formerly OS X, is technically a XNU kernel, a BSD system layer, plus a proprietary GUI and proprietary frameworks. It is derived from NextSTEP, which had a Mach kernel instead.
All unix systems were multitasked from the beginning, so I donāt think that macOS or any of its previous flavour was pioneering on this point.