Do the Fortran compilers I use generate programs that use E-cores ?
There is so much discussion about the number of E-cores and P-cores in new processor marketing, but does the Fortran compiler generate executables that can use either E-cores or P-cores ?
Does -march=native cope with this.?
Surely if I am targeting AVX instructions, I am limited to only P-cores, so what do E-cores do for my OpenMP computation.
I have not seen much discussion of E-cores being used with Fortran.
I believe the -march=native will generate code that works for both P cores and E cores. If the P cores support AVX512, and E cores only support AVX2, I don’t know if the code will just support AVX2 to ensure max compatibility, or it will support AVX512 and the AVX512 code can be run on E cores which only support AVX2 or whatever else. But anyway, the code will work for both P cores and E cores.
I did not use openMP. But I know that for MPI, for example, if my laptop has 4 P cores and 8 E cores, if I do mpiexec -n 6, then it will use like 4 P cores and 2 E cores, then the speed of my application will be bottlenecked by the E cores which means slow.
To prevent messing up with E cores, I always disable E cores. E cores does not have any use for high performance computing. I see some HPC clusters switch to AMD’s EPYC which do not use E cores. For those HPC clusters using some new Intel chips, I doubt they enable E cores.
I really don’t understand why Intel add so many E cores in their new CPUs. Perhaps because they stuck with the 7nm or 10nm process, and therefore their P cores cannot compete with AMD’s 5nm or 3nm chips. So Intel decided to add more low performance E cores to make their CPU looks not too bad in terms of the number of cores. Overall, I think E core is a very stupid idea. If there are background programs that do not need to use very fast cores, Intel can just significantly lower the frequencies and power of some P cores to run those background programs. I think Intel should stop making CPU with those stupid P core + E core design. Just make CPU with only P cores or only E cores.
I suspect this is more of an OS question, rather than about Fortran compilers per-se.
My machine has this CPU with 8 P cores (2 threads each) and 8 E cores (1 thread each).
MPI seems to detect this:
> mpirun --map-by core --display-map -np 1 true
Data for JOB [36700,1] offset 0 Total slots allocated 16
======================== JOB MAP ========================
Data for node: gareth-D3 Num slots: 16 Max slots: 0 Num procs: 1
Process OMPI jobid: [36700,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]]:
[BB/../../../../../../.././././././././.]
=============================================================
In the diagram at the bottom, the first 8 cores have two .. and the last 8 have one ., so I guess these are the threads.
I can run Fortran (or other) code on some or all of these via MPI in the usual way. But using the E-cores with the P-cores can lead to horrible load imbalance. If you have a workload that doesn’t require all processes to progress at roughly the same rate, then I guess it could be of some use.
It appears that we need to “beleive” rather than know what is happening with AVX and E-cores!
Apparently there is going to be another AVX10 instruction set from Intel.
For compiler developers, it must be a mess as to what AVX instructions are utilised.
While not using E-cores may be an option, my understanding is there are P-cores, E-cores and GPU cores, all with different instruction sets.
I am finding it difficult to understand what the compiler may do when -O3 -march=native -fopenmp is used.
Considering this with Windows’ 24H2 patch, which further differentiates between Intel and AMD; with what OMP_PLACES is claimed to do, I am very uncertain what performance can be expected from the hardware alternatives available.
Very likely (but this is just a guess), the compiler will generate binary code that is compatible with both the P and the E cores. But this is not specific to Fortran anyway.
I’m not sure what should be done tu use only the P-cores with some binary code that don’t work on the E-cores (so compiled with appropriate options)… Is the OS able to schedule on its own such code ont the P-cores only?
Or maybe the E-cores have a a hardware translation layer? e.g. an AVX512 instruction is internally translated to E-core instruction? After all, I think there have been such a hardware translation layer for a long time in the x86 CPUs: while the x86 intruction set is CISC, the inner architecture of the Intel x86 CPUs have evolved to something that is more similar to RISC, and some instructions must be translated. So, it’s not new.
A solution to use only the P-cores would be to set the OMP_PROC_BIND and OMP_PLACES variables accordingly.
But obviously, the CPUs that mix P and E cores are not the most appropriate for HPC…
AFAIK, the instruction set of the E-cores is just a subset of the x86 instruction set. GPU cores are completely different.
Because of P- and E-cores, it is not straightforward for me to understand the result of various benchmarks on the net… For example, I’ve come across the following articles about M4 CPU, which expects that M4 will be the fastest on the market:
And according to the Geekbench site, the single and multi-core scores for M4 Max are indeed greater than those for Ryzen9950X and Core i9-14900K.
But I am not sure to what extent the above figures are relevant for purely numerical calculations (rather than web browsing, file compression, and gaming etc). Various scores for specific benchmarks are also shown in each page, and I guess “Ray tracer” etc will be more relevant for numerical calculations. (Then, the scores are 37428, 65128, and 47412, respectively, so changing the order, which may be natural if “Ray tracer” mainly uses P-cores.) But M4 seems to have very fast memory, so the results may also depend on how much data are needed from memory for computation (so probably depends on the type of calculations). If available, I would like to see more benchmarks oriented towards numerical calculations…
(Apart from the benchmark, the small size of M4 MacMini is very appealing (the weight is just 600-700 g!)