Optimizing Makefile for VASP.6.4.2 Compilation with Intel OneAPI 2023.2.0 on AMD EPYC

Hello Fortran Community,

I am currently configuring the compilation settings for VASP 6.4.2 using Intel OneAPI 2023.2.0 on a system powered by dual AMD EPYC 9554 processors with 24 * 32 G 4800MT/s DDR 5 memory. My aim is to fully utilize the capabilities of my hardware to achieve the best performance with VASP.

Here is the makefile.include I am using for the compilation:

### makefile.include starts ###

# Default precompiler options
CPP_OPTIONS = -DHOST=\"LinuxIFC\" \
              -DMPI -DMPI_BLOCK=8000 -Duse_collective \
              -DscaLAPACK \
              -DCACHE_SIZE=4000 \
              -Davoidalloc \
              -Dvasp6 \
              -Duse_bse_te \
              -Dtbdyn \
              -Dfock_dblbuf

CPP         = fpp -f_com=no -free -w0  $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)

FC          = mpiifort
FCL         = mpiifort

FREE        = -free -names lowercase

FFLAGS      = -assume byterecl -w

OFLAG       = -O3
OFLAG_IN    = $(OFLAG)
DEBUG       = -O0
OBJECTS     = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o

# For what used to be vasp.5.lib
CPP_LIB     = $(CPP)
FC_LIB      = $(FC)
CC_LIB      = icc
CFLAGS_LIB  = -O
FFLAGS_LIB  = -O1
FREE_LIB    = $(FREE)

OBJECTS_LIB = linpack_double.o

# For the parser library
CXX_PARS    = icpc
LLIBS       = -lstdc++

##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##

# When compiling on the target machine itself, change this to the
# relevant target when cross-compiling for another architecture
#VASP_TARGET_CPU ?= -xHOST
# For 4th generation EPYC:
#https://www.vasp.at/forum/viewtopic.php?p=25947#p25947
#https://www.amd.com/content/dam/amd/en/documents/developer/compiler-options-quick-ref-guide-amd-epyc-9xx4-series-processors.pdf
VASP_TARGET_CPU ?= -axCORE-AVX512
FFLAGS     += $(VASP_TARGET_CPU)

# Intel MKL (FFTW, BLAS, LAPACK, and scaLAPACK)
# (Note: for Intel Parallel Studio's MKL use -mkl instead of -qmkl)
FCL        += -qmkl=sequential
MKLROOT    ?= /path/to/your/mkl/installation
LLIBS      += -L$(MKLROOT)/lib/intel64 -lmkl_scalapack_lp64
-lmkl_blacs_intelmpi_lp64
INCS        =-I$(MKLROOT)/include/fftw

# HDF5-support (optional but strongly recommended)
CPP_OPTIONS+= -DVASP_HDF5
HDF5_ROOT  ?= /path/to/your/hdf5/installation
LLIBS      += -L$(HDF5_ROOT)/lib -lhdf5_fortran
INCS       += -I$(HDF5_ROOT)/include

# For the VASP-2-Wannier90 interface (optional)
CPP_OPTIONS    += -DVASP2WANNIER90
WANNIER90_ROOT ?= /path/to/your/wannier90/installation
LLIBS          += -L$(WANNIER90_ROOT)/lib -lwannier

#Re: Clarification Request on Configuring DFT-D4 Support in VASP.
#https://www.vasp.at/forum/viewtopic.php?p=25763#p25763
CPP_OPTIONS += -DDFTD4
DFTD4_ROOT  ?= /path/to/your/dft4/installation
# version 3.6.0 built and installed with meson
#LLIBS       += $(shell pkg-config
--with-path=$(DFTD4_ROOT)/lib64/pkgconfig --libs dftd4)
#INCS        += $(shell pkg-config
--with-path=$(DFTD4_ROOT)/lib64/pkgconfig --cflags dftd4)
# version 3.6.0 and loaded dftd4 module, i.e. PKG_CONFIG_PATH
environment variable set correctly
LLIBS       += $(shell pkg-config --libs dftd4)
INCS        += $(shell pkg-config --cflags dftd4)

### makefile.include ends ###

My specific questions are:

  1. Are there any additional compiler flags or settings that I should consider explicitly for AMD EPYC 9554 processors to optimize the performance of VASP?

  2. I am using -axCORE-AVX512 to target the AVX512 instruction set. Is this the optimal choice for AMD EPYC, or should I consider other CPU-targeted options?

  3. With regard to linking against libraries such as MKL and HDF5, are there recommended practices when working with AMD architectures and Intel OneAPI compilers?

If you have experience with similar configurations or have performance tuning tips for VASP, could you please share your insights? Thank you for your assistance and looking forward to your suggestions.

Below are the related discussions:

https://www.vasp.at/forum/viewtopic.php?t=19453
https://www.vasp.at/forum/viewtopic.php?p=26152#p26152
https://www.vasp.at/forum/viewtopic.php?p=26159#p26159

Best regards,
Zhao

-axCORE-AVX512 should be used for Intel processors.
For AMD processors, you can use -march=core-avx2

You may try different values for -DCACHE_SIZE=4000 while 4000 may not be optimal in your case.

Alternatively, you may contact HPC centers having AMD processors for the parameters used in their Makefile. Such compilation parameters are presumably not subject to license and can be shared with no worry.

-axCORE-AVX512 should be used for Intel processors.
For AMD processors, you can use -march=core-avx2

The -axCORE-AVX512 is suggested by AMD official guidance here, as shown below:

You may try different values for -DCACHE_SIZE=4000 while 4000 may not be optimal in your case.

-DMPI_BLOCK=8000 -DCACHE_SIZE=4000

Are there any rules of thumb for setting these values, such as how they relate to CPU cache size, number of cores, and total memory size?

See here for the related discussion.

Once I used xCORE-AVX512 for compiling another code using Intel compiler on an AMD system, at run time there was warning messages related the incorrect vectorization scheme, though it did not crash. It may have been inefficient.

I am no aware of any rule of thumb for setting CACHE_SIZE in VASP Makefile, except the higher value may be more suitable for processors with larger cache. A large number of processors have the same value of L3 cache per core, 1.5 MB per core. Nevertheless, there are Sapphire Rapids such as Intel Xeon Platinum 8470 with 2 MB per core. It is worth it to give a try of quick benchmarking runs for CACHE_SIZE with values 4000,8000,16000, or 32000

Did you also test the AMD EPYC 9xx4 series processors at that time? Please note that the official guidance mentioned above only applies to AMD EPYC 9xx4 series processors.

  1. Are all these values in the unit of KB?
  2. Why do they have to be multiples of 4000?

No, I have not tried it on AMD EPYC 9xx4. I it was AMD Milan. EPYC 9 series may be different.

CACHE_SIZE is in the units of 16 bytes. I suggested values that I have seen others have suggested on their sites or have used in their Makefiles. I do not know whether they should be multiple of 4000. Also, I have not seen others using a value larger than 32000.