Performance drop when using intel oneAPI with standard sematics option

Interoperability of logical(C_BOOL) with bool in C requires to set the fpscomp logicals option for the Intel oneAPI Fortran compiler. Since this is included in -standard-semantics, I set it.

Unfortunately, -standard-semantics increases runtime in my case from 90s to 2000s, i.e. by a factor of 20. I find this enormous, especially given the fact that -standard-semantics was my default for ifort.
Did anyone encountered similar experiences and could eventually suggest flags (e.g. turning of selectively features implied by -standard-semantics) to cure this?

Are you sure that it is the -standard-semantics flag that is causing this? I have seen a bizarre performance drop when I turned on -O3 optimisation. This, however, turned out to be due to the processors on my machine - efficient and perfornant processors. See https://community.intel.com/t5/Intel-Fortran-Compiler/Unexpected-timing-result-with-optimisation-O3/m-p/1704011/emcs_t/S2h8ZW1haWx8dG9waWNfc3Vic2NyaXB0aW9ufE1EN1FYR0w1SUdIN0hYfDE3MDQwMTF8U1VCU0NSSVBUSU9OU3xoSw#M176341.

It could be that your machine is accidentally picking up the wrong type of processor.

I indeed have -O3 -fp-model strict -xHost -align array64byte, but that works ok (ifx is a little bit slower than ifort, but not much). But -standard-semantics is the trigger.

Hm, I have no experience with that, I am afraid. Seems odd that such a seemingly simple thing should cause this.

More information: I’m using oneAPI 2025.1.1 in a docker container and the CPU is pretty a old Intel(R) Xeon(R) CPU X5670 @ 2.93GHz. /proc/cpuinfo contains

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 44
model name	: Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz
stepping	: 2
microcode	: 0x1f
cpu MHz		: 3294.091
cache size	: 12288 KB
physical id	: 0
siblings	: 6
core id		: 0
cpu cores	: 6
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid dtherm ida arat vnmi flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_unknown
bogomips	: 5852.65
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

-standard-semantics enables a LOT of options that have their own switches. You can read the list at standard-semantics I suggest trying the individual options and see what makes the biggest difference.

I do see that Intel has reworked how the various “standard” options work - it used to be that -stand controlled diagnostics only, but now it changes semantics too.

You should focus on the options beginning with ieee

For us, we found that when performance regressed migrating from parallel studio XE2019 to OneAPI 2021 using standard semantics, adding “/assume:noieee_compares” (on Windows, “-assume noieee_compares” on Linux) restored it.

I encountered something similar to this and it was related to how gradual underflows were treated. With certain input data, my code generated a large number of denormalized numbers, and if gradual underflow was enabled it took a lot of time to process them.

As noted by @sblionel the documentation describes what lower-level switches are used when a higher-level switch that conglomerates lower-level switches is invoked; and it is al ways useful to read the literature; but ifort/ifx has a lot of switches that expose what it is doing. Particularly when using a lot of switches that may or may not be overlapping each other, using switches like -dryrun shows the actual switches being passed on, which can be really useful when trying to figure out issues like this. For better or worse at least for ifx the intrinsic COMPILER_OPTIONS() shows what you actually entered on the command line, which does not help this particular use-case.