Just a quick note that the only Cray compiler I have access to is the one on ARCHER2 where we have to use their compiler wrapper ftn
, that’s not something I can control (but might be the cause of the issue)…
I’ve only been able to reproduce this with Cray, so I’m leaning towards it being some kind of compiler bug, but at the same time I’m very wary that undefined behaviour can do weird things and you’re very at the mercy of whether a compiler exploits it or not.
I built test-drive itself as shown in ARCHER2-Cray-build_test_drive.txt (4.5 KB), and all building and running was done as
username@ln03:~/test> ftn -Itest-drive-inst/include/test-drive/Cray-11.0.4/ -Ltest-drive-inst/lib64/ -g -O0 -c test_weird.f90 -ltest-drive
username@ln03:~/test> ftn -Itest-drive-inst/include/test-drive/Cray-11.0.4/ -Ltest-drive-inst/lib64/ -g -O0 main.f90 test_weird.o -ltest-drive
username@ln03:~/test> valgrind ./a.out
using
username@ln03:~/test> ftn --version
Cray Fortran : Version 11.0.4
username@ln03:~/test> cc --version
Cray clang version 11.0.4 (bc9473a12d1f2f43cde01f962a11240263bd8908)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/cray/pe/cce/11.0.4/cce-clang/x86_64/share/../bin
username@ln03:~/test> valgrind --version
valgrind-3.16.0.RC2
Here is test_weird.f90 (3.8 KB) and main.f90 (882 Bytes).
The exact behaviour is very dependent on which tests are enabled. If the first test (that explicitly sets should_fail
to .false.
and is currently commented out) is uncommented and its equivalent without .false.
commented out everything works as expected but valgrind reports issues in test-drive:
==240373== Memcheck, a memory error detector
==240373== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==240373== Using Valgrind-3.16.0.RC2 and LibVEX; rerun with -h for copyright info
==240373== Command: ./a.out
==240373==
# Testing: weird
Starting int-to-char::zero ... (1/7)
==240373== Conditional jump or move depends on uninitialised value(s)
==240373== at 0x4078F0: make_output$testdrive_ (testdrive.F90:438)
==240373== by 0x4075D0: run_unittest$testdrive_ (testdrive.F90:396)
==240373== by 0x40706A: run_testsuite$testdrive_ (testdrive.F90:336)
==240373== by 0x401E22: main (main.f90:20)
==240373==
==240373== Conditional jump or move depends on uninitialised value(s)
==240373== at 0x407D80: make_output$testdrive_ (testdrive.F90:444)
==240373== by 0x4075D0: run_unittest$testdrive_ (testdrive.F90:396)
==240373== by 0x40706A: run_testsuite$testdrive_ (testdrive.F90:336)
==240373== by 0x401E22: main (main.f90:20)
==240373==
==240373== Conditional jump or move depends on uninitialised value(s)
==240373== at 0x4082D3: make_output$testdrive_ (testdrive.F90:458)
==240373== by 0x4075D0: run_unittest$testdrive_ (testdrive.F90:396)
==240373== by 0x40706A: run_testsuite$testdrive_ (testdrive.F90:336)
==240373== by 0x401E22: main (main.f90:20)
==240373==
... int-to-char::zero [PASSED]
Starting int-to-char::one-digit ... (2/7)
... int-to-char::one-digit [PASSED]
Starting int-to-char::one-digit-negative ... (3/7)
... int-to-char::one-digit-negative [PASSED]
Starting int-to-char::two-digits ... (4/7)
... int-to-char::two-digits [PASSED]
Starting int-to-char::two-digits-negative ... (5/7)
... int-to-char::two-digits-negative [PASSED]
Starting int-to-char::ten-digits ... (6/7)
... int-to-char::ten-digits [PASSED]
Starting int-to-char::ten-digits-negative ... (7/7)
... int-to-char::ten-digits-negative [PASSED]
==240373==
==240373== HEAP SUMMARY:
==240373== in use at exit: 31,117 bytes in 5 blocks
==240373== total heap usage: 110 allocs, 105 frees, 386,284 bytes allocated
==240373==
==240373== LEAK SUMMARY:
==240373== definitely lost: 0 bytes in 0 blocks
==240373== indirectly lost: 0 bytes in 0 blocks
==240373== possibly lost: 0 bytes in 0 blocks
==240373== still reachable: 31,117 bytes in 5 blocks
==240373== suppressed: 0 bytes in 0 blocks
==240373== Rerun with --leak-check=full to see details of leaked memory
==240373==
==240373== Use --track-origins=yes to see where uninitialised values come from
==240373== For lists of detected and suppressed errors, rerun with: -s
==240373== ERROR SUMMARY: 21 errors from 3 contexts (suppressed: 0 from 0)
If only two tests are uncommented everything (ignoring valgrind) works fine as well:
==111181== Memcheck, a memory error detector
==111181== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==111181== Using Valgrind-3.16.0.RC2 and LibVEX; rerun with -h for copyright info
==111181== Command: ./a.out
==111181==
# Testing: weird
Starting int-to-char::ten-digits ... (1/2)
==111181== Conditional jump or move depends on uninitialised value(s)
==111181== at 0x406760: make_output$testdrive_ (testdrive.F90:438)
==111181== by 0x406440: run_unittest$testdrive_ (testdrive.F90:396)
==111181== by 0x405EDA: run_testsuite$testdrive_ (testdrive.F90:336)
==111181== by 0x401E22: main (main.f90:20)
==111181==
==111181== Conditional jump or move depends on uninitialised value(s)
==111181== at 0x406BF0: make_output$testdrive_ (testdrive.F90:444)
==111181== by 0x406440: run_unittest$testdrive_ (testdrive.F90:396)
==111181== by 0x405EDA: run_testsuite$testdrive_ (testdrive.F90:336)
==111181== by 0x401E22: main (main.f90:20)
==111181==
==111181== Conditional jump or move depends on uninitialised value(s)
==111181== at 0x407143: make_output$testdrive_ (testdrive.F90:458)
==111181== by 0x406440: run_unittest$testdrive_ (testdrive.F90:396)
==111181== by 0x405EDA: run_testsuite$testdrive_ (testdrive.F90:336)
==111181== by 0x401E22: main (main.f90:20)
==111181==
... int-to-char::ten-digits [PASSED]
Starting int-to-char::ten-digits-negative ... (2/2)
... int-to-char::ten-digits-negative [PASSED]
==111181==
==111181== HEAP SUMMARY:
==111181== in use at exit: 31,117 bytes in 5 blocks
==111181== total heap usage: 60 allocs, 55 frees, 382,849 bytes allocated
==111181==
==111181== LEAK SUMMARY:
==111181== definitely lost: 0 bytes in 0 blocks
==111181== indirectly lost: 0 bytes in 0 blocks
==111181== possibly lost: 0 bytes in 0 blocks
==111181== still reachable: 31,117 bytes in 5 blocks
==111181== suppressed: 0 bytes in 0 blocks
==111181== Rerun with --leak-check=full to see details of leaked memory
==111181==
==111181== Use --track-origins=yes to see where uninitialised values come from
==111181== For lists of detected and suppressed errors, rerun with: -s
==111181== ERROR SUMMARY: 6 errors from 3 contexts (suppressed: 0 from 0)
With exactly 5 tests (ARCHER2-Cray-no_explicit-5_tests.txt (32.1 KB)) there are no segfaults, but all the tests fail. Again though re-enabling the test that sets should_fail
fixes things (except valgrind issues):
==135853== Memcheck, a memory error detector
==135853== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==135853== Using Valgrind-3.16.0.RC2 and LibVEX; rerun with -h for copyright info
==135853== Command: ./a.out
==135853==
# Testing: weird
Starting int-to-char::zero ... (1/6)
==135853== Conditional jump or move depends on uninitialised value(s)
==135853== at 0x407550: make_output$testdrive_ (testdrive.F90:438)
==135853== by 0x407230: run_unittest$testdrive_ (testdrive.F90:396)
==135853== by 0x406CCA: run_testsuite$testdrive_ (testdrive.F90:336)
==135853== by 0x401E22: main (main.f90:20)
==135853==
==135853== Conditional jump or move depends on uninitialised value(s)
==135853== at 0x4079E0: make_output$testdrive_ (testdrive.F90:444)
==135853== by 0x407230: run_unittest$testdrive_ (testdrive.F90:396)
==135853== by 0x406CCA: run_testsuite$testdrive_ (testdrive.F90:336)
==135853== by 0x401E22: main (main.f90:20)
==135853==
==135853== Conditional jump or move depends on uninitialised value(s)
==135853== at 0x407F33: make_output$testdrive_ (testdrive.F90:458)
==135853== by 0x407230: run_unittest$testdrive_ (testdrive.F90:396)
==135853== by 0x406CCA: run_testsuite$testdrive_ (testdrive.F90:336)
==135853== by 0x401E22: main (main.f90:20)
==135853==
... int-to-char::zero [PASSED]
Starting int-to-char::one-digit-negative ... (2/6)
... int-to-char::one-digit-negative [PASSED]
Starting int-to-char::two-digits ... (3/6)
... int-to-char::two-digits [PASSED]
Starting int-to-char::two-digits-negative ... (4/6)
... int-to-char::two-digits-negative [PASSED]
Starting int-to-char::ten-digits ... (5/6)
... int-to-char::ten-digits [PASSED]
Starting int-to-char::ten-digits-negative ... (6/6)
... int-to-char::ten-digits-negative [PASSED]
==135853==
==135853== HEAP SUMMARY:
==135853== in use at exit: 31,117 bytes in 5 blocks
==135853== total heap usage: 100 allocs, 95 frees, 385,617 bytes allocated
==135853==
==135853== LEAK SUMMARY:
==135853== definitely lost: 0 bytes in 0 blocks
==135853== indirectly lost: 0 bytes in 0 blocks
==135853== possibly lost: 0 bytes in 0 blocks
==135853== still reachable: 31,117 bytes in 5 blocks
==135853== suppressed: 0 bytes in 0 blocks
==135853== Rerun with --leak-check=full to see details of leaked memory
==135853==
==135853== Use --track-origins=yes to see where uninitialised values come from
==135853== For lists of detected and suppressed errors, rerun with: -s
==135853== ERROR SUMMARY: 18 errors from 3 contexts (suppressed: 0 from 0)
Just 3 tests spews mojibake and parts of my environment variables to the terminal (ARCHER2-Cray-no_explicit-3_tests.txt (94.7 KB)), which makes me think Cray is doing something weird with C-style null termination. But if that were the case then I would expect allocating an extra character to fix things but it doesn’t.
Anything more than 5 tests segfaults (ARCHER2-Cray-no_explicit-6_tests.txt (56.3 KB)) as does including both of the zero tests (ARCHER2-Cray-no_explicit-both_zero_tests.txt (3.6 KB)).
The valgrind errors point to this line, this line, and this line, which makes me think that this is to do with the way Cray compiles optional dummy arguments, but I’m not sure how to refactor test-drive to avoid that so I can’t check it.