Formatted hexadecimal output in lower case

I’ve played with this method too. Some compilers are able to (auto-)vectorize it quite well. For instance if you write the loop like this:

    integer(1), parameter :: A = iachar('A'), Z = iachar('Z')
    integer(1), parameter :: diff_case = iachar('A', 1) - iachar('a', 1)

    integer(1) :: r
    logical(1) :: m

    integer :: i, l

    l = len_trim(str)

    do i = 1, l
        r = iachar(str(i:i),1)
        m = (r >= A) .and. (r <= Z)
        if (m) r = r - diff_case
        str(i:i) = achar(r)
    end do

and compile with flags -O3 -march=skylake gives the bulk loop:

.L5:
        vmovdqu ymm0, YMMWORD PTR [rax]
        add     rax, 32
        vpaddb  ymm1, ymm0, ymm5
        vpaddb  ymm2, ymm0, ymm3
        vpsubusb        ymm1, ymm1, ymm4
        vpcmpeqb        ymm1, ymm1, ymm6
        vpblendvb       ymm0, ymm0, ymm2, ymm1
        vmovdqu YMMWORD PTR [rax-32], ymm0
        cmp     rcx, rax
        jne     .L5

It is using the YMM registers (AVX) to process 32 characters per loop iteration.

1 Like

That is good to know. I notice that you use integer(1) and logical(1) to hold the intermediate results. Is that necessary to generate the AVX code, or would default integer and logical also work there?

It depends on the compiler flags used. With gfortran 14.2 and -O3 -march=skylake you get AVX with both integer(1) and integer(4), but what matters is how many characters get processed per AVX instruction.

I’ve included a program you can test with options like:

  • <-O3|-O2> -march=<...> [-mprefer-vector-width=128|256|512] [-DINT8]
Click arrow for TEST program
module rand_tools

implicit none

contains

subroutine rand_int_array(i,n,m)
    integer, intent(inout) :: i(:)
    integer, intent(in) :: n, m
    real, allocatable :: u(:)
    allocate(u(size(i)))
    call random_number(u)
    i = n + FLOOR((m+1-n)*u)
end subroutine

integer function rand_int(n,m)
    integer, intent(in) :: n, m

    real :: u
    call random_number(u)
    rand_int = n + FLOOR((m+1-n)*u)

end function

subroutine init_random_seed()
integer :: i, n, clock
integer, dimension(:), allocatable :: seed

call random_seed(size = n)
allocate(seed(n))

call system_clock(count=clock)

seed = clock + 37 * (/ (i - 1, i = 1, n) /)
call random_seed(put = seed)

deallocate(seed)
end subroutine

function random_characters(nchar) result(stream)
    integer, intent(in) :: nchar
    character(len=:), allocatable :: stream
    integer :: i
    integer, allocatable :: r(:)

    !   33-47   !"#$%&'()*+,-./
    !   48-57   0123456789
    !   58-64   :;<=>?@
    !   65-70   ABCDEF
    !   71-90   GHIJKLMNOPQRSTUVWXYZ
    !   91-96   [\]^_`
    !  97-102   abcdef
    ! 103-122   ghijklmnopqrstuvwxyz
    ! 123-126   {|}~

    allocate(r(nchar))
    call rand_int_array(r,33,126)

    allocate(character(nchar) :: stream)
    do i = 1, nchar
        stream(i:i) = achar(r(i))
    end do

end function

end module


subroutine tolower(str)
    implicit none
    character(len=*), intent(inout) :: str

#if INT8
    integer, parameter :: ik = 1
#else
    integer, parameter :: ik = 4
#endif

    integer(ik), parameter :: A = iachar('A'), &
                              Z = iachar('Z')
    integer(ik), parameter :: diff_case = &
        iachar('A', 1) - iachar('a', 1)
    integer(ik) :: r
    logical(ik) :: m

    integer :: i, l

    l = len_trim(str)
    do i = 1, l
        r = iachar(str(i:i),1)
        m = (r >= A) .and. (r <= Z)
        if (m) r = r - diff_case
        str(i:i) = achar(r)
    end do

end subroutine


program main
use rand_tools, only: random_characters
implicit none

external :: tolower
integer :: i, sz
character(len=:), allocatable :: chars
real(kind(1.0d0)) :: t1, t2

#if INT8
print *, "INTEGER(1)"
#else
print *, "INTEGER(4)"
#endif

print '(4A10)', "size", "elapsed (s)", "rate (gb/s)", "lastchar"
do i = 3, 9
    sz = 10**i
    chars = random_characters(nchar=sz)

    call cpu_time(t1)
    call tolower(chars)
    call cpu_time(t2)

    print '(I10,ES10.3,ES10.3,A10)', sz, t2-t1, sz/(t2-t1)/1.0d9, chars(sz:sz)
end do

end program

Edit: the program was printing MB/s instead of GB/s as shown in the table header. This is now amended.

1 Like

That is good to know too. Is there any reason to avoid INT8 declarations of those intermediates? I can’t think of one offhand. If not, then I might change my module accordingly in order to maximize the AVX vector lengths.

I also can’t think of one. If you are using achar/iachar the representation fits in -127:127, and you only care about the upper alphabetic characters.

Unfortunately the compilers widely differ in the success with which they auto-vectorize these character processing loops. I’ve also tried applying !$omp simd private(r,m) on the loop.

With ifort -O3 -xHOST -fpp [-DINT8] [-qopenmp] I got the following results:

INTEGER(4) + (no optimization, -O0)
    10000000  1.5309E-01  6.5319E-02           u
   100000000  1.5096E+00  6.6241E-02           +
  1000000000  1.4733E+01  6.7874E-02           *

INTEGER(4)
        size elapsed (s) rate (gb/s)    lastchar
    10000000  5.3570E-02  1.8667E-01           u
   100000000  5.3215E-01  1.8792E-01           +
  1000000000  5.1987E+00  1.9236E-01           *

INTEGER(4) + OpenMP SIMD
        size elapsed (s) rate (gb/s)    lastchar
    10000000  2.5510E-03  3.9200E+00           u
   100000000  2.8938E-02  3.4557E+00           +
  1000000000  2.7165E-01  3.6812E+00           *

INTEGER(1)
        size elapsed (s) rate (gb/s)    lastchar
    10000000  5.1328E-02  1.9483E-01           u
   100000000  5.1985E-01  1.9236E-01           +
  1000000000  4.8631E+00  2.0563E-01           *

INTEGER(1) + OpenMP SIMD
        size elapsed (s) rate (gb/s)    lastchar
    10000000  2.2200E-03  4.5045E+00           u
   100000000  2.1006E-02  4.7605E+00           +
  1000000000  2.2825E-01  4.3812E+00           *

The INTEGER(1) + OpenMP SIMD version is the fastest. The differences are huge 0.19 GB/s versus 4.8 GB/s.

1 Like

Well, CDC’s Fortran compilers didn’t directly support 6/12. They were DISPLAY (6-bit) code only. So you’d have to roll a custom routine. A routine to convert a string to upper case would look at the preceding character, and if a caret (^), nuke the caret and slide the rest of the string down a character. OTOH, if you were a PLATO author writing in the TUTOR language, they assumed lower case was the default. So a caret would have to be inserted to make it upper case. Then the reverse for converting strings to lower case.

Depending on context, the 60-bit CDC systems also supported 8/12 ASCII (5 characters/word), packed 8-bit characters (7.5 characters/word), and the same in EBCDIC. Utilities like FORM, FCOPY, and the 8-Bit Subroutines were provided to help make sense of the mess.

Fun times.