I’ve played with this method too. Some compilers are able to (auto-)vectorize it quite well. For instance if you write the loop like this:
integer(1), parameter :: A = iachar('A'), Z = iachar('Z')
integer(1), parameter :: diff_case = iachar('A', 1) - iachar('a', 1)
integer(1) :: r
logical(1) :: m
integer :: i, l
l = len_trim(str)
do i = 1, l
r = iachar(str(i:i),1)
m = (r >= A) .and. (r <= Z)
if (m) r = r - diff_case
str(i:i) = achar(r)
end do
and compile with flags -O3 -march=skylake
gives the bulk loop:
.L5:
vmovdqu ymm0, YMMWORD PTR [rax]
add rax, 32
vpaddb ymm1, ymm0, ymm5
vpaddb ymm2, ymm0, ymm3
vpsubusb ymm1, ymm1, ymm4
vpcmpeqb ymm1, ymm1, ymm6
vpblendvb ymm0, ymm0, ymm2, ymm1
vmovdqu YMMWORD PTR [rax-32], ymm0
cmp rcx, rax
jne .L5
It is using the YMM registers (AVX) to process 32 characters per loop iteration.
1 Like
That is good to know. I notice that you use integer(1)
and logical(1)
to hold the intermediate results. Is that necessary to generate the AVX code, or would default integer and logical also work there?
It depends on the compiler flags used. With gfortran 14.2 and -O3 -march=skylake
you get AVX with both integer(1)
and integer(4)
, but what matters is how many characters get processed per AVX instruction.
I’ve included a program you can test with options like:
<-O3|-O2> -march=<...> [-mprefer-vector-width=128|256|512] [-DINT8]
Click arrow for TEST program
module rand_tools
implicit none
contains
subroutine rand_int_array(i,n,m)
integer, intent(inout) :: i(:)
integer, intent(in) :: n, m
real, allocatable :: u(:)
allocate(u(size(i)))
call random_number(u)
i = n + FLOOR((m+1-n)*u)
end subroutine
integer function rand_int(n,m)
integer, intent(in) :: n, m
real :: u
call random_number(u)
rand_int = n + FLOOR((m+1-n)*u)
end function
subroutine init_random_seed()
integer :: i, n, clock
integer, dimension(:), allocatable :: seed
call random_seed(size = n)
allocate(seed(n))
call system_clock(count=clock)
seed = clock + 37 * (/ (i - 1, i = 1, n) /)
call random_seed(put = seed)
deallocate(seed)
end subroutine
function random_characters(nchar) result(stream)
integer, intent(in) :: nchar
character(len=:), allocatable :: stream
integer :: i
integer, allocatable :: r(:)
! 33-47 !"#$%&'()*+,-./
! 48-57 0123456789
! 58-64 :;<=>?@
! 65-70 ABCDEF
! 71-90 GHIJKLMNOPQRSTUVWXYZ
! 91-96 [\]^_`
! 97-102 abcdef
! 103-122 ghijklmnopqrstuvwxyz
! 123-126 {|}~
allocate(r(nchar))
call rand_int_array(r,33,126)
allocate(character(nchar) :: stream)
do i = 1, nchar
stream(i:i) = achar(r(i))
end do
end function
end module
subroutine tolower(str)
implicit none
character(len=*), intent(inout) :: str
#if INT8
integer, parameter :: ik = 1
#else
integer, parameter :: ik = 4
#endif
integer(ik), parameter :: A = iachar('A'), &
Z = iachar('Z')
integer(ik), parameter :: diff_case = &
iachar('A', 1) - iachar('a', 1)
integer(ik) :: r
logical(ik) :: m
integer :: i, l
l = len_trim(str)
do i = 1, l
r = iachar(str(i:i),1)
m = (r >= A) .and. (r <= Z)
if (m) r = r - diff_case
str(i:i) = achar(r)
end do
end subroutine
program main
use rand_tools, only: random_characters
implicit none
external :: tolower
integer :: i, sz
character(len=:), allocatable :: chars
real(kind(1.0d0)) :: t1, t2
#if INT8
print *, "INTEGER(1)"
#else
print *, "INTEGER(4)"
#endif
print '(4A10)', "size", "elapsed (s)", "rate (gb/s)", "lastchar"
do i = 3, 9
sz = 10**i
chars = random_characters(nchar=sz)
call cpu_time(t1)
call tolower(chars)
call cpu_time(t2)
print '(I10,ES10.3,ES10.3,A10)', sz, t2-t1, sz/(t2-t1)/1.0d9, chars(sz:sz)
end do
end program
Edit: the program was printing MB/s instead of GB/s as shown in the table header. This is now amended.
1 Like
That is good to know too. Is there any reason to avoid INT8 declarations of those intermediates? I can’t think of one offhand. If not, then I might change my module accordingly in order to maximize the AVX vector lengths.
I also can’t think of one. If you are using achar
/iachar
the representation fits in -127:127, and you only care about the upper alphabetic characters.
Unfortunately the compilers widely differ in the success with which they auto-vectorize these character processing loops. I’ve also tried applying !$omp simd private(r,m)
on the loop.
With ifort -O3 -xHOST -fpp [-DINT8] [-qopenmp]
I got the following results:
INTEGER(4) + (no optimization, -O0)
10000000 1.5309E-01 6.5319E-02 u
100000000 1.5096E+00 6.6241E-02 +
1000000000 1.4733E+01 6.7874E-02 *
INTEGER(4)
size elapsed (s) rate (gb/s) lastchar
10000000 5.3570E-02 1.8667E-01 u
100000000 5.3215E-01 1.8792E-01 +
1000000000 5.1987E+00 1.9236E-01 *
INTEGER(4) + OpenMP SIMD
size elapsed (s) rate (gb/s) lastchar
10000000 2.5510E-03 3.9200E+00 u
100000000 2.8938E-02 3.4557E+00 +
1000000000 2.7165E-01 3.6812E+00 *
INTEGER(1)
size elapsed (s) rate (gb/s) lastchar
10000000 5.1328E-02 1.9483E-01 u
100000000 5.1985E-01 1.9236E-01 +
1000000000 4.8631E+00 2.0563E-01 *
INTEGER(1) + OpenMP SIMD
size elapsed (s) rate (gb/s) lastchar
10000000 2.2200E-03 4.5045E+00 u
100000000 2.1006E-02 4.7605E+00 +
1000000000 2.2825E-01 4.3812E+00 *
The INTEGER(1) + OpenMP SIMD version is the fastest. The differences are huge 0.19 GB/s versus 4.8 GB/s.
1 Like
Well, CDC’s Fortran compilers didn’t directly support 6/12. They were DISPLAY (6-bit) code only. So you’d have to roll a custom routine. A routine to convert a string to upper case would look at the preceding character, and if a caret (^), nuke the caret and slide the rest of the string down a character. OTOH, if you were a PLATO author writing in the TUTOR language, they assumed lower case was the default. So a caret would have to be inserted to make it upper case. Then the reverse for converting strings to lower case.
Depending on context, the 60-bit CDC systems also supported 8/12 ASCII (5 characters/word), packed 8-bit characters (7.5 characters/word), and the same in EBCDIC. Utilities like FORM, FCOPY, and the 8-Bit Subroutines were provided to help make sense of the mess.
Fun times.