Ok, I took out all of the mod tests, and just left the index range tests (including jmin, kmin, lmin, and mmin), and here are those times (Apple M2, gfortran -O3).
i^5 = 61917364224
133 110 84 27 144
real 0m0.010s
user 0m0.008s
sys 0m0.002s
Here is that code.
program euler
! 27^5 + 84^5 + 110^5 + 133^5 = 144^5
use, intrinsic :: iso_fortran_env, only: ik => int64
implicit none
integer, parameter :: nmax = 6208
integer, parameter :: findmax = 1
integer :: i,j, jmin, k, kmin, l, lmin, m, mmin, nfound
integer(ik) :: i5, sj, sk, sl, sm, diff
integer(ik) :: n5(nmax)
character(*), parameter :: cfmt = '(*(g0,1x))'
nfound = 0
outer: do i = 1, nmax
i5 = int(i,ik)**5
n5(i) = i5
jmin = index_min( i5/4 )
do j = jmin, i
sj = n5(j)
kmin = index_min( (i5-sj)/3 )
do k = kmin, j
sk = sj + n5(k)
diff = i5 - sk
if ( diff < 0 ) exit
lmin = index_min( diff/2 )
do l = lmin, k
sl = sk + n5(l)
diff = i5 - sl
if ( diff < 0 ) exit
if ( (sl + n5(l)) < i5 ) cycle ! no solution in m=1:l.
mmin = index_min( diff )
do m = mmin, l
sm = sl + n5(m)
diff = i5 - sm
if ( diff > 0 ) then
cycle
elseif ( diff < 0 ) then
exit
else ! sm == i5
nfound = nfound + 1
print cfmt, "i^5 = ", i5
print cfmt, j, k, l, m, i
if ( nfound >= findmax ) then
exit outer
else
exit
endif
endif
enddo
enddo
enddo
enddo
enddo outer
contains
pure function index_min( v ) result(r)
! return a value r such that r**5 <= v.
! it need not be the largest value, but the closer to v the better.
implicit none
integer :: r
integer(ik), intent(in) :: v
r = max( 1, int( real(v)**0.2 ) )
return
end function index_min
end program euler
Even these times are now below 10 ms user time, although the elapsed times are usually a little over. If I set findmax to 3, here are the timings.
i^5 = 61917364224
133 110 84 27 144
i^5 = 1981355655168
266 220 168 54 288
i^5 = 15045919506432
399 330 252 81 432
real 0m0.117s
user 0m0.115s
sys 0m0.002s
I will probably switch timing results to findmax=3 in the future because the findmax=1 values are so small now.
The next thing to try is to put back some of the mod tests to see how effective they really are now.
EDIT: Here is an update on the saga of optimizing the Euler search algorithm. I managed to generalize the modulo testing for the l and k loops, so now I have a general version that allows an arbitrary number of modulo tests. I combined that with the index_min() function I posted before, and this is the fastest version I have so far. The findmax=1 times (Apple M2) are
$ gfortran -O3 --warn-all euler10.F90
$ time a.out
i^5 = 61917364224
133 110 84 27 144
real 0m0.008s
user 0m0.005s
sys 0m0.002s
These times are now pretty consistently in the single-digit millisecond range. The findmax=3 times are
i^5 = 61917364224
133 110 84 27 144
i^5 = 1981355655168
266 220 168 54 288
i^5 = 15045919506432
399 330 252 81 432
real 0m0.067s
user 0m0.064s
sys 0m0.002s
These timings are almost 2x faster than the previous version. Here is this version of the code.
module allow_mod
implicit none
private
public :: allow_t, fill_allow, print_allow
integer, parameter :: lk = selected_logical_kind(1) ! smallest logical value.
type level_t
integer :: nval ! number of allowed values.
integer, allocatable :: val(:) ! (1:nval) allowed values.
integer, allocatable :: map(:) ! (0:N-1) map from mod(m,N) to contiguous index, ix.
end type level_t
type allow_t
integer :: N ! modulus value.
integer :: i5modNx = -1 ! mapped mod(i**5,N) values
type(level_t) :: i ! target i info.
type(level_t), allocatable :: l(:) ! (1:ix)
type(level_t), allocatable :: k(:) ! (1:ix)
end type allow_t
character(*), parameter :: cfmt = '(*(g0,1x))'
contains
subroutine fill_allow( N, allow )
! allocate and fill in the allow derived type based on the modulus N.
! the goal is to limit the k and l loop indices when possible. this is
! achieved by using modulo arithmetic to limit the possibilities of
! matching j**5 + k**5 + l**5 + m**5 with the target i**5.
implicit none
integer, intent(in) :: N ! modulus value.
type(allow_t), intent(inout) :: allow
integer :: i, k, i5x, lx, nval
integer :: i5mod(1:N), val(1:N) ! temp work arrays.
logical(lk) :: mask(0:N-1), kmask(0:N-1)
allow%N = N
! compute mask(:) for the possible mod(i**5,N) values.
mask(:) = .false.
do i = 1, N
k = mod(i*i,N) ! i**2
k = mod(k*k,N) ! i**4
k = mod(i*k,N) ! i**5
i5mod(i) = k
mask(k) = .true.
enddo
allocate( allow%i%map(0:N-1) )
call set_val_map( mask, allow%i%map, val, nval )
allow%i%nval = nval
allow%i%val = val(1:nval)
! compute the kmask(:) array.
! the .true. entries are the only sk values that can be generated.
kmask(:) = .false.
do i5x = 1, allow%i%nval
call update_mask( N, allow%i%val(i5x), allow%i%val, kmask )
enddo
! allocate and fill in the level l and level k info.
allocate( allow%l(1:nval), allow%k(1:nval) )
do i5x = 1, allow%i%nval
! fill in the level l info for this i5x.
mask(:) = .false.
call update_mask( N, allow%i%val(i5x), allow%i%val, mask )
allocate( allow%l(i5x)%map(0:N-1) )
call set_val_map( mask, allow%l(i5x)%map, val, nval )
allow%l(i5x)%nval = nval
allow%l(i5x)%val = val(1:nval)
! fill in the level k info for this i5x.
mask(:) = .false.
do lx = 1, allow%l(i5x)%nval
call update_mask( N, allow%l(i5x)%val(lx), allow%i%val(1:allow%i%nval), mask )
enddo
mask = mask .and. kmask ! mask out any unreachable values.
allocate( allow%k(i5x)%map(0:N-1) )
call set_val_map( mask, allow%k(i5x)%map, val, nval )
allow%k(i5x)%nval = nval
allow%k(i5x)%val = val(1:nval)
enddo
end subroutine fill_allow
subroutine print_allow( nlist, allow )
! print the contents of allow with annotations.
implicit none
integer, intent(in) :: nlist ! output unit
type(allow_t), intent(in) :: allow
character(*), parameter :: fmtg = '(*(g0.3,1x))'
integer :: i5x
write(nlist,fmtg)
write(nlist,fmtg) 'Allowed index values for modulus N=', allow%N
write(nlist,fmtg)
write(nlist,fmtg) 'Loop level m, nval=', allow%i%nval, &
& 'nonzero ratio=', real(allow%i%nval)/allow%N
write(nlist,fmtg) 'Allowed sm values, val(:)=', allow%i%val
write(nlist,fmtg)
write(nlist,fmtg) 'Loop level l info:'
do i5x = 1, allow%i%nval
write(nlist,fmtg) 'For i5x=', allow%i%val(i5x), 'nval=', allow%l(i5x)%nval, &
& 'nonzero ratio=', real(allow%l(i5x)%nval)/allow%N
write(nlist,fmtg) 'Allowed sl values, val(:)=', allow%l(i5x)%val
enddo
write(nlist,fmtg)
write(nlist,fmtg) 'Loop level k info:'
do i5x = 1, allow%i%nval
write(nlist,fmtg) 'For i5x=', allow%i%val(i5x), 'nval=', allow%k(i5x)%nval, &
& 'nonzero ratio=', real(allow%k(i5x)%nval)/allow%N
write(nlist,fmtg) 'Allowed sk values, val(:)=', allow%k(i5x)%val
enddo
return
end subroutine print_allow
subroutine set_val_map( mask, map, val, nval )
! search mask(0:) for .true. entries, set map(0:) and save the indices in val(:).
! the arrays map(0:) and val(:) are inverse mappings.
implicit none
logical(lk), intent(in) :: mask(0:) ! (0:N-1).
integer, intent(out) :: map(0:) ! (0:N-1)
integer, intent(out) :: val(:) ! allow(1:nval) are the allowed values.
integer, intent(out) :: nval ! number of allowed values.
integer :: i
map(:) = 0
nval = 0
do i = 0, ubound(mask,dim=1)
if ( mask(i) ) then
nval = nval + 1
val(nval) = i
map(i) = nval
endif
enddo
return
end subroutine set_val_map
subroutine update_mask( N, m, allow, mask )
! update mask(:) using m and allow(:).
implicit none
integer, intent(in) :: N ! the current modulus.
integer, intent(in) :: m ! the target value.
integer, intent(in) :: allow(:) ! input allowed values.
logical(lk), intent(inout) :: mask(0:) ! updated allowed values.
integer :: i, k
do i = 1, size(allow)
k = modulo( m - allow(i), N )
mask(k) = .true.
enddo
return
end subroutine update_mask
end module allow_mod
program euler
! 27^5 + 84^5 + 110^5 + 133^5 = 144^5
use, intrinsic :: iso_fortran_env, only: ik => int64
use allow_mod, only: allow_t, fill_allow, print_allow
implicit none
integer, parameter :: nmax = 6208
integer, parameter :: findmax = 1
integer :: i,j, jmin, k, kmin, l, lmin, m, mmin, nfound
integer(ik) :: i5, sj, sk, sl, sm, diff
integer(ik) :: n5(nmax)
character(*), parameter :: cfmt = '(*(g0,1x))'
type(allow_t), allocatable :: allow(:)
call set_allow( [275, 31, 11], allow ) ! some good values are 11, 31, 100, 275.
nfound = 0
outer: do i = 1, nmax
i5 = int(i,ik)**5
n5(i) = i5
call set_i5modN( i5, allow )
jmin = index_min( i5/4 )
do j = jmin, i
sj = n5(j)
kmin = index_min( (i5-sj)/3 )
do k = kmin, j
sk = sj + n5(k)
diff = i5 - sk
if ( diff < 0 ) exit
if ( mod_test_k( sk, allow ) ) cycle
lmin = index_min( diff/2 )
do l = lmin, k
sl = sk + n5(l)
diff = i5 - sl ! for a solution m**5 == diff is required.
if ( diff < 0 ) exit
if ( (sl + n5(l)) < i5 ) cycle ! no solution in m=1:l.
if ( mod_test_l( sl, allow ) ) cycle
mmin = index_min( diff )
do m = mmin, l
sm = sl + n5(m)
diff = i5 - sm
if ( diff > 0 ) then
cycle
elseif ( diff < 0 ) then
exit
else ! sm == i5
nfound = nfound + 1
print cfmt, "i^5 = ", i5
print cfmt, j, k, l, m, i
if ( nfound >= findmax ) then
exit outer
else
exit ! exit m loop and continue searching.
endif
endif
enddo
enddo
enddo
enddo
enddo outer
contains
pure function index_min( v ) result( r )
! return a value r such that r**5 <= v.
! it need not be the largest value, but the closer to v the better.
implicit none
integer :: r
integer(ik), intent(in) :: v
r = int( real(v)**0.2 )
if ( int(r,ik)**5 > v ) r = r - 1 ! allow for possible floating point error.
r = max( 1, r )
return
end function index_min
subroutine set_allow( modvals, allow )
use, intrinsic :: iso_fortran_env, only: nlist => output_unit
implicit none
integer, intent(in) :: modvals(:)
type(allow_t), intent(out), allocatable :: allow(:)
integer :: i
allocate( allow(size(modvals)) )
do i = 1, size(modvals)
call fill_allow( modvals(i), allow(i) )
if ( .false. ) call print_allow( nlist, allow(i) )
enddo
return
end subroutine set_allow
subroutine set_i5modN( i5, allow )
! set all of the mapped i5modN values.
implicit none
integer(ik), intent(in) :: i5
type(allow_t), intent(inout) :: allow(:)
integer :: i, N
do i = 1, size(allow)
N = allow(i)%N
allow(i)%i5modNx = allow(i)%i%map( mod(i5,N) )
enddo
return
end subroutine set_i5modN
pure logical function mod_test_k( sk, allow )
! perform modulo tests on the k loop index.
implicit none
integer(ik), intent(in) :: sk
type(allow_t), intent(in) :: allow(:)
integer :: i, i5modNx, N
! loop in order.
mod_test_k = .false.
do i = 1, size(allow)
N = allow(i)%N
i5modNx = allow(i)%i5modNx
mod_test_k = allow(i)%k(i5modNx)%map(int(mod(sk,N))) .eq. 0
if ( mod_test_k ) exit ! exit ASAP.
enddo
return
end function mod_test_k
pure logical function mod_test_l( sl, allow )
! perform modulo tests on the l loop index.
implicit none
integer(ik), intent(in) :: sl
type(allow_t), intent(in) :: allow(:)
integer :: i, i5modNx, N
! loop in order.
mod_test_l = .false.
do i = 1, size(allow)
N = allow(i)%N
i5modNx = allow(i)%i5modNx
mod_test_l = allow(i)%l(i5modNx)%map(int(mod(sl,N))) .eq. 0
if ( mod_test_l ) exit ! exit ASAP.
enddo
return
end function mod_test_l
end program euler
This is much longer than the original code with no loop tests of any kind, but it now runs something like 1000x faster than that simple version.
The line
call set_allow( [275, 31, 11], allow )
now determines which modulo tests are done. I’m still trying to understand which combinations are the most useful. The above timings were done with those values. If I add a fourth test with a value of 100, the code slows down a little, showing that the index value test overhead starts to cost more than just executing the redundant loops. Also moving back to one or two tests, slows down the code a little. But these are just empirical measurements, I don’t fully understand yet how those tests interact with each other.
Just to give credit where credit is due, it was @themos who suggested the hyper-triangular loop index restrictions, @JohnCampbell who suggested the minimum loop value idea, and It was @mecej4 who suggested the use of modulo relations. This version combines all of those ideas to get these remarkable timings.
I should also mention, as a fortran language feature, that this seems like an ideal situation to use parameterized derived types. This version of the code uses allocatable arrays. I might experiment a little with PDTs, but I don’t know how many compilers could actually compile such a version.
EDIT2: I made two changes to the last version of the code in this post. First, I changed the index_min() function to account for floating point error. This is explained in detail in a subsequent post. The other change is that I posted the original code with a typo. A 275 value was typed in as 175, and I didn’t notice it until later even after I discussed that value in the text. This does have a small affect on some of the timings. For example, with
call set_allow( [275, 31], allow )
I get these Apple M2 timings for the first soluton:
i^5 = 61917364224
133 110 84 27 144
real 0m0.008s
user 0m0.005s
sys 0m0.002s
and these timings for three solutions:
i^5 = 61917364224
133 110 84 27 144
i^5 = 1981355655168
266 220 168 54 288
i^5 = 15045919506432
399 330 252 81 432
real 0m0.061s
user 0m0.059s
sys 0m0.002s
These are slightly better than the previous times with that version of the code where I did modulo tests with three values.