Faster string to double

See: A New JSON Library - #20 by jacobwilliams

I’m looking at the speed of string to double conversion using the built in read(str,fmt=*) and can we do any better. See test case below. Does anybody have any ideas about a faster way?

program main

use iso_fortran_env, only: wp => real64, ip => int64

implicit none 

integer,parameter :: n = 1000000 !! number of values 

integer :: i 
character(len=30),dimension(:),allocatable :: strings
real(wp) :: rval
integer :: ierr
integer(ip) :: start, finish, count_rate

! create a list of values to parse
allocate(strings(n))
do i = 1, n
    call RANDOM_NUMBER(rval)
    write(strings(i), '(E30.16)') rval
end do

call system_clock(start, count_rate)
do i = 1, n
    read(strings(i),fmt=*,iostat=ierr) rval
end do
call system_clock(finish)

write(*,'(A30,1X,F7.4,1X,A)') 'time * : ', (finish-start)/real(count_rate,wp), ' seconds'

end program main

On my laptop (M1 Mac with Gfortran 11.1.0) I get:

 time * :   0.7653  seconds

How can we beat this? I know there are some C++ libraries out there (e.g., fast_double_parser). Can we use only Fortran to get something as fast?

6 Likes

I haven’t studied the algorithm from Daniel Lemire in detail, but I think all the bit shifting operations needed are there. It’s just difficult to port because you have to convert all unsigned integer sequences into signed ones.

I vaguely recall @zoziha making a post about this issue somewhere in the stdlib repo, but wasn’t able to find it right now.

1 Like

Try to use double atof(const char *nptr) from libc. I once wrote a toy parser for some custom format with nested blocks containing floats. Having the same parser in C (and using it from Fortran) gave me a significantly large speedup with respect of just reading the numbers with formatted I/O in Fortran.

5 Likes

I suggest C library function strtod, you should get around 2.5X improvement with this. That is about as good as it can get if one is trying to have a (relatively decent and robust) portable set of code.

This is as per I wrote in this comment, “This may include perhaps replacing … sections … with optimized C”

Click for code

String to Double

   use iso_c_binding, only: c_double, c_char, c_ptr, c_null_ptr, c_long
   interface
      function strtod( str, endptr ) result(d) bind(C, name="strtod" )
      ! <stdlib.h> :: double strtod(const char *str, char **endptr)
         import :: c_char, c_ptr, c_double
         ! Argument list
         character(kind=c_char,len=1), intent(in) :: str(*)
         type(c_ptr), intent(inout) :: endptr
         ! function result
         real(c_double) :: d 
      end function 
   end interface
   
   integer,parameter :: n = 1000000 !! number of values 
   
   integer :: i 
   character(len=30),dimension(:),allocatable :: strings
   real(c_double) :: rval, r_check
   integer :: ierr
   integer(c_long) :: start, finish, count_rate
   type(c_ptr) :: endptr
   
   ! create a list of values to parse
   allocate( strings(n) )
   do i = 1, n
      call RANDOM_NUMBER(rval)
      write(strings(i), '(E30.16)') rval
      if ( i == 1 ) r_check = rval 
   end do
   
   blk_io: block
      print *, "Block 1: formatted read toward string to double"
      read(strings(1),fmt=*,iostat=ierr) rval
      if ( abs(rval-r_check) > epsilon(r_check) ) then
         print *, "Warning: mismatch during formatted read."
         exit blk_io
      end if 

      call system_clock(start, count_rate)
      do i = 1, n
         read(strings(i),fmt=*,iostat=ierr) rval
      end do
      call system_clock(finish)
      write(*,'(A30,1X,F7.4,1X,A)') 'time * : ', (finish-start)/real(count_rate,c_double), ' seconds'
   end block blk_io
   print *
   blk_c: block
      print *, "Block 2: C strtod"
      endptr = c_null_ptr
      rval = strtod( strings(1), endptr )
      if ( abs(rval-r_check) > epsilon(r_check) ) then
         print *, "Warning: mismatch during strtod."
         exit blk_c
      end if 
      call system_clock(start, count_rate)
      do i = 1, n
         rval = strtod( strings(i), endptr )
      end do
      call system_clock(finish)
      write(*,'(A30,1X,F7.4,1X,A)') 'time * : ', (finish-start)/real(count_rate,c_double), ' seconds'
   end block blk_c

end

Block 1: formatted read toward string to double
time * : 0.4660 seconds

Block 2: C strtod
time * : 0.1800 seconds

4 Likes

I would suggest you first try several routines written in C using iso_c_binding, in particular:

  • atof
  • strtod
  • fast_double_parser

And see if you can get decent performance with your JSON parser. I am hoping you can.

If you can, then we should discus how to port them to pure Fortran.

P.S. Fortran absolutely needs the most efficient reading and writing of floating point numbers. That’s part of scientific computing to save arrays into human readable files. I know that Fortran was not designed for efficient string processing, but in modern era, JSON is one of the formats that is used a lot for scientific computing, and so having a parser that is at least as fast as Python’s is a must in my opinion.

6 Likes

Thanks! I will definitely try it!

Yes agreed. There are many things I would never suggest be written in Fortran and calling C is fine. But reading data from a file is something almost every Fortran program in existence has to do, so we shouldn’t have to call to C to do that efficiently. If we do, it means something about Fortran needs to be fixed.

2 Likes

Aside: for Fortran, string to double is not enough. We also need it to work for single, quad, and any other precision the compiler in question supports.

1 Like

My opinion is this is where base C or ASSEMBLER comes in and that’s why I wrote as such in the other thread with JSON library. Fortran stdlib can try to include highly optimized code in C or ASSEMBLER - which might learn from C stdlib or C++ for other cases like double - and serve as consumable procedures for Fortran stdlib users toward such needs.

From what I have observed, this was not an uncommon workflow during the days of legacy FORTRAN; sections of programs that were highly important for peak performance were assembled and linked with other FORTRAN code.

1 Like

Thanks for your program. For me it runs fine on Windows with Intel Fortran and on WSL 2 with gfortran, giving results

 Block 1: formatted read toward string to double
                     time * :   0.7804  seconds

 Block 2: C strtod
                     time * :   0.1153  seconds

but on Windows setting n to 10^4 (vs. 10^6 in the original code) strtod is very slow, with output

 Block 1: formatted read toward string to double
                     time * :   0.0310  seconds

 Block 2: C strtod
                     time * :   0.5000  seconds

using GNU Fortran (GCC) 12.0.0 20211024 from equation.com with GNU Fortran (GCC) 10.2.0 from w64devkit giving similar results.

Really curious what the baseline speed is (ie. the best you get from the C libraries) as I am wondering whether to continue resurrecting some code from long ago, as my minimal cleanup got some speedups and interesting results but has some issues.

So I added another loop to your program that calls an old function that I dusted off just enough to fire up. The encouraging
part was I got a speed-up with all three compilers I tried (top time is original code):

(
exec 2>&1
set -x
gfortran x1.f90
./a.out
./a.out
ifort x1.f90
./a.out
./a.out
nvfortran x1.f90
./a.out
./a.out
) >> $0
exit
+ gfortran x1.f90
+ ./a.out
                     time * :   1.1303  seconds
                     time * :   0.2492  seconds
+ ./a.out
                     time * :   1.1178  seconds
                     time * :   0.2483  seconds
      2318 -0.55511151231257827E-15 0.79741218980722206 0.79741218980722262 0.79741218980722262           
    498194 0.55511151231257827E-15 0.87089963264350001 0.87089963264349946 0.87089963264349946           
    810741 -0.55511151231257827E-15 0.58151616177296606 0.58151616177296661 0.58151616177296661           
+ ifort x1.f90
+ ./a.out
                     time * :   0.7095  seconds
                     time * :   0.1049  seconds
    870474 -.5551115123125783E-15 .6825123899679394 .6825123899679399 .6825123899679399             
+ ./a.out
                     time * :   0.6774  seconds
                     time * :   0.1067  seconds
    870474 -.5551115123125783E-15 .6825123899679394 .6825123899679399 .6825123899679399             
+ nvfortran x1.f90
+ ./a.out
                     time * :   0.3050  seconds
                     time * :   0.1965  seconds
+ ./a.out
                     time * :   0.3135  seconds
                     time * :   0.1951  seconds

The funny numbers are because I stored the results of both calculations and did a delta between them, and listed values with >2*epsilon(0.0d0) delta.

The code was so old it was written for Holleriths and needs further work if anyone finds that appealing enough to continue, but it showed a few problems and a few reasons the defaults might be so slow.

With all three compilers when I compared the string values to the float values they were dead-on; with mine I got a reasonable value but not dead on. So some of the speed difference may be the cost of accuracy.

I very odd thing is if I replace the asterisk format with “(g0)” in ifort and nvfortran (gfortran 10 does not support that) in the original it was twice as fast. I found that interesting that it was that large a difference, although I tried it because I had a hunch it might be a little faster.

Also note that the speeds from the various compilers vary significantly, showing there is room for improvement there, which if pursued would generate fimprovements for everyone not just JSON users.

So forgive the dusty deck, and note I don’t think this method will easily lend itself to getting dead-on values. Change the 2epsilon to 1epsilon to see the majority of the values are not “perfect” but satisfy most requirements; but even if not pursued here it might be useful for some readers with a need for speed:

module M_ascii
! W-A-R-N-I-N-G
! W-A-R-N-I-N-G : Dusty deck that needs work. For illustration only
private
public ator

contains

logical function ator(astr,val)
use iso_fortran_env, only: wp => real64, ip => int64
implicit none
!
!     Convert ASCII-text to DP and return .TRUE. if OK
!
character(len=*),intent(in) :: astr
integer :: nc ! NUMBER OF CHARS
real(kind=wp) dfac, exp
real(kind=wp) :: val, v, ifac
integer :: i, mode, sexp, sval, a
integer,parameter :: uech=ichar('E'),   lech=ichar('e')
integer,parameter :: udee=ichar('D'),   ldee=ichar('d')
integer,parameter :: plsign=ichar('+'), mnsign=ichar('-'), decim=ichar('.')
integer,parameter :: u0=ichar('0'),     u9=ichar('9'),     aspace=ichar(' ')
      ator = .false.
      val = 0.0d0
      exp = 0.0d0
      sval = 1
      sexp = 1
      ifac = 10
      dfac = 1.0d0
!
      nc=len_trim(astr)
      mode = 0
      do i = 1, nc
         a=ichar(astr(i:i))
         ! value part of the number
         if(mode.eq.0)then
            select case(a)
            case(u0:u9)
               v   = real(a-u0,kind=wp)
               if (ifac.ne.1) then
                  val = val*ifac
               else
                  v   = v / dfac
                  dfac = dfac * 10.0d0
               endif
               val = val + v
            case(plsign)
               sval = 1
            case(mnsign)
               sval = -1
            case(decim)
               ifac = 1
               dfac = 10.0d0
            case(uech,lech,udee,ldee)
               mode = 1
            case(aspace)
            case default
               val = 0.0d0
               !return
            end select
         else
            select case(a)
            ! exponent
            case(u0:u9)
               exp = exp*10 + real(a-u0,kind=wp)
            case(plsign)
               sexp = 1
            case(mnsign)
               sexp = -1
            case default
               val = 0.0d0
               !return
            case(aspace)
            end select
         endif
      enddo
      val = val * sval
      if (exp.ne.0.0d0) val = val * (10.0d0**(exp*sexp))
      ator = .true.
end function ator

end module m_ascii

program main
use M_ascii, only : ator
use iso_fortran_env, only: wp => real64, ip => int64
implicit none
integer,parameter :: n = 1000000 !! number of values
real(wp)          :: rval_ator(n), rval_read(n), rval(n)
integer           :: ierr, i
integer(kind=ip)  :: start, finish, count_rate
logical           :: lerr
character(len=30),allocatable :: strings(:)

! create a list of values to parse
allocate(strings(n))
do i = 1, n
    call random_number(rval(i))
    write(strings(i), '(g0)') rval(i)
enddo

! use internal read
call system_clock(start, count_rate)
do i = 1, n
    read(strings(i),fmt=*,iostat=ierr) rval_read(i)
enddo
call system_clock(finish)
write(*,'(a30,1x,f7.4,1x,a)') 'time * : ', (finish-start)/real(count_rate,wp), ' seconds'

! use ator()
call system_clock(start)
do i = 1, n
    lerr=ator(strings(i),rval_ator(i))
enddo
call system_clock(finish)
write(*,'(a30,1x,f7.4,1x,a)') 'time * : ', (finish-start)/real(count_rate,wp), ' seconds'

! compare results
do i = 1, n
   if(abs(rval_ator(i)-rval_read(i)).gt.2*epsilon(rval_ator(i))) then
      write(*,'(i10,1x,*(g0,1x))')i,rval_ator(i)-rval_read(i),rval_ator(i),rval_read(i),strings(i)
   endif
enddo

!do i = 1, n
!   if(rval(i)-rval_read(i).ne.0.0d0) then
!      write(*,'(i10,1x,*(g0,1x))')i,rval_ator(i)-rval_read(i),rval_ator(i),rval_read(i),strings(i)
!   endif
!enddo

!do i = 1, n
!   if(rval(i)-rval_ator(i).ne.0.0d0) then
!      write(*,'(i10,1x,*(g0,1x))')i,rval_ator(i)-rval_read(i),rval_ator(i),rval_read(i),strings(i)
!   endif
!enddo

end program main
2 Likes

What is this trick for? I’d guess it will be optimized-out anyway.

Which place do you mention? Writing the strings?

Not a trick, a relic that got left in from a version where I did the pseudo random number in one line instead of the loop. I deleted it.

In the internal read used to convert the string to a number. Not all compilers support “g0” on input.

The main purpose was to show a specialized routine for converting the string to a number would resolve the speed issue, as a profile of the run showed a large amount of the time is spent in the internal read. The internal read is expensive as it is using a very large and complicated routine (READ) that has many many options for a very basic task. If the standard adopted something simple like if the format is ‘(g0)’ with no other options to read using a simple optimized routine, or even did the same with a * format when only a single scalar intrinsic is being read, or if stdlib contains optimized functions for that purpose this conversion would be as fast as in any other language without adding any significant new syntax to Fortran, potentially all done in Fortran itself. I think stdlib is already doing that; but if not this method could be improved by breaking the numeric value into A.BeC and processing A,B, and C as INTEGER values and then simply merging those into a double precision value, and the conversion to an INTEGER is very precise and uses already available procedures is one possibility for an improvement; but there are several less clear methods that involve a lot of bit shifting and and-ing and or-ing that can do this conversion in Fortran very quickly, just a lot less intuitively.

I did a quick (and very dirty) test, trying to convert a number using Fortran vector operations.
My dummy has many flaws, e.g. it can only process numbers in one fixed format and the result is not precise (e.g. 0.9064336162189542E+00 => 0.90643362990412046).
But if someone can help to fix these issues, we might get a very good speed.
I got on my Core i7 ThinkPad, build with fpm run --profile release:

init values    :   0.9222  seconds
formatted read :   0.7652  seconds
my dummy       :   0.0274  seconds

This is 27-28 times faster, so there is much room left for error fixing while keeping competetive with C strtod which can be about 6-7 times faster than formatted read.

My code (inserted into the code from @jacobwilliams above):

Click
function str2real(s) result(r)
character(*), target, intent(in) :: s
real(wp) :: r

integer,  parameter :: N = 32
real(wp), parameter :: expo(N) = [1e8, 1e7, 1e6, 1e5, 1e4, 1e3, 1e2, 1e1, 1e0, &
                                  1e0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8, &
                                  1e-9, 1e-10, 1e-11, 1e-12, 1e-13, 1e-14, 1e-15, 1e-16, &
                                  1e-17, 1e-18, 0., 0., 0., 0.]
character(N) :: equ_s
integer*1    :: equ_i(N)
integer*1    :: mask(N)

equivalence(equ_i, equ_s)

equ_s = s
equ_i = equ_i - 48

where (0 <= equ_i .and. equ_i <= 9)
    mask = 1
elsewhere
    mask = 0
end where

equ_i = equ_i * mask
r = sum(equ_i * expo) * 10**(equ_i(29)*10+equ_i(30))
end function str2real

Things that have to be fixed:

  • Correct result (maybe change the order of sum to start with)
  • Variable formats, e.g. position of period and exponent
  • Code clean up

I hope someone has an idea how we could fix these issues. Array operations could be the key to success. Feel free to make suggestions or just copy my code and try it on your own.

2 Likes

Ultrajson str to double conversion is here: ultrajson/double-conversion.cc at 7fb005adf296c6d64e114ad8dbd6a4751855fd52 · ultrajson/ultrajson · GitHub, seems quite complicated.

str to int conversion seems much easier: ultrajson/ultrajsondec.c at d36b56c7b6f90fe14f543720d34aac930f77b1fa · ultrajson/ultrajson · GitHub

Here is the grammar that JSON accepts for floating poing numbers: JSON, see the “number” part. That is not complicated to parse. So I think the parsing itself can be quick. It’s the conversion to double that might be complicated because of ensuring no accuracy is lost. But that’s a separate problem.

The function does not convert strings to reals correctly for some cases I tried. The program

program xstr_real
implicit none
integer, parameter :: wp = kind(1.0d0)
print*,str2real("0.12"),str2real("1.2"),str2real("12.0")
contains
function str2real(s) result(r)
character(*), target, intent(in) :: s
real(wp) :: r

integer,  parameter :: N = 32
real(wp), parameter :: expo(N) = [1e8, 1e7, 1e6, 1e5, 1e4, 1e3, 1e2, 1e1, 1e0, &
                                  1e0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8, &
                                  1e-9, 1e-10, 1e-11, 1e-12, 1e-13, 1e-14, 1e-15, 1e-16, &
                                  1e-17, 1e-18, 0., 0., 0., 0.]
character(N) :: equ_s
integer*1    :: equ_i(N)
integer*1    :: mask(N)

equivalence(equ_i, equ_s)

equ_s = s
equ_i = equ_i - 48

where (0 <= equ_i .and. equ_i <= 9)
    mask = 1
elsewhere
    mask = 0
end where

equ_i = equ_i * mask
r = sum(equ_i * expo) * 10**(equ_i(29)*10+equ_i(30))
end function str2real
end program xstr_real

gives output
1200000.0000000000 102000000.00000000 120000000.00000000

These are cases I didn’t implement yet. The detection of the period and the exponent has to be done, their positions are currently fixed.

Edit: I forgot optimisation, therefore my dummy is much faster, see comment below

After implementing the detection of period and exponent, the runtime unfortunately dropped down drastically. The exponent itself is still ignored, would only require an integer parser, though.
Time is now 0.1576 seconds compared to 0.7314 seconds. Still better than formatted read, but also still not precise.

The Code
function str2real(s) result(r)
character(*), target, intent(in) :: s
real(wp) :: r

integer*1, parameter :: period = -2
integer*1, parameter :: char_e = 21
integer,   parameter :: N = 32
real(wp),  parameter :: expo(*) = [1e15,  1e14,  1e13,  1e12,  1e11,  1e10,  1e9,   1e8,   &
                                   1e7,   1e6,   1e5,   1e4,   1e3,   1e2,   1e1,   1e0,   &
                                   1e0,                                                    &
                                   1e-1,  1e-2,  1e-3,  1e-4,  1e-5,  1e-6,  1e-7,  1e-8,  &
                                   1e-9,  1e-10, 1e-11, 1e-12, 1e-13, 1e-14, 1e-15, 1e-16, &
                                   1e-17, 1e-18, 1e-19, 1e-20, 1e-21, 1e-22, 1e-23, 1e-24]
character(N) :: equ_s
integer*1    :: equ_i(N)
integer*1    :: mask(N)
integer      :: period_loc
integer      :: exponent_loc
integer      :: mask_from
integer      :: mask_till


equivalence(equ_i, equ_s)

equ_s = s
equ_i = equ_i - 48

period_loc   = findloc(equ_i, period, 1)
exponent_loc = findloc(equ_i, char_e, 1)

where (0 <= equ_i .and. equ_i <= 9)
    mask = 1
elsewhere
    mask = 0
end where

mask_from = 18 - period_loc
mask_till = mask_from + exponent_loc - 2

r = sum(equ_i(:exponent_loc - 1) * expo(mask_from:mask_till) * mask(:exponent_loc - 1))
end function str2real

Good news, I forgot to pass --profile release… Correct times are:

formatted read :   0.7430  seconds
my dummy       :   0.0344  seconds

My dummy is still 21 times faster.