Faster string to double

I’m not sure what the consensus is on this issue, but I would point out that it is consistent with E format output where the exponent width is too small for the letter, sign, and number. In that case the letter is dropped, and the sign and exponent value are written, so no information is lost. If the exponent field is still too small, then the entire field is filled with *, losing the information. If it is possible to convert those numbers, it would be symmetrical with the previous write that produced them.

For the issue of storing intermediate values in integers, that always means that those values are limited by the integer range. If the string of digits is converted and stored as floating point numbers, then the only digits that are lost are those that would be lost anyway. Consider storing, say, a 30-digit value into eventually a 32-bit real. There is likely no integer kind large enough to store that 30-digit value. But if it is converted directly into the 32-bit real, then there would be no problem with integer overflow. Of course, each step of that conversion would require a floating point multiply and add, so it would be slower than the corresponding integer conversion.

2 Likes

Good point, not sure either …

For this, the strategy I found to work quite nicely was to read the fractional part of the number with an integer(wp) that should enable the precision required for the end real. Currently only implemented for real(8), so let’s say we want to read the following value:

s = '0.123456789123456789123456789123456789'

a formatted read will throw

real(8) :: x
...
read(s,*) x 
write(*,*) x !< 0.12345678912345678 the value is truncated at the 17th decimal place

by iterating over the string (and making sure to finish the loop at the limit of the integer(8) :: int_wp) I would get first and int=0 and int_wp = 1234567891234567891 and then
r = int + int_wp * 1d-19 = 0.12345678912345678.
The idea was to use the fractional base multiplication and conversion just once at the end.

To enable for larger real types indeed we should extend the fractional base for this idea to work, but I guess it should in principle ( ? )

If I read correctly (Agner’s tables), in modern processors, FMA is faster than int -> double conversion. The former has latency of 4 or 5 (reciprocal throughput of 0.5), whereas the latter has 6 (reciprocal of 1 or 2). That’s for scalar conversion (not packed/SIMD).
Interesting.

Had to double check. Using gfortran 12.2.0 that string gets parsed to 1.0e-3. No warnings.

You can use one integer only. For "123.456e7" you might have int = 123456 and a counter for the digits after the dot. The parsed number then is 123456 * 10**(7 - k), where k = 3 in this case. No idea which way is faster/more accurate, haven’t tested myself.

1 Like

This program shows what list-directed i/o does with various strings.

program efmt
   real :: x
   character(:), allocatable :: str
   str = '1.e-3'  ! decimal, with e
   read(str,*) x
   write(*,*) x
   str = '1.-3'   ! decimal, no e
   read(str,*) x
   write(*,*) x
   str = '1-3'    ! no decimal, no e
   read(str,*) x
   write(*,*) x
   str = '1234567890123456789012345678901234567890-9'    ! long string, no decimal, no e
   read(str,*) x
   write(*,*) x
end program efmt

With NAG and gfortran, the output is something like:

$ nagfor efmt.f90 && a.out
NAG Fortran Compiler Release 7.1(Hanzomon) Build 7114
[NAG Fortran Compiler normal termination]
   1.0000000E-03
   1.0000000E-03
   1.0000000E-03
   1.2345679E+30

One might expect a string-to-real function to do similar conversions.

Thanks for the reference!!

In light of the evidences I’ll just be glad that inadvertelly got the same output

I took your advice and rearranged the core code, it is compact and cleaner, but didn’t gain any performance (except for some flukes in which running times would drop to 0.02 or 0.016, but mostly around 0.032)

Adding a check on the maximum admissible lenght of the integer(wp) for handling long strings did drop significantly the perfs (would increase to [0.044;0.047]s). Have to rethink the loops…

I gave some thoughts to the last discussions regarding long strings and tried a small change to the algorithm (you can check out the full version in the github):

elemental subroutine str2real_unroll(s,r,p,stat)
...
! read whole and fractional number in a single integer
pP = p
int_wp = 0
do i = p, min(19+p-1,len(s)) !< stop at the max precision available for the integer(8)
     val = iachar(s(i:i))-digit_0
     if( val >= 0 .and. val <= 9 ) then
           int_wp = int_wp*10 + val
     else if( val == period ) then
           pP = i
     else
           exit
     end if
end do
p = i

With this modification, the following test will pass:

call check("0.123456789123456789123456789123456789")

as I’ll get the same read as formated read : “0.123456789123457”

The test suggested by @RonShepard won’t pass:

call check("1234567890123456789012345678901234567890-9")

as I’ll get : “1.23456789012346” , while intel or gfortran in windows or linux will give “1.2345679E+30”

Any ideas here? (maybe the second part of the algorithm can be tweaked without loosing generality and performance?)

The good thing is that so far I managed to stay within the same performance rate.

After the loop

do i = p, min(19+p-1,len(s))

you go into an if that expects part of the exponential section, either e, E, + or -. Which in the case of

"1234567890123456789012345678901234567890-9"

does not happen. You may need to keep reading characters (digits) and discarding them if you have read enough for the set precision, until you find e, E, + or - (deal with exponent) or the end of the string (be done).

Also must check for ‘d’ and ‘D’.

I made a few updates to the str2real function here: GitHub - jalvesz/Fortran-String-to-Num

All the tests pass even for long strings, and all four [‘e’,‘E’,‘d’,‘D’] are used to check for the exponents.

Was wondering, anyone interested in double checking? And I was thinking that this could be part of the stdlib, any opinions?

1 Like

I was reading this thread this morning before the last post and thinking that it would be nice to have this in the stdlib since we have to_string for the reverse. Encouraging to see development continue. Not sure about the difference in the algos posted by @Carltoffel and @hkvzjal

3 Likes

You can checkout both implementations in the file str2num_m.f90. When I picked up the project I started from @Carltoffel’s implementation then moved to the “rollout” one as I found out it was not only faster but I managed to streamline several numbers in a single string.

2 Likes

Opened a PR on stdlib here Proposal for a reference string to number conversion facility in stdlib by jalvesz · Pull Request #743 · fortran-lang/stdlib · GitHub

Implementation: https://github.com/jalvesz/stdlib/blob/master/src/stdlib_str2num.f90
test: https://github.com/jalvesz/stdlib/blob/master/test/string/test_string_to_number.f90
example: https://github.com/jalvesz/stdlib/blob/master/example/strings/example_str2num.f90

Used the occasion to duplicate for str2float and str2double

The complete interface can be called from the subroutine str2num which covers all available types:

    !> easy to use function interfaces
    public :: str2int,    str2int_p
    public :: str2float,  str2float_p
    public :: str2double, str2double_p
    !> generic subroutine interface
    public :: str2num

Any suggestions/contributions?

3 Likes

I would also suggest a real128 version also.

If only we could make this look like the intrinsic functions, int and real.

3 Likes

I did try but the problem is that the compilers can not disambiguate the l.h.s of the assignment for the function interface, that’s why I ended up with a subroutine interface and the different functions. If anyone has an idea how to manage it would indeed be great.

One approach might be to add a MOLD= argument. That is the way that, for example, TRANSFER() and other intrinsic functions work.

One might want this function to work with any supported real kind. This gets back to the difficulty of writing generic functions in a portable way. Adding a MOLD= argument takes care of the user interface part of the problem, but there is still the usual problem of writing code that works with every kind in the REAL_KINDS(:) array.

@RonShepard @everythingfunctional something like this Compiler Explorer ?

...
public :: str2num, str2num_p
...
interface str2num
        module procedure str2int
        module procedure str2float
        module procedure str2double
end interface

interface str2num_p
        module procedure str2int_p
        module procedure str2float_p
        module procedure str2double_p
end interface
...
elemental function str2int(s,mold) result(v)
        ! -- In/out Variables
        character(*), intent(in) :: s !> input string
        integer, intent(in) :: mold !> dummy argument to disambiguate at compile time the generic interface
        integer :: v !> output value
...
function str2double_p(s,mold,stat) result(r)
        ! -- In/out Variables
        character(len=:), pointer :: s !> input string
        real(dp), intent(in) :: mold !> dummy argument to disambiguate at compile time the generic interface
        real(dp) :: r    !> Output real value
        integer(1),intent(inout), optional :: stat
! etc

No, the way you did it is the way to do it with current Fortran. I mostly made the comment to introduce the potential proposal for the standard.

Ok, I see, I guess that for a proposal for the standard, the first step would be that the compilers handle the l.h.s of an assignment as a kind of 0th argument, such that it can then unambiguously chose which function to use when creating generic interfaces.

Any comments on the current proposal for stdlib in the mean time? Different names for str2_kind_ or str2num with an extra arg to disambiguate ?

1 Like