FizzBuzz in Fortran much slower than in C

Inspired by High throughput Fizz Buzz I coded the problem in Fortran

program main
implicit none
integer :: i
logical :: div_3,div_5
do i=1,1000000000-1
   div_3 = mod(i,3) == 0
   div_5 = mod(i,5) == 0
   if (div_3 .and. div_5) then
      write (*,"('FizzBuzz')")
   else if (div_3) then
      write (*,"('Fizz')") 
   else if (div_5) then
      write (*,"('Buzz')")
   else
      write (*,"(i0)") i
   end if
end do
end program main

and compiled with gfortran -O3 xfizzbuzz.f90 on WSL 2, followed by ./a.out | pv > /dev/null. I get output

7.33GiB 0:11:25 [11.0MiB/s]

while in C

#include <stdio.h>

int main() {
	for (int i = 1; i < 1000000000; i++) {
		if ((i % 3 == 0) && (i % 5 == 0)) {
			printf("FizzBuzz\n");
		} else if (i % 3 == 0) {
			printf("Fizz\n");
		} else if (i % 5 == 0) {
			printf("Buzz\n");
		} else {
			printf("%d\n", i);
		}
	}
}

with gcc -O3 I get
7.33GiB 0:01:17 [97.5MiB/s] , which is about 9 times faster. I assume faster I/O in C is the cause. Can anyone produce a faster Fortran FizzBuzz?

4 Likes

I tried to buffer the output a little bit:

program main
implicit none
integer, parameter :: n = 51
integer :: i, j, k
logical :: div_3,div_5
character(10*n) :: buffer
do i=1,(1000000000/n)-1
   do j=0,n-1
       k = n*i + j
       div_3 = mod(k,3) == 0
       div_5 = mod(k,5) == 0
       if (div_3 .and. div_5) then
          write (buffer,"('FizzBuzz')")
       else if (div_3) then
          write (buffer,"('Fizz')")
       else if (div_5) then
          write (buffer,"('Buzz')")
       else
          write (buffer,"(i0)") i
       end if
   end do
   write (*,"(g0)") buffer
end do
end program main

But it didn’t improve that much, maybe 20 % or 30 %
I randomly tried different buffer sizes, but it didn’t change much.

Edit: The output isn’t correct anymore. But I’m too tired to fix it. Maybe someone else has more luck with this approach.

I have not measured the timing exactly but I get the impression that writing to a file instead of to the terminal might be quicker, although still slow.

I am around 7 MiB/s on a Kubuntu machine with an Intel(R) Pentium(R) CPU G3220 @ 3.00GHz.

With write (*,*) i , I am around 10 MiB/s.

A strange thing is that write (*, *) "i" is around 5 MiB/s, although we could think writing a “i” on screen should be faster than converting an integer to characters…

And it’s not better with ifort (even a little slower than GFortran).

How about writing it all internally and outputting the whole at the end,

implicit none
integer :: i, istart, istop
integer, parameter :: MAXNUM = 10**8
character(1), parameter :: NLC = new_line("a")
character(MAXNUM*10) :: buffer
istart = 1
do i = 1, MAXNUM
    if (mod(i,3) == 0 .and. mod(i,5) == 0) then
        istop = istart + 8
        buffer(istart:istop) = "FizzBuzz"//NLC
    else if (mod(i,3) == 0) then
        istop = istart + 4
        buffer(istart:istop) = "Fizz"//NLC
    else if (mod(i,5) == 0) then
        istop = istart + 4
        buffer(istart:istop) = "Buzz"//NLC
    else
        istop = istart + int(log(real(i)+epsilon(1.))/log(10.)) + 1
        write (buffer(istart:istop-1),"(i0)") i
        buffer(istop:istop) = NLC
    end if
    istart = istop + 1
end do
write (*, "(a)", advance = "no") buffer(1:istop)
end

This is ~6X faster than the original Fortran code,

ifort -O3 original.f90
./a.out | pv > /dev/null
700MiB 0:01:38 [6.64MiB/s]
ifort -O3 modified.f90
./a.out | pv > /dev/null
700MiB 0:00:17 [41.2MiB/s]
1 Like

While not interested in the highest throughput, at code-golf.io you can attempt to build the Fizz-Buzz program with the least number of bytes/characters. Here was my solution with 133 characters:

character(8)c
do i=1,100
c=''
if(mod(i,3)<1)c='Fizz'
if(mod(i,5)<1)c=trim(c)//'Buzz'
if(c=='')write(c,'(I0)')i
print'(A)',c
enddo
end

The current record-holders manage the same in 64 bytes! (In case you find the solution do not post it here to keep the game fair.)

Interestingly, the current shortest C solution has 72 bytes, and the current best assembly solution requires 64 bytes.

1 Like

Well, you said to make it fast, not intelligible. I would not recommend
anyone asked in an interview to use the things I did here :wink:
I threw the kitchen sink at it, probably most of the nasty things were
not necessary. If I get some time I will go back and figure out which
ones were not needed if no one else does.

But it did produce some decent numbers, faster than the C baseline of 66 seconds – 14 to 20 seconds on a laptop; although the I/O alone would be longer than that if the output was not going to /dev/null.

Everything should be standard Fortran accept the trick of opening stdout
as /dev/stdout, but unfortunately there is no standard way I can think
of to open a pre-attached stdout as a stream, and you said to write the
messages to the screen. There are other ways to do that, but it is platform specific and probably would not work on MSWindows.

ifort -O3 itoa.f90
$ ./a.out|pv >/dev/null
7.33GiB 0:00:17 [ 437MiB/s] [                                   <=>                                                                ]
$ gfortran -O3 itoa.f90
./a.out|pv >/dev/null
7.33GiB 0:00:14 [ 500MiB/s] [                             <=>                                                                      ]
$ nvfortran -O3 itoa.f90
./a.out|pv >/dev/null
7.33GiB 0:00:20 [ 362MiB/s] [                                         <=>     
gcc -O3 fizzbuzz.c
$ ./a.out|pv >/dev/null
7.33GiB 0:01:06 [ 113MiB/s] [                                                                 <=>     
program testit
use iso_fortran_env, only: byte => int8
implicit none
integer :: i,j,k,m,ios,lun
logical :: wrote
integer,parameter :: sz= 1000000000-1, maxl=2**16 ! make sure keep above > 21
integer(kind=byte) :: buffer(maxl)
integer(kind=byte),parameter :: fizz(4)=[ichar('F'),ichar('i'),ichar('z'),ichar('z')]
integer(kind=byte),parameter :: buzz(4)=[ichar('B'),ichar('u'),ichar('z'),ichar('z')]
integer(kind=byte),parameter :: nl=ichar(new_line('a'))
integer(kind=byte),parameter :: blank=ichar(new_line(' '))
   open( newunit = lun, access = 'stream', action = 'write', file = '/dev/stdout', form = 'unformatted', iostat = ios )
   j=0
   k=0
   m=1
   do i=1,sz
      j=j+1
      k=k+1
      wrote=.false.
      if(j.eq.3)then
         buffer(m:m+3)=fizz
         wrote=.true.
         m=m+4
         j=0
      endif
      if(k.eq.5)then
         buffer(m:m+3)=buzz
         wrote=.true.
         k=0
         m=m+4
      endif
      if(.not.wrote)then
         call i2a(i)
      endif
      buffer(m:m)=nl
      m=m+1
      if(m.ge.maxl-11)then
         write(LUN)buffer(:m)
         m=1
      endif
   enddo
   if(m.ne.1)then
      write(LUN)buffer(:m-1)
   endif
contains
subroutine i2a(number) 
integer,intent(in)   :: number
integer   :: i, n, s(20)
integer,parameter :: zero=ichar('0')
   n=number   
   i = 0
   do while (n  > 0)
      i=i+1
      s(i) = mod(n,10) + zero
      n=n/10
   enddo
   buffer(m:m+i)=s(i:1:-1)
   m=m+i
end subroutine
end program testit
1 Like

This is about of the same duration, and is much less obfuscated .

program main
implicit none
integer,parameter :: sz= 1000000000-1, zero=ichar('0'), maxl= 2**13 ! make sure keep above > 21
character,parameter :: nl*1=(new_line('a')), fizz*4='Fizz', buzz*4='Buzz', fizzbuzz*8='FizzBuzz'
character(len=maxl) :: buffer
logical :: div_3, div_5
integer :: ii, nn, ss(20), jj, i, m
   m=1
   do i = 1, sz
      div_3 = mod(i,3) == 0
      div_5 = mod(i,5) == 0
      if (div_3 .and. div_5) then
         buffer(m:m+7)=fizzbuzz
         m = m + 8
      elseif ( div_3 ) then
         buffer( m:m+3 ) = fizz
         m=m+4
      elseif ( div_5 ) then
         buffer( m:m+3 ) = buzz
         m=m+4
      else
         nn=i   
         ii = 0
         do while (nn > 0)
            ii = ii + 1
            ss(ii) = mod(nn,10) + zero
            nn = nn / 10
         enddo
         do jj = ii, 1, -1
            buffer(m:m) = achar(ss(jj))
            m = m + 1
         enddo
      endif
      buffer(m:m) = nl
      m = m + 1
      if( m >= maxl - 11 )then
         write(*,'(a)',advance='no')buffer(:m)
         m = 1
      endif
   end do
   if( m /= 1 ) write(*,'(a)',advance='no')buffer(:m-1)
end program main
1 Like

c_new_line can be used here instead.

2 Likes

A request: will it be possible for someone to take the defined IO with a derived type case shown below and try it with stream output on a Linux system, while the “FizzBuzz” values are setup using coarrays? Intel API can be an option for this. And compare it with the Fortran sequential IO with an intrinsic type shown by @Beliavsky in the original post?

   use num_string_m, only : num_string_t
   character(len=*), parameter :: FIZZBUZZ = "FizzBuzz"
   character(len=*), parameter :: FIZZ = "Fizz"
   character(len=*), parameter :: BUZZ = "Buzz"
   integer, parameter :: N = 1000000000
   type(num_string_t), allocatable :: s(:)[:]
   logical :: div_3, div_5
   integer :: i, x, j, lun, istat
   x = N / num_images()
   allocate( s(x)[*] )
   sync all
   do concurrent ( i = 1:x  )
      j = (this_image()-1)*x + i
      div_3 = mod( j, 3 ) == 0
      div_5 = mod( j, 5 ) == 0
      if ( div_3 .and. div_5 ) then
         s(i) = FIZZBUZZ
      else if ( div_3 ) then
         s(i) = FIZZ
      else if ( div_5 ) then
         s(i) = BUZZ
      else
         s(i) = j
      end if
   end do
   sync all
   if ( this_image() == 1 ) then
      open( newunit=lun, access="stream", form="formatted", status="scratch", iostat=istat )
      if ( istat == 0 ) then
         do i = 1, num_images()
            write( lun, fmt="(*(DT))", advance="no" ) s(:)[i]
         end do
      end if
   end if
end

The code toward the derived type:

Click for code

Module for numstring_t

module num_string_m
   type :: num_string_t
      private
      character(len=digits(0)) :: s
      integer :: lens = 0
   contains
      private
      procedure, pass(this) :: assign_s
      procedure, pass(this) :: assign_n
      procedure, pass(dtv) :: write_s
      generic, public :: assignment(=) => assign_s, assign_n
      generic, public :: write(formatted) => write_s
   end type
contains
   elemental subroutine assign_s( this, s )
      class(num_string_t), intent(inout) :: this
      character(len=*), intent(in)       :: s
      this%s = s
      this%lens = len_trim(this%s) 
   end subroutine
   elemental subroutine assign_n( this, num )
      class(num_string_t), intent(inout) :: this
      integer, intent(in)                :: num
      integer :: i, n, rem
      n = num
      this%s = ""
      this%lens = 0
      to_s: do i = digits(0), 1, -1
         rem = mod( n, 10 )
         this%s(i:i) = achar(iachar("0") + rem)
         this%lens = this%lens + 1
         n = n/10
         if ( n == 0 ) exit to_s
      end do to_s
      this%s = adjustl(this%s)
   end subroutine
   subroutine write_s( dtv, lun, iotype, vlist, istat, imsg )
      ! Argument list
      class(num_string_t), intent(in)  :: dtv
      integer, intent(in)              :: lun
      character(len=*), intent(in)     :: iotype
      integer, intent(in)              :: vlist(:)
      integer, intent(out)             :: istat
      character (len=*), intent(inout) :: imsg
      ! local variable
      character(len=20) :: pfmt
      istat = 0
      select case ( iotype )
         case ( "LISTDIRECTED" )
            ! No special consideration
            write(lun, fmt=*, iostat=istat, iomsg=imsg) dtv%s(:dtv%lens) // new_line("")
         case ( "DT" )
            ! vlist(1) is to be used as the field widths of the
            ! component of the derived type variable. First set up the format to
            ! be used for output.
            if ( size(vlist) > 0 ) then
               write(pfmt,"(*(g0))" ) "(1x,g", vlist(1), ")"
            else
               pfmt = "(1x,g0)"
            end if
            write(lun, fmt=pfmt, advance="no", iostat=istat, iomsg=imsg) dtv%s(:dtv%lens) // new_line("")
         case ( "NAMELIST" )
            ! Not supported
            istat = 1
            imsg = "Namelist option is not yet supported."
            return
      end select
      return
   end subroutine
end module

It will be interesting to see how much adversely the performance might be affected when complicated options using modern Fortran are in play.

Also, the determination of the “FizzBuzz” values themselves has gotta be an embarrassingly parallel problem for a processor, so the question is whether a Fortran processor fed with coarrays and aided by DO CONCURRENT gets close achieving it following which it is just a matter of IO of large data.

Isn’t the order of print important in this benchmark? I do not fully get the point of this benchmark, it seems to be more of a competition within the C language where it says “…assembly is also allowed…”. If one can really use one language within another, then Kargl’s code is perfectly legitimate. Everything becomes assembly at the end.

I also disagree with the choice of Fortran much slower than in C as the title of the post here (as well as other past and future similar posts). Such a narrow-scope title does not reveal all circumstances surrounding the benchmark in question but it does negatively impact the public’s view of Fortran, those who are not experts to understand the details of the benchmark or those who are not patient enough to read through all responses to the post. They read the title, get the impression that Fortran is slow, and move on to spread that potentially-wrong impression to others in the same way gossips spread.

2 Likes

Did you try it? With a couple of changes to “wrap one’s arms” around the code and help understand what the program might do

..
integer, parameter :: N = 24
..
   if ( this_image() == 1 ) then
      lun = 6
      if ( istat == 0 ) then
         do i = 1, num_images()
            write( lun, fmt="(*(DT))", advance="no" ) s(:)[i]

The program should generate the following output: isn’t that what is the expected output in order for the first 24 values? If not, what’s wrong?

Click for console output

Program with derived type IO

C:\Temp>ifort /standard-semantics /O3 /Qcoarray:shared /Qcoarray-num-images=8 c.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.3.0 Build 20210609_000000
Copyright (C) 1985-2021 Intel Corporation. All rights reserved.

Microsoft (R) Incremental Linker Version 14.29.30038.1
Copyright (C) Microsoft Corporation. All rights reserved.

-out:c.exe
-subsystem:console
c.obj

C:\Temp>c.exe
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19
Buzz
Fizz
22
23
Fizz

C:\Temp>

The code I posted

  1. attempts to determines all the FizzBuzz values in parallel,
  2. tries to cut down the number of IO transfers from the problem size (N) to num_images()
  3. then does stream IO to a processor-dependent location toward a “scratch file” (original post has sequential IO piped to /dev/null)

I am wondering if there is any advantage to this relative to the conventional Fortran approach that has long been procedural and to perform IO transfers as one proceeds.

I am always interested in learning better how different approaches based on more complicated data structures and concurrent computing and so forth in the modern Fortran toolset compare for certain tasks with the conventional approach of procedural code with intrinsic (primitive) data types that has long been practiced starting with FORTRAN I and which will continue as long as computing remains viable. The FizzBuzz Fortran code by @Beliavsky in the original post is a classic illustration of the latter. Hence my request. Besides, I am not quite setup at the moment to do a proper comparison with that FizzBuzz Fortran program by @Beliavsky . I am particularly interested in the results on a many core Linux platform.

Separately, if you look at my comment, you will notice I stay away from comparing with C. That is no interest to me.

1 Like

I might add that write (and read) even with output into a string like in “write(str,fmt) my_number” internally requires a global lock to update resources like unit numbers. At least in gfortran. So using write for string operations within an openmp block with many threads might lead to contention. I guess other compilers handle it similarly. (In fact I even had a baffling segfault caused by simple string operations due to a missing lock in the gfortran runtime library. Now fixed.)

Because the write runtime code is that slow for various compilers, even with buffering, we have our own buffer classes to only invoke write for larger chunks of data. It does not feel right for a language if you have to code your own buffering to get decent write performance.

1 Like

I agree. It doesn’t feel right to me either. As a user I would like the Fortran compilers to deliver very good performance out of the box. There might be some reason why that is not possible in general, but in that case, I think at least having a compiler option to use a faster implementation of IO would help. That way it’s off by default, but one could turn it on, such as in the example above.

5 Likes

iforts write is not that different performance-wise. It is still pretty unexpected that a global lock or any other thread interaction is required for something like “write(str,’(es12.5)’) x” (not sure about ifort, though) However, a do-it-all write statement looks a little bit like a design short-coming in the language itself.

1 Like

The FizzBuzz benchmark is being discussed at Hacker News.

2 Likes