Unexpected difference between time measurements

I run a simple program to solve a system of linear algebraic equations with a sparse matrix using Sparspak90.

The simulation is run in WSL2 with Ubuntu-22.04 on a Windows 10 machine. The code is compiled with gfortran, with the default implementations of LAPACK and BLAS.

The weird thing is the time measurement. I get

StubbyKExample$ rm *.exe; ./compile_SimpleExample.sh ; time ./SimpleExample.exe            
       63070       63070     4223032                                                       
 in FindOrder                                                                              
 Calling MMD                                                                               
 before factor                                                                             
   7.98282814       217.167267                                                             
 after factor                                                                              
 ---------------------------- Timing Information ----------------------------              
                Time for Ordering                           0.078                          
                Time for symbolic factorization             0.090                          
                Time for matrix input                       0.285                          
                Time for factorization                     20.921                          
                Time for forward/back solve                 0.023                          
 ----------------------------------------------------------------------------              
                                                                                           
real    3m45.192s                                                                          
user    3m36.989s                                                                          
sys     0m0.417s       

So the total time of the run seems to agree with the reported time obtained with cpu_time ( 217.167267 - 7.98282814)

   call cpu_time(T1)
        call LUFactor(s%n, s%nsuper, s%xsuper, s%snode, &
                     s%xlindx, s%lindx, s%xlnz, s%lnz, s%xunz, s%unz, &
                     s%ipiv, s%errflag)
   call cpu_time(T2)
   print *, T1, T2

However, this does not agree with the statistics printed from the measurements with GetTime (20.921)

        call GetTime(s%factorTime)
        call Factor(s%slvr)
        call GetTime(s%factorTime, s%factorTime)

Now, I am tempted to believe the measurement of 20.9 seconds, because the CHOLMOD UMFPACK sparse solver (called from Julia) deals with the task in the same environment, the same method similar methods (supernodal LU vs multi-frontal LU), in about 20.5 seconds. But then, what is the code doing for the remaining three minutes?

My final datapoint is a time measurement of the Julia rewrite of the Fortran code. That takes ~27 seconds for the factorization. Again, same environment.

Edit: Originally, the comparison was with chol (hence CHOLMOD), but in order not to compare apples with orangutans I switched to lu (UMFPACK), forgetting to edit the post. Sorry about the mix up.

If I understand correctly, you are comparing times for two different fortran codes doing matrix factorization. The first one is apparently a LU factorization where as the second faster one is modified Cholesky factorization . Cholesky factorization is going to be faster than LU. You are measuring two different subroutines so it is not surprising that their run times are different.

edit: you did not say the what method the Factor() subroutine implements. In any case LUFactor and Factor seem to be different subroutines and so both the timings reported could be correct.

Sorry about the mix up, I posted a correction.

Factor simply calls LUFactor, and no other work is really done in addition to this one call. Hence these are of equal cost.

The comparison is really between a Fortran code and a Julia translation of that code and a ANSI C code of the UMFPACK. So, no, I am not comparing two Fortran codes.

I am still a little lost. Factor is calling LUFactor, but the timings shown indicate LUFactor takes 210s whereas Factor takes 20s? Is that right?

It would be worth checking if this call is doing something weird due to aliasing

Yes, there are three minutes of time unaccounted for.

Using cpu_time gives consistent timing:

 before factor                                                                                                                                                       
   8.15433502       217.850983                                                                                                                                       
 in SparseFactor    8.15433311       217.851074                                                                                                                      
 after factor   

But that means the Fortran solver takes an order of magnitude longer than Julia.

It seems to be a bug related to GetTime then. The outer subroutine obviously cannot be faster than the inner subroutine. I would trust the time obtained from intrinsic cpu_time.

As for the difference compared with Julia, it could be any number of things. Without looking at both the codes, it is hard to make a useful remark. The timings should be in the same ball park, unless something has gone seriously wrong.

I have done some profiling of the Julia code: most of the time is spent in BLAS Level 3 routines. Mostly dgemm. So if there is a difference between the Fortran and Julia versions, it must be somewhere in the BLAS.

Julia uses OpenBLAS (by default) I believe. It is easy to compile from source and link with your fortran program.
It replaces both LAPACK and BLAS.

That is an interesting proposition. I use
gfortran -O3 ``cat src`` -o SimpleExample.exe /usr/lib/x86_64-linux-gnu/libopenblas.a

The results are intriguing:

$ rm *.exe; ./compile_SimpleExample.sh ; time ./SimpleExample.exe                                                                                                                                                                
       63070       63070     4223032                                                                                                                                 
 in FindOrder                                                                                                                                                        
 Calling MMD                                                                                                                                                         
 before factor                                                                                                                                                       
   6.54717493       318.723877                                                                                                                                       
 in SparseFactor    6.54717207       318.724640                                                                                                                      
 after factor                                                                                                                                                        
 ---------------------------- Timing Information ----------------------------                                                                                        
                Time for Ordering                           0.038                                                                                                    
                Time for symbolic factorization             0.069                                                                                                    
                Time for matrix input                       0.102                                                                                                    
                Time for factorization                      3.905                                                                                                    
                Time for forward/back solve                 0.042                                                                                                    
 ----------------------------------------------------------------------------                                                                                        
                                                                                                                                                                     
real    0m46.345s                                                                                                                                                    
user    2m43.273s                                                                                                                                                    
sys     2m38.839s          

The wall-clock time measurement is roughly 55 seconds. The reading of the matrix took 6 seconds, which leaves 49 seconds for the ordering, symbolic, and numerical factorization. So far so good: that is roughly 44 seconds for the numerical factorization, or 13 seconds slower than the Julia rewrite of the code.

Then the “real” time is perhaps spent calculating (omitting the reading of the matrix from the file)?
I believe the “user” time could then be the total spent computing with multiple threads?
But what about the “sys”? So much time spent in the kernel?

But the real puzzler is the time returned by cpu_time ( 6.54717493 318.723877)!?

And the time reported for the factorization is now 3.9 seconds!? I cannot find any reference
for the function GetTime! Where does this come from? And what does it report?

I can confirm that Julia uses OpenBlas by default. That said, if you have an Intel or AMD cpu, you will probably see better performance from MKL which in Julia can be used simply with using MKL (and installing the package)

1 Like

The real value is the wall time from beginning to end of execution. It does not omit anything. usr and sys can be higher due to multithreading as you have noted.

The value reported by cpu_time is not interpretable. You cannot rely on cpu_time() returning 0 at the top of the program. We should be looking at the relative times T2-T1 always. So, we cannot say that reading the matrix took 6 seconds from one call to cpu_time. (edit: important point below by @JohnCampbell that i missed. cpu_time can be greater than wall time)

GetTime should be in your source code. It is not an intrinsic function.

CPU_TIME reports the processor time for all threads being used ( it looks like multiple threads are being used, where CPU_TIME >> elapsed time) This is reported as a REAL number.
SYSTEM_CLOCK can be used to calculate the “wall clock time” and may be a better comparison to the JULIA timing, although you would have to confirm if this is a wall / elapsed time.
SYSTEM_CLOCK uses “tick” and “rate” as integers. INTEGER(8) may provide a better / more accurate answer, although this should not be an issue if measuring 50 seconds.

2 Likes

FYI, Here is a function that wraps SYSTEM_CLOCK

  Function getSysTime() Result(time_now)

    USE ISO_FORTRAN_ENV, ONLY: DP=>REAL64, INT64

! Provides a wrapper function around Fortran intrinsic SYSTEM_CLOCK subroutine
! SYSTEM_CLOCK returns current number of clock_ticks and clock_rate in
! ticks/second. Time is clock_ticks/clock_rate

    Implicit NONE

    Real(DP) :: time_now

    Integer(INT64) :: counts
    Real(DP)       :: count_rate
    Integer(INT64) :: count_max

    Call SYSTEM_CLOCK(COUNT=counts, COUNT_RATE=count_rate,            &
                      COUNT_MAX=count_max)

    time_now = REAL(counts,DP)/count_rate

  End Function getSysTime


1 Like

Sorry, been blind(*). Found GetTime:

    subroutine GetTime( delta, previous)
!****************************************************************************
        implicit none
        real (double) :: delta
        real (double), optional, intent(in) :: previous
        integer :: time

        call system_clock(time)

        if ( present(previous) ) then
            delta = time/10000.0 - previous
        else
            delta = time/10000.0
        endif

    end subroutine GetTime

But system_clock docs leave the task of unravelling all the hw dependencies to the user. So now I have to sort out that.

(*) I use Sublime Text, and I had the .gitignore filter on. :frowning:

Thank you all, this is all very helpful.

This use of SYSTEM_CLOCK is not robust and would probably only work on the processor where it was developed.
It assumes rate = 10000, which is only valid for the processor being used and time as integer*4

I would recommend an adaptation of the post by @rwmsu

  Function getSysTime() Result(time_now)

    USE ISO_FORTRAN_ENV, ONLY: DP=>REAL64, INT64

! Provides a wrapper function around Fortran intrinsic SYSTEM_CLOCK subroutine
! SYSTEM_CLOCK returns current number of clock_ticks and clock_rate in
! ticks/second. Time is clock_ticks/clock_rate

    Implicit NONE

    Real(DP) :: time_now

    Integer(INT64) :: counts
    Integer(INT64) :: count_rate

    Call SYSTEM_CLOCK (COUNT=counts, COUNT_RATE=count_rate)

    time_now = REAL(counts,DP) / REAL(count_rate,DP)

  End Function getSysTime
1 Like

Now the timings make sense. With openblas:

 Started =========================                                                                                                                                   
       63070       63070     4223032                                                                                                                                 
 Done reading =========================                                                                                                                              
 in SparseFactor    6.43736601       320.629852                                                                                                                      
 ---------------------------- Timing Information ----------------------------                                                                                        
                Time for Ordering                           0.350                                                                                                    
                Time for symbolic factorization             0.658                                                                                                    
                Time for matrix input                       0.969                                                                                                    
                Time for factorization                     39.308                                                                                                    
                Time for forward/back solve                 0.197                                                                                                    
 ----------------------------------------------------------------------------                                                                                        
                                                                                                                                                                     
real    0m46.363s                                                                                                                                                    
user    2m46.857s                                                                                                                                                    
sys     2m35.386s  

And with the system blas/lapack:

 Started =========================                                                                                                                                   
       63070       63070     4223032                                                                                                                                 
 Done reading =========================                                                                                                                              
 in SparseFactor    6.48510504       349.577545                                                                                                                      
 ---------------------------- Timing Information ----------------------------                                                                                        
                Time for Ordering                           0.396                                                                                                    
                Time for symbolic factorization             0.660                                                                                                    
                Time for matrix input                       0.968                                                                                                    
                Time for factorization                     42.900                                                                                                    
                Time for forward/back solve                 0.264                                                                                                    
 ----------------------------------------------------------------------------                                                                                        
                                                                                                                                                                     
real    0m50.018s                                                                                                                                                    
user    2m57.461s                                                                                                                                                    
sys     2m54.265s 

So, not a huge difference.

I have no idea why in the OP the factorization time was reported as 20.921. Compensating for the
wrong GetTime with a factor of ten, this would correspond to 209.21 seconds of factorization!

Now we are down to roughly 40 seconds for the numerical factorization. So the Julia rewrite is approximately 5 seconds faster. And, UMFPACK is approximately 16 seconds faster.

Interesting. I did a small test on my system (Win10) factorizing a 10000x10000 random dense matrix (generated using DLATMR) with the latest OpenBLAS, explicitly setting the number of threads.
The timed portion (using @rwmsu’s function) looks like

  call cpu_time(ct1)
  st1=getSysTime()
  call dgetrf( M, N, AA, LDA, IPIVOT, INFO )
  CALL cpu_time(ct2)
  st2= getSysTime()
  print'(A,e12.4)'," CPU_TIME     : ",ct2-ct1
  print'(A,e12.4)'," SYSTEM_CLOCK : ",st2-st1
$>set OMP_NUM_THREADS=1
$>dmatgen.exe
 CPU_TIME     :   0.1189E+02
 SYSTEM_CLOCK :   0.1188E+02
$>set OMP_NUM_THREADS=8
$>dmatgen.exe
 CPU_TIME     :   0.2525E+02
 SYSTEM_CLOCK :   0.3762E+01

which is consistent.

I further compared to python(scipy) and julia. Python version took 5.4s, where as the Julia version took about 16.4s.

Python Version
import numpy as np
import struct
import scipy.linalg as la
import time
with open("A.dat", 'rb') as f:
    data=f.read()

N,= struct.unpack('i', data[:4])
A= np.frombuffer(data[4:4+8*N*N], dtype=np.float64)
A= A.reshape((N,N))
B= np.frombuffer(data[8*N*N+4:], dtype=np.float64)
print(f"{N=}")
# print(A)
# print(B)

t1= time.time()
lu,piv= la.lu_factor(A)
t2=time.time()-t1
print(f"{t2=}")
x= la.lu_solve((lu,piv), B)
print(x[:10])
print("|B-AX|==0 ? :",np.allclose(A@x,B))
Julia v1.7.3
using LinearAlgebra, Statistics;
print("BLAS thread count: ",BLAS.get_num_threads(),'\n') # is 8
io=open("A.dat","r")
N=read(io,Int32)
A=Array{Float64}(undef,N,N)
B=Array{Float64}(undef,N,1)
read!(io,A)
read!(io,B)
print(N,"\n")

t0=time()
L,U,P=lu(A)
t1=time()

X=U\(L\B[P])

print(t1-t0, '\n')
print(X[1:10], '\n')
print(sqrt(mean((B-A*X).^2)))

Not sure where that leaves us haha.
edited at add: My installation of Julia has some problem so the julia times are not representative.

edit 2: After fixing my Julia install, Julia takes about 4.3s

1 Like