Undetected format error

Well, there is still a problem. The optimization pass just moved it around.

Since you are using gfortran, you said when you compile with -Wall -pedantic you mentioned you get ‘several 1000 messages’. Reviewing those is a place to start…

Also try compiling with the -fcheck=all option to get array bounds checking and such during run-time.

Finally, running under a tool like valgrind is often enlightening.

The -fPIC option is needed when building shared libraries.

As I have written it was nice to have some advice but I am not able to handle several compilers.
I tried some -Wall checks but evidently the messages from that are not always correct either, it
indicated some unused-variables that were used.

I remember I used valgrind some years ago but it was complicated and I do not think I can handle it now.
I have compiled with -fcheck-all and that has no problems.

If someone is interested in finding bugs in the fortran -O2 option he/she is welcome to download my source code
and just run examples/macro/step1.OCM twice to see some nice diagram before it crashes.

Right now I have a severe problem restoring the original version of OC without all the debug ino I added.

If you are still using fixed-form source and/or not using implicit none, ‘unused variables’ might indicate typos.

If you have valgrind installed on your system, instead of typing:
./a.out
simply use:
valgrind ./a.out
Though you might want to redirect the valgrind output to a separate log file:
valgrind ./a.out 2>valout

In that case, since you mentioned that the issue is present with both -O1 and -O2, then it’s probably caused by -O1 (since -O2 implements all of -O1).

You can check the GCC documentation for the specific flags you can tweak (so you can apply then all explicitly and play with them until your code breaks).

You can turn off the warning for unused arguments, if that helps: -Wall -Wno-unused-dummy-argument.

I have tried -Wuninitialized on the files concerned but that gave nothing. I assume that if I have allocated a matrix or array then there is no check if all elements have been assigned a value. I believe that is my problem.
I remember I used valgrind to get rid of memory leaks some 10 years ago but I have no such problems now. I am not sure valgrind can help if the problem is that after several 1000 successful minimizations suddenly one fails, probably because a record in an allocated variable has some value which has not been updated.

I too have found converting old F77 “memory dumps to file” into TYPE structures using allocatable components of integer and real kind allocatable size arrays to be very effective (but I have not used variable length character strings).
You can then store the data in memory in large indexed data structures or write them to an indexed file structure. When writing these data records to file, I would recommend using multiple standard conforming writes for different kind arrays, rather than equivalenced memory dumps as in F77. This avoids different compiler problems. (eg Gfortran real*10 arrays )
The old F77 equivalence approaches for mapping memory can also be corrupted by -O2 or -O3 compiler efficiencies, so it is best to eliminate these old approaches.

Stream access files with 8-byte integer addresses can store a surprisingly large amount of data, as can multiple record TYPE structures in, now 128+ GBytes of memory !
Keep the data indexing clear and simple and standard conforming.

Thanks for the encouragement of using oldfashioned methods. My view of computers is probably very primitive and I consider all this new stuff of records and pointers is adding superstructures on top of the very basic sequential memory. The big difference is the size of easily accessible memory, no need to mount a tape or change disk to access stuff done a month or year ago. The basic part of AI is fast access to a huge memory.

In fact, the Fortran language definition entirely abstracts away the memory organisation. Your program relies on a particular mapping of Fortran entities to memory, and consequently became dependent on a particular set of compilers that implement the abstract entities in the same way. In particular, I see in metlib4.F90 an EQUIVALENCE between integer and character variables. That is an error in Fortran, but compilers frequently just went ahead and produced code for it. This was a “convenience”, which someone must now pay for. Fortran is like a steel rod, maybe with spikes on it. The “convenience” extensions turn it into string and string has a tendency to end up in tangles and knots. The reward for turning string back into steel is that the program will become useful for decades, and, crucially, when there comes a time when a particular feature is removed from the language and a rewrite is needed in terms of new features, it will be a process that has been studied, is straightforward to implement, and is proven to work.

1 Like

The melib library is the old F77 code I wrote around 1980. It includes a lot of F77 specials and was written on a NORD-10 with 64 kB physical memory. But code and data had separate 64kB address space. The routines with the EQUIVALENCE statement were used for storing data in the integer workspace used because there were no TYPE definitions in F77. I used metlib when writing OC because I had to get a user interface up and running very quickly and in metlib there is a lot of code handling user input..

I remember we had a special problem because on the NORD-10 a double precision real was 48 bits which saved a lot of memory. We used a computer dependent variable NWPR for the Number of Words per Real to make the code run on other hardware.

One avenue to explore is to factor your computation into a purely Fortran part that gets its input from a prepared formatted (human-readable) file and writes its output similarly, and a part that helps the user prepare such a file and possibly helps in interpreting the output graphically/visually. Some Fortran compilers give you very powerful analysis tools only if the entire program is visible to the compiler.

I have not been able to reproduce your failure mode so far. Perhaps if you generated a log of the
commands you entered it would be useful as a reproducer for others. As mentioned above, I think some of the debug compiler flags and valgrind output are a valuable way to go. Some of the warnings can be deceptive, as I can
see you skip the conditions the compiler is warning about in some of these loops, but I did not determine that for all of them. The valgrind(1) output looks promising. At its simplest, just run

script output.log
valgrind $EXECUTABLE
# enter input until failure
exit

and look in the output file. There are more sophisticated ways to run valgrind but if you just look for lines in the output file that start with “==” you can see traces for several warnings about uninitialized variables that could well be the root of your problem

some array bound warnings that should be verified are OK:

././src/minimizer/matsmin.F90:820:34: Warning: Array reference at (1) out of bounds (0 < 1) in loop
  806 |           do icc=1,mostcon
  820 |                       mostconph(1,icc-1)=nvf
././src/minimizer/matsmin.F90:821:34: Warning: Array reference at (1) out of bounds (0 < 1) in loop 
  806 |           do icc=1,mostcon
  821 |                       mostconph(2,icc-1)=iph
././src/models/gtp3B.FINC:7313:42: Warning: Array reference at (1) out of bounds (10 > 9) in loop 
 7311 |          do kp=1,ncol
 7312 |             if(colvar(dcom,kp)%column.eq.0) then
 7313 |                if(kp.lt.ncol) colvar(dcom,kp+1)%column=0
././src/models/gtp3B.FINC:6486:40: Warning: Array reference at (1) out of bounds (8 > 4) 
 6482 |          do ls=1,8
 6486 |             intlinks(1,incperm)=prmint4(ls,lq)
././src/models/gtp3B.FINC:6500:47: Warning: Array reference at (1) out of bounds (8 > 4) in loop 
 6497 |          do ls=1,8
 6500 |                   call findconst(lokph,prmint4(ls,lq),jord(2,1),cix)
././src/models/gtp3B.FINC:6503:46: Warning: Array reference at (1) out of bounds (8 > 4) in loop
 6497 |          do ls=1,8
 6503 |                   intlinks(1,incperm)=prmint4(ls,lq)
././src/numlib/oclablas.F90:14455:28: Warning: Array reference at (1) out of bounds (0 < 1) in loop
14450 |       DO 70 I = 0, SPM1
14455 |             SUBMAT = IWORK( I ) + 1
././src/numlib/oclablas.F90:14456:43: Warning: Array reference at (1) out of bounds (0 < 1) in loop 
14450 |       DO 70 I = 0, SPM1
14456 |             MATSIZ = IWORK( I+1 ) - IWORK( I )
././src/numlib/oclablas.F90:14498:31: Warning: Array reference at (1) out of bounds (0 < 1) in loop 
14491 |          DO 90 I = 0, SPM2, 2
14498 |                SUBMAT = IWORK( I ) + 1
././src/numlib/oclablas.F90:14499:46: Warning: Array reference at (1) out of bounds (0 < 1) in loop 
14491 |          DO 90 I = 0, SPM2, 2
14499 |                MATSIZ = IWORK( I+2 ) - IWORK( I )

Note the installation PDF says to install ochelp.txt but it looks like it intends ochelp.htm

It looks harmless, but why is lph being set in this routine? It looks like a no-op but perhaps some variable with scope outside the procedure is supposed to be set?

src/models/gtp3EY.FINC
@@ -3282,10 +3282,11 @@
     bigloop: do while(.true.)
        lpp=lpp+1
        if(lpp.gt.lenph) then
-! this is lenght of provide phase and we have match up to this position, accept
+! this is length of provide phase and we have match up to this position, accept
 ! normally a phase name ends with a space but with allocated characters ...
 !          write(*,*)'Is "',phasename,'" same as "',selph(lp)%phasename,'"?'
-          lph=lp; goto 1000
+          !APPEARS TO BE NOOP IF LPH NOT SOMETHING GLOBAL! lph=lp;
+	  goto 1000
        endif
        chp=phasename(lpp:lpp)
        lpx=lpx+1

I would not shortshift valgrind and the compiler warnings. They look like they would be fruitful to wade through given the description of the error you are experiencing. How reproducable is it? Does it occur at the same place every time given the same input?

Wow, thanks a lot. I will have a look at the loops you indicate.

The error occurs during the STEP calculation if you run the macro file

examples/macros/step1.OCM

twice, but may occur at different places and sometimes one has to run it three times. The program does not crash, it just reports a convergence error for a calculation that worked the first time.
You can skip all the plots and just terminate the macro with a “set inter” after the first step command. You need to have the database file steel1.TDB on the same directory

The routines in the oclablas.F90 file are extracted from LAPACK and BLAS some 10 years ago, maybe there are later updates. Originally I used some homemade routines for inverting matrices but changing cut more than 25% of the CPU time. I will check if I understand the code.

The code in ges5EY.F90 is not involved in this calculation, it is part of the new code I am working on to read XML files for the databases, the current TDB database format cannot handle some new features that will be added and lph is indeed a global variable.

Bosse

Dear Urban

I would like to thank you very much for the getkey routine which is used in OC to handle user input on unix.

I have looked at all valgrind warning and those in ooclablas are no problem as there is an IF
statement inside the loop which ensures the index is >0

In the matsmin.F90 file there is also an IF statement which should esure icc-1 is >0

For the gtp3B.F90 file the code involved I no longer understand. It is a complex case where I have used shape/reshape to create a matrix which is irregular. I did understand this shape/reshape once but now I have completely forgotten how it works. Anyway this code is particular for phases with order/disorder transformations and those are not involved in the STEP1 macro.

Bosse

Thanks for the clarification on how to reproduce the problem, and taking the time to double-check those bounds checks so I know those are not smacking memory. They would have been a red herring to me. I will try it again soon per your instructions and with ifx as well as gfortran. I turned on the gfortran -fimplicit-none flag and got an error about lph not being defined. I will check I did not break something, as I made a copy to play with that I converted to use fpm, but it appears to be running OK. For lack of anything better I had made a script to run all the macros with a NEW between them and changed NEW to not prompt to make that easier but other than that made no intentional changes.

I see you made a interesting command history editor using getkeys. Might be interesting as a package all by itself under an open license. That getkey procedure has an odd history. It was first needed as part of a graphics driver for a Tektronix 4010 raster terminal driver which continued being used using the xterm built-in 4010 emulator in circumstances where simple plots were made where X11 was not possible over the available networks; and then later for reading raw keys in ANSI terminals so nice to know someone put it to good use in current applications. I wish something similar had become standard with the introduction of streams to Fortran; but standard READ does not even read streams from stdin, procluding using Fortran for binary filters without calling C, so did not even get close on that one.

Found a few very minor uses of extensions that might someday cause an issue similiar to the original post I will post to the github page; but particularly given the vintage of some of the code it appears to be very clean so the problem does not appear to be obvious so far. Just a probable non-standard continuation of a string and such. Very few warnings for the size and history of the code from the compilers so far.