MNIST Problem finding file

macneacail · May 15, 2022, 11:11pm

I have downloaded from github the program from GitHub - modern-fortran/neural-fortran: A parallel neural net microframework

I now have all but mnist running in VS 2019. I can program in LINUX, but I choose not to, there are a number of interesting problems with LINUX and some of our experimental equipment.

MNIST runs but it is looking for a file in a folder called files, which from a through search does not appear in the github location. I can download a copy of the mnist file from other places, but I am not sure that the others are correct.

Would it be possible to get the file put back into github or to obtain a copy of the original one designed for this program.

This problem has been mentioned on the Intel Fortran Forum, and Fortran Fan suggested I ask here.

Thanks
John

ivanpribec · May 15, 2022, 11:20pm

See the related issue:

github.com/modern-fortran/neural-fortran

Download data files at runtime or in an installation script

opened 11:48PM - 15 Apr 22 UTC

closed 11:57PM - 18 Apr 22 UTC

rouson

enhancement

I just uploaded [mnist.tar.gz](https://github.com/modern-fortran/neural-fortran/…files/8498876/mnist.tar.gz) to GitHub file storage in way that is persistent without being part of the `git` repository. I recommend against storing binary files in a repository. Over time, they increase download times and because `git` can't do useful diffs on binary files, every change to the file means a completely new version must be stored. What's worse, `git rm` doesn't take any of the committed versions of the file out of the commit history so the only way to fix the problem is to rewrite history, which essentially means that everyone who has a local copy of the repository will need to do a fresh clone and if having everyone do a fresh clone is not practical, then it's probably necessary to set up a new repository. I recommend either 1. Use Fortran's intrinsic `execute_commnand_line()` subroutine to download and uncompress the file from the above location at runtime if it's missing or 2. Provide a short script that users can run to handle basic set up tasks (including downloading data files) and then build and test the repository's software. On the OpenCoarrays project, I found that our downloads went up by a factor of 2-3 soon after I wrote an installation script. More recently, I settled on a much simpler approach to writing an installer for [Caffeine](https://go.lbl.gov/caffeine) that is an order of magnitude smaller than the OpenCoarrays installer, more robust, and much more maintainable. **I'll be glad to adapt Caffeine's `install.sh` script to neural-fortran if you like.** When someone can get a package built and tested by typing nothing more than `./install.sh`, it saves a lot of time over reading build/test instructions, no matter how simple those instructions are.

You can also open an issue in the neural-fortran GitHub repository to contact the authors directly (cc @milancurcic).

FortranFan · May 16, 2022, 12:50am

@macneacail ,

Welcome to this Discourse. I think you have come to the right place to follow up on your interest in that machine learning example whose lead author has already been alerted in the previous post.

For any other readers, here is the thread by OP at the Intel Fortran forum:

milancurcic · May 16, 2022, 1:37am

Hi John, thank you for trying neural-fortran and thanks @ivanpribec and @FortranFan for helping out.

The MNIST data files used to be included with the code until recently, and are now downloaded automatically by load_mnist if they’re not present in the working directory. As you have already found, this depends on curl which is not present on Windows.

As a workaround, download mnist.tar.gz directly and unpack it in the directory in which you intend to run the program that calls load_mnist. This will bypass the subprocess call to curl.

Please let me know how it goes and if any other questions come up. I’m especially interested in how the Windows user experience could be improved, and what missing features would you like to see in neural-fortran.

milancurcic · May 16, 2022, 1:46am

I just added a note to the README with the above instruction on downloading MNIST directly:

macneacail · May 17, 2022, 4:57pm

I have the program running in windows. As I want to play with the program, I have removed the coarrays, I will have to learn how to upload the program to github.

macneacail · May 17, 2022, 10:27pm

It is very easy to produce a bitmap in Fortran directly, harder to do a jpg. Any chance we could upload bmp files.

This may or may not be the first image in the training set. It is rough, I have to put the full scale in the colour range at the moment blue < 0.5 and red > greater.

I want to understand why you get a 95% acceptance rate and to that I need the images.

Do you have a picture of your training set? I just want to make sure I get them the right way round.

I am looking to do vehicles and people and bikes etc on roads.

I was looking at a program in C# and using small European cars, three of which will fit on a 9 metre bridge, the results said they were 50% cats.

JMN

I think the cats is the very small wheels and the low clearance and the shape compared to US cars used in the training sets.

macneacail · May 17, 2022, 10:58pm

It is upside down, my picture. It is always good to check, I found the start of the sample drawn on the web. Now I understand your method, it is an easy plot.

Did you put the dat files together or are they from someone else?

Screenshot 2022-05-17 175125

milancurcic · May 18, 2022, 4:09pm

If I recall correctly, and I can’t find the code now, I

Re-formatted the data to be easily readable with direct access;
Organized the dataset by files into 50K samples for training, 10K for validation (the set you use while tuning the network), and 10K for testing (the set you use after tuning the network).

The canonical source has them in the IDX format and organized in 60K and 10K sets for training and testing, respectively.

Obviously, in your experiments, you can choose any sample sizes for training/validation/testing that you want.

macneacail · May 18, 2022, 4:36pm

I used your data sets. I have just used a simple Fortran bitmap maker to look at the images. Today I will probably if lucky get all 50000 and 10000 on a few large bitmaps.

My underlying problem relates to determining the deflection of a bridge under moving loads, I want to be able to high speeds to determine the type of vehicle that crossed a bridge. We match the data to accelerometer data that is super sensitive. So we can measure the frequency response of the Pont du Gard to a bus load on the bridge and bicycles passing over a normal bridge.

The fundamental question relates to cost recovery on use of highways with electric cars.

The people I work with want to use Python, I think that Fortran will give me the speed I need, so think of a German Autobahn, at 150 km/hr one has an interesting challenge determining in real time the vehicles. You have a NUC computer and you have say 0.25 seconds to match the vehicle in computer time. [ You can monitor a bridge and determine a good traffic count and peak acceleration and deflection with a RPi b in real time. You however have real problems with LINUX and some experimental devices, there is a fundamental flaw in LINUX, which is why after trying LINUX for years and having expert LINUX people say it is solvable, and I know it is not, I use Windows.

There are three good road research groups in the world, these people put this as a long term pipe dream, I think it is reasonably doable, and I prefer to do it in Fortran. Hence your software. If you want to be on the bleeding edge, you surf in Portugal and die, same in Fortran.

So at the moment I am trying to understand what you do, how you do it and how to adapt it to vehicle photos and not handwriting of numbers.

The computer systems that will monitor driverless vehicles for the road authorities will be challenging, it has some real ethical issues that most traffic engineers do not understand.

I hope this makes sense. Sorry for the ramble, but I want you to understand I am looking at a real world use - soon.

John

macneacail · May 20, 2022, 4:53pm

it is better when you can see the data. This is the first 5000 letters, some of them are almost unrecognizable. There are actually legal experts who read unreadable stuff.

macneacail · May 20, 2022, 9:19pm

The sampling data used for the validation has some interesting features, the data appears to have some runs of duplicate numbers and also some from one user who did some funny sequences.

See sample, was that intended? There also appears to be some done with a HB pencil as in most and some done with none HB was that also intended.

milancurcic · May 20, 2022, 9:27pm

I haven’t noticed that before, and I haven’t looked at the samples at the level you’ve shown in this thread.

Here’s some background on the MNIST database, including the benchmarks:

macneacail · May 20, 2022, 9:40pm

I understand, I just want to be sure before I assume anything. For the level, blame my Chem 1 tutor at Australian National University in 1975 and Heber Sugo at Newcastle Uni.

To paraphrase an old saying - no one trusts a theoretical boffin except the boffin and everyone trusts experimental data except the experimenter.

To quote a recent exchange, “Is it stable?”

me: We did a double integration at 2000 time steps per second, it is stable to about 10 nm for short durations, is that stable enough?

Person: Is a nm like a foot?

As Winnie the Pooh said, all honey is not created equal.

macneacail · May 23, 2022, 4:50pm

I have looked at the statistics for the 50000 data set, it is heavy on the number 1 and light on the number seven, to the point of being questionable statistically.

It will be interesting to look at the results. If you are going to mix up two numbers 1 and 7 seem like a likely pair.

Getting there.

macneacail · May 23, 2022, 5:15pm

The statistics of the 10000 validation set are reversed, 1 and 7 are both well represented - over represented really.

If they are drawn from the same data, this is really strange.

macneacail · May 23, 2022, 10:11pm

module subroutine train(self, input_data, output_data, batch_size, &
        epochs, optimizer)
        use Base
    class(network), intent(in out) :: self
    real, intent(in) :: input_data(:,:)
    real, intent(in) :: output_data(:,:)
    integer, intent(in) :: batch_size
    integer, intent(in) :: epochs
    type(sgd), intent(in) :: optimizer

    real :: pos
    integer :: dataset_size
    integer :: batch_start, batch_end
    integer :: i, j, n, k
    integer :: istart, iend, indices(2)

    dataset_size = size(output_data, dim=2)

    write(*,100)
100 Format("               Starting the training")
    CALL RANDOM_SEED

In order to work in Intel Fortran you need to add RANDOM_SEED to your code, otherwise you are just repeating the same numbers over and over. Random_number appears to take a constant from the computer if you do not use random_seed.

sblionel · May 24, 2022, 3:40pm

Calling RANDOM_SEED with no arguments is an implementation-specific way of getting nonrepeatable sequences - the standard does not specify this behavior, and various compilers handle it differently. To address this, Fortran 2018 added RANDOM_INIT. This takes two arguments, both LOGICAL type. The first is REPEATABLE - you give this as .TRUE. if you want the same sequence each time, .FALSE. if you want different sequences. The second argument is IMAGE_DISTINCT - this controls whether coarray images share a single sequence or each have their own sequence. Both arguments are required. RANDOM_INIT(.FALSE.,.FALSE.) is what you want here. (Intel Fortran supports this.)

milancurcic · May 24, 2022, 3:46pm

You’re correct, but it’s inconsequential here. At least with my ifort (2021.5.0), subsequent calls to random_number return different values, which is needed for proper selection of mini-batches.

That different invocations of the program are deterministic should be fine if the model and data are well defined, i.e. if the training is not prone to getting trapped in local minima. Nevertheless, I’ll open an issue to use random_init so that the behavior is consistent between compilers, as suggested by @sblionel (thanks!).

sblionel · May 24, 2022, 3:54pm

@milancurcic it’s not "subsequent calls to random_number" that are the issue, but whether on each run of the program do you get the same or a different sequence? Intel Fortran will give you a fixed sequence on each run unless you take steps to “randomize” the seed.

Topic		Replies	Views
Fortran and Neural Networks	17	7085	November 14, 2021
Using a trained neural network model in a fortran code Help	25	2747	November 17, 2023
GSoC 2023 - Interest in Optimizers in a deep learning framework project \| Fortran GSoC-2023	19	1143	April 3, 2023
Fortran For Building ML Models Tutorials	5	1421	July 12, 2023
ATHENA, Fortran neural network library - 3D convolutional layers Announcements	2	609	October 26, 2023

MNIST Problem finding file

Related topics