How much memory is my logical array using?

It depends very much on what you are doing in your algorithm, so in short, I would advise you to do your own benchmarks varying the number of bits starting from a small amount until one that it isn’t doable with both methods (you don’t need the results to have physical meaning while doing this).

If your algorithm rely on/exploits batch operations (like XOR, FLIP/NOT, AND…) over the bits I don’t see the reason why it would be any less efficient, in fact it will be faster as there are fewer elements (of the array) to be processed at once by your CPU (meaning more bits processed each step).

Otherwise, the bit extraction time will be relevant only if there is a function calling overhead involved to each bit, because the number of spins would remain the same.
I don’t see why it would be any slower as it is a cheap operation after all, plus neighbor’s bits would be a cache hit (in case you are doing spin “frustration”, “relaxation” methods).

But again, I suggest you benchmark if you’re still in doubt.

PS: Also be careful with deallocations you may be forgetting, even though fortran deallocates some of the arrays, as the code is crashing, I assume there is some kind of memory that should be freed manually (but it is not) and the SO needs to kill the process, it is worth to look into tools like valgrind or using flags like “-fsanitize=address” in GCC.

3 Likes

I tend to agree to give a try to the bitwise approach, I actually do not foresee that much need for unpacking and repacking with spin systems and many things can be done with built-ins. This somewhat standard reference by Sandvik gives many practical guidance on how to deal with spin systems representing whole states by a single integer, it has snippets in fortran 90 already (other than pseudocode) so you’ll have an easy time following the examples.

1 Like

I actually relied on that review some year ago for a PhD course assignment and implemented everything in matlab (which would have been a performance nightmare dealing with arrays of logicals). I found myself very satisfied with the approach.

2 Likes

Some have pointed out in this thread about how to handle spin systems.
At least for nuclear physics, may check chapter of the book below,
Computational Nuclear Physics 1 | SpringerLink
Chapter 9, page 171 to 187.

OP’s perseverance of waiting several weeks on a PC for the results is appreicated. However in real world high performance code and good publications in nuclear physics and condensed matter physics, as far as I know, spin isospin system are always handled by binary representation and use bit operations to naturally mimic the behavior of spin isospin operators. If logical array is the way to go, people would already use it. Also, just realized OP’s code does not have MPI, so racing condition is not a problem here. OP may consider using MPI to really parallelize the code, that will allow the code to be easily run on a much larger system and handle more difficult problems. Other people have given OP numerous suggestions and help, so be grateful.

Yes, for sure statistical averaging of realizations is an embarrassingly parallel task so it would clearly benefit from usage of way more cores. You probably don’t even need MPI and just send a lot of batch jobs to a HPC scheduler. But the original poster described his own hardware constraints and might well not have access to HPC resources, so that’s it I guess…


I don’t want to enter too much the diplomacy issues about tone, but maybe I can kindly suggest to just avoid taking too much hard feelings and try to “fly above”. I also found that comment a little harsh but the same could apply to “lol” and “:rofl:” towards OP problems, such unnecessary additions may well feel a bit mocking, in the context of an answer on a forum. Communicating intentions on the web is hard and misunderstanding happens all the time, we all should try to be understanding in that respect.

3 Likes

Update: I am running the code with “LOGICAL(KIND=1) :: VARIABLE_NAME”. I can already see that it takes about a quarter of the memory. That should make it faster. I will know for sure in about twenty four hours.

1 Like

In regard to parallelization, thanks for the suggestion. I’m not an expert but I did study the subject a bit, and I can say that my code cannot be parallelized. That is because of the physics. The best that can be achieved is to run several realizations on different cores.

Anyway reading with more care all the thread your case seems exactly the one in which bitwise representation is necessary. Your problem is hard regarding system size, which if you encode with an array of logicals directly maps to a memory problem.

You did not provide explicit description of the operations you need to do with your spins, but most standard research work can be done without extracting and inserting the spins, so no overhead. You can do what you prefer, of course, with the given advice here, but I want to point out to other people reading the thread that bitwise representation of spin models is the de facto standard in the field, as you can verify inspecting the research reference linked above.

Maybe your specific problem needs peculiar treatment, I cannot evaluate with given information, but I feel I need to state clearly that the consensus in condensed matter physics research for the generic class of problems at hand is to work with single integers as representations of extensive spin chains / nets / lattices.

Think about it, if you wish.

1 Like

Yes sure, that is a trivial way to parallelize, you are already doing it manually on your machine (running many instances on different terminals). In large multicore systems intended for high performance computing there are automatized ways to do that, and scale to hundreds of parallel running realizations, but if you do not have access to that kind of resources what you are doing is the best you can get.

1 Like

Test result: LOGICAL(KIND=1) sped up the program by about 15%.

rba asked for some details about my spin comuptations. I pull three neighboring spins (a triplet) out of the array, I perform a test on them, and if the test is positive, then I change their values and put them back in the array. I repeat this operation on randomly selected triplets. That is probably all you want to know, but if you want some more info, the outcome of the test actually depends on the recent history of these tests. But I can’t imagine anyone wants to get into that here.

I will study up on the bit method. To be honest, after looking at the bitsets module suggested above, I am a bit intimidated. This appears to be much more complex than using an array of logicals. Nonetheless, I will work at it.

2 Likes

There is a way of coding a 2-valued 2d Ising Monte Carlo simulation in only one bit, which I stumbled upon in 1985 and implemented on a 2MHz 6502 8-bit processor so that 8 independent simulations could be done in parallel, but I forget the details and the assembly source code is lost. Maybe you will rediscover it!

1 Like

Hello rgba, I have just started studying a bitwise approach. My apologies if I’m a bit slow on this. I have a basic initial question as I get started, which I was wondering if you would consider.

When I look at the documentation for fortran bit operations, they all operate on the bits of an integer. With the large size of my spin array, I’m assuming I would store my spin array in the bits of an integer array. Let’s assume the integers are 32 bits. If I want to extract the q = 183464th spin (bit), I would need to determine which integer in the array contains the desired spin (bit), and then I would need to determine which bit in that integer is the desired bit. The point is, I would need divide q by 32, and then find the remainder, which will be a bit computationally demanding. Is there a short cut I’m missing? Thanks for any assistance.

division by 32 and remainder can be done with bit operations. specifically, the division is i>>6 and the modulo is i&31.

2 Likes

In fortran, that would be shiftr(i,6) and iand(i,31). The intrinsic functions btest(), ibclr(), ibset(), among others, are the functions to test, clear, and set individual bits in an integer. These should all compile to the same machine instructions, regardless of the language.

However, as noted above by several posters, many operations would not be performed on individual bits, but rather on the entire data structure. Operators like iand(), ieor(), and so on work on both scalars and arrays.

You need not use 32 bit integers. You can use 8-bit integers too – just define and use the appropriate kind parameter where necessary. An advantage of this is that if you have special operations other than the standard bit operators provided by fortran, you can define those through a lookup table that is 256 elements long. That is, instead of looping over the 8 bits, you could do a single table lookup for the whole integer at once. That gives you more options regarding the implementation details.

4 Likes

To expound on this a bit, integer division by a power of 2 can be implemented by a right shift (multiplication by 2 is a left shift), and modulo by a power of 2 can be implemented by and-ing with (n-1). There’s all sorts of fun ways to compute stuff using bitwise operations listed here: Bit Twiddling Hacks

3 Likes

Some compilers already perfrom such bit twiddling trickery for you if possible. This talk from Matt Godbolt shows an example of multiplication starting near time point 18:38:

3 Likes

What performance changes you get will be very interesting.

In the mean time, on your 16GB 8 core machine the logical arrays at one byte per value should take up around 12.4GB, but you have not mentioned what the size of the program when running is, and whether it has been increasing it’s memory footprint while running. Are you still running your tests with KIND=1?

IF you can now run 8 instead of 6 and got a 15% wallclock reduction on top of that, that seems like a satisfying change. You have 8 real cores
or is that counting hyperthreading?

The bit mode seems likely to be worthwhile to pursue as well, but theoretically you should now be able to run 8 instances simultaneously if the memory footprint does not change/machine does not overheat so you would now no longer be memory-bound but processor-bound with your current platform; so getting a smaller memory footprint will essentially only be a gain on that particular platform if it produces a wallclock improvement.

Any further data from the 1-byte logical version? Long-term using the bit methods would allow running larger problems, might produce a significant speed-up, etc. Definitely not saying to not pursue it ( It got me playing with creating a one-bit C structure (my compiler supports that; others end up using a byte or more for the values anyway, and you are not guaranteed contigious memory and so on – not sure it is going to work but it is interesting) and some C support routines to access it via the ISO_C_BINDING interface). Tried manipulating it real quick with Fortran pointers directly and could not figure out a way to do it – thinking it is probably not possible using standard Fortran to do that, but still thinking about it).

Not sure how many C compilers support stuff like

struct {unsigned int running:1;} onebit[1000];

as actual 1-bit values anyway, but I know it is not all of them. So I would not change your current direction, but the overall changes just going to KIND=1 are interesting to hear about. It is an oddity of Fortran that recent standards require the default REAL, INTEGER, and LOGICAL to be the same storage size. I can think of good reasons that is nice with REAL and INTEGER but not sure why LOGICAL got thrown in there too.

So just-in-case, you might want to also profile (with gfortran, gprof(1) should work) and see where you are spending time and see if there is anything there that can be optimized. With wall-clocks exceeding a week I would say it is at least worth looking, especially if you are already familiar with profiling.

Really wondering what the memory footprint is with the KIND=1 and whether it increases significantly and if that was the reason you were locking up after a week if you have the time to post the number and you are running KIND=1 instances.

Also curious if you have tried other compilers? My general experience is compilers like nvfortran can fly with floats but seems to have some real problems with LOGICAL, and that ifort/ifx is noticeably better with LOGICAL than gfortran. If you have it available, I find trying multiple compiles is almost always worth it for a variety of reasons.

2 Likes

Thanks urbanjost for your interest. Yes, I have been running six jobs since Wednesday, three with kind=1 and three with default, and the kind=1 is about 15% faster. I have installed an activity monitor and a temperature monitor, and everything looks right. The only curiosity is that the activity monitor says I have sixteen cores? Also, the jobs bounce around between cores like spaghetti on the activity monitor, which doesn’t seem efficient to me.

From here on out, my results are going to come in slow, I’m afraid. After these jobs are finished, I guess I’ll take a risk and run eight jobs (or maybe I should start with seven). Fingers crossed! I will keep you informed, but don’t expect updates for a couple weeks.

In the meantime, I am slowly studying up on the bit method. It will make the code much more dense, so I’m not enjoying this prospect. I’m going to start with a bit method approach to initializing a huge array of spins with random ups and downs. I’ll show it here soon to anyone who is still interested for comment.

Thanks to everyone for their help.

ps. The code is concise: 352 lines for everything, including some empty spacing lines. The kind=1 program uses 20 MGb and the default uses 80 MGb. I believe that these numbers have not changed since I started it. I have tried only gfortran.

The activity monitor does have a sort of contradiction. The processes screen says each job is taking 6% of the CPU, which I guess is total CPU. The resources screen shows the several of the cores at 100% usage, but those number are constantly jumping around as I already mentioned.

There are two things that would make it easier to program these kinds of applications. The one I mentioned before is an intrinsic bit data type in fortran. The other would be to have the fortran RANDOM_NUMBER() take both real and integer arguments. The integer version would simply return random bits. The former can be mimicked pretty well with a derived type along with the necessary operators to set, clear, copy, etc. individual bits and substrings of bits. The latter can be done with any of the open source or public domain random number generators. Random bit strings, stored in integers, are already used as the first step to compute random real numbers.

1 Like

High context switching is a little surprising unless you are
running other work on the machine. Very simplistically, what
process demand does “uptime” show? Is it significantly above
the number of cores you have?

Often, machines are configured more to work well with GUI
interfaces interactively, not for HPC work so there are
several things you can do to reduce it, but I suspect you
have a number of other processes running (perhaps even the
performance monitors you have) or you should not be seeing that.

Do you know how to measure the misses and what rates would
be harmful? It sounds like you are familiar with context
switching.

So if it is high, and your process demand is high, look for
things you have running you do not need to have running and
close or kill the unneeded processes.

Everything after that has about fifty conditionals in
front of it. So it depends but

Look at looking your program to a specific core with the
taskset command, assuming the programs are not parallel to lock
something to a specific processor even if the program has
already started you get the PID of the process and
enter

taskset --cpu-list 1 --pid $PID

to lock a process to CPU 1 (it starts at 0)

Depending on a lot of things you want to keep your processes
off processor 0; in the other scenario you want all your
processes to use a cpu list and include 0 and one other
exclusively listed core…

As root, these are probably the most important kernel parameters
that you can display with “sysctl -a” and set with
example commands like (just example values):

sysctl -w ‘kernel.sched_latency_ns = 18000000’
sysctl -w ‘kernel.sched_min_granularity_ns = 2250000’
sysctl -w ‘kernel.sched_wakeup_granularity_ns = 3000000’

Probably better to point to some references rather than try to
write something on system tuning here, I will see if I can find
something on the WWW. If anyone else has some good pointers we
might all like to see them.

Normally, with straight-forward single-processor Fortran codes
on a relatively idle machine you should not be seeing a lot of
switching so if it really is at harmful levels (and if you are
not on the verge of overheating) you can turn off powersave modes
(if they are on) and lock your cores at high-performance rates
as well.

Need to know a little more about what Linux you have, as this is one of
those things that is different (like do you have the “powersave” command?).
between different Linux versions. What does “cat /etc/release” show if
you do not mind?

Need to see a few numbers and some output. The simplest thing would be if you do a top and turn on the core-level display (top -1)

1 Like