The only “first touch” that I can see out of the parallel region is sim_ind_a(1,:) = init_a, which is a very limited one. I doubt moving it to the parallel region make a significant difference.
However, this makes me think that most of the code is maybe memory bound, as it has to write a very large amound of data (12GB estimated with the parameters as posted). If so, most of code being memory bound and the call to RANDOM_NUMBER() being serialized by a mutex, there’s no chance to get a significant speed-up with multithreading.