Going from 100x100 to 2000x2000 is 400 times bigger so the increase in run time should not be unexpected.
The use of Nvidia off-loading with DO CONCURRENT looks to be a very interesting opportunity.
It appears to provide the basics of shared/private array nomination, which looks good.
It appears to be an “easy access” to large thread counts on GPU’s.
I would expect that the same issues that apply to OpenMP would apply here, especially memory transfer bandwidth limitations.
Please update us on how you go with this approach and the pc configuration you are using.
Also, you should check your test example, as floating point underflow and use of sin/cos/atan with unusual values can distort the run times, which confuses the test interpretation.