fortbench: a benchmark for agentic coding of Fortran

I was playing with the new Qwen 3.5 family of models and benchmarked them inspired by SWE-Bench on GitHub - lazy-fortran/fortbench: Real-world Fortran coding benchmark for agent CLIs · GitHub . Qwen is still a bit worse than Claude and GPT, but can solve more than 50% of my tasks. Would be curious about input or feedback how to expand it.

PS: Changed my user from @ert to @krystophny to be consistent with GitHub.

4 Likes

Are you running Qwen 3.5 locally? I tried it using the qwen-code, but it wasn’t able to fix a simple Fortran problem. But as a chat it works really well, probably the best local model I tried.

Yes! I am using opencode with the qwens, not qwen-code. I am now also trying to wire it to codex as a local model. For this, llama.cpp needed some modifications because of unknown tool names but it runs now. How good I cannot tell yet. I also did only benchmarks but no practical work yet, and seems like for benchmarks the larger qwens (27B, 35B-A3B, 122B-A10B) are barely usable.

Yes, I used Qwen3.5-35B-A3B-8bit, and once it runs, it has about 70 tokens/s on my laptop, so very usable. But in qwen-code, it would load the whole conversation over and over, so it would take 5-10 minutes to load the prompt, then quickly generate a response in a few seconds, then qwen-code would load again for every request, and for any task you need, say, 20 requests, so in practice it was unusable. Given that the task continues, I would think you don’t need to reload the prompt from scratch. I am sure this will get figured out in the coming years. As a chat, it is very good.