Sorry. I haven’t actually used FKB, just took a look at what it could do and decided to use the Intel oneAPI AI toolkit implementation which I think is threaded (at least on Intel processors). It ran fast enough to handle all but the biggest cases I was trying to run in a reasonable length of time. Fortunately, I was able to get a small allocation on a cluster with several V100s so I ran the largest cases there.