A New JSON Library

Awesome, yes, we might need to write our own string to real converter.

Another benchmark. For the 6.5 MB file big.json that isn’t just real numbers (e.g., it has a lot of string data):

Fortran:

rojff        : 1.5498  seconds
fson         : 0.9193  seconds
json_fortran : 0.2063  seconds

Python:

rapidjson    : 0.045112584 seconds
json         : 0.033147166 seconds
ujson        : 0.021337875 seconds
3 Likes

Thanks to all of you for taking a look and running some benchmarks. For some reason I thought I was a bit closer performance wise. Guess we’ve got some work to do.

As for ideas about where the bottlenecks might be. From what I’ve heard, and to some extent experienced myself, the Fortran library code for reading/writing numeric data is… shall we say not the fastest. And since the canada.json is a lot of numeric data, I suspect that is taking a lot of the time.

Another thought, my file_cursor_t is reading the file one character at a time. Perhaps implementing some sort of buffering would illicit some improvements?

As for the overhead due to error handling, I’d think the branch prediction on modern processors ought to alleviate a lot of that. I’d be curious to know if anybody would know of any way to confirm or deny that though.

I’m happy to take contributions if anybody would be interested. I’d be interested to hear thoughts on the API as well.

That’s very interesting that the builtin json parser in Python can beat rapidjson. My experience has been that rapidjson is one of the fastest.

I think we should experiment with writing a JSON parser that assumes valid JSON and just parses it as quickly as it can. I wouldn’t even worry about representing it at first (nor error handling), just parse it, and perhaps just count how many {} pairs there are. And see if we can get competitive. Then we can add error handling and representing it in Fortran.

Yep, take a look at JSON-Fortran. There is some stuff in there to make the file read go faster (e.g., using STREAM, and also reading it in chunks rather than one character at a time). But, I have to ask: why do you not just use JSON-Fortran? :slight_smile:

1 Like

I didn’t look real hard at it, but my first impressions were that the API didn’t seem that friendly, and the documentation/tutorial wasn’t that illuminating. I tried reading through the source code a bit, because I was curious how you implemented the parser, but had a hard time finding my way around. I never did find where the actual logic for the parser started. So I’ll admit that to some extent my library was born out of Not Invented Here, but I was more interested in the usability aspect than performance, at least initially.

And I will say that rojff is fast enough to be usable, if not necessarily the fastest.

Why not provide Fortran “bindings” for UltraJSON aka ujson and “call it a day”?!

After all, “UltraJSON is an ultra fast JSON encoder and decoder written in pure C” That it has Python bindings is beside the point.

2 Likes

The Python interface (it appears at least), returns and accepts native Python data types (i.e. dict, list, string, float, bool). What types should be accepted and returned in Fortran. We don’t have an intrinsic dictionary type, and can’t put different types in an array. Parsing JSON data really fast is great, but once I’ve parsed it, I need to be able to do something useful with it, and that shouldn’t require jumping through hoops or circumventing the type system.

Also, Python has a garbage collector. Presumably C is allocating some memory, how and when do you deallocate that on the Fortran side? Hopefully you’re not making it easy for the user to forget to do that, or have to do it manually at all. That’s how memory leaks happen.

You skipped the first part:

Fast is better than slow
Slow is better than unmaintainable

So the main recommended JSON library in Fortran must be fast, which is better than if it was slow, which would be better than unmaintainable.

2 Likes

Surely it isn’t that bad! :slight_smile: The parser starts in json_parse_file. It’s a recursive parser, mostly inherited from fson with some updates by me to make it faster. The underlying structure is a linked list of pointers.

But, yes, probably the documentation could be improved. The code is well commented, and it generates nice Ford docs: json_file – JSON-Fortran. (I think I expose public and private methods, so that might be part of the problem). Also probably some more tutorials on how to do certain things would be beneficial. I also am happy to accept contributions. JSON-Fortran is definitely production code and we use it not only for reading JSON files, but also for creating and manipulating the data in memory and writing files, as well as for data exchange among tools (e.g., Python and Fortran).

2 Likes

I’ll look at ujson. My experiments with strtod are very promising… so stay tuned…

This is true. I just follow

Make it work
Make it right
Make it fast

I’m in the middle of the “Make it fast” part. And I wasn’t that far off of fast :stuck_out_tongue:

3 Likes

I’m sure it’s not that bad. But I partly wanted the excuse to go through the fun exercise of writing my own anyway, so I didn’t try as hard as I could have. I’d be interested to see some example usage.

1 Like

There is one at the link @jacobwilliams posted: json_file – JSON-Fortran

Ok, I see. I have some questions and potential critiques of the API, if @jacobwilliams is interested.

2 Likes

Sure, send them as a GitHub issue!

And feel free to steal anything you want from JSON-Fortran. I think the stream/chunk thing is going to help a lot (that’s what I remember)… probably there is even a better way than what I did (maybe some kind of fancy asynchronous IO thing?)

1 Like

Speaking of “stealing”… We don’t need to write python bindings for ujson, we could directly write bindings for the C-code of ujson, or maybe just “steal” some ideas from this file.
I’m not very comfortable with licenses. Someone should check ujson's license first, but I think we are allowed to do that.

I added a buffer to the file_cursor_t in rojff. @jacobwilliams , could you update the time taken in your table for rojff?

I have now also tried asynchronous reads, and opening the file with STREAM access. Trying to do asynchronous read slowed it down, so maybe I just wasn’t doing something correctly. STREAM access did not appear to have any appreciable impact on performance.