A New JSON Library

Announcing a new JSON library. Return of JSON (just in time for Halloween too) :stuck_out_tongue_winking_eye:

My old library (jsonff) wasn’t cutting it with large data sets. So borrowing from its test suite and inspired by its API, I introduce a new library built from the ground up for high performance. It gives pretty decent error messages for invalid JSON too. Give it a try and let me know what you think.

8 Likes

Nice. How does it compare performance wise with other JSON libraries? Say the default one in Python.

I haven’t done thorough performance testing yet, but it can do the large (2.2 Mb) canada.json file (from this json benchmark repo), parse and output, in less than 2 seconds on my i5 laptop. So it seems on par with some C and C++ libraries.

Can you quickly try it using Python on the same laptop?

On my laptop (M1 Mac, Gfortran 11.1.0), load times for canada.json are:

parse time (sec)
JSON-Fortran json_file%load() 0.174 :sunglasses:
rojff 0.670
Python 3.8.5 json.load() 0.04
Python ujson 4.0.1 ujson.load() 0.0220

Consistent with my previous testing. The Python parser is very fast. The JSON-Fortran parser is mostly inherited from the earlier fson library.

Updates: Updated with ujson and rojff (fpm build --profile release).

3 Likes

Worth also comparing with Python’s ujson (can be used as a drop-in replacement for standard lib json). On my computer ujson.load() is 2.3 times faster than json.load() (parsing canada.json).

I’ll put my test code in a repo somewhere. I’ve actually been meaning to benchmark the different libraries. I want to try and speed up JSON-Fortran. I have made changes over the years to get it faster than the one I inherited, but I was thinking of redesigning it a bit… we should be able to write something that gets closer to the Python speed…

1 Like

Thanks @jacobwilliams! So rojff is unfortunately 16.75x slower than the default Python’s parser. The JSON-Fortran is 4.35x slower. That’s better.

As a user, I just want a library in pure Fortran (fpm installable) that is comparable to Python’s default JSON library. So it doesn’t have to be as fast as the fastest C++ libraries. But it needs to be competitive at least with Python. I would say 50% slower at most, so 0.06 on the above benchmark would be ok I think.

Then we can say that with LFortran in a Jupyter notebook that we have an equivalent experience. If we are 4x or 16x slower, then people will think “what is the point of using Fortran if you can’t even match Python, which we all know is slow?” (Yes, I know well that Python is actually quite fast with these libraries, but I also know that we are capable of matching the speed, one way or another.)

3 Likes

Yep, agree.

FYI: my benchmark code is now here: GitHub - jacobwilliams/json-fortran-benchmarks: Benchmarks for JSON Fortran parsers

3 Likes

Are there major differences in the libraries in terms of error handling? I could imagine that error handling generates huge overhead because you have to check more and ideally you have to track the position while parsing (like the curser_t type in rojff).

1 Like

JSON-Fortran does have pretty comprehensive error checking. If there’s a parsing error, the caller can retrieve what the error was, what line and character it occurred on, etc.

1 Like

Interesting! It will be nice if someone knowledgeable of both Python libraries as well as JSON-Fortran can complete a thorough investigation and provide a summary of the root-cause(s) of the nearly 8X slowness compared to Python ujson.

I would really like to be proven wrong but my hypotheses is 3 reasons as to the slowness of such libraries developed in pure Fortran:

  1. The language standard of Fortran itself needs significant improvements to help enable a vital aspect of scientific and technical computing which is pre and post-processing of data, now there are massive amounts of it. The core number-crunching is important but processing of all the input and program data to get to the number-crunching stage and once crunched, process the results again for all the stakeholders is paramount. The utility in question here, a JSON library for Fortran, is but one part of this. However, circa 2021-22, it’s rather difficult to build a performant library in Fortran compared to the alternatives. The language itself needs to offer a set of facilities to enable such library authoring, I’ve listed my suggestions here. Add the computer science concepts of move semantics and rule of 7 to the list, for this is relevant to how libraries such as JSON-Fortran and rjoff tend to be architected.
  2. Fortran compilers need to really up the game on optimization though it’s a very difficult battle. The other paradigms, especially C++, Python, and Julia, attract the sharpest minds and have tons and tons of them to optimize and optimize their language processors. Fortran needs a lot of catching up here.
  3. Library authors themselves will need to put in tons more effort to further optimize their libraries and eke out every ounce of performance, if they are intent on remaining competitive with other alternatives. This may include perhaps replacing critical sections of their “pure Fortran” code with optimized C and/or assembler pieces; this may apply to Fortran stdlib as well.

Or I may be completely off and may be it is just one or two low-hanging fruits in the Fortran code for these JSON libraries that affect the performance and once those fruits are grabbed and the code improved, the Fortran equivalent becomes similarly fast. As I wrote above, I wouldn’t mind at all if I am wrong on this even as my experience thus far has informed me otherwise.

2 Likes

@jacobwilliams , please see this. I was just about to suggest you the same re: 64-bit integer (and 64-bit real) with your timing measurements involving your use of system_clock when @tomohirodegawa made that post in the other thread.

It may not change the gist of your benchmarks you reported thus far all that much but it will be good to ensure your “instrumentation” for timing has no issues. On this, please note your current use of a default integer and real with system_clock will leave a nagging doubt with some folks, it will for me.

1 Like

Done! (note: I need to clean up the whole thing…this really is just something I slapped together in 15 minutes). :slight_smile:

1 Like

:+1:

I found the gprof(1) output from a gfortran(1) build interesting:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 87.51      0.07     0.07  2473280     0.00     0.00  __json_value_module_MOD_json_value_reverse
 12.50      0.08     0.01   167178     0.00     0.00  __json_value_module_MOD_json_value_add_member
  0.00      0.08     0.00  2473280     0.00     0.00  __json_value_module_MOD_pop_char
  0.00      0.08     0.00   222252     0.00     0.00  __json_value_module_MOD_push_char
  0.00      0.08     0.00   167179     0.00     0.00  __json_value_module_MOD_parse_value
  0.00      0.08     0.00   167178     0.00     0.00  __json_value_module_MOD_json_info
  0.00      0.08     0.00   111126     0.00     0.00  __json_value_module_MOD_parse_number
  0.00      0.08     0.00   111080     0.00     0.00  __json_string_utilities_MOD_string_to_real
  0.00      0.08     0.00   111080     0.00     0.00  __json_value_module_MOD_string_to_dble
  0.00      0.08     0.00   111080     0.00     0.00  __json_value_module_MOD_to_real
  0.00      0.08     0.00    56045     0.00     0.00  __json_value_module_MOD_to_array
  0.00      0.08     0.00       46     0.00     0.00  __json_string_utilities_MOD_string_to_integer
  0.00      0.08     0.00       46     0.00     0.00  __json_value_module_MOD_string_to_int
  0.00      0.08     0.00       46     0.00     0.00  __json_value_module_MOD_to_integer
  0.00      0.08     0.00       12     0.00     0.00  __json_string_utilities_MOD_unescape_string
  0.00      0.08     0.00       12     0.00     0.00  __json_value_module_MOD_parse_string
  0.00      0.08     0.00        4     0.00     0.00  __json_value_module_MOD_to_object
  0.00      0.08     0.00        4     0.00     0.00  __json_value_module_MOD_to_string
  0.00      0.08     0.00        2     0.00    40.00  __json_value_module_MOD_parse_array
  0.00      0.08     0.00        2     0.00     0.00  __json_value_module_MOD_parse_object
  0.00      0.08     0.00        1     0.00     0.00  __json_file_module_MOD_json_file_failed
  0.00      0.08     0.00        1     0.00    80.01  __json_file_module_MOD_json_file_load
  0.00      0.08     0.00        1     0.00     0.00  __json_value_module_MOD_json_clear_exceptions
  0.00      0.08     0.00        1     0.00     0.00  __json_value_module_MOD_json_failed
  0.00      0.08     0.00        1     0.00     0.00  __json_value_module_MOD_json_initialize
  0.00      0.08     0.00        1     0.00     0.00  __json_value_module_MOD_json_parse_end
  0.00      0.08     0.00        1     0.00    80.01  __json_value_module_MOD_json_parse_file
  0.00      0.08     0.00        1     0.00     0.00  __json_value_module_MOD_json_prepare_parser

Wait, what code did you run that generated this? json_value_reverse shouldn’t be called at all for just parsing a file.

I was running various app codes as a quick view of where time was spent and got called off onto something else and also was seeing a bug that seems to have creep into the version of fpm as well that shows up with your code (using the latest version, which I just rebuilt if I use “fpm run” I just see “app app app app”. So back and I see you probably wanted me to run something like

MYBUILD='--profile release --flag -p'
fpm build $MYBUILD
 fpm run json_fortran_test $MYBUILD
gprof $(fpm run json_fortran_test $MYBUILD --runner) >gprof.out
(more||less) <gprof.out
exit

which I still think is more on target now that I have taken a bit of time to look at the new version. Not sure what platform you have or if you use gprof(1), which is a bit of an art as well as a bit of science but if not, give that a try. Will do that a bit more rigorously if you find the results useful.

I started an fpm plug-in that I had not finished that I might use this code to polish off:

NAME
  fpm-time(1) - call fpm(1) with gprof(1) to generate a flat timing profile
SYNOPIS
  fpm-time [subcommand] [--target] targets
DESCRIPTION
  Run the fpm(1) command with the gfortran(1) compiler and compiler flags
  required to build instrumented programs which will generate gprof(1)
  output files. Run the program and then run a basic gprof(1) command
  on each output.

  IMPORTANT: ONE target program should be selected if multiple targets exist.

  NOTE: 2021-03-21

     This is a prototype plug-in for fpm(1), which is currently in alpha
     release. It may require changes at any time as a result.

OPTIONS
   subcommand  fpm(1) subcommand used to run a program (test,run). If
               no options are specified the default is "test".
               The name "example" will be converted to "run --example"
               internally.
   --targets   which targets to run. The default is "*". ONE target should
               be tested
   --flag      ADDITIONAL flags to add to the compile
   --repeat,R  number of times to execute the program. Typically, this helps
               reduce the effects of I/O buffering and other factors that can
               skew results. Defaults to one execution.
   --help      display this help and exit
   --version   output version information and exit

EXAMPLE
   # in the parent directory of the fpm(1) project
   # (where "fpm.toml" resides).

    fpm-time
    fpm-time run demo1 demo2

SEE ALSO
    gprof(1), gcov(1)

I started that in March. Maybe time to finish it :blush:

If I finish it, if your default test is in the test directory you just run

fpm time

and get a profile run of your test, started the same for gcov(1) too. Also want to extend it to other tools like valgrind(1) and other tools supplied with compilers.

2 Likes

Ah interesting. Yes, I can duplicate this. Thanks!

Something is definitely wrong in the Gprof results. The reverse routine isn’t called for parsing. When I just comment it out completely and rerun Gprof, then it says some other uncalled routine is at the top. So, it is getting confused somehow… Is it a bug?

I noticed that this canada.json file is mostly real numbers. It seems most of the time is spent converting the strings to reals. I haven’t checked jsonff, but in JSON-Fortran, I’m just using:

read(str,fmt=*,iostat=ierr) rval

I notice when I just replace this with

rval = 0.0_RK
ierr = 0

Then the parse time goes down to about 0.05 seconds. So clearly, there is room for improvement here. Is there a faster string to real parser out there for Fortran? Hmmm… maybe I’ll make a new post about this so as not to hijack this thread any more.

2 Likes