Description

I’m writing for my PhD a software which does not a lot of things, but needs to do it fast. I added a lot of logging, for debugging and demonstration purposes, and – of course – I’m using Marshal to dump the various logging events.

To my very surprise, it introduced a very big performance hit. Moreover, the log files were very big. I was, at that time, logging /each/ log event separately, like this:
  Marshal.dump(ev_name, event_log)
  Marshal.dump(ev_args, event_log)

Recently, I asked myself what was the size and performance hit of Marshal.dump itself: would not it be better to dump a few events at a time instead of separately. Fortunately for me, my software is running through an event loop, so it is easy to log events one cycle at a time. Here is the surprise: it is not only much faster, but it results in much smaller logs …

First, a small benchmark

You can download here a small benchmark which measures the various times and sizes for different sets of ruby objects, each being written either separately or as a big array. The results are below. The result is IMO without ambiguity: avoid marshalling small objects separately.

  count size size/obj write read
separated 1000 94608 23 0.037 0.028
array 1000 69631 17 0.012 0.012
separated 15000 1424311 23 0.573 0.443
array 15000 1049334 17 0.256 0.307
separated 25000 2374141 23 0.923 0.734
array 25000 1749165 17 0.475 0.399
separated 35000 3324068 23 1.302 1.016
array 35000 2449092 17 0.717 0.578
separated 45000 4273752 23 1.665 1.334
array 45000 3148776 17 0.901 0.912

plotted, this gives

Real-world results

The results for my own application point in the same direction: avoid marshaling objects one by one. If you can, do it in sets. The gain in marshalling/unmarshalling times is around 6, same for size. The results are below.

  cycles events read_time time/cycle time/ev log_size bytes/cycle bytes/evt
old 178 4611 1.879002 10.6 0.4ms 5414392 30417 1174
new 142 4055 0.386898 2.7ms 0.1ms 786554 5539 193