Kamis, 23 September 2010

Comparison of Perl serialization modules

A while ago I needed a Perl data serializer with some requirements (supports circular references and Regexp objects out of the box, consistent/canonical output due output will be hashed). Here's my rundown of currently available data serialization Perl modules. A few notes: the labels fast/slow is relative to each other and are not the result of extensive benchmarking.

Data::Dumper. The grand-daddy of Perl serialization module. Produces Perl code with adjustable indentation level (default is lots of indentation, so output is verbose). Slow. Available in core since the early days of Perl 5 (5.005 to be exact). To unserialize, we need to do eval(), which might not be good for security. Usually the first choice for many Perl programmers when it comes to serialization and arguably the most popular module for that purpose.

Storable. Fast. Produces compact, portable binary output. Also available in core distribution. Does not support Regexp objects out of the box (though adding support for that requires only a few lines). Binary format used to change several times in the past without backward compatibility in the newer version of the module, giving people major PITA. Supposedly stabilized now.

YAML::XS. Fast. Verbose YAML output (currently doesn't seem to have option to output inline YAML). My personal experience in the past is sometimes this module behaved weirdly and died with a cryptic error, but I guess currently it's pretty stable.

There are other YAML implementations like YAML::Syck (also pretty speedy) and the old Pure-Perl YAML.pm and partial implementation YAML::Tiny. The last two might not be a good choice for general serialization needs.

Data::Dump. Very slow. Produces nicely indented Perl output. The strength of this module is in pretty output and flexibility in customizing the formatting process. Based on Data::Dump I've hacked two other specialized modules: Data::Dump::PHP for producing PHP code, and Data::Dump::Partial to produce compact and partial Perl output for logging.

XML::Dumper. Produces *very* verbose (as is the case with all XML) XML output. Slow. Aside from the XML format, I don't think there's a reason why you should choose this over the others.

JSON::XS. Fast, outputs pretty compact but still readable code, but does not support circular references or Regexp objects.

JSYNC. Slow, outputs JSON and in addition supports circular references but not yet Regexp objects.

FreezeThaw. Slow, produces compact output but not as compact as Storable. Does not support Regexp objects out of the box.

Apart from these there are many other choices too, but I personally don't think any of them is interesting enough to be a favorite. For example, last time I checked PHP::Serialization (and all the other PHP-related modules) does not support circular references. There's also, for example, Data::Pond: cute concept but of little practical use as it is even more limited than JSON format.

There are also numerous alternatives to Data::Dumper/Data::Dump, producing Perl or Perl-like code or indented formatted output, but they are either: not unserializable back to data structures (so, they are more of a formatting module instead of serialization module) or focus on pretty printing instead of speed. In general I think most Data::Dumper-like modules are slow when it comes to serializing data.

In conclusion, choice is good but I have not found my perfect general serialization module yet. My two favorites are Storable and YAML::XS. If JSYNC is faster and supports Regexp, or if YAML::XS or YAML::Syck can output inline/compact YAML, that would be as near to perfect as I would like it.

Hope this comparison is useful. Corrections and additions welcome.

4 komentar:

  1. It's always felt a bit awkward, the way that Data::Dumper seems to live two lives; one for serialising Perl data structures so they can be eval()'ed in again after, and the other for serialising Perl data structures for humans to read as pretty-printed output. These to me feel like very different problems, to be attacked in different ways.

    BalasHapus
  2. @Max: Thanks for the pointer! I've glanced the module name MessagePack before a few times in CPAN Recent Feeds, but have never thought to actually look it up. Trying it out right now ...

    BalasHapus
  3. Bummer, Data::MessagePack doesn't seem to allow objects nor circular references. Looks interesting though.

    BalasHapus
  4. Perhaps you could do a new post that offers a 'decision tree' for how to select a serialization module.

    BalasHapus

Catatan: Hanya anggota dari blog ini yang dapat mengirim komentar.