Saxon thrashes Altova…

Posted in development on September 22nd, 2008 by admin

This Sunday I was busy trying to optimize data load process. In fact I ended up by completely rewriting the stylesheets. During this process I had a chance to compare performance of 2 XSLT processors I use: Altova XSLT and Saxon-B. The results are nonpresumable.

What you can find below is not a real benchmark. I simply took an average execution time calculated after 3 test runs on some of the stylesheets I use.

  Saxon-B1 (compiled) Saxon-B2 Saxon-SA  AltovaXSLT 
Stylesheet #1 (input file: 45 Mb) 11.2183 11.156 11.484 108.031
Stylesheet #2 (input file: 1.5Mb) 3.296 3.171 4.484 153.671
Stylesheet #3 (input file: 70Mb) N/A 77.453 N/A ERR_OOM

[1] – Compiled stylesheet was used instead of raw XSLT. And these results lead us to the interesting conclusion: popular assumption that a product that compiles to bytecode will be necessarily faster than an interpreter is WRONG. I will try to cover this topic in my next posts.

[2] – Settings for all saxon runs are as following: -l:off -dtd:off -tree:tiny

[3] – All results are in seconds

Well, in some cases Saxon, which is pure Java, is up to 48 times slower than pure C++. Moreover Altova consumes enormous amounts of memory failing to process relatively small files (approximately 45 – 70 Mb) on a 32-bit machine, while Saxon uses around 300Mb regardless of the input file size.

So right now the dictionary is processed in 77 seconds and loaded into the DB in less than 5 seconds. Not bad I think…

Tags:

Got data?

Posted in development on September 16th, 2008 by admin

Yesterday I finished the first version of fancy XSLT 2.0 stylesheet that transforms JMDict into insert statements; today I tried to run it on a 70Mb file. Results are as following:

  • AltovaXML – allocated 1.5Gb or RAM and died after 30 minutes.
  • MSXSL – crashed after 15 minutes.
  • Saxon SA – died immediately with class cast exception.
  • Saxon-B – gobbled 300Mb of RAM and successfully finished processing in 218 minutes.

15 minutes of work and template was processed by Saxon in about an hour. Still not good. The reason is: enormous amount of cross references that require iterating over the whole document to find exact match.

By the time I got there I buried the idea to load data using inserts. We’ll use bulk load. So tomorrow another optimization round is expected. However out of curiosity I tried to import ‘insert-like’ data. It took approximately 6 hours :)

Tags: