Thinking on how-to learn (part 1)

Posted in japanese on September 25th, 2008 by admin

Kanji… List of kanji’s… For many people these are synonyms. And it’s quite natural for many Japanese learners to think about kanji’s as of long list of characters that should be indexed, graded and memorized. You will find lots of pre-cooked lists and most likely fall into the trap.

Flash card programs, paper flash cards, books like Heisig’s ‘Remembering the Kanji’, JLPT-based lists and 常用漢字 on top of it.

In my opinion, the worst problem with those pre-cooked lists is that beginners try use them somehow in their studies. One sees 常用漢字 list and thinks, ‘This list has grades and is arranged according to frequency of usage. So if I make flash cards and will be memorizing 5 kanji a day I will be able to learn them all in 13 months.’ Others go a little bit farther – they take into additional factors like time required to revisit already learned symbols and summer vacation. Even in this case this way of thinking leads only to frustration once you started your attempts to nail these characters down.

So, what many people don’t understand is that you have to be a real genius to memorize all 1945 characters absolutely without context. Even if you managed somehow to remember all the kanji’s from the list, you should be aware of the fact that they are not real words and you have no idea how to transform these pictograms into meaningful language primitives (I mean words, of course). You have no idea how to read them and how to use them.

Even if you put the fact that you cannot really use those ready-to-use kanji lists aside, what’s the usefulness of these kanji inventories? Let’s take 常用漢字 as an example. Why am I supposed to learn 「亜」but not the kanji for the word “who” (誰)? Have you ever seen, even once, 「アジア」 and 「アメリカ」 written as 「亜細亜」 and 「亜米利加」? Moreover this kanji goes FIRST in this list. Why can’t I find 「枕」 among these 1945 characters? Don’t you use pillows every single day in your life? But you definitely should know that 「斤」 means 1.32 lb.

So what’s the bottom line for this post? Throw all your flash cards? No, I’m not advocating for throwing your stuff out, I’m simply trying to say that we should always think of the list not as of goal but as of an aid.

Tags: ,

Saxon thrashes Altova…

Posted in development on September 22nd, 2008 by admin

This Sunday I was busy trying to optimize data load process. In fact I ended up by completely rewriting the stylesheets. During this process I had a chance to compare performance of 2 XSLT processors I use: Altova XSLT and Saxon-B. The results are nonpresumable.

What you can find below is not a real benchmark. I simply took an average execution time calculated after 3 test runs on some of the stylesheets I use.

  Saxon-B1 (compiled) Saxon-B2 Saxon-SA  AltovaXSLT 
Stylesheet #1 (input file: 45 Mb) 11.2183 11.156 11.484 108.031
Stylesheet #2 (input file: 1.5Mb) 3.296 3.171 4.484 153.671
Stylesheet #3 (input file: 70Mb) N/A 77.453 N/A ERR_OOM

[1] – Compiled stylesheet was used instead of raw XSLT. And these results lead us to the interesting conclusion: popular assumption that a product that compiles to bytecode will be necessarily faster than an interpreter is WRONG. I will try to cover this topic in my next posts.

[2] – Settings for all saxon runs are as following: -l:off -dtd:off -tree:tiny

[3] – All results are in seconds

Well, in some cases Saxon, which is pure Java, is up to 48 times slower than pure C++. Moreover Altova consumes enormous amounts of memory failing to process relatively small files (approximately 45 – 70 Mb) on a 32-bit machine, while Saxon uses around 300Mb regardless of the input file size.

So right now the dictionary is processed in 77 seconds and loaded into the DB in less than 5 seconds. Not bad I think…


Got data?

Posted in development on September 16th, 2008 by admin

Yesterday I finished the first version of fancy XSLT 2.0 stylesheet that transforms JMDict into insert statements; today I tried to run it on a 70Mb file. Results are as following:

  • AltovaXML – allocated 1.5Gb or RAM and died after 30 minutes.
  • MSXSL – crashed after 15 minutes.
  • Saxon SA – died immediately with class cast exception.
  • Saxon-B – gobbled 300Mb of RAM and successfully finished processing in 218 minutes.

15 minutes of work and template was processed by Saxon in about an hour. Still not good. The reason is: enormous amount of cross references that require iterating over the whole document to find exact match.

By the time I got there I buried the idea to load data using inserts. We’ll use bulk load. So tomorrow another optimization round is expected. However out of curiosity I tried to import ‘insert-like’ data. It took approximately 6 hours :)


And the war began…

Posted in kanjibox on September 8th, 2008 by admin

Several years passed since I was first engaged with Japanese. Our relationship looked like tide due to permanent crunch time on my projects, business trips and thousands of other things that normally fill your life like sand fills the spaces between rocks and pebbles. However I was constantly finding myself with the textbook during the most inappropriate times. I simply couldn’t resist…

Quite recently the idea of how should I proceed in order to master the language crystallized in my head. And led me to the registration of this domain and couple of sleepless nights when I was working on database structure.

The site is empty at the moment. But you’ll see the results soon…