Porter2 and regexp kung-fu

December 23rd, 2008

To improve search quality I needed stemming algorithm. Porter2 seemed to be the best choice. However I realized that the only reference implementation exists is written on Snowball.

Now I’ll be throwing stones to Snowball. I really cannot get people who handcrafted this language. Its unreadability can be compared to perl, but the syntax and expression possibilities are really limited.

Can you tell me for sure what this piece of code is doing?

[substring] among (
'eed' 'eedly'
(R1 <-'ee')
'ed' 'edly' 'ing' 'ingly'
test gopast v delete
test substring among(
'at' 'bl' 'iz'
(<+ 'e')
'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt'
([next] delete)
'' (atmark p1 test shortv <+ 'e'

So I finally sat and implemented PHP5 version of this algorithm.

