Porter2 and regexp kung-fu

To improve search quality I needed stemming algorithm. Porter2 seemed to be the best choice. However I realized that the only reference implementation exists is written on Snowball.

Now I’ll be throwing stones to Snowball. I really cannot get people who handcrafted this language. Its unreadability can be compared to perl, but the syntax and expression possibilities are really limited.

Can you tell me for sure what this piece of code is doing?


[substring] among (
'eed' 'eedly'
(R1 <-'ee')
'ed' 'edly' 'ing' 'ingly'
(
test gopast v delete
test substring among(
'at' 'bl' 'iz'
(<+ 'e')
'bb' 'dd' 'ff' 'gg' 'mm' 'nn' 'pp' 'rr' 'tt'
([next] delete)
'' (atmark p1 test shortv <+ 'e'
)
)
)

So I finally sat and implemented PHP5 version of this algorithm.

This version relies heavily on regular expressions. It takes 0.045 seconds to process one stem. And I know that the speed can be greatly improved. But this is a good starting point.

<?php
/**
 * Copyright (C) 2008 - 2009 Kanjibox
 *
 * @version 1.0
 * @see     http://www.kanjibox.com
 * @author  Kaibun
 */

require_once 'Zend/Cache.php';

/**
 * This class is a PHP5 implementation of Porter2 algorithm.
 * @see http://snowball.tartarus.org/algorithms/english/stemmer.html
 */
class KB_EnglishStemmer {
    /**
     * Cache for mass stemmings. Do not use for mass stemmings
     * of *different* words. Use for stemming words from documents
     * for example, where the same word may crop up multiple times.
     * @var Zend_Cache
     */
    private static $cache;


    private $r1 = '';
    private $r2 = '';
    private $found_y = false;

    /**
     * Define a short syllable in a word as either
     * (a) a vowel followed by a non-vowel other than w, x or Y and preceded by a non-vowel, or
     * (b) a vowel at the beginning of the word followed by a non-vowel.
     * So rap, trap, entrap end with a short syllable, and ow, on, at are classed as short syllables.
     * But uproot, bestow, disturb do not end with a short syllable.
     *
     */
    private static $short_syllable_regexp = '/([^aeiouy][aeiouy][^aeiouywxY]$)|(^[aeiouy][^aeiouy]$)/';


    /**
     * Stems a word.
     *
     * @param  string $word  Word to stem.
     * @param  bool   $cache Whether to use caching.
     * @return string        Stemmed word.
     */
    public function stem($word, $cache = false) {

        $word = ltrim( $word, "\r\n" );
        $word = rtrim( $word, "\r\n" );

        if (strlen($word) <= 2) {
            return $word;
        }

        if ( $cache && $result = get_cached_word( $word ) ) {
            return $result;
        }
        $exception1 = array(
            'skis' => 'ski',
            'skies' => 'sky',
            'dying' => 'die',
            'lying' =>  'lie',
            'tying' =>  'tie',
            /* special -LY cases */
            'idly' => 'idl',
            'gently' => 'gentl',
            'ugly' => 'ugli',
            'early' => 'earli',
            'only' => 'onli',
            'singly' => 'singl',
            /* invariant forms: */
            'sky' => 'sky',
            'news' => 'news',
            'howe' => 'howe',
            'atlas' => 'atlas',
            'cosmos' => 'cosmos',
            'bias' => 'bias',
            'andes' => 'andes'
        );

        $stem = '';
        foreach( $exception1 as $k => $v ) {
            if( $word == $k ) {
                $stem = $v;
                break;
            }
        }
        if( $stem == '') {
            $stem = $this->prelude ($word);
            $stem = $this->step0   ($stem);
            $stem = $this->step1a  ($stem);

            $exception2 = array(
                'inning', 'outing', 'canning', 'herring', 'earring',
                'proceed', 'exceed', 'succeed'
            );

            $found = false;
            foreach( $exception2 as $v ) {
                if( $stem == $v ) {
                    $found = true;
                    break;
                }
            }

            if( $found === false ) {
                $stem = $this->step1b  ($stem);
                $stem = $this->step1c  ($stem);
                $stem = $this->step2   ($stem);
                $stem = $this->step3   ($stem);
                $stem = $this->step4   ($stem);
                $stem = $this->step5   ($stem);
            }

            $stem = $this->postlude($stem);

        }

        if ( $cache ) {
            cache_word( $word, $stem );
        }

        return $stem;
    }

    /**
     * Returns cached stem or false.
     * @param  string  $word Word to find.
     * @return mixed|false Stem of the word.
     */
    private function get_cached_word( $word ) {
        ensure_cache_is_created();
        return self::$cache->load( $word );
    }

    /**
     * Stores word to cache.
     */
    private function cache_word( $word, $stem ) {
        ensure_cache_is_created();
        return self::$cache->save( $stem, $word );
    }

    /**
     * Ensures that cache is created before use.
     */
    private function ensure_cache_is_created() {
        if( ! self::$cache instanceof Zend_Cache ) {
            define ( 'ROOT_DIR', dirname( dirname( dirname ( __FILE__ ) ) ) );

            $frontendOptions = array(
                'cache_id_prefix' => __CLASS__,
                   'lifetime' => 3600,
                   'automatic_serialization' => false
            );

            $backendOptions = array(
                'cache_dir' => ROOT_DIR . '/data/tmp/cache',
                'file_locking' => false
            );

            self::$cache = Zend_Cache::factory('Core', 'File', $frontendOptions, $backendOptions );
        }
        return $this;
    }

    private function prelude($word) {
        // remove initial ' if present
        $word = ltrim( $word, "'" );

        $this->found_y = false;

        // set initial y, or y after a vowel, to Y, and then establish the regions R1 and R2
        $regex_vowel = '/^y|(?<=[aeiouy])y/';
        if( preg_match($regex_vowel, $word  ) ) {
            $word = preg_replace($regex_vowel, 'Y', $word );
            $this->found_y = true;
        }

        $this->update_r1r2($word);
        return $word;
    }

    private function postlude($word) {
        // revert Y back to y
        if( $this->found_y ) {
            $word = str_replace('Y', 'y', $word );
            $this->found_y = false;
        }
        return $word;
    }

    private function update_r1r2($word) {
        $r_regexp = '/(?<=[aeiouy][^aeiouy]).+/';
        $ex_regexp = '/^(gener|commun|arsen)(.*)/';

        // R1 is the region after the first non-vowel following a vowel, or the
        // end of the word if there is no such non-vowel. (This definition may be modified for certain
        // exceptional words — see below.)
        // R2 is the region after the first non-vowel following a vowel in R1, or the end of the word
        // if there is no such non-vowel. (See note on R1 and R2.)

        $ex_matches = array();
        $r1_matches = array();
        $r2_matches = array();

        $this->r1 = '';
        $this->r2 = '';

        if( preg_match($ex_regexp, $word, $ex_matches) ) {
            $this->r1 = $ex_matches[2];

        } else if( preg_match($r_regexp, $word, $r1_matches) ) {
            assert( ! empty($r1_matches) && count($r1_matches) == 1 );
            $this->r1 = $r1_matches[0];
        }

        if( $this->r1 != '' ) {
            if( preg_match($r_regexp, $this->r1, $r2_matches ) ) {
                assert( ! empty($r2_matches) && count($r2_matches) == 1 );
                $this->r2 = $r2_matches[0];
            }
        }
    }

    private function step0($word) {
        // search for the longest among the suffixes, ' 's 's' and remove if found
        $suffixes = array("/'s'$/" => "", "/'s$/" => "", "/'$/" => "" );
        $word = preg_replace( array_keys($suffixes), array_values($suffixes), $word );
        $this->update_r1r2($word);
        return $word;
    }

    private function step1a($word) {
        // sses replace by ss
        $count = 0;
        $word = preg_replace( '/sses$/', 'ss', $word, 1, $count );
        if( $count == 0 ) {
            // ied  ies replace by i if preceded by more than one letter, otherwise
            // by ie (so ties -> tie, cries -> cri)
            $replace = 'ie';
            if( strlen($word) > 4 ) {
                $replace = 'i';
            }
            $word = preg_replace( '/(ied|ies)$/', $replace, $word, 1, $count );
        }

        // us ss do nothing
        if( $count == 0 )    {
            $count = preg_match('/(us|ss)$/', $word);
        }

         if( $count == 0 ) {
            // s delete if the preceding word part contains a vowel not immediately before the s
            // (so gas and this retain the s, gaps lose it)
            if( substr($word, -1) == 's' ) {
                $word = preg_replace( '/(.*[aeiouy].*.)s$/', '$1', $word );
            }
        }

        $this->update_r1r2( $word );

        return $word;
    }

    private function step1b($word) {
        // Search for the longest among the following suffixes, and perform the action indicated.
        // eed   eedly replace by ee if in R1
        // ed   edly   ing   ingly delete if the preceding word part contains a vowel,
        // and after the deletion:
        // if the word ends at, bl or iz add e (so luxuriat -> luxuriate), or
        // if the word ends with a double remove the last letter (so hopp -> hop), or
        // if the word is short, add e (so hop -> hope)
        $eed_regexp = '/(eed|eedly)$/';
        $m1 = array();
        if( preg_match($eed_regexp, $word, $m1 ) ) {
            if( count($m1) > 0 && strlen($m1[0]) <= strlen($this->r1) ) {
                $word = preg_replace( $eed_regexp, 'ee', $word );
            }
        } else {
            $count = 0;
            $word = preg_replace( '/^(.*(?:[aeiouy]).*)(?:ed|edly|ing|ingly)$/', '$1', $word, 1, $count );
            if( $count > 0 ) {
                $last2 = substr($word, -2);
                if( $last2 == 'at' || $last2 == 'bl' || $last2 == 'iz' ) {
                    $word .= 'e';
                }
                $count = 0;
                $doubles = array( '/bb$/', '/dd$/', '/ff$/', '/gg$/', '/mm$/', '/nn$/', '/pp$/', '/rr$/', '/tt$/' );
                foreach( $doubles as $dd ) {
                    if( preg_match( $dd, $word ) ) {
                        $word = substr($word, 0, -1);
                        $count = 1;
                        break;
                    }
                }
                if( $count == 0 ) {
                    $this->update_r1r2($word);
                    // A word is called short if it ends in a short syllable, and if R1 is null.
                    // So bed, shed and shred are short words, bead, embed, beds are not short words
                    if( $this->r1 == '' && preg_match( self::$short_syllable_regexp, $word ) ) {
                        $word .= 'e';
                    }
                }
            }
        }
        $this->update_r1r2($word);
        return $word;
    }

    private function step1c($word) {
        // replace suffix y or Y by i if preceded by a non-vowel which is not the first letter
        // of the word (so cry -> cri, by -> by, say -> say)
        $word = preg_replace( '/(.+[^aeiouy])(?:y|Y)$/', '$1i', $word );

        $this->update_r1r2($word);
        return $word;
    }

    private function step2($word) {
        // Search for the longest among the following suffixes, and,
        // if found and in R1, perform the action indicated.
        // tional:                   replace by tion
        // enci:                       replace by ence
        // anci:                       replace by ance
        // abli:                       replace by able
        // entli:                   replace by ent
        // izer   ization:           replace by ize
        // ational   ation   ator:  replace by ate
        // alism   aliti   alli:       replace by al
        // fulness:                   replace by ful
        // ousli   ousness:           replace by ous
        // iveness   iviti:          replace by ive
        // biliti   bli:               replace by ble
        // ogi:                       replace by og if preceded by l
        // fulli:                    replace by ful
        // lessli:                   replace by less
        // li:                       delete if preceded by a valid li-ending
        $suffixes = array (
            '/tional$/' => '$1tion',
            '/enci$/' => '$1ence',
            '/anci$/' => '$1ance',
            '/abli$/'=> '$1able',
            '/entli$/'=> '$1ent',
            '/(izer|ization)$/'   => '$2ize',
            '/(ational|ation|ator)$/' => '$2ate',
            '/(alism|aliti|alli)$/'  => '$2al',
            '/fulness$/' => '$1ful',
            '/(ousli|ousness)$/' => '$2ous',
            '/(iveness|iviti)$/' => '$2ive',
            '/(biliti|bli)$/'  => '$2ble',
            '/(?<=l)ogi$/' => '$1og',
            '/fulli$/' => '$1ful',
            '/lessli$/' => '$1less',
            '/(?<=[cdeghkmnrt])li$/' => '$1'
        );

        $i   = 0;
        $idx = '';
        $len = 0;
        foreach ( $suffixes as $ss => $vv) {
            $m1 = array();
            if( preg_match($ss, $word, $m1) /*&& count($m1) && strlen($m1[0]) <= strlen($this->r1)*/ ) {
                if( strlen($m1[0]) > $len ) {
                    $len = strlen($m1[0]);
                    $idx = $ss;
                }
            }
            ++$i;
        }
        if( $len > 0 && preg_match($idx, $word, $m1) && count($m1) && strlen($m1[0]) <= strlen($this->r1) ) {
            $word = preg_replace( $idx, $suffixes[$idx], $word);
        }

        $this->update_r1r2($word);
        return $word;
    }

    private function step3($word) {

        // Search for the longest among the following suffixes, and, if found and in R1,
        // perform the action indicated.
        // tional:                   replace by tion
        // ational:                   replace by ate
        // alize:                   replace by al
        // icate   iciti   ical:    replace by ic
        // ful   ness:               delete
        // ative:                   delete if in R2

$dbg = 'abjectness' == $word;

        $suffixes = array (
            '/tional$/' => '$1tion',
            '/ational$/' => '$1ate',
            '/alize$/' => '$1al',
            '/(icate|iciti|ical)$/' => '$2ic',
            '/(ful|ness)$/' =>  '$2'
        );

        $i   = 0;
        $idx = '';
        $len = 0;
        foreach( $suffixes as $ss => $vv ) {
            $m1 = array();
            if( preg_match($ss, $word, $m1) && count($m1) && strlen($m1[0]) <= strlen($this->r1) ) {
            if( strlen($m1[0]) > $len ) {
                    $len = strlen($m1[0]);
                    $idx = $ss;
                }
            }
        }
if($dbg ) { KB_Debug::dump($idx, 'idx ');}
        $m1 = array();
        if( $len > 0 && preg_match($idx, $word, $m1) && count($m1) && strlen($m1[0]) <= strlen($this->r1) ) {
            $word = preg_replace( $idx, $suffixes[$idx], $word);
if($dbg ) { KB_Debug::dump($word, 'word ');}
        } else {
            $ative_regexp = '/ative$/';
            $m2 = array();
            if( preg_match( $ative_regexp, $word, $m2 ) && count($m2) > 0 && strlen($m2[0]) <=strlen($this->r2) ) {
                $word = preg_replace( $ative_regexp, '$1', $word);
            }
        }

        $this->update_r1r2($word);
        return $word;
    }

    private function step4($word) {
        // Search for the longest among the following suffixes, and, if found and in R2,
        // perform the action indicated.
        // al   ance   ence   er   ic   able   ible   ant   ement   ment   ent   ism   ate   iti   ous   ive   ize delete
        // ion delete if preceded by s or t
        $expr1 = '/(al|ance|ence|er|ic|able|ible|ant|ement|ment|ent|ism|ate|iti|ous|ive|ize)$/';
        $matches1 = array();

        if( preg_match($expr1, $word, $matches1) && count($matches1) > 0 && strlen($matches1[0]) <= strlen($this->r2) ) {
            $word = preg_replace( $expr1, '$2', $word);
        }
        $expr2 = '/(.*[s|t])(?:ion)$/';
        if( preg_match('/ion$/', $this->r2) ) {
            $word = preg_replace( $expr2, '$1', $word);
        }

        $this->update_r1r2($word);
        return $word;
    }

    private function step5($word) {
        // Search for the the following suffixes, and, if found, perform the action indicated.
        // e delete if in R2, or in R1 and not preceded by a short syllable
        // l delete if in R2 and preceded by l

        $e_regexp = '/e$/';
        // 'e' (R2 or (R1 not shortv) delete)
        $m1 = array();
        $m2 = array();
        $count = 0;
        if( ( preg_match($e_regexp, $word, $m1) && count($m1) > 0 && strlen($m1[0]) <= strlen($this->r1)
            && ! preg_match( self::$short_syllable_regexp, substr($word, 0, -1) ) )
            || ( preg_match($e_regexp, $word, $m2) && count($m2) > 0 && strlen($m2[0]) <= strlen($this->r2) ) ) {
            $word = preg_replace($e_regexp, '$1', $word);
            $count = 1;
        }
        $this->update_r1r2($word);
        $l_regexp = '/(?<=l)l$/';
        $m3 = array();
        if( $count == 0 && preg_match($l_regexp, $word, $m3) && count($m3) >0 && strlen($m3[0]) <= strlen($this->r2) ) {
            $word = preg_replace($l_regexp, '$1', $word);
        }

        $this->update_r1r2($word);
        return $word;
    }
}
?>
Tags: ,

4 Responses to “Porter2 and regexp kung-fu”

  1. Timur Alhimenkov Says:

    Wow! Thank you very much!
    I always wanted to write in my blog something like that. Can I take part of your post to my blog?
    Of course, I will add backlink?

    Regards, Timur I.

  2. Saurooon Says:

    Greatings,
    I have already seen it somethere

    Thank you
    Saurooon

  3. admin Says:

    Well, if you have seen it, that means that somebody simply took my post and used it either by claiming himself an author or just referencing my post… Well, that’s expected when you publish the source code :)

  4. admin Says:

    Sure. Just keep the reference to the original author :) All improvements are highly appreciated.

    P.S. And sorry for delaying the answer.

Leave a Reply