Approximate/fuzzy string search in PHP

This PHP class, approximate-search.phps, provides non-exact text search (often called fuzzy search or approximate matching).

It allows you to specify a Levenshtein edit distance treshold, i.e. an error limit for a match. For example, a search for kamari with a treshold of 1 errors would match kamari, kammari, kaNari and kamar but not kaNar.

The code is optimized for repeated searching of the same string, e.g. walking through rows of a database.

Usage example

$search = new Approximate_Search( $patt, $max_err );
if ( $search->too_short_err )
    $error = "Unable to search - use longer pattern " .
             "or reduce error tolerance.";

while( $text = /* get some more text */)
{
    $matches = $search->search( $text );
    while( list($i,) = each($matches))
      print "Match that ends at $i.\n";
}

BUG FIX! 2006-08-27

The previous version had a little bug that in some (quite rare) cases prevented the code from finding a match.

Algorithm

The program basically consist of two phases and some optimizations. Without going into details, here's a short overview of the idea:

Suppose we have an n character long search pattern with k errors allowed.

  1. Pruning / filtering. First partition the search pattern into k+2 parts. By the pigeonhole principle (of combinatorics), at least 2 of these parts must be uncorrupted in each occurrence of the pattern. We search the full text for these parts first and discard immediately those portions of the text that don't contain at least 2 of the patterns sufficiently near each other and in the correct order.
  2. Fine search. For the remaining candidates, do a finer search using an (n+1) x (k+1) grid shaped nondeterministic state machine (NFA) whose horizontal edges represent character matches, vertical edges additions and diagonal edges deletions. This one's a bit difficult to explain without some drawings, so I'd suggest you take a look at, for example, Baeza-Yates & Navarro: "Faster Approximate String Matching". It describes a very low level optimization method that I'm not using (as it would probably be slower in PHP) but it also explains the basic version quite well.

License

This is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License (LGPL) as published by the Free Software Foundation; either version 2.1, or (at your option) any later version.