String comparison in PHP

PHP ships with multiple, native mechanisms for comparing arbitrary string values. Let's take a closer look at two that quantify "sameness."

There are multiple ways to compare strings in PHP. The simplest way it to check if two strings are equal, using either loose equality (==) or strict equality (===). In either case, PHP will give you a Boolean answer as to whether or not two strings are the same.

But sometimes you want to gauge how similar two strings actually are. In that case, we have two different options for determining the “sameness” of arbitrary strings.

Levenshtein distance

The Levenshtein distance between two strings represents the number of characters that need to be changed in one string to create another. For example, given the strings 'stronger' and 'stranger' it’s apparent at a glance that they differ by a single character. PHP can tell you this programmatically:

echo levenshtein('stronger', 'stranger'); // 1

In cases where the two strings differ in length, the Levenshtein distance also quantifies the number of characters that need to be added or deleted to convert between them. It’s a great way to quickly check simple strings for typographical or clerical errors, unfortunately it also has its limitations.

For example, both 'stronger' and 'strangers' present a Levenshtein distance of 1 when compared to 'stranger' even though the differences quantified are, well, quite different.

Hamming distance

Another helpful quantifier of string sameness is the Hamming distance. Similar to a Levenshtein distance, Hamming calculations identify the number of changes required to make one string equal to another. The biggest differences are that a Hamming distance only works on strings of equal lengths and, at least in PHP, works on binary rather than generic text.

If you already have raw binary, PHP’s native gmp_hamdist() function can calculate a Hamming distance for you automatically:

echo gmp_hamdist(0b1001010011, 0b1011111100); // 6

If we want to operate on arbitrary strings, we need to wrap this function in one of our own that converts between ASCII and binary. Then we can compare the results of a Levenshtein calculation with that of a Hamming one:

function hamming_distance(string $first, string $second): int
{
  $length = strlen($first);

  if (strlen($second) !== $length) {
    throw new LengthException('Input strings must be of equal length!');
  }

  $result = 0;
  for($i = 0; $i < $length; $i++) {
    $result += gmp_hamdist(ord($first[$i]), ord($second[$i]));
  }

  return $result;
}

echo hamming_distance('stronger', 'stranger'); // 3

Either technique is valid and serve well in differing circumstances. In which situations might you need to leverage a way to quantify the “sameness” of two strings?

#