Jul 16

Here’s a code snippet fresh from my PHP n-gram search class. The $str arguement
expects a string, $size is the length of the desired n-gram, and $clean lets us
opt-out of some “clean-up” where duplicate n-grams are removed, and non-alphanumeric
characters are removed from the string. It both returns an array and sets a class value.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
< ?
  public function get_ngrams($str, $size = 5, $clean = true)
  {
	  if ($clean)
	  {
		  $str = strtolower(preg_replace("/[^A-Za-z0-9]/",'',$str));
 
	  }
	  for ($i = 0; $i < strlen($str); $i++)
	  {
		  $potential_ngram = substr($str, $i, $size);
		  if (strlen($potential_ngram) > 1)
		  {
		  	$arrNgrams[] = $potential_ngram;
		  }
	  }
	  if ($clean)
	  {
		  $arrNgrams = array_unique($arrNgrams);
	  }
	  $this->arrNgrams = $arrNgrams;
	  return($arrNgrams);
  }
?>

leave a reply

You must be logged in to post a comment.