12
Oct
2011
10 Comments

Find similar products in MySQL using Levenshtein Distance

Many times on a product page you will want to show the shopper some similar products they might also be interested in. I struggled with this for a long time before I discovered this solution, which I think works very well. The Levenstein distance is a measure of how different 2 comparing strings are.

I found a great implementation for a Levenstein Distance function in MySQL by Jason Rust. I had to make one small adjustment to get MySQL to allow me to create it, which was simply changing the delimiter from ; to something else since the function definition contains semicolons in it.

So the updated code would be:

DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    -- max strlen=255 
    DECLARE cv0, cv1 VARBINARY(256); 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END$$
And the helper function:
DELIMITER $$
CREATE FUNCTION levenshtein_ratio( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END$$

Once you have those defined you can use it to find related products by their vendor style number. Most manufacturers use a common scheme for their style numbers so the closest matching style numbers are usually the most similar products. With that in mind, you could now find simiilar products with a query like this:

SELECT *, levenshtein(style_number, "style-number-to-match") as related_score
  FROM products
  ORDER BY related_score
  LIMIT 2

This will give you the 2 most matching products (the first one being itself). Also note, this function is very time consuming so to speed it up I limited it to only products with the same brand. You also want to exclude the same product from returning in the result set.

SELECT *, levenshtein(style_number, "style-number-to-match") as related_score
  FROM products
  WHERE brand = "my-brand"
  AND style_number != "style-number-to-match"
  ORDER BY related_score
  LIMIT 2

Even still, if you have many products this seems too slow to use in production. So what I did was use ajax to find the related products after the page finished loading so it does not make the entire page slow. A small price to pay to get very accurate related products.

10 Comments

924d6cb86b5ca47c982519db3d388a05?s=40
malko
Wed Feb 22 16:41:36 2012
You just made my day !!!!
I would never gessed this was a delimiter problem that prevent me to create the function. Thank you again for sharing.
3926821eca5581a66daa4beeecac50a2?s=40
Laura
Sun Aug 19 22:42:19 2012
I am trying 

SELECT *, levenshtein("nombre", "pepiTo")  FROM proyectos as related_score

But the one entry in table proyecto with nombre= pepito is like 7th on the list, plus showing many other irrelevant results even if I limit the search. Can you help me?
96a4db75c09753b7d40e23e90a7ca784?s=40
Sun Aug 19 22:43:40 2012
Hi Laura, it looks like you are selecting the levenshtein distance, but not specifying that you want the results with lowest score (closest match) first. I think you got confused and specified the table as related_score instead of the column. Try changing your query to this:

SELECT *, levenshtein(nombre, “pepiTo”) AS related_score 
FROM proyectos 
ORDER BY related_score

If you want to only show the highest matching wihout all the irrelavent results try adding:

SELECT *, levenshtein(nombre, “pepiTo”) AS related_score 
FROM proyectos 
WHERE related_score < 2
ORDER by related_score

Also MySQL is case insensitive so I think (have not tested) pepiTo and pepito would have a distance/score of zero.

Hope that helps.

P.S. I just noticed one other thing, you are wrapping the column name (the first parameter in the function) with quotes. Don't do that, it should be: levenshtein(nombre, “pepiTo”)
3926821eca5581a66daa4beeecac50a2?s=40
Laura
Sun Aug 19 22:44:25 2012
It does work. The thing is nombre is equal to "pepeito juarez". Can this be combined with a LIKE statement or something
3926821eca5581a66daa4beeecac50a2?s=40
Laura
Sun Aug 19 22:53:48 2012
Hi, thank you for your answer. I hadn´t seen it when I wrote the next post
I got it working like this
SELECT nombre FROM proyectos WHERE levenshtein(nombre, "pepito juarez") < 2;
I would like it to work if even one of the words in the column matches (or part of a word), as a like statement.
Do you know if this is possible?
Thanks for you reply
Laura
96a4db75c09753b7d40e23e90a7ca784?s=40
Sun Aug 19 22:54:29 2012
What you are asking for is a lot more complex. You see, the distance from pepeito to pepito is 1 (adding the single e char before the i)

But the distance from pepeito to pepito juarez is 8 (the extra e in the first word is 1, the space, and the the 6 chars in the second word.

I do not know of any algorithm that will solve this for you. My best advice is store the first name and last name in separate columns and you won't have this issue.
Fe6098700651187cdfe9a0c19abb40c8?s=40
Sat Jan 18 02:49:22 2014
Sorry for the late reply but was looking for similar when I came across this blog.

Depending on your database version and storage engine you can use the FULLTEXT index and MATCH AGAINST to match phrases.

SELECT *, levenshtein_ratio(nombre, “pepiTo”) AS relevancy 
FROM proyectos 
HAVING relevancy > 40 OR MATCH( nombre) AGAINST( 'pepiTo' )
ORDER BY relevancy
See: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

There is also SOUNDS LIKE but is better for use with the english language.

But I agree with JD, it would be far more efficient to use separate columns for first and last name.
You could optionally also try something like:
SELECT *, levenshtein(nombre, “pepiTo”) AS related_score,
SUBSTRING_INDEX(SUBSTRING_INDEX(nombre, ' ', 1), ' ', -1) AS first_name,
	IF(LENGTH(nombre) - LENGTH(REPLACE(nombre, ' ', ''))>1, SUBSTRING_INDEX(SUBSTRING_INDEX(nombre, ' ', 2), ' ', -1), NULL) AS middle_name,
	SUBSTRING_INDEX(SUBSTRING_INDEX(nombre, ' ', 3), ' ', -1) AS last_name  
FROM proyectos
HAVING related_score < 2 OR levenshtein(first_name, “pepiTo”) < 2 OR levenshtein(middle_name, “pepiTo”) < 2 OR levenshtein(last_name, “pepiTo”) < 2
125694255d51a3ec3f0404a32c8397d9?s=40
eliphone
Thu Jan 9 09:25:19 2014
hi,when the strings contain chinese character,it seems not work.How can I do for it.Can you help me?
55e7f448215da57564a39904a1d7395d?s=40
Amit
Fri Sep 5 08:38:12 2014
I want to implement the levenshtein distance function into 'search tab' of my localhost site. I have created the same function in mysql but I think I might have to customize this function to include the parameters like 'cost_to_insert','cost_to_delete','cost_to_replace' for better accuracy.Can you help me?
3d5182190355712cc33d36d0bed97148?s=40
Wed Dec 17 04:30:16 2014
Thank you very much

Leave A Comment