Christopher
Stoll

fuzziac.js : Approximate String Matching in JavaScript

Fuzziac.js is JavaScript class for on-line approximate string matching. It was originally intended for use with auto-complete, such as that provided by jQuery UI. This is started as a research project for an undergraduate algorithms class which I took way back in 2011. We had to complete a term project on an algorithm, and Dr. Duan made dynamic programming look cool, so I though I would look for an application of dynamic programming outside the realm of DNA sequencing.

Where I was working at the time I had been updating web-based intranet applications to use RESTful services and shared JavaScript libraries; one of the most popular updates I made was to add auto-complete (thank you jQuery UI) to all the person name fields across the various applications. I worked for a multi-national and culturally diverse company, so spelling people’s names right was a big problem (e.g. Is it Lucy Lu or Lucy Liu? How do you spell Lakshmi Yaragudipati? Jake Bruder’s real name is Joachim Bruder; who knew? etc., etc.). Previous attempts to provide normalized person name data relied upon entering employee IDs, but no one knows anyone else’s employee ID, so they would always have to look it up. Or, if there was no penalty for entering the wrong employee ID, people would just enter their own, which made the results of data analysis faulty. I knew that in order for the data to be correct it would have to be easy for users to enter it correctly, and this problem seemed to be a good fit for a dynamic programming approach.

I assumed that I wasn’t the first person to think that dynamic programming was a good way to match names, but perhaps I was the first to apply it to web forms in this way. I searched through published research papers and found a paper by Top, Dowla, and Gansemer called A Dynamic Programming Algorithm for Name Matching. I later found out that what I was trying to do was known as approximate string matching, and there was a note on Wikipedia which mentions that Sellers actually proposed using dynamic programming for approximate string matching in 1980.

The approach of Top, Dowla, and Gansemer was aimed at off-line use, so I was concerned that it would not be fast enough on moderately sized data sets to be a realistic option. It turns out that the comparison runs in O(n * m) time and the backtracking runs in O(n + m) time, which is not too bad. We will be comparing names, which tend to be short; if we assume that all of our names will be less than 32 characters (or whatever the size limit on the name field is), then we can take the upper bounds of the comparison time as a constant. I must admit that some of my peers did not like that I took what looks to be almost O(n^2) (n * m = n^2; if n == m) and just turn it into O(1). From my perspective the time it takes to make one comparison is bound by the time it takes to compare two strings which are the maximum length allowed in the name field; that time is constant depending upon which machine the comparison is run on. It’s no longer about the curve, it’s about how long it takes to compare the two longest possible strings. Perhaps I should work out a more rigorous mathematical proof though.

The real problem now is that we need to run that comparison for each name in our list of names, k number of times. So, now we have O(k), and a linear asymptote is great, as long as our constant comparison time is not too large. So, I modified the code in ways to reduce the constants as much as possible, though there is more work to be done. In practice I was able to work with lists of over 2,500 names, except for on old version of Internet Explorer which had absolutely horrendous JavaScript performance. That was enough to contain all of the people in the region which I supported while still leaving room for growth. Almost all of the applications I supported were scoped so that any particular person would only see the names of people at their same location, which further reduced the size of the problem set.

I previously wrote a post about my dynamic programming approach to name matching, but not many people read it, perhaps because I called it JavaScript Dynamic Programming Example. I am guessing that people searching the web for algorithms are looking for a specific solution to solve the problem at hand, and few people needed to see a dynamic programming algorithm implemented in JavaScript. Also, I did not provide a turn-key example. So, I wrote this post with that in mind and created a GitHub repo for the code. This solution solved a real problem for me and I would like to share it with others (hint: if it is hard for users of your web app to find the right username, and the data set is not too large, then consider taking this approach).

If you would like more technical information, such as how to use this script, then go check out the fuzziac.js GitHub page.

Published: 2014-01-24
JavaScriptDynamic ProgrammingApproximate string matching