I recently needed to process a bunch of addresses that our proprietary geocoding software was having a problem with. I won’t say the name of the geocoder, but it rhymes with BarfGIS. Geocoding is a process where street addresses are translated into latitude and longitude coordinates. If you’re lucky that is. If they’re not translated you can go through all 30,000 addresses line by line and figure out where they are on a map, but there’s too much Scotch involved in that and the inevitable liver transplant.
Not wanting to box my liver, I decided to try and use modern 21st century techniques to try and match these 30,000 strings to a location on a map, which may well have taken more time than eyeballing them and using a map.
One method of doing so, might be to use a Levenshtein distance algorithm. So we want to compare the positive address locations with the malformed zombie address strings whose parents abandoned them and they had to sell themselves on the streets, using Levenstein distance scoring. Levenstein distance scoring goes through every character of the two strings and assigns a score based on the similarity. Different characters are tried in different string positions, and then the strings are compared again to see if the similarity gets better or worse. This iteration goes on until the heat death of the universe, or so it seems.
Part of this odyssey involved a significant amount of time on stackoverflow.com being berated by a 35 year old bearded hipster who works at Amazon as “Business Intelligence Engineer,” and spends a lot of time playing a fantasy card game called Dominion. I’m not making this up, I don’t have that much creative ability. Unfortunately, spending twenty-two years in the Marine Corps flying KC-130s, commanding M-1A1 tank platoons, and generally doing things that most people have only seen in the movies did not adequately prepare me with interactions with the likes of someone who’s spent their life arguing about why they should buy copper in Dominion.
My first language choice to tackle this was R, since I prefer the RStudio interface for development as you can run strips of code in a session and work out the kinks. You can use the command line in Python to run code too, but it’s not as smooth. RStudio really shines here.
Most of the examples I went through to use the R parallel library recommended not using all of your machine’s cores for parallel processing, leaving one open for operating systems background processes, which is probably good advice. But wasted because the library is as rickety as a Survivor canoe built by a philosophy major.
The parallel library in R requires you to make a separate environment for your ‘cluster,’ a processing area where you have to load in your variables, libraries,and functions in separately, then tell the cluster to start, and wait for it to process your data.
Unfortunately, parallel R bombed out with more memory errors than a Hollywood executive with wondering hands. This was on a a Win7 64 bit machine. I can’t speak for other platforms, I’ve heard people have had good luck with using this on a Linux system. I’m amazed at anyone who can even keep a Linux system up and running after the first driver update. Anyway, I don’t have that option in this environment, the client insists that I keep their data on their Windows machines for security reasons. That should be the funniest line in this post.
I originally blamed the memory errors on the library I was using to calculate the Levenshtein distance, so I used another one and the results were identical. The size of the data sets was approximately 35,000 records for positive matches and 30,000 records for negative matches, which I consider fairly small by industrial analysis standards. Google handles hundreds of millions of address search strings every day, most of which are spelled worse than the writing on a bathroom stall in a gas station or a wuffle house (sic). The Levenstein distance between a gas station bathroom and a wuffle house is zero by the way.
The memory consumption error in R by the parallel library calculating Levenstein distances between strings was so egregious that in several occasions it took down the entire operating system and I had to reboot the machine. Two of those times I had to get the computer some cookies and a warm glass of milk so it would let me log in again. To make matters worse, when I saw the memory consumption arc increasing and I knew it would blow up I tried to stop the R session, but that still didn’t work. Stopping the computation didn’t seem to help, stopping the cluster and setting it to NULL was also ineffectual. Threatening it was no help at all.
So then I moved on to Python. Before the Python fans start doing the happy dance, let me say it also had a problem, which again involved a memory consumption leak. Or maybe I should say the multiprocessing library had a leak, Python of course should remain blameless in the memory massacre that I witnessed. Trillions of bytes, wondering around homeless, not knowing their address, unable to get garbage collection. Or as they call it in New York, Tuesday.
I was able to find a solution to the problem in Python by explicitly calling close() and join() on the processing pool which is supposed to be done automatically, but apparently is not. There are similar mechanisms to explicitly stop the clusters in R, but these didn’t fix the memory problem. Even with the multiprocessing library used in Python, the distance calculation still took over eleven hours to complete. Eleven hours. Can you imagine if you had to wait eleven hours for a google map after typing in an address that was spelled wrong? Spelling bees would be a lot more competitive, I’ll tell you that.
[R] parallel processing: Unusable for industrial applications without extensive forcefeeding, and should be only used by the high scorers on stack overflow and Dominion players.
Python multiprocessing: usable but kludgy, needs slight forcefeeding, but usable for mere mortals who are more comfortable with real weapons, and have girlfriends or wives.