The Analyst's Dilemma, R vs. Python, or How I Stopped Worrying and Hated Them Both

August 3, 2017
jenholm
Data Science Blog, Random Thoughts Blog
0

I recently worked on two analytic projects which had libraries which were crucial to the completion of the projects, which I like to call putting all your eggs in one basket, which usually puts you in trouble. Project one was a project where we needed to reach into SurveyMonkey.com’s rest API to get our HR survey data and then analyze it. The project sponsor told me that R had a library that was capable of doing this, which is great because then I could extract the data and analyze it in the same language.

Unfortunately, defeat was snatched from the jaws of victory when I installed RMonkey, and found that SurveyMonkey.com had changed its API, as companies are wont to do, without making the changes backward compatible with the old API. OK, this isn’t RMonkey’s fault, it’s SurveyMonkey’s fault. But this isn’t an isolated incident, and reflects a fault across the R developershere of

Building a cool R library
Put it out there
Fuhgetaboutit

So we have all these R libraries out there on github, like the island of unwanted toys, or former celebrities on Dancing with the Stars, that no longer work, but hover in the source repositories like middle aged men at Vegas MTV pool parties, no longer wanted, dysfunctional, and kinda creepy.

Even worse, all of SurveyMonkey’s documentation is in Python, making it tougher still to write my own library in R to access the API. Eventually I ended up using the Python surveymonty package to access the API. After extracting the site data, I did the analysis of the survey results with R, which did have some great (and functioning libraries) to sift out questionable results using the Mahalanobis distance as a basis for multivariate outlier detection.

Fast forward several months, and I found the need for a discrete event simulation library in R. Their seemed to be a good one, called simmer based on the simpy library in python.

Uh-oh, I thought.

Once again, after spending a day or so working the examples in simmer, I found the library to be buggy, so much so the documentation’s examples didn’t even run without exception. Apparently the author(s) of the library had made some changes without updating the documentation, making the entire shebang invalid. Back I went to Python, and the simpy library, which ran without incident, and in accordance with the supplied examples.

There seems to be a recurring theme here. Maybe several; don’t use projects that have been kicked out of the CRAN repository due to bad maintenance, that’s my fault. The second issue, passing of bad information about libraries that might work for your project. The third is, a lack of policing the internet, or maybe I should say rating projects that you can use on the internet. Perhaps we need a better method of rating projects, or in some cases, any method of rating projects and passing that information to others, particularly on github. While we are thankful that builders took the time to build their projects and put them out there, we don’t want the house they built to fall on our heads.

Joel Spolsky has described how they test software at Microsoft, and the hoops they jump though to try and make sure the system works with as much legacy software as humanly possible. While we don’t have the assets that Microsoft does in testing R projects, we do have the assets to at least rate the libraries. This is implemented in SourceForge, but not on github or CRAN. I guess I could right an app to do that, but then who would maintain it? Now, as we say when we use Regex, we have two problems.

Enholm Heuristics

Leave a Reply Cancel reply