Monday, November 24, 2014

How to import your iTunes library into R

If there's anything that 23andme, last.fm, Strava, or any of those countless facebook apps have shown us, it's that we love analyzing our own data and discovering new things about ourselves. A great source of data is your iTunes library. If you're anything like me, you listen to music constantly- at home, at work, or on the go. With iPods (and iPhones) having been popular for over a decade, iTunes could potentially have data on a significant portion of your life. Why not poke around it?

iTunes stores its library data in two separate files: iTunes Library.itl and iTunes Library.xmlAccording to Apple, the .itl file is the database iTunes uses for itself, but the .xml file is intended for use with external applications. The XML file is in a standard format easily readable by both humans and computers and is used all over the web for things like RSS feeds and web services.

With the use of two packages, XML and the ever-so-useful plyr, importing your iTunes library data into an R data frame is a dead-easy process.

Setup:

  1. Make a copy of your iTunes Library.xml or iTunes Music Library.xml file. (Safety first!)
  2. Install XML and plyr from CRAN.

Instructions:

Just follow along with these four easy steps:

  • ituneslib <- readKeyValueDB("iTunes Music Library.xml")
    This command loads your iTunes library into R as a .plist file, rather than a standard xml file. Remember to change "iTunes Music Library.xml" to whatever your file is named. Don't panic if it seems like the program's frozen! For my library (3500 songs on a 2007 iMac), this operation took about a minute and a half.
  • tracksxml <- ituneslib$Tracks
    This command grabs the "Tracks" section of the library and moves it into a separate variable for convenience. Tracksxml is currently a list of lists, with each list further wrapped inside a 1-item list, so now we have to restructure this into a more sane data format.
  • tracksxml1 <- lapply(tracksxml, data.frame)
    This command transforms the nested lists into data frames so that plyr can do its magic. (That is, the lists-inside-lists-inside-a-list are now data frames-inside-lists-inside-a-list). This command took me 25 seconds.
  • songs <- ldply(tracksxml1)
    And now plyr steps in and makes everything into a nice neat data frame!
After you're done, feel free to remove the tracksxml lists, they're no longer necessary as soon as you have your data frame. There is likely a faster way to do all of this, but this way is by far the simplest to type out and debug. For easy copy-pasting, here are the commands again:

ituneslib <- readKeyValueDB("iTunes Music Library.xml")
tracksxml <- ituneslib$Tracks
tracksxml1 <- lapply(tracksxml, data.frame)
ldply(tracksxml1)

Or you can do it all in one line:
tracks <- ldply(lapply(readKeyValueDB("iTunes Music Library.xml")$Tracks, data.frame))

Have fun and happy analyzing!

Thursday, November 20, 2014

Trials and tribulations building Rstudio Server on a mac

I've been trying to get RStudio server to build on my iMac all of yesterday. In my opinion, it's the best IDE for R, and being able to run it on another computer remotely is icing on the cake. My Samsung Chromebook with crouton really doesn't have the "oomph" to... well... do anything meaningful. But building has been anything but a trivial process, and I want to post here to document some pitfalls I've found myself running into... this isn't a well-documented process.

I've been relying on instructions from this blog post by Ian Gow, which provides easy-to-follow commands (great when you're too lazy to remember to type out build instructions into gcc by hand), and has supposedly worked for him. With my setup (which uses macports R 3.1.1 and the latest RStudio version on github), I've still run into some problems.

First, there's a hidden dependency on java, which I installed with the OS X installer. You need this to work with apache-ant, and it doesn't come up until later in the build process. When I was building from scratch on my chromebook (RStudio does not supply ARM binaries), I ran into this problem and I assumed it would be an issue here.

Second, R is not recognized by RStudio when running cmake. I started running into these errors:

Rogers-iMac:build roger$ cmake -DRSTUDIO_TARGET=Server -DCMAKE_BUILD_TYPE=Release ..
-- Mac OS X version: 10.10
-- Boost version: 1.50.0
-- Found R: /opt/local/lib/libR.dylib/Resources
CMake Error at src/cpp/CMakeLists.txt:218 (message):
  Minimum R version (2.11.1) not found.


-- Configuring incomplete, errors occurred!
See also "/Users/roger/projects/rstudio/build/CMakeFiles/CMakeOutput.log".
See also "/Users/roger/projects/rstudio/build/CMakeFiles/CMakeError.log".

I tried taking the easy way out initially, by passing -DRSTUDIO_VERIFY_R_VERSION=0 in the cmake instructions instead (suggested on the RStudio forums here), but then ran into the same problem that the poster did, not being able to find Rinternals.h

Rogers-iMac:build roger$ cmake -DRSTUDIO_TARGET=Server -DCMAKE_BUILD_T[330/3072]
e -DRSTUDIO_VERIFY_R_VERSION=0 ..
-- The C compiler identification is AppleClang 6.0.0.6000054
-- The CXX compiler identification is AppleClang 6.0.0.6000054
[...]
[ 44%] Built target rstudio-monitor
Scanning dependencies of target rstudio-r
[ 44%] Building CXX object src/cpp/r/CMakeFiles/rstudio-r.dir/RErrorCategory.cpp
.o
[ 44%] Building CXX object src/cpp/r/CMakeFiles/rstudio-r.dir/RExec.cpp.o
In file included from /Users/roger/projects/rstudio/src/cpp/r/RExec.cpp:17:
In file included from /Users/roger/projects/rstudio/src/cpp/r/include/r/RExec.hp
p:30:
In file included from /Users/roger/projects/rstudio/src/cpp/r/include/r/RSexp.hp
p:33:
/Users/roger/projects/rstudio/src/cpp/r/include/r/RInternal.hpp:43:10: fatal err
or: 'Rinternals.h' file not found
#include <Rinternals.h>
         ^
1 error generated.
make[2]: *** [src/cpp/r/CMakeFiles/rstudio-r.dir/RExec.cpp.o] Error 1
make[1]: *** [src/cpp/r/CMakeFiles/rstudio-r.dir/all] Error 2
make: *** [all] Error 2

I wasn't sure how to continue here, but I found an old forum post on the RStudio forums with a nice clue. Sometimes cmake would have problems finding the R executable, especially if it was in an uncommon location. Macports generally puts its binaries in /opt/local/bin, but RStudio was only searching /usr/bin and /usr/local/bin. So all I would have to do is point the cmake command to the location of R (letting it complete successfully) with export STUDIO_WHICH_R=/opt/local/bin/R right before running the command, and afterwards fixing a few lines in build/CMakeCache.txt. (I also have this as a question/answer on StackOverflow)

cmake ran successfully, and afterwards I changed these lines in CMakeCache.txt:

//R doc directory
LIBR_DOC_DIR:PATH=/opt/local/Library/Frameworks/R.framework/Resources/doc

//R executable
LIBR_EXECUTABLE:PATH=/opt/local/bin/R

//R home directory
LIBR_HOME:PATH=/opt/local/Library/Frameworks/R.framework/Resources

//R include directory Think I found our problem!LIBR_INCLUDE_DIRS:PATH=/opt/local/lib/libR.dylib/Resources/include
LIBR_INCLUDE_DIRS:PATH=/opt/local/Library/Frameworks/R.framework/Versions/3.1/Resources/include/

//Path to a library.
LIBR_LIBRARIES:FILEPATH=/opt/local/lib/libR.dylib

The final result: Success!



Afterwards, I followed the suggestions of the developers and put RStudio behind a proxy for better security. I chose nginx, which is rapidly growing in popularity, in part because of its usefulness as a reverse proxy (look how easy that configuration is!). This will also be useful if I want to host other unrelated stuff on my computer, I can easily throw rstudio behind a directory and have it work just like normal.

In any case, it's time for me to play around with this a bit. Until next time!

Friday, October 10, 2014

Gun Control and Gun Violence

The United States has heard repeated calls for more gun control legislation in the wake of the Sandy Hook Elementary School shooting. Every day it seems there's a new mass shooting, with dire implications for the state of our country. But these mass shootings are isolated events that have almost been tailor-made to provoke disproportionate media attention. The day-to-day assaults, kidnappings, and murders affect a lot more people.

Liberals claim that gun control makes places safer by making guns harder to obtain and be used illegally. Conservatives counterclaim that gun control makes communities more dangerous by eliminating a key method of self defense for law abiding citizens. Each side has their share of talking points. Liberals point to the high rates of gun-related deaths in the United States compared to other developed countries (a point used famously in Michael Moore's Bowling for Columbine), and conservatives point to stories of self-defense by would-be victims of robberies, home invasion, domestic abuse, or other serious crimes. Although I fall on the liberal side of the spectrum, I would much rather take up a position supported by empirical evidence. So I asked the question: Do harsher or stricter gun laws affect crime?

To test this, I decided to do a simple regression of violent crime rates against the relative restrictiveness of gun laws. I am using the FBI's Uniform Crime Report statistics for 2012, and comparing it with FreeExistence.org's Gun Rights Index. Both datasets come with a few major caveats. FreeExistence.org has a strongly libertarian viewpoint, and clearly shows some favoritism towards laxer gun laws, while the FBI cautions strongly against using UCR data in order to rank crime in areas, due to different reporting standards by different police agencies. Despite FE's obvious political slant, I don't see a reason to doubt their data; it's unclear whether it would be more in their favor to magnify differences and paint states like New York as overly restrictive, or rather to blur them so as to obscure any supposed effect.

My first regression ended up with a slight negative correlation, suggesting that more restrictive gun laws led to more violence. But when I plotted the results, I found that I had made a classic demographic pitfall... counting DC as a state!

Outlier much? (Look in the top left)
If you remove DC, the correlation breaks down completely:

The lack of a correlation likely means that there are a host of other factors involved (understandable, since I only looked at a single metric of gun freedom). As another quick test, I tried the regression on just the homicide rate, although it doesn't look much better.

For those who haven't taken a statistics class, or need a refresher, the r² is an effect size (essentially what % of gun violence is affected by gun control), and the p is the probability that any correlation is just a product of pure chance. So there appears to be no connection between gun violence and gun laws in the bottom two graphs, and only a slight connection in the first (which would probably go away if I controlled for poverty level, population density, or something similar).

In any case, there are still a number of ways to go on this theme – looking at gun ownership rates, like this analysis from the Violence Policy Center; using a different metric of gun violence, like gun-related deaths (from the CDC) or individual arrest records (through the UCR's successor, the NIBRS) ; or controlling for other possible factors, like economic inequality or education levels; might be promising. For now, I'm unable to draw conclusions either way.

Sunday, July 20, 2014

Help Democratize Democracy!

A few days ago, I found a really cool project on Twitter called OpenElections, which is trying to create a master dataset of every certified election result in the US. It's gotten a chunk of critical acclaim, including a grant from the Knight Foundation (named for the founders of one of the biggest newspaper chains of the country). So far it's gotten records in 39 states dating back at least a decade, and is now looking to convert them into a computer-readable format for use in data analysis.


turning this... into this!

Scenic Corinth, MS, whose election
results I transcribed on Friday.
Photo from Corinth Main St. Alliance
As someone interested in statistics and politics, I saw huge potential for this project. So when I saw a chance tweet calling for volunteers, I got on to their github page and started submitting pull requests. It's really simple! This weekend I started working on the 2008 Presidential election, although I've still got a few counties to go. I can only imagine how useful this dataset could be, and it makes me feel proud to live in a country where an effort like this is even possible.

So go help democratize democracy, and help the open data movement! It's not very hard to transcribe a few documents, and your effort will aid a very promising project!

Sunday, July 13, 2014

Quantifying Land Constraints in Boom Cities

Condo construction in Brickell, Miami.
Photo from southbeachcars on flickr
A few weeks ago, Stephen Smith (who runs Market Urbanism) was comparing the fates of Miami and Vancouver, two cities that have experienced massive housing construction booms. Both cities have grown tremendously... and grown upward. This comes in the face of major land constraints- the Everglades for Miami, and the Cascade Mountains for Vancouver. But how much do these barriers actually impact development?