Monday, November 24, 2014

How to import your iTunes library into R

If there's anything that 23andme, last.fm, Strava, or any of those countless facebook apps have shown us, it's that we love analyzing our own data and discovering new things about ourselves. A great source of data is your iTunes library. If you're anything like me, you listen to music constantly- at home, at work, or on the go. With iPods (and iPhones) having been popular for over a decade, iTunes could potentially have data on a significant portion of your life. Why not poke around it?

iTunes stores its library data in two separate files: iTunes Library.itl and iTunes Library.xmlAccording to Apple, the .itl file is the database iTunes uses for itself, but the .xml file is intended for use with external applications. The XML file is in a standard format easily readable by both humans and computers and is used all over the web for things like RSS feeds and web services.

With the use of two packages, XML and the ever-so-useful plyr, importing your iTunes library data into an R data frame is a dead-easy process.

Setup:

  1. Make a copy of your iTunes Library.xml or iTunes Music Library.xml file. (Safety first!)
  2. Install XML and plyr from CRAN.

Instructions:

Just follow along with these four easy steps:

  • ituneslib <- readKeyValueDB("iTunes Music Library.xml")
    This command loads your iTunes library into R as a .plist file, rather than a standard xml file. Remember to change "iTunes Music Library.xml" to whatever your file is named. Don't panic if it seems like the program's frozen! For my library (3500 songs on a 2007 iMac), this operation took about a minute and a half.
  • tracksxml <- ituneslib$Tracks
    This command grabs the "Tracks" section of the library and moves it into a separate variable for convenience. Tracksxml is currently a list of lists, with each list further wrapped inside a 1-item list, so now we have to restructure this into a more sane data format.
  • tracksxml1 <- lapply(tracksxml, data.frame)
    This command transforms the nested lists into data frames so that plyr can do its magic. (That is, the lists-inside-lists-inside-a-list are now data frames-inside-lists-inside-a-list). This command took me 25 seconds.
  • songs <- ldply(tracksxml1)
    And now plyr steps in and makes everything into a nice neat data frame!
After you're done, feel free to remove the tracksxml lists, they're no longer necessary as soon as you have your data frame. There is likely a faster way to do all of this, but this way is by far the simplest to type out and debug. For easy copy-pasting, here are the commands again:

ituneslib <- readKeyValueDB("iTunes Music Library.xml")
tracksxml <- ituneslib$Tracks
tracksxml1 <- lapply(tracksxml, data.frame)
ldply(tracksxml1)

Or you can do it all in one line:
tracks <- ldply(lapply(readKeyValueDB("iTunes Music Library.xml")$Tracks, data.frame))

Have fun and happy analyzing!

Thursday, November 20, 2014

Trials and tribulations building Rstudio Server on a mac

I've been trying to get RStudio server to build on my iMac all of yesterday. In my opinion, it's the best IDE for R, and being able to run it on another computer remotely is icing on the cake. My Samsung Chromebook with crouton really doesn't have the "oomph" to... well... do anything meaningful. But building has been anything but a trivial process, and I want to post here to document some pitfalls I've found myself running into... this isn't a well-documented process.

I've been relying on instructions from this blog post by Ian Gow, which provides easy-to-follow commands (great when you're too lazy to remember to type out build instructions into gcc by hand), and has supposedly worked for him. With my setup (which uses macports R 3.1.1 and the latest RStudio version on github), I've still run into some problems.

First, there's a hidden dependency on java, which I installed with the OS X installer. You need this to work with apache-ant, and it doesn't come up until later in the build process. When I was building from scratch on my chromebook (RStudio does not supply ARM binaries), I ran into this problem and I assumed it would be an issue here.

Second, R is not recognized by RStudio when running cmake. I started running into these errors:

Rogers-iMac:build roger$ cmake -DRSTUDIO_TARGET=Server -DCMAKE_BUILD_TYPE=Release ..
-- Mac OS X version: 10.10
-- Boost version: 1.50.0
-- Found R: /opt/local/lib/libR.dylib/Resources
CMake Error at src/cpp/CMakeLists.txt:218 (message):
  Minimum R version (2.11.1) not found.


-- Configuring incomplete, errors occurred!
See also "/Users/roger/projects/rstudio/build/CMakeFiles/CMakeOutput.log".
See also "/Users/roger/projects/rstudio/build/CMakeFiles/CMakeError.log".

I tried taking the easy way out initially, by passing -DRSTUDIO_VERIFY_R_VERSION=0 in the cmake instructions instead (suggested on the RStudio forums here), but then ran into the same problem that the poster did, not being able to find Rinternals.h

Rogers-iMac:build roger$ cmake -DRSTUDIO_TARGET=Server -DCMAKE_BUILD_T[330/3072]
e -DRSTUDIO_VERIFY_R_VERSION=0 ..
-- The C compiler identification is AppleClang 6.0.0.6000054
-- The CXX compiler identification is AppleClang 6.0.0.6000054
[...]
[ 44%] Built target rstudio-monitor
Scanning dependencies of target rstudio-r
[ 44%] Building CXX object src/cpp/r/CMakeFiles/rstudio-r.dir/RErrorCategory.cpp
.o
[ 44%] Building CXX object src/cpp/r/CMakeFiles/rstudio-r.dir/RExec.cpp.o
In file included from /Users/roger/projects/rstudio/src/cpp/r/RExec.cpp:17:
In file included from /Users/roger/projects/rstudio/src/cpp/r/include/r/RExec.hp
p:30:
In file included from /Users/roger/projects/rstudio/src/cpp/r/include/r/RSexp.hp
p:33:
/Users/roger/projects/rstudio/src/cpp/r/include/r/RInternal.hpp:43:10: fatal err
or: 'Rinternals.h' file not found
#include <Rinternals.h>
         ^
1 error generated.
make[2]: *** [src/cpp/r/CMakeFiles/rstudio-r.dir/RExec.cpp.o] Error 1
make[1]: *** [src/cpp/r/CMakeFiles/rstudio-r.dir/all] Error 2
make: *** [all] Error 2

I wasn't sure how to continue here, but I found an old forum post on the RStudio forums with a nice clue. Sometimes cmake would have problems finding the R executable, especially if it was in an uncommon location. Macports generally puts its binaries in /opt/local/bin, but RStudio was only searching /usr/bin and /usr/local/bin. So all I would have to do is point the cmake command to the location of R (letting it complete successfully) with export STUDIO_WHICH_R=/opt/local/bin/R right before running the command, and afterwards fixing a few lines in build/CMakeCache.txt. (I also have this as a question/answer on StackOverflow)

cmake ran successfully, and afterwards I changed these lines in CMakeCache.txt:

//R doc directory
LIBR_DOC_DIR:PATH=/opt/local/Library/Frameworks/R.framework/Resources/doc

//R executable
LIBR_EXECUTABLE:PATH=/opt/local/bin/R

//R home directory
LIBR_HOME:PATH=/opt/local/Library/Frameworks/R.framework/Resources

//R include directory Think I found our problem!LIBR_INCLUDE_DIRS:PATH=/opt/local/lib/libR.dylib/Resources/include
LIBR_INCLUDE_DIRS:PATH=/opt/local/Library/Frameworks/R.framework/Versions/3.1/Resources/include/

//Path to a library.
LIBR_LIBRARIES:FILEPATH=/opt/local/lib/libR.dylib

The final result: Success!



Afterwards, I followed the suggestions of the developers and put RStudio behind a proxy for better security. I chose nginx, which is rapidly growing in popularity, in part because of its usefulness as a reverse proxy (look how easy that configuration is!). This will also be useful if I want to host other unrelated stuff on my computer, I can easily throw rstudio behind a directory and have it work just like normal.

In any case, it's time for me to play around with this a bit. Until next time!