Data is Life

Tuesday, February 3, 2015

How Margarine is Tearing New England Families Apart

(Spoiler: It's not. Despire a very strong correlation)

Source: Spurious Correlations at tylervigen.com

When I was in my Econometrics class at college, my professor drilled into me the "Ten Commandments of Applied Econometrics", from an influential paper of the same name by Peter Kennedy. These rules apply as much to econometrics as they do any statistical modelling exercise:

Thou shalt use common sense in economic theory
The "Common Sense" that Kennedy (and by extension, Professor Khemraj) talks about is truly basic methodology. Things like mixing up stock and flow variables (eg confusing wealth/assets (stock) with income (flow)), or comparing a per capita figure with a total.
Thou shalt ask the right questions
If you walk into an analysis blindfolded, the results might not look too good when you turn the lights back on. Make sure you truly understand the question that you're asking.
Thou shalt know the context - do not perform ignorant statistical analysis
You should always know how your data came to be and how that might influence your results. If you're looking at vacation days across the developed world, remember that Brits count paid holidays in vacation time while Americans do not. If you're analyzing vote totals for municipal candidates across the US, remember that candidates in New York and Minnesota can stand for multiple parties. What was the wording on the survey that you're basing your analysis off of? How did each of your record labels classify their artists' genres?
Thou shalt inspect the data
These days it's incredibly easy to just type summary(lm(dataset, formula = a ~ b)) into an r prompt, see a significant result, copy the coefficient, and declare victory. It's also incredibly easy to take a closer look at your data. At least plot out what your data looks like; humans are visual creatures and it's much easier to see something absurd if you visualize it properly.
Thou shalt not worship complexity
The simpler your model is, the easier it's going to be to tell if you have a bad/misspecified variable, the less demanding it will be on your data, the more likely it is that you will be able to replicate your findings in the future.
Thou shalt look long and hard at thine results
When you find something, look at your results and make sure that they are sane. Make sure that everything's going the right way (i.e. your coefficients are properly positive or negative); that the right things are significant, and that the overall conclusion is sane. You don't want to produce a report only to find out that you transposed two of your variables!
Thou shalt beware of the cost of data mining
With a firehose of data at our fingertips, it's tempting to just plug 'n' chug, blindly regressing things against each other and seeing what fits. This rarely ends well. Congratulations, you just discovered a random correlation that just happens to only fit your dataset perfectly. Your results are often bunk and evaporate as soon as more data becomes available.
Thou shalt be willing to compromise
Unfortunately for economists, statisticians, and data analysts everywhere, we live in the real world; not a neatly-defined model. Your data will not be perfect, and it is your responsibility to work with what you have; not to cross your arms and hold out for that "perfect special dataset". Like in real life, there is no Mr./Mrs. Right. It is up to you to work with the conditions that you have, and understand the implications, to deliver as good a result as you can.
Thou shalt not confuse significance with substance
Just because a result is significant, does not mean it actually means anything. With enough data points; you'd be surprised at how much can magically become statistically significant. It's as important to look at coefficients and effect sizes to understand if the relationship is worthwhile.
Thou shall confess in the presence of sensitivity
One of Nate Silver of 538's favorite hobbies is to eat up overfitted political models. If your model relies on a small leap of faith in your variable specification, be responsible and make sure that's disclosed. Otherwise you could end up with egg on your face.

With those in mind, let's go back to the connection between margarine consumption and divorce rates in Maine. This is from "Spurious Correlations", an excellent demonstration of the dangers of blindly trusting your preferred statistic above common sense. The website pulls a number of data feeds from public and private-sector data sources from a ten-year period ('00-'09), and picks out the strongest correlations between them. Thus you dig up alarming "conclusions" about Nicholas Cage's film appearances, Oil imports from Norway, or American sour cream consumption. Of course, all of these connections are absurd, merely drawn from finding the biggest coincidences in a suitably large time series dataset.

In conclusion, be responsible when using statistics, or you could end up a sworn enemy to the American margarine industry from your faulty analysis on American marriages.

Friday, January 30, 2015

Diversity and Inequality

Last weekend, a post on Reddit's linguistics subforum showing a map of worldwide language diversity was a big hit. This map used a metric called Greenberg's Linguistic Diversity Index, which is the percent chance that two random inhabitants of a given country have two different mother tongues. States like largely-homogeneous South Korea and Haiti have low scores (0.003 and 0.000, respectively), while places like Tanzania and Papua New Guinea, where every village might speak a different language, have LDIs of 0.95 or higher.

Source: Reddit User Whiplashoo21

In the ensuing discussion, one user was interested in seeing how linguistic diversity compared with development. As you can see on the map above, many of the most linguistically diverse countries are in impoverished sub-Saharan Africa. In fact, this is a popular topic in political science and economics, studying whether cultural diversity makes a country better off, or whether it leaves a state susceptible to Balkanization and ethnic conflict.

To test this out, I started by looking at exactly what the commenter was asking about; LDI against inequality-adjusted HDI. For those who don't know, the Human Development Index is an attempt at a more holistic measure of development, which looks at three basic indicators (life expectancy, educational attainment, and per capita GDP) to come up with a single number. Since 2010, the UNDP has also published a second index, adjusted for inequality. Most states provide the UN with enough data to compute both indices, although there are a number of notable exceptions.

Source: Wikipedia image, Data from UNDP.

For my data, I used UNESCO's 2009 report on linguistic diversity for the LDI, and the UNDP's 2014 figures for HDI. This data is slightly different than the reddit post's source, but there aren't very substantial variations between the two LDIs.

IHDI = -0.308LDI + 0.691, R² = 0.246***, p < 0.00001

Unfortunately, diversity does not appear to be a positive at first glance. As the graph shows, there's a strong, but small, negative correlation. This model estimates that linguistic diversity accounts for about 25% of the variation in HDI scores. While this isn't a very large impact, it is a very interesting effect to see. Of course, it's a cardinal error in statistics to equate correlation with causation, and in this case there are two things to look out for: First, it's very likely that linguistic diversity isn't endogenous; it doesn't happen by itself. Second, there's very likely some third variable acting on both a country's diversity and development. To showcase this better, I grouped countries by continent.

Notice how much more diverse and poorer Africa is than the rest of the world. Both of these are a product of colonialism; the former a product of the Scramble for Africa which prioritized natural resources or landmarks over pre-existing ethnic groups in forming colonial borders.

Looking at greater cultural diversity comes up with similar results. I used a measure from a paper by Erkan Gören at the University of Oldenburg. Gören came up with a new index that takes into account religious, ethnic, and linguistic differences, and then adjusts them for how similar the languages actually are. This cultural diversity map (made for a blog post by Pew research) looks mostly similar to the original linguistic diversity map.

And similarly, looking at his figures come up with similar trendlines.

IHDI = -0.404GI + 0.697, R² = 0.282***, P < 0.00001

Going back to the scramble for Africa, I decided to try something new and adjusted HDI scores by continent. Since there's a lot of similar history for many countries on the same continent (Much of the Americas are monolingual ex-colonies with a very small indigenous population remaining, Africa has boundaries not drawn to ethnic lines, Asia has more-or-less well-drawn ethnic lines), maybe much of this relationship is just a product of colonial history.

IHDI_z = -0.529LDI + 0.242, R² = 0.028, p = 0.0508

And sure enough, the relationship breaks down.

One more thing I noticed is that more linguistically-diverse countries are more unequal.

% Loss = 15.614 LDI + 14.063, R² = 0.200***, p < 0.00001

Then again, that's just because poor countries tend to be more unequal anyway.

% Loss = -58.027HDI + 60.576, R² = 0.7607***, P < 0.00001

Until next time.

Monday, November 24, 2014

How to import your iTunes library into R

If there's anything that 23andme, last.fm, Strava, or any of those countless facebook apps have shown us, it's that we love analyzing our own data and discovering new things about ourselves. A great source of data is your iTunes library. If you're anything like me, you listen to music constantly- at home, at work, or on the go. With iPods (and iPhones) having been popular for over a decade, iTunes could potentially have data on a significant portion of your life. Why not poke around it?

iTunes stores its library data in two separate files: iTunes Library.itl and iTunes Library.xml. According to Apple, the .itl file is the database iTunes uses for itself, but the .xml file is intended for use with external applications. The XML file is in a standard format easily readable by both humans and computers and is used all over the web for things like RSS feeds and web services.

With the use of two packages, XML and the ever-so-useful plyr, importing your iTunes library data into an R data frame is a dead-easy process.

Setup:

Make a copy of your iTunes Library.xml or iTunes Music Library.xml file. (Safety first!)
Install XML and plyr from CRAN.

Instructions:

Just follow along with these four easy steps:

ituneslib <- readKeyValueDB("iTunes Music Library.xml")
This command loads your iTunes library into R as a .plist file, rather than a standard xml file. Remember to change "iTunes Music Library.xml" to whatever your file is named. Don't panic if it seems like the program's frozen! For my library (3500 songs on a 2007 iMac), this operation took about a minute and a half.
tracksxml <- ituneslib$Tracks
This command grabs the "Tracks" section of the library and moves it into a separate variable for convenience. Tracksxml is currently a list of lists, with each list further wrapped inside a 1-item list, so now we have to restructure this into a more sane data format.
tracksxml1 <- lapply(tracksxml, data.frame)
This command transforms the nested lists into data frames so that plyr can do its magic. (That is, the lists-inside-lists-inside-a-list are now data frames-inside-lists-inside-a-list). This command took me 25 seconds.
songs <- ldply(tracksxml1)
And now plyr steps in and makes everything into a nice neat data frame!

After you're done, feel free to remove the tracksxml lists, they're no longer necessary as soon as you have your data frame. There is likely a faster way to do all of this, but this way is by far the simplest to type out and debug. For easy copy-pasting, here are the commands again:

ituneslib <- readKeyValueDB("iTunes Music Library.xml")
tracksxml <- ituneslib$Tracks
tracksxml1 <- lapply(tracksxml, data.frame)
ldply(tracksxml1)

Or you can do it all in one line:

tracks <- ldply(lapply(readKeyValueDB("iTunes Music Library.xml")$Tracks, data.frame))

Have fun and happy analyzing!

Thursday, November 20, 2014

Trials and tribulations building Rstudio Server on a mac

I've been trying to get RStudio server to build on my iMac all of yesterday. In my opinion, it's the best IDE for R, and being able to run it on another computer remotely is icing on the cake. My Samsung Chromebook with crouton really doesn't have the "oomph" to... well... do anything meaningful. But building has been anything but a trivial process, and I want to post here to document some pitfalls I've found myself running into... this isn't a well-documented process.

I've been relying on instructions from this blog post by Ian Gow, which provides easy-to-follow commands (great when you're too lazy to remember to type out build instructions into gcc by hand), and has supposedly worked for him. With my setup (which uses macports R 3.1.1 and the latest RStudio version on github), I've still run into some problems.

First, there's a hidden dependency on java, which I installed with the OS X installer. You need this to work with apache-ant, and it doesn't come up until later in the build process. When I was building from scratch on my chromebook (RStudio does not supply ARM binaries), I ran into this problem and I assumed it would be an issue here.

Second, R is not recognized by RStudio when running cmake. I started running into these errors:

Rogers-iMac:build roger$ cmake -DRSTUDIO_TARGET=Server -DCMAKE_BUILD_TYPE=Release ..
-- Mac OS X version: 10.10
-- Boost version: 1.50.0
-- Found R: /opt/local/lib/libR.dylib/Resources
CMake Error at src/cpp/CMakeLists.txt:218 (message):
Minimum R version (2.11.1) not found.

-- Configuring incomplete, errors occurred!
See also "/Users/roger/projects/rstudio/build/CMakeFiles/CMakeOutput.log".
See also "/Users/roger/projects/rstudio/build/CMakeFiles/CMakeError.log".

I tried taking the easy way out initially, by passing -DRSTUDIO_VERIFY_R_VERSION=0 in the cmake instructions instead (suggested on the RStudio forums here), but then ran into the same problem that the poster did, not being able to find Rinternals.h

Rogers-iMac:build roger$ cmake -DRSTUDIO_TARGET=Server -DCMAKE_BUILD_T[330/3072]
e -DRSTUDIO_VERIFY_R_VERSION=0 ..
-- The C compiler identification is AppleClang 6.0.0.6000054
-- The CXX compiler identification is AppleClang 6.0.0.6000054
[...]
[ 44%] Built target rstudio-monitor
Scanning dependencies of target rstudio-r
[ 44%] Building CXX object src/cpp/r/CMakeFiles/rstudio-r.dir/RErrorCategory.cpp
.o
[ 44%] Building CXX object src/cpp/r/CMakeFiles/rstudio-r.dir/RExec.cpp.o
In file included from /Users/roger/projects/rstudio/src/cpp/r/RExec.cpp:17:
In file included from /Users/roger/projects/rstudio/src/cpp/r/include/r/RExec.hp
p:30:
In file included from /Users/roger/projects/rstudio/src/cpp/r/include/r/RSexp.hp
p:33:
/Users/roger/projects/rstudio/src/cpp/r/include/r/RInternal.hpp:43:10: fatal err
or: 'Rinternals.h' file not found
#include <Rinternals.h>
^
1 error generated.
make[2]: *** [src/cpp/r/CMakeFiles/rstudio-r.dir/RExec.cpp.o] Error 1
make[1]: *** [src/cpp/r/CMakeFiles/rstudio-r.dir/all] Error 2
make: *** [all] Error 2

I wasn't sure how to continue here, but I found an old forum post on the RStudio forums with a nice clue. Sometimes cmake would have problems finding the R executable, especially if it was in an uncommon location. Macports generally puts its binaries in /opt/local/bin, but RStudio was only searching /usr/bin and /usr/local/bin. So all I would have to do is point the cmake command to the location of R (letting it complete successfully) with export STUDIO_WHICH_R=/opt/local/bin/R right before running the command, and afterwards fixing a few lines in build/CMakeCache.txt. (I also have this as a question/answer on StackOverflow)

cmake ran successfully, and afterwards I changed these lines in CMakeCache.txt:

//R doc directory
LIBR_DOC_DIR:PATH=/opt/local/Library/Frameworks/R.framework/Resources/doc

//R executable
LIBR_EXECUTABLE:PATH=/opt/local/bin/R

//R home directory
LIBR_HOME:PATH=/opt/local/Library/Frameworks/R.framework/Resources

//R include directory Think I found our problem!LIBR_INCLUDE_DIRS:PATH=/opt/local/lib/libR.dylib/Resources/include
LIBR_INCLUDE_DIRS:PATH=/opt/local/Library/Frameworks/R.framework/Versions/3.1/Resources/include/

//Path to a library.
LIBR_LIBRARIES:FILEPATH=/opt/local/lib/libR.dylib

The final result: Success!

Afterwards, I followed the suggestions of the developers and put RStudio behind a proxy for better security. I chose nginx, which is rapidly growing in popularity, in part because of its usefulness as a reverse proxy (look how easy that configuration is!). This will also be useful if I want to host other unrelated stuff on my computer, I can easily throw rstudio behind a directory and have it work just like normal.

In any case, it's time for me to play around with this a bit. Until next time!

Friday, October 10, 2014

Gun Control and Gun Violence

The United States has heard repeated calls for more gun control legislation in the wake of the Sandy Hook Elementary School shooting. Every day it seems there's a new mass shooting, with dire implications for the state of our country. But these mass shootings are isolated events that have almost been tailor-made to provoke disproportionate media attention. The day-to-day assaults, kidnappings, and murders affect a lot more people.

Liberals claim that gun control makes places safer by making guns harder to obtain and be used illegally. Conservatives counterclaim that gun control makes communities more dangerous by eliminating a key method of self defense for law abiding citizens. Each side has their share of talking points. Liberals point to the high rates of gun-related deaths in the United States compared to other developed countries (a point used famously in Michael Moore's Bowling for Columbine), and conservatives point to stories of self-defense by would-be victims of robberies, home invasion, domestic abuse, or other serious crimes. Although I fall on the liberal side of the spectrum, I would much rather take up a position supported by empirical evidence. So I asked the question: Do harsher or stricter gun laws affect crime?

To test this, I decided to do a simple regression of violent crime rates against the relative restrictiveness of gun laws. I am using the FBI's Uniform Crime Report statistics for 2012, and comparing it with FreeExistence.org's Gun Rights Index. Both datasets come with a few major caveats. FreeExistence.org has a strongly libertarian viewpoint, and clearly shows some favoritism towards laxer gun laws, while the FBI cautions strongly against using UCR data in order to rank crime in areas, due to different reporting standards by different police agencies. Despite FE's obvious political slant, I don't see a reason to doubt their data; it's unclear whether it would be more in their favor to magnify differences and paint states like New York as overly restrictive, or rather to blur them so as to obscure any supposed effect.

My first regression ended up with a slight negative correlation, suggesting that more restrictive gun laws led to more violence. But when I plotted the results, I found that I had made a classic demographic pitfall... counting DC as a state!

Outlier much? (Look in the top left)

If you remove DC, the correlation breaks down completely:

The lack of a correlation likely means that there are a host of other factors involved (understandable, since I only looked at a single metric of gun freedom). As another quick test, I tried the regression on just the homicide rate, although it doesn't look much better.

For those who haven't taken a statistics class, or need a refresher, the r² is an effect size (essentially what % of gun violence is affected by gun control), and the p is the probability that any correlation is just a product of pure chance. So there appears to be no connection between gun violence and gun laws in the bottom two graphs, and only a slight connection in the first (which would probably go away if I controlled for poverty level, population density, or something similar).

In any case, there are still a number of ways to go on this theme – looking at gun ownership rates, like this analysis from the Violence Policy Center; using a different metric of gun violence, like gun-related deaths (from the CDC) or individual arrest records (through the UCR's successor, the NIBRS) ; or controlling for other possible factors, like economic inequality or education levels; might be promising. For now, I'm unable to draw conclusions either way.

Sunday, July 20, 2014

Help Democratize Democracy!

A few days ago, I found a really cool project on Twitter called OpenElections, which is trying to create a master dataset of every certified election result in the US. It's gotten a chunk of critical acclaim, including a grant from the Knight Foundation (named for the founders of one of the biggest newspaper chains of the country). So far it's gotten records in 39 states dating back at least a decade, and is now looking to convert them into a computer-readable format for use in data analysis.


turning this...	into this!

Scenic Corinth, MS, whose election
results I transcribed on Friday.
Photo from Corinth Main St. Alliance

As someone interested in statistics and politics, I saw huge potential for this project. So when I saw a chance tweet calling for volunteers, I got on to their github page and started submitting pull requests. It's really simple! This weekend I started working on the 2008 Presidential election, although I've still got a few counties to go. I can only imagine how useful this dataset could be, and it makes me feel proud to live in a country where an effort like this is even possible.

So go help democratize democracy, and help the open data movement! It's not very hard to transcribe a few documents, and your effort will aid a very promising project!

Sunday, July 13, 2014

Quantifying Land Constraints in Boom Cities

Condo construction in Brickell, Miami.
Photo from southbeachcars on flickr

A few weeks ago, Stephen Smith (who runs Market Urbanism) was comparing the fates of Miami and Vancouver, two cities that have experienced massive housing construction booms. Both cities have grown tremendously... and grown upward. This comes in the face of major land constraints- the Everglades for Miami, and the Cascade Mountains for Vancouver. But how much do these barriers actually impact development?