Data is Life

Thursday, May 28, 2015

Fool Millions Into Eating Chocolate With This One Weird Trick!

Yesterday, John Bohannon of io9 posted an article, "I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How.", that went absolutely viral. Bohannon ran an expose on the pervasiveness of bad nutritional science and bad nutritional science reporting by creating his own bogus study. It's a fascinating read showing how the system breaks down if you can get through a few initial barriers. But I wanted to talk specifically about p-hacking, about what John & Co. did to get to 0.05 and out the door:

Use a tiny sample size

The bogus study used three groups - a control group, a group on a low-carb diet, and a low-carb group that also ate bitter chocolate. But each group had only 16 people to start with, where one or two outlier subjects could create a "significant" trend out of thin air. It's the reason why John says that "almost no one takes studies with fewer than 30 subjects seriously anymore."

Keep rolling the dice

The study also measured a lot of information from these 15 subjects - weight, blood pressure, circulation, sleep quality, and other measurements for a total of 18 possible metrics. The significance threshold they used was p < 0.05, meaning that the study would declare a trend "significant" if there was under a 5% chance of a false positive. Sounds good, right?

Except if you start studying multiple variables, each of them has a 5% chance of being a false positive. With 18 results, there's a 66% chance that any study with this threshold will return a false positive. This is called Data Dredging, and is an excellent way to create connections that only fit your study group. (I actually talked about this two posts ago!) This alone can compromise an honest study. But if you're trying to make a bogus nutritional study that says something, not caring what it is...

The more tickets you buy, the more likely you are to win. We didn’t know exactly what would pan out—the headline could have been that chocolate improves sleep or lowers blood pressure—but we knew our chances of getting at least one “statistically significant” result were pretty good.

Hope nobody notices:

A reputable journal would have rejected this shoddily-constructed study out of hand. But unfortunately (or fortunately, if you're John Bohannon), there are a lot of irreputable journals out there. Some basic fact checking at the countless websites, newspapers, and magazines that published this study would have shown that the scientific journal was a sham, that the institute was nothing more than a website, or that "Dr. Johannes Bohannon" didn't even have a relevant PhD.

How was this so easy? Simple. John & Co. made a very inocuous-looking press release, complete with a few stock photos and a concise, yet sufficiently scientific-sounding summary of key points. By choosing bitter chocolate as their "secret ingredient", they picked something that sounded plausible. In the words of John Bohannon's collaborator Gunter Frank, “Bitter chocolate tastes bad, therefore it must be good for you. It’s like a religion.”

By corrupting a study in just the right ways, Bohannon was able to find a statistically significant result. And by sidestepping a key part of the review process, his team's sham work found traction in a popular, yet poorly-scrutinized area of scientific reporting: diet science.

Tuesday, March 10, 2015

Gun Control and Gun Violence, Part 2

After my first go-around looking at the connection between gun control and gun violence, I decided to revisit this question with a more detailed dataset. Before, I was using the FBI's Uniform Crime Report statistics, which cover eight major crimes across almost every police jurisdiction in the United States. This time, I looked at the National Incident-Based Reporting System, which is much more comprehensive; documenting every incident reported by participating jurisdictions and including time, place, crime, weapon used, characteristics of the suspect and victim, and much more information. A problem with the UCR is that data does not include weapon used- I couldn't tell if a criminal in Florida had used a gun or just a banana to rob their victim.

Unlike the much simpler UCR dataset, I had quite a few difficulties getting the NIBRS files to do what I wanted. As you might expect, a comprehensive crime dataset for the United States was big. Really big. I was able to do my first analysis on a puny ARM-powered chromebook. I had to use my desktop to even be able to open the file, which was a tab-delimited ASCII file 6 gigabytes large. I normally use R to do quantitative analysis these days, but I had to load an open-source clone of SPSS to properly load the file and convert it. This isn't even "Big Data" territory, and I still started running into performance issues. I started using dplyr to get its performance benefits, but a query on the entire database would still take me about 10-15 minutes to run, even with a Solid State Drive. This is where you learn about the importance of using a subset of your data as a test, because any typo you make stacks up quickly!

More discouraging was the fact that NIBRS is not universal. Now, UCR is a voluntary system, but still covers 98% of all Americans. The NIBRS only covers 30% (mostly broken up by state), and doesn't include crimes from the seven biggest states. Look at the coverage map below:

(data from a JRSA report)

Fortunately, there is data available on the proportion of crime in and out of the database (the numbers above reflect the percent of crime covered in NIBRS for each state), so it is possible to normalize this data somewhat, but the lack of data for many parts of the country may make a definitive analysis difficult.

Even with a more detailed dataset, I wasn't able to find any connection between gun laws and crime. Even controlling for things like crime rates (is a higher percentage of crime gun-related violent crime?), or a disproportionate effect on victims of color, I saw no impact.

Where I did see a big difference was (oddly) with population size. The bigger the state's population, the more often violent crime tended to involve a gun. The trend was twice as strong for overall population than for just urban population. Population density had no impact.

In order to get good enough quality data, I cut out any state that did not report at least half of its total crime. As you can see in the map, that leaves only a smattering of state agencies, and even fewer cities. The strong correlation between population and crime ratios may be an artifact of two of the largest states (Ohio and Michigan) being home to a number of poor rust belt cities, while many of the smaller states are not in the traditionally poor deep south.

I'd be interested in seeing the impact as more police agencies sign on to NIBRS and open their case data to the public. A larger source of crime data would revolutionize criminology and sociology, and make it easier to understand trends like this. In the meantime, I'm going to have to say the jury's still firmly out on gun control.

Tuesday, February 3, 2015

How Margarine is Tearing New England Families Apart

(Spoiler: It's not. Despire a very strong correlation)

Source: Spurious Correlations at tylervigen.com

When I was in my Econometrics class at college, my professor drilled into me the "Ten Commandments of Applied Econometrics", from an influential paper of the same name by Peter Kennedy. These rules apply as much to econometrics as they do any statistical modelling exercise:

Thou shalt use common sense in economic theory
The "Common Sense" that Kennedy (and by extension, Professor Khemraj) talks about is truly basic methodology. Things like mixing up stock and flow variables (eg confusing wealth/assets (stock) with income (flow)), or comparing a per capita figure with a total.
Thou shalt ask the right questions
If you walk into an analysis blindfolded, the results might not look too good when you turn the lights back on. Make sure you truly understand the question that you're asking.
Thou shalt know the context - do not perform ignorant statistical analysis
You should always know how your data came to be and how that might influence your results. If you're looking at vacation days across the developed world, remember that Brits count paid holidays in vacation time while Americans do not. If you're analyzing vote totals for municipal candidates across the US, remember that candidates in New York and Minnesota can stand for multiple parties. What was the wording on the survey that you're basing your analysis off of? How did each of your record labels classify their artists' genres?
Thou shalt inspect the data
These days it's incredibly easy to just type summary(lm(dataset, formula = a ~ b)) into an r prompt, see a significant result, copy the coefficient, and declare victory. It's also incredibly easy to take a closer look at your data. At least plot out what your data looks like; humans are visual creatures and it's much easier to see something absurd if you visualize it properly.
Thou shalt not worship complexity
The simpler your model is, the easier it's going to be to tell if you have a bad/misspecified variable, the less demanding it will be on your data, the more likely it is that you will be able to replicate your findings in the future.
Thou shalt look long and hard at thine results
When you find something, look at your results and make sure that they are sane. Make sure that everything's going the right way (i.e. your coefficients are properly positive or negative); that the right things are significant, and that the overall conclusion is sane. You don't want to produce a report only to find out that you transposed two of your variables!
Thou shalt beware of the cost of data mining
With a firehose of data at our fingertips, it's tempting to just plug 'n' chug, blindly regressing things against each other and seeing what fits. This rarely ends well. Congratulations, you just discovered a random correlation that just happens to only fit your dataset perfectly. Your results are often bunk and evaporate as soon as more data becomes available.
Thou shalt be willing to compromise
Unfortunately for economists, statisticians, and data analysts everywhere, we live in the real world; not a neatly-defined model. Your data will not be perfect, and it is your responsibility to work with what you have; not to cross your arms and hold out for that "perfect special dataset". Like in real life, there is no Mr./Mrs. Right. It is up to you to work with the conditions that you have, and understand the implications, to deliver as good a result as you can.
Thou shalt not confuse significance with substance
Just because a result is significant, does not mean it actually means anything. With enough data points; you'd be surprised at how much can magically become statistically significant. It's as important to look at coefficients and effect sizes to understand if the relationship is worthwhile.
Thou shall confess in the presence of sensitivity
One of Nate Silver of 538's favorite hobbies is to eat up overfitted political models. If your model relies on a small leap of faith in your variable specification, be responsible and make sure that's disclosed. Otherwise you could end up with egg on your face.

With those in mind, let's go back to the connection between margarine consumption and divorce rates in Maine. This is from "Spurious Correlations", an excellent demonstration of the dangers of blindly trusting your preferred statistic above common sense. The website pulls a number of data feeds from public and private-sector data sources from a ten-year period ('00-'09), and picks out the strongest correlations between them. Thus you dig up alarming "conclusions" about Nicholas Cage's film appearances, Oil imports from Norway, or American sour cream consumption. Of course, all of these connections are absurd, merely drawn from finding the biggest coincidences in a suitably large time series dataset.

In conclusion, be responsible when using statistics, or you could end up a sworn enemy to the American margarine industry from your faulty analysis on American marriages.

Friday, January 30, 2015

Diversity and Inequality

Last weekend, a post on Reddit's linguistics subforum showing a map of worldwide language diversity was a big hit. This map used a metric called Greenberg's Linguistic Diversity Index, which is the percent chance that two random inhabitants of a given country have two different mother tongues. States like largely-homogeneous South Korea and Haiti have low scores (0.003 and 0.000, respectively), while places like Tanzania and Papua New Guinea, where every village might speak a different language, have LDIs of 0.95 or higher.

Source: Reddit User Whiplashoo21

In the ensuing discussion, one user was interested in seeing how linguistic diversity compared with development. As you can see on the map above, many of the most linguistically diverse countries are in impoverished sub-Saharan Africa. In fact, this is a popular topic in political science and economics, studying whether cultural diversity makes a country better off, or whether it leaves a state susceptible to Balkanization and ethnic conflict.

To test this out, I started by looking at exactly what the commenter was asking about; LDI against inequality-adjusted HDI. For those who don't know, the Human Development Index is an attempt at a more holistic measure of development, which looks at three basic indicators (life expectancy, educational attainment, and per capita GDP) to come up with a single number. Since 2010, the UNDP has also published a second index, adjusted for inequality. Most states provide the UN with enough data to compute both indices, although there are a number of notable exceptions.

Source: Wikipedia image, Data from UNDP.

For my data, I used UNESCO's 2009 report on linguistic diversity for the LDI, and the UNDP's 2014 figures for HDI. This data is slightly different than the reddit post's source, but there aren't very substantial variations between the two LDIs.

IHDI = -0.308LDI + 0.691, R² = 0.246***, p < 0.00001

Unfortunately, diversity does not appear to be a positive at first glance. As the graph shows, there's a strong, but small, negative correlation. This model estimates that linguistic diversity accounts for about 25% of the variation in HDI scores. While this isn't a very large impact, it is a very interesting effect to see. Of course, it's a cardinal error in statistics to equate correlation with causation, and in this case there are two things to look out for: First, it's very likely that linguistic diversity isn't endogenous; it doesn't happen by itself. Second, there's very likely some third variable acting on both a country's diversity and development. To showcase this better, I grouped countries by continent.

Notice how much more diverse and poorer Africa is than the rest of the world. Both of these are a product of colonialism; the former a product of the Scramble for Africa which prioritized natural resources or landmarks over pre-existing ethnic groups in forming colonial borders.

Looking at greater cultural diversity comes up with similar results. I used a measure from a paper by Erkan Gören at the University of Oldenburg. Gören came up with a new index that takes into account religious, ethnic, and linguistic differences, and then adjusts them for how similar the languages actually are. This cultural diversity map (made for a blog post by Pew research) looks mostly similar to the original linguistic diversity map.

And similarly, looking at his figures come up with similar trendlines.

IHDI = -0.404GI + 0.697, R² = 0.282***, P < 0.00001

Going back to the scramble for Africa, I decided to try something new and adjusted HDI scores by continent. Since there's a lot of similar history for many countries on the same continent (Much of the Americas are monolingual ex-colonies with a very small indigenous population remaining, Africa has boundaries not drawn to ethnic lines, Asia has more-or-less well-drawn ethnic lines), maybe much of this relationship is just a product of colonial history.

IHDI_z = -0.529LDI + 0.242, R² = 0.028, p = 0.0508

And sure enough, the relationship breaks down.

One more thing I noticed is that more linguistically-diverse countries are more unequal.

% Loss = 15.614 LDI + 14.063, R² = 0.200***, p < 0.00001

Then again, that's just because poor countries tend to be more unequal anyway.

% Loss = -58.027HDI + 60.576, R² = 0.7607***, P < 0.00001

Until next time.

Monday, November 24, 2014

How to import your iTunes library into R

If there's anything that 23andme, last.fm, Strava, or any of those countless facebook apps have shown us, it's that we love analyzing our own data and discovering new things about ourselves. A great source of data is your iTunes library. If you're anything like me, you listen to music constantly- at home, at work, or on the go. With iPods (and iPhones) having been popular for over a decade, iTunes could potentially have data on a significant portion of your life. Why not poke around it?

iTunes stores its library data in two separate files: iTunes Library.itl and iTunes Library.xml. According to Apple, the .itl file is the database iTunes uses for itself, but the .xml file is intended for use with external applications. The XML file is in a standard format easily readable by both humans and computers and is used all over the web for things like RSS feeds and web services.

With the use of two packages, XML and the ever-so-useful plyr, importing your iTunes library data into an R data frame is a dead-easy process.

Setup:

Make a copy of your iTunes Library.xml or iTunes Music Library.xml file. (Safety first!)
Install XML and plyr from CRAN.

Instructions:

Just follow along with these four easy steps:

ituneslib <- readKeyValueDB("iTunes Music Library.xml")
This command loads your iTunes library into R as a .plist file, rather than a standard xml file. Remember to change "iTunes Music Library.xml" to whatever your file is named. Don't panic if it seems like the program's frozen! For my library (3500 songs on a 2007 iMac), this operation took about a minute and a half.
tracksxml <- ituneslib$Tracks
This command grabs the "Tracks" section of the library and moves it into a separate variable for convenience. Tracksxml is currently a list of lists, with each list further wrapped inside a 1-item list, so now we have to restructure this into a more sane data format.
tracksxml1 <- lapply(tracksxml, data.frame)
This command transforms the nested lists into data frames so that plyr can do its magic. (That is, the lists-inside-lists-inside-a-list are now data frames-inside-lists-inside-a-list). This command took me 25 seconds.
songs <- ldply(tracksxml1)
And now plyr steps in and makes everything into a nice neat data frame!

After you're done, feel free to remove the tracksxml lists, they're no longer necessary as soon as you have your data frame. There is likely a faster way to do all of this, but this way is by far the simplest to type out and debug. For easy copy-pasting, here are the commands again:

ituneslib <- readKeyValueDB("iTunes Music Library.xml")
tracksxml <- ituneslib$Tracks
tracksxml1 <- lapply(tracksxml, data.frame)
ldply(tracksxml1)

Or you can do it all in one line:

tracks <- ldply(lapply(readKeyValueDB("iTunes Music Library.xml")$Tracks, data.frame))

Have fun and happy analyzing!

Thursday, November 20, 2014

Trials and tribulations building Rstudio Server on a mac

I've been trying to get RStudio server to build on my iMac all of yesterday. In my opinion, it's the best IDE for R, and being able to run it on another computer remotely is icing on the cake. My Samsung Chromebook with crouton really doesn't have the "oomph" to... well... do anything meaningful. But building has been anything but a trivial process, and I want to post here to document some pitfalls I've found myself running into... this isn't a well-documented process.

I've been relying on instructions from this blog post by Ian Gow, which provides easy-to-follow commands (great when you're too lazy to remember to type out build instructions into gcc by hand), and has supposedly worked for him. With my setup (which uses macports R 3.1.1 and the latest RStudio version on github), I've still run into some problems.

First, there's a hidden dependency on java, which I installed with the OS X installer. You need this to work with apache-ant, and it doesn't come up until later in the build process. When I was building from scratch on my chromebook (RStudio does not supply ARM binaries), I ran into this problem and I assumed it would be an issue here.

Second, R is not recognized by RStudio when running cmake. I started running into these errors:

Rogers-iMac:build roger$ cmake -DRSTUDIO_TARGET=Server -DCMAKE_BUILD_TYPE=Release ..
-- Mac OS X version: 10.10
-- Boost version: 1.50.0
-- Found R: /opt/local/lib/libR.dylib/Resources
CMake Error at src/cpp/CMakeLists.txt:218 (message):
Minimum R version (2.11.1) not found.

-- Configuring incomplete, errors occurred!
See also "/Users/roger/projects/rstudio/build/CMakeFiles/CMakeOutput.log".
See also "/Users/roger/projects/rstudio/build/CMakeFiles/CMakeError.log".

I tried taking the easy way out initially, by passing -DRSTUDIO_VERIFY_R_VERSION=0 in the cmake instructions instead (suggested on the RStudio forums here), but then ran into the same problem that the poster did, not being able to find Rinternals.h

Rogers-iMac:build roger$ cmake -DRSTUDIO_TARGET=Server -DCMAKE_BUILD_T[330/3072]
e -DRSTUDIO_VERIFY_R_VERSION=0 ..
-- The C compiler identification is AppleClang 6.0.0.6000054
-- The CXX compiler identification is AppleClang 6.0.0.6000054
[...]
[ 44%] Built target rstudio-monitor
Scanning dependencies of target rstudio-r
[ 44%] Building CXX object src/cpp/r/CMakeFiles/rstudio-r.dir/RErrorCategory.cpp
.o
[ 44%] Building CXX object src/cpp/r/CMakeFiles/rstudio-r.dir/RExec.cpp.o
In file included from /Users/roger/projects/rstudio/src/cpp/r/RExec.cpp:17:
In file included from /Users/roger/projects/rstudio/src/cpp/r/include/r/RExec.hp
p:30:
In file included from /Users/roger/projects/rstudio/src/cpp/r/include/r/RSexp.hp
p:33:
/Users/roger/projects/rstudio/src/cpp/r/include/r/RInternal.hpp:43:10: fatal err
or: 'Rinternals.h' file not found
#include <Rinternals.h>
^
1 error generated.
make[2]: *** [src/cpp/r/CMakeFiles/rstudio-r.dir/RExec.cpp.o] Error 1
make[1]: *** [src/cpp/r/CMakeFiles/rstudio-r.dir/all] Error 2
make: *** [all] Error 2

I wasn't sure how to continue here, but I found an old forum post on the RStudio forums with a nice clue. Sometimes cmake would have problems finding the R executable, especially if it was in an uncommon location. Macports generally puts its binaries in /opt/local/bin, but RStudio was only searching /usr/bin and /usr/local/bin. So all I would have to do is point the cmake command to the location of R (letting it complete successfully) with export STUDIO_WHICH_R=/opt/local/bin/R right before running the command, and afterwards fixing a few lines in build/CMakeCache.txt. (I also have this as a question/answer on StackOverflow)

cmake ran successfully, and afterwards I changed these lines in CMakeCache.txt:

//R doc directory
LIBR_DOC_DIR:PATH=/opt/local/Library/Frameworks/R.framework/Resources/doc

//R executable
LIBR_EXECUTABLE:PATH=/opt/local/bin/R

//R home directory
LIBR_HOME:PATH=/opt/local/Library/Frameworks/R.framework/Resources

//R include directory Think I found our problem!LIBR_INCLUDE_DIRS:PATH=/opt/local/lib/libR.dylib/Resources/include
LIBR_INCLUDE_DIRS:PATH=/opt/local/Library/Frameworks/R.framework/Versions/3.1/Resources/include/

//Path to a library.
LIBR_LIBRARIES:FILEPATH=/opt/local/lib/libR.dylib

The final result: Success!

Afterwards, I followed the suggestions of the developers and put RStudio behind a proxy for better security. I chose nginx, which is rapidly growing in popularity, in part because of its usefulness as a reverse proxy (look how easy that configuration is!). This will also be useful if I want to host other unrelated stuff on my computer, I can easily throw rstudio behind a directory and have it work just like normal.

In any case, it's time for me to play around with this a bit. Until next time!

Friday, October 10, 2014

Gun Control and Gun Violence

The United States has heard repeated calls for more gun control legislation in the wake of the Sandy Hook Elementary School shooting. Every day it seems there's a new mass shooting, with dire implications for the state of our country. But these mass shootings are isolated events that have almost been tailor-made to provoke disproportionate media attention. The day-to-day assaults, kidnappings, and murders affect a lot more people.

Liberals claim that gun control makes places safer by making guns harder to obtain and be used illegally. Conservatives counterclaim that gun control makes communities more dangerous by eliminating a key method of self defense for law abiding citizens. Each side has their share of talking points. Liberals point to the high rates of gun-related deaths in the United States compared to other developed countries (a point used famously in Michael Moore's Bowling for Columbine), and conservatives point to stories of self-defense by would-be victims of robberies, home invasion, domestic abuse, or other serious crimes. Although I fall on the liberal side of the spectrum, I would much rather take up a position supported by empirical evidence. So I asked the question: Do harsher or stricter gun laws affect crime?

To test this, I decided to do a simple regression of violent crime rates against the relative restrictiveness of gun laws. I am using the FBI's Uniform Crime Report statistics for 2012, and comparing it with FreeExistence.org's Gun Rights Index. Both datasets come with a few major caveats. FreeExistence.org has a strongly libertarian viewpoint, and clearly shows some favoritism towards laxer gun laws, while the FBI cautions strongly against using UCR data in order to rank crime in areas, due to different reporting standards by different police agencies. Despite FE's obvious political slant, I don't see a reason to doubt their data; it's unclear whether it would be more in their favor to magnify differences and paint states like New York as overly restrictive, or rather to blur them so as to obscure any supposed effect.

My first regression ended up with a slight negative correlation, suggesting that more restrictive gun laws led to more violence. But when I plotted the results, I found that I had made a classic demographic pitfall... counting DC as a state!

Outlier much? (Look in the top left)

If you remove DC, the correlation breaks down completely:

The lack of a correlation likely means that there are a host of other factors involved (understandable, since I only looked at a single metric of gun freedom). As another quick test, I tried the regression on just the homicide rate, although it doesn't look much better.

For those who haven't taken a statistics class, or need a refresher, the r² is an effect size (essentially what % of gun violence is affected by gun control), and the p is the probability that any correlation is just a product of pure chance. So there appears to be no connection between gun violence and gun laws in the bottom two graphs, and only a slight connection in the first (which would probably go away if I controlled for poverty level, population density, or something similar).

In any case, there are still a number of ways to go on this theme – looking at gun ownership rates, like this analysis from the Violence Policy Center; using a different metric of gun violence, like gun-related deaths (from the CDC) or individual arrest records (through the UCR's successor, the NIBRS) ; or controlling for other possible factors, like economic inequality or education levels; might be promising. For now, I'm unable to draw conclusions either way.