Data is Life

Monday, August 24, 2015

Hard Times and Hard Drives

Ever had a hard drive crash and burn? Backblaze, an online backup company, has a little over 6 a day. It's a small fraction of the 35,000 they have up and running right now, but it means they are constantly logging

performance and diagnostic metrics for their hard disk arrays. Conveniently, they have also decided to release this data to the public in a set of CSV files, with a script to import them into an SQL database.

Taking a quick look at the data, I noticed something really odd about a few of the drives - their failure rates were off the charts! These weren't just drives that were oddballs in the rotation - A line of 3 Terabyte Seagate Barracuda drives were having failure rates of over 40 percent, and these were the third most used drives in Backblaze's server farm!

Taking a closer look at the data, I started to notice a pattern emerge - The worst drives tended to have capacity sizes that were cleanly divisible by 3. Here's a different perspective on the failure rates:

For an ordinary person, a hard drive crash every 5 years would be absolutely catastrophic. But for those 3 TB Barracuda drives, Backblaze would lose half of them to drive failures every year. And while Seagate (model number starting with ST) drives seemed to be performing the worst, it looked like anything with a size of 1.5 or 3 terabytes was suspect. In fact, the 4 TB Seagate drives seemed to be only a little worse than normal, while a 3 TB Western Digital drive had a 10% failure rate!

It was at this point that I remembered a distant rumor that I heard on the internet once - that 3-platter drives were less reliable than 2-platter ones (A hard drive is made up of a series of magnetic "platters" that hold data). Sure enough, the ST3000DM001 appears to have 3 platters. ST31500341AS, #2 on the list, has four. But surely that wouldn't be enough to cause the massive attrition rates that we've seen on those bad hard drives...

It turns out the answer has a lot to do with weather in Southeast Asia and a nationwide scramble for hard drive space. In a blog post from this April, Backblaze noted the severe reliability problems with this line of hard drives, but explained that they had no choice. 2011 was a difficult year for storage companies everywhere, as catastrophic floods in Thailand led to a severe hard drive shortage worldwide. Backblaze responded by finding whatever hard drives they could and hoping that their RAID setup could provide enough redundancy to cover whatever problems the hard drives had. Evidently, their ability to compensate for failures was put to the test.

What does this mean for the rest of us? For me, I'm reassured that if I buy a hard drive, I have a good chance of having it last for at least a few years. It looks like some drives are clearly better than others (and as a data driven person, I'm very reassured by seeing empirical results of this). And while 2-platter drives are more expensive per Gigabyte, the proof of their reliability is very clear to see.

If we're not all using Solid State Drives in 2 years anyway.
Until next time.

Friday, June 5, 2015

#NFPGuesses: A Recap and a year's summary

Payroll employment rises by 280,000 in May; unemployment rate essentially unchanged (5.5%) http://t.co/1Y9cSWJUIB #JobsReport #BLSdata
— BLS-Labor Statistics (@BLS_gov) June 5, 2015

Swing and a miss! But this time it's a surprise in the opposite direction - I was 80,000 jobs too short.

Here's what the distribution of guesses looked like for this month's go-around:

One thing I've noticed is the remarkable consistency of the average guess. I've plotted out all of the rounds of #NFPGuesses from February to today, and notice how average has come out at around 225,000 each time.

May

April

March

February

As you can see, the average of guesses has been remarkably consistent, even when the actual numbers are not:

It looks like the median twitter armchair economist is consistent, and consistently optimistic. In an average month, an average twitter user will guess +222K, which is higher than the actual jobs figure of 205,400.

There might be a few reasons why this might be the case:

This mid-winter dip coincides with severe weather in the Northeast, especially New England. It comes after a huge employment boom in autumn. So it is entirely possible, given a steady recovery and surprisingly disastrous conditions in a region, that tweeters would overestimate job growth.

Other job data appears to be strong, which means that NFP data might be underwhelming, relative to expectations. For example, JOLTS data (the Job Openings and Labor Turnover Survey), has been quite encouraging over the period I looked at. Companies are looking for employees harder than they have been in the past seven years.

Finally, this may simply be a manifestation of optimism bias, the tendency of people to predict good events to happen more frequently than reality, and bad ones less frequently. Finance, for example, has a natural incentive to be optimistic, given that good economic conditions prevail most of the time, and crises and recessions in fits and starts.

I want to keep my blog fresh, but I might have another post or 2 with insights I've gathered from looking at this particular trend. So stay tuned!

Until next time.

Thursday, June 4, 2015

#NFPGuesses Preview

Tomorrow will be the first Friday of the month, when the Bureau of Labor Statistics releases its employment numbers (called the Non Farm Payroll numbers). For twitter-addicted finance and economist types, this spawns a monthly ritual, #NFPGuesses, where people publicly post their guess for the month and see who can "nail the number".

For the past four months, I've been collecting every #NFPGuesses tweet in anticipation of collecting them and analyzing them. Over the past two weeks, I've learned an entirely new statistical package (Pandas for Python) and made a very simple, yet elegant and reusable tool to quickly extract and clean my tweets.

But let's get right into it. Here's a visualization of everyone's guesses from last month. I put myself in red:

Surprisingly, there are fewer extreme optimists/pessimists than I expected. This is likely because I filtered out accounts with fewer than 100 followers, but most of the outliers are either jokes:

i am going to stick with my usual low ball on #NFPguesses and go with 134, and thats not short hand for 134k, just 134
— Ryan Longhenry (@Six1FourCapital) May 8, 2015

I see your #NFPguesses and raise you: 666 https://t.co/5pGfFVLjcD
— MARK GILBERT (@ScouseView) May 8, 2015

or mismatched entries (human data entry would be more reliable, but python is faster!).

Twitter seems to be in-line with industry expectations, but when there's a miss, everybody misses. Last month, industry and twitter consensus was well above average– March was an awful hiring month!

It looks like Twitter is as good a forecaster as Wall Street. But this shouldn't surprise anyone, because macroeconomic forecasting is hard, NFP data is noisy (look at the revisions), and #NFPGuesses is "obscure" enough that only people who are already somewhat economically inclined will participate.

Stay tuned, because I'm going to do a lot more with this data... people should be turning in their guesses for tomorrow's numbers right now, and I've practically done all the heavy lifting already!

And oh, right...

+200k #NFPGuesses
— Roger Filmyer (@rfilmyer) June 4, 2015

See you soon!

Thursday, May 28, 2015

Fool Millions Into Eating Chocolate With This One Weird Trick!

Yesterday, John Bohannon of io9 posted an article, "I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How.", that went absolutely viral. Bohannon ran an expose on the pervasiveness of bad nutritional science and bad nutritional science reporting by creating his own bogus study. It's a fascinating read showing how the system breaks down if you can get through a few initial barriers. But I wanted to talk specifically about p-hacking, about what John & Co. did to get to 0.05 and out the door:

Use a tiny sample size

The bogus study used three groups - a control group, a group on a low-carb diet, and a low-carb group that also ate bitter chocolate. But each group had only 16 people to start with, where one or two outlier subjects could create a "significant" trend out of thin air. It's the reason why John says that "almost no one takes studies with fewer than 30 subjects seriously anymore."

Keep rolling the dice

The study also measured a lot of information from these 15 subjects - weight, blood pressure, circulation, sleep quality, and other measurements for a total of 18 possible metrics. The significance threshold they used was p < 0.05, meaning that the study would declare a trend "significant" if there was under a 5% chance of a false positive. Sounds good, right?

Except if you start studying multiple variables, each of them has a 5% chance of being a false positive. With 18 results, there's a 66% chance that any study with this threshold will return a false positive. This is called Data Dredging, and is an excellent way to create connections that only fit your study group. (I actually talked about this two posts ago!) This alone can compromise an honest study. But if you're trying to make a bogus nutritional study that says something, not caring what it is...

The more tickets you buy, the more likely you are to win. We didn’t know exactly what would pan out—the headline could have been that chocolate improves sleep or lowers blood pressure—but we knew our chances of getting at least one “statistically significant” result were pretty good.

Hope nobody notices:

A reputable journal would have rejected this shoddily-constructed study out of hand. But unfortunately (or fortunately, if you're John Bohannon), there are a lot of irreputable journals out there. Some basic fact checking at the countless websites, newspapers, and magazines that published this study would have shown that the scientific journal was a sham, that the institute was nothing more than a website, or that "Dr. Johannes Bohannon" didn't even have a relevant PhD.

How was this so easy? Simple. John & Co. made a very inocuous-looking press release, complete with a few stock photos and a concise, yet sufficiently scientific-sounding summary of key points. By choosing bitter chocolate as their "secret ingredient", they picked something that sounded plausible. In the words of John Bohannon's collaborator Gunter Frank, “Bitter chocolate tastes bad, therefore it must be good for you. It’s like a religion.”

By corrupting a study in just the right ways, Bohannon was able to find a statistically significant result. And by sidestepping a key part of the review process, his team's sham work found traction in a popular, yet poorly-scrutinized area of scientific reporting: diet science.

Tuesday, March 10, 2015

Gun Control and Gun Violence, Part 2

After my first go-around looking at the connection between gun control and gun violence, I decided to revisit this question with a more detailed dataset. Before, I was using the FBI's Uniform Crime Report statistics, which cover eight major crimes across almost every police jurisdiction in the United States. This time, I looked at the National Incident-Based Reporting System, which is much more comprehensive; documenting every incident reported by participating jurisdictions and including time, place, crime, weapon used, characteristics of the suspect and victim, and much more information. A problem with the UCR is that data does not include weapon used- I couldn't tell if a criminal in Florida had used a gun or just a banana to rob their victim.

Unlike the much simpler UCR dataset, I had quite a few difficulties getting the NIBRS files to do what I wanted. As you might expect, a comprehensive crime dataset for the United States was big. Really big. I was able to do my first analysis on a puny ARM-powered chromebook. I had to use my desktop to even be able to open the file, which was a tab-delimited ASCII file 6 gigabytes large. I normally use R to do quantitative analysis these days, but I had to load an open-source clone of SPSS to properly load the file and convert it. This isn't even "Big Data" territory, and I still started running into performance issues. I started using dplyr to get its performance benefits, but a query on the entire database would still take me about 10-15 minutes to run, even with a Solid State Drive. This is where you learn about the importance of using a subset of your data as a test, because any typo you make stacks up quickly!

More discouraging was the fact that NIBRS is not universal. Now, UCR is a voluntary system, but still covers 98% of all Americans. The NIBRS only covers 30% (mostly broken up by state), and doesn't include crimes from the seven biggest states. Look at the coverage map below:

(data from a JRSA report)

Fortunately, there is data available on the proportion of crime in and out of the database (the numbers above reflect the percent of crime covered in NIBRS for each state), so it is possible to normalize this data somewhat, but the lack of data for many parts of the country may make a definitive analysis difficult.

Even with a more detailed dataset, I wasn't able to find any connection between gun laws and crime. Even controlling for things like crime rates (is a higher percentage of crime gun-related violent crime?), or a disproportionate effect on victims of color, I saw no impact.

Where I did see a big difference was (oddly) with population size. The bigger the state's population, the more often violent crime tended to involve a gun. The trend was twice as strong for overall population than for just urban population. Population density had no impact.

In order to get good enough quality data, I cut out any state that did not report at least half of its total crime. As you can see in the map, that leaves only a smattering of state agencies, and even fewer cities. The strong correlation between population and crime ratios may be an artifact of two of the largest states (Ohio and Michigan) being home to a number of poor rust belt cities, while many of the smaller states are not in the traditionally poor deep south.

I'd be interested in seeing the impact as more police agencies sign on to NIBRS and open their case data to the public. A larger source of crime data would revolutionize criminology and sociology, and make it easier to understand trends like this. In the meantime, I'm going to have to say the jury's still firmly out on gun control.

Tuesday, February 3, 2015

How Margarine is Tearing New England Families Apart

(Spoiler: It's not. Despire a very strong correlation)

Source: Spurious Correlations at tylervigen.com

When I was in my Econometrics class at college, my professor drilled into me the "Ten Commandments of Applied Econometrics", from an influential paper of the same name by Peter Kennedy. These rules apply as much to econometrics as they do any statistical modelling exercise:

Thou shalt use common sense in economic theory
The "Common Sense" that Kennedy (and by extension, Professor Khemraj) talks about is truly basic methodology. Things like mixing up stock and flow variables (eg confusing wealth/assets (stock) with income (flow)), or comparing a per capita figure with a total.
Thou shalt ask the right questions
If you walk into an analysis blindfolded, the results might not look too good when you turn the lights back on. Make sure you truly understand the question that you're asking.
Thou shalt know the context - do not perform ignorant statistical analysis
You should always know how your data came to be and how that might influence your results. If you're looking at vacation days across the developed world, remember that Brits count paid holidays in vacation time while Americans do not. If you're analyzing vote totals for municipal candidates across the US, remember that candidates in New York and Minnesota can stand for multiple parties. What was the wording on the survey that you're basing your analysis off of? How did each of your record labels classify their artists' genres?
Thou shalt inspect the data
These days it's incredibly easy to just type summary(lm(dataset, formula = a ~ b)) into an r prompt, see a significant result, copy the coefficient, and declare victory. It's also incredibly easy to take a closer look at your data. At least plot out what your data looks like; humans are visual creatures and it's much easier to see something absurd if you visualize it properly.
Thou shalt not worship complexity
The simpler your model is, the easier it's going to be to tell if you have a bad/misspecified variable, the less demanding it will be on your data, the more likely it is that you will be able to replicate your findings in the future.
Thou shalt look long and hard at thine results
When you find something, look at your results and make sure that they are sane. Make sure that everything's going the right way (i.e. your coefficients are properly positive or negative); that the right things are significant, and that the overall conclusion is sane. You don't want to produce a report only to find out that you transposed two of your variables!
Thou shalt beware of the cost of data mining
With a firehose of data at our fingertips, it's tempting to just plug 'n' chug, blindly regressing things against each other and seeing what fits. This rarely ends well. Congratulations, you just discovered a random correlation that just happens to only fit your dataset perfectly. Your results are often bunk and evaporate as soon as more data becomes available.
Thou shalt be willing to compromise
Unfortunately for economists, statisticians, and data analysts everywhere, we live in the real world; not a neatly-defined model. Your data will not be perfect, and it is your responsibility to work with what you have; not to cross your arms and hold out for that "perfect special dataset". Like in real life, there is no Mr./Mrs. Right. It is up to you to work with the conditions that you have, and understand the implications, to deliver as good a result as you can.
Thou shalt not confuse significance with substance
Just because a result is significant, does not mean it actually means anything. With enough data points; you'd be surprised at how much can magically become statistically significant. It's as important to look at coefficients and effect sizes to understand if the relationship is worthwhile.
Thou shall confess in the presence of sensitivity
One of Nate Silver of 538's favorite hobbies is to eat up overfitted political models. If your model relies on a small leap of faith in your variable specification, be responsible and make sure that's disclosed. Otherwise you could end up with egg on your face.

With those in mind, let's go back to the connection between margarine consumption and divorce rates in Maine. This is from "Spurious Correlations", an excellent demonstration of the dangers of blindly trusting your preferred statistic above common sense. The website pulls a number of data feeds from public and private-sector data sources from a ten-year period ('00-'09), and picks out the strongest correlations between them. Thus you dig up alarming "conclusions" about Nicholas Cage's film appearances, Oil imports from Norway, or American sour cream consumption. Of course, all of these connections are absurd, merely drawn from finding the biggest coincidences in a suitably large time series dataset.

In conclusion, be responsible when using statistics, or you could end up a sworn enemy to the American margarine industry from your faulty analysis on American marriages.

Friday, January 30, 2015

Diversity and Inequality

Last weekend, a post on Reddit's linguistics subforum showing a map of worldwide language diversity was a big hit. This map used a metric called Greenberg's Linguistic Diversity Index, which is the percent chance that two random inhabitants of a given country have two different mother tongues. States like largely-homogeneous South Korea and Haiti have low scores (0.003 and 0.000, respectively), while places like Tanzania and Papua New Guinea, where every village might speak a different language, have LDIs of 0.95 or higher.

Source: Reddit User Whiplashoo21

In the ensuing discussion, one user was interested in seeing how linguistic diversity compared with development. As you can see on the map above, many of the most linguistically diverse countries are in impoverished sub-Saharan Africa. In fact, this is a popular topic in political science and economics, studying whether cultural diversity makes a country better off, or whether it leaves a state susceptible to Balkanization and ethnic conflict.

To test this out, I started by looking at exactly what the commenter was asking about; LDI against inequality-adjusted HDI. For those who don't know, the Human Development Index is an attempt at a more holistic measure of development, which looks at three basic indicators (life expectancy, educational attainment, and per capita GDP) to come up with a single number. Since 2010, the UNDP has also published a second index, adjusted for inequality. Most states provide the UN with enough data to compute both indices, although there are a number of notable exceptions.

Source: Wikipedia image, Data from UNDP.

For my data, I used UNESCO's 2009 report on linguistic diversity for the LDI, and the UNDP's 2014 figures for HDI. This data is slightly different than the reddit post's source, but there aren't very substantial variations between the two LDIs.

IHDI = -0.308LDI + 0.691, R² = 0.246***, p < 0.00001

Unfortunately, diversity does not appear to be a positive at first glance. As the graph shows, there's a strong, but small, negative correlation. This model estimates that linguistic diversity accounts for about 25% of the variation in HDI scores. While this isn't a very large impact, it is a very interesting effect to see. Of course, it's a cardinal error in statistics to equate correlation with causation, and in this case there are two things to look out for: First, it's very likely that linguistic diversity isn't endogenous; it doesn't happen by itself. Second, there's very likely some third variable acting on both a country's diversity and development. To showcase this better, I grouped countries by continent.

Notice how much more diverse and poorer Africa is than the rest of the world. Both of these are a product of colonialism; the former a product of the Scramble for Africa which prioritized natural resources or landmarks over pre-existing ethnic groups in forming colonial borders.

Looking at greater cultural diversity comes up with similar results. I used a measure from a paper by Erkan Gören at the University of Oldenburg. Gören came up with a new index that takes into account religious, ethnic, and linguistic differences, and then adjusts them for how similar the languages actually are. This cultural diversity map (made for a blog post by Pew research) looks mostly similar to the original linguistic diversity map.

And similarly, looking at his figures come up with similar trendlines.

IHDI = -0.404GI + 0.697, R² = 0.282***, P < 0.00001

Going back to the scramble for Africa, I decided to try something new and adjusted HDI scores by continent. Since there's a lot of similar history for many countries on the same continent (Much of the Americas are monolingual ex-colonies with a very small indigenous population remaining, Africa has boundaries not drawn to ethnic lines, Asia has more-or-less well-drawn ethnic lines), maybe much of this relationship is just a product of colonial history.

IHDI_z = -0.529LDI + 0.242, R² = 0.028, p = 0.0508

And sure enough, the relationship breaks down.

One more thing I noticed is that more linguistically-diverse countries are more unequal.

% Loss = 15.614 LDI + 14.063, R² = 0.200***, p < 0.00001

Then again, that's just because poor countries tend to be more unequal anyway.

% Loss = -58.027HDI + 60.576, R² = 0.7607***, P < 0.00001

Until next time.