Thursday, May 10, 2012

your data is always wrong and other things I wish I knew a long time ago

Data analysis is tough work. I've been doing it for eight years or so, and I still have an infinite amount to learn and master.

However, there are a few things I have picked up, and some of them are simple enough for the 22-year-old Chris Perry to understand, so I'd like to formally request that whenever any of my descendants gets around to inventing time travel,1 they pass these on to my former self.

The first item you should tell young Chris Perry, of course, is to INVEST EVERY DIME YOU HAVE in Apple stock while it's sub-$50/share, just like your roommate John IS TELLING YOU TO DO EVERY NIGHT YOU IDIOT, but I'll spare you the rest of the non-data-related items for today.

Your data is always wrong


Your data is always wrong.

It's wrong for all sorts of reasons: the data collection mechanism is broken, the servers went down, there are strange confounding effects due to your pseudo-random selection, ghosts are somehow causing weird corner cases every millionth row, solar flares, you can name them all, but the real reason is this: just like mom said, life isn't fair.

Data is not perfect. It never has been. It never will be. Your stats teacher couldn't tell you that because she wanted to perpetuate the myth of some Santa Clausian dataset which exists without blemish and travels the world giving gifts to Analysts on every Pi Day Eve, but the truth is the world is just imperfect, and there are always imperfections in your data.


Once you accept this, you'll be a much better analyst. The data you are looking at, right this second, is wrong. Most data isn't so terribly wrong that you need to go running around outside screaming that the Bayesians are coming, because it usually falls within the margin of error,2 so you're usually okay, but you must always keep this in mind.

Thou shalt not give bad results

This is the only commandment in all of data analysis. You can never ever transgress the the most holy commandment and make a mistake that causes you to deliver bad results.

This does not contradict my first point, because there's a big difference between calling Indiana for McCain and telling me that 84% of the U.S. population is Jewish. I did the former, and I blame the margin of error, and I saw the latter happen and it wasn't pretty.

If you quote numbers to your boss, you are held to a high standard. If you retract them later after you discover an error, or worse, someone else does, you look like an idiot. You must be right.

Know your numbers

Speaking of, be sure and know your numbers. Before you present to anyone, have the answers to a few obvious questions they might ask about those numbers on the tip of your tongue. To use a completely hypothetical example, if you were to list out segments of your user base that return at high rates, you might want to, and again, this is completely hypothetical, have stats prepared on what those people do when they return, you nincompoop.

Have relevant figures ready and in your head when presenting. It will help you not look stupid. Trust me.

Always double check

Even if you just spot-check a result or two by hand, you really need to verify your results.

If you were a software developer, you would have the privilege of writing code that can, at absolute bare minimum, be released and tested on users out in the wild, and by their screams and angry hacker news comments and your burning servers, you'd be able to tell that something is broken.

Not so for data analysis. If your code is buggy, you are screwed beyond belief. With satanically horrific frequency, it is often not at all obvious that there are problems with your results.

If you're not verifying your results using some independent method, then you are just relying on your own gut check to make sure the numbers are right, and let me tell you, relying on your so-called "gut check" just means you're going to get your gut checked by a physician someday soon because when the CEO calls you out for screwing something up, you will learn the true meaning of ulcerating.

You can ignore some stuff


It's routine and common to see numbers that are close-ish to each other and declare it's it's close enough for government work,3 and continue on. And that usually works.

But, and going back to your gut here, sometimes you'll see something just a little bit off, and you'll have this momentary little nagging thought that'll tell you to investigate that further.

If you ignore this, you will almost certainly regret it. Write it down, and check it out later when you've fallen out of your groove and you need something else to spend time on before you get stuck going back and ensuring compliance with your logging spec.

You will always perform a task another time

At least. Always. Without fail. No wait, let me hear you tell me that this analysis is special and you're only doing it this one time, so you're just going to whip up some cheap code and you won't save it, or maybe, horror of horrors, you just plan on doing something by hand in Excel,4 and let me tell you that it is a Grand Law of the Universe that either your boss, your co-worker, or you will want you to do it again in the future, or an analysis almost exactly like unto it.

You are one hundred percent guaranteed to do it again. Therefore, script it. You must script it. If it's not easily repeatable by script, you have failed.

I once had a data export project for a product worth literally millions of dollars that depended on one guy who fiddled with some exporter by hand, thinking he would only have to do it once. We did it over a dozen times. It wasn't until the dozenth that I realized he was doing it by hand, and I suddenly understood why he wanted to stick me with a rusty shiv every time I came down telling him it blew up again.

Document your scripts

While you're scripting, please, for the love of everything happy and kind on this earth, please document your scripts. You learned the comment character in your language of choice. Use it. You'll thank yourself in a year.

Also, spend an extra ten seconds and think of a descriptive name for your script. Make it really easy to find.

Lastly, naming your variables bob, jim, foo1, foo2, etc., makes for a very sad you eons later when you're trying to decipher what went on.5 You are guaranteed to forget everything about the script you are now writing within a week. I promise.

Always start small

I know it always seems like your code is bug free, and you can run the analysis over the entire dataset, and it doesn't matter if you have to wait a minute or two for results, because, hey, even if there is a bug in your code, there will only be one, and you'll only need to re-run it once, and the Easter Bunny is real and someday a politician will voluntarily balance the budget.

No. I normally hesitate to contradict people in such strong of terms, but you are an idiot. Your code has errors, and you're going to need a few cycles in order to iron it all out, and you'll save yourself a lot of wasted time if you just run everything on a small subset, then, once you're sure everything works, run it on the entire dataset.

Even if this is only a difference of 30 seconds, you will still save yourself loads of time, because do you know what happens to your brain when you wait for 30 seconds in order to change something? It shuts off completely and starts singing the theme song to Spongebob Squarepants. You are taxing your faculties trying to maintain everything in your mental memory, so keep the feedback loop as short as possible.

There are two steps to performing analysis

Step 1: Spec out what you are going to do.
Step 2: Do it.

If you try and merge those steps, and just set off down your road less traveled with your silly hopes and dreams, you're going to run into that tree down the path, and instead of busting out your chainsaw and knocking that sucker out of the way, you're going to alter your course a little bit to the left because, hey, it's still roughly the same direction and it'll kind of get you to the same place, and maybe you'll mumble something about the law of large numbers on your way.


Spec out exactly what you are going to do as a separate and distinct step. This will force you to have the discipline necessary to tackle the roadblocks that fall in your path, instead of sissying out and performing a crummier analysis for it.

You need a data buddy

You need someone to bounce ideas off of, to help you think through big problems, and, in general, be a sounding board. Find someone smart who isn't afraid to tell you you're wrong. You'll thank her or him, and me, later when you produce, as Moe said, "the best damn [analysis] in town!"6

1. And you're taking your sweet time you insouciant child.
2. This term was invented by statisticians to force other disciplines to give us a break whenever our numbers are slightly off.
3. This is my favorite sissying-out phrase, beating out appeals to the margin of error.
4. Though if you are doing "data analysis" by hand in Excel, you are living deep in sin and must needs repent.
5. The phrasing of this sentence is basically stolen from my good friend Chris who was kind enough to look over a draft of this. Thanks. And thanks Jamie for looking it over too. Oh, and Britt, thanks for helping so much on this during finals. You're the best.
6. Due to copyright restrictions, my favorite Simpsons moment ever has been removed from YouTube. But you should watch Homer vs. the 18th Amendment sometime.

1 comment:

Bruce said...

I totally relate with these, especially the last few:

- You will always perform a task another time
- Document your scripts
- Always start small
- There are two steps to performing analysis

Just last week I violated "start small", and ended up wasting tons of time on slow iterations of a script that I was *sure* was going to work on the next iteration... for about 15 iterations. Too bad I didn't read this post first. :)