Big data doesn't need to be so big

There are lots of people talking about big data these days. There's a lot of discussion about how to build apps for "web scale", and there's an emphasis on real time apps that collect comprehensive data.

This article is me playing devil's advocate.

I think the big data hype is hurting our ability to think critically about our problems. Obviously this isn't the case everywhere. Many engineers are excellent and brilliant and do the right thing. But many others simply brute-force their problems into submission.

These days it's possible to cheaply collect and analyze billions of data points on commodity resources. On the surface that seems like a good thing, but we should keep in mind that many of the problems we solve today were solvable yesterday too. We went to the moon on a 2MHz processor and 4KB of RAM.

Think about what a feat that was! Think about the problem solving skills that took!

If we could that, why can't we just get by on a million data points instead of a billion? Have we already forgotten statistics? When was the last time you actually asked yourself if you need big data? Unless you actually need to record every single data point -- if that's your core value proposition -- I'm sure you can get by on a random sample.

I'd be willing to bet that any data source that can provide a billion data points will still be statistically significant if you were to sample only 1% or 0.1% of the source.

Big data doesn't need to be so big. Even Google analytics uses sampling for datasets larger than 100,000.

My suggestion to you, then, is to really evaluate your options before you spin up 50 new EC2 instances. If you're a startup and funds are slim, or if your data population is very dense, then brush up on your statistics and figure out if you can get away with sampled or aggregate data. Please don't think that you can avoid doing smarter work just because servers are getting cheap.

Like I said, this is a devil's advocate post. It's meant as food for thought. If you really do need all that data, then ignore this. I won't defend this article to the death. But if you've never thought about this sort of thing, it's time to start. Don't get caught up in the hype of big data, and just do what's best for the project!