DataTaunew | comments | leaders | submitlogin
Ask DT: Why learn "big data" frameworks when you're just a regular hacker?
7 points by tfturing 3853 days ago | 9 comments
That is, why learn frameworks like Hadoop, Spark, Storm, etc. when you don't have copious amounts of AWS credits? Of course, it'll look good in front of an employer, but it seems somewhat shallow if you really want to use machine learning to predict things or do science. Is it possible for one to casually use these frameworks and produce something cool and useful without spending money on cluster time?


3 points by robdoherty2 3853 days ago | link

It doesn't hurt to get acquainted with these tools to get a feel for how they work. You also don't need to set up a massive cluster in AWS to play with them. It is in fact possible to set up local running jobs (check the tutorials for most of these projects).

Keep in mind that there's no substitute for working on, say, a 1000-node hadoop cluster and getting a feel for how such resources have to be shared across a large organization, but learning the basic paradigm of available tools is helpful.

-----

1 point by tfturing 3852 days ago | link

But is it of any benefit to anyone not really thinking about working at a tech company or with disposable income, but wanting to use "big data" frameworks to make conclusions from large datasets?

-----

3 points by skadamat 3853 days ago | link

I would say be familiar with them but honestly you only will really use stuff like Hadoop or Spark when working at companies with large amounts of data. Usually, employers will just expect you to learn whatever tools are necessary on the job. But definitely still make it obvious that you CAN code, and not just an academic statistician that only uses R!

-----

2 points by larrydag 3852 days ago | link

I'm a Data Scientist first and a programmer second. I have not found a use-case for these frameworks yet in my work or side projects. I've found that I can do most any Data Science task with R and a relational database (PostgreSQL, MySQL, etc.). What are the typical use cases of these frameworks especially for Data Science?

-----

1 point by jcbozonier 3848 days ago | link

I got by on just coding scripts to process files for quite a while. If you're careful, you can put off learning this stuff until you have quite a bit (read: terabytes) of data. If you're sloppy, you might need it after tens of GB.

The biggest use case is that these frameworks allow a certain amount of "sloppiness" and for less pre-planning. Instead I just know that all of this text is getting dumped to s3 and I know I can find a way to sift through it all using Hadoop-ish tools. Pour it into RedShift when I've got a specific view I want to be able to query ad hoc.

It's not that you can't do some of this in other ways (for myself at least). It's that I can be pretty nimble doing it this way personally. That's all.

-----

1 point by roycoding 3852 days ago | link

I'll echo this. I've been consulting as a data scientist for 1.5+ years and almost none of the work I've done has required large scale frameworks, even when clients assumed it was necessary.

That being said, I think it's important to understand the basics and have some grasp of these large frameworks so that you'll be more likely to know when they're appropriate.

-----

1 point by ironchef 3852 days ago | link

The issue is typically once you start getting to either high volume datasets (200+ GB) or high velocity datasets ("realtime or neartime") imo. Then one would often need to resort to some of these frameworks. Higher variety doesn't seem to require it offhand (although it can make things easier) and changes in veracity don't seem to necessitate it either.

-----

1 point by adamlaiacano 3851 days ago | link

The short answer, and the one that you seem to be looking to get, is that you don't have to learn these things if you don't need to use them. You also don't need to learn frameworks like rails or tornado, or languages like c or fortran, even though they are all instrumental every-day tools for a minority of people who work in "big data."

If the data you're working with can fit in memory, your far better off sticking to python/r/julia/matlab/stata/whatever. Your code will run much faster because Hadoop is for an i/o bound system rather than CPU bound, it's far easier to set a up (especially if you aren't familiar with JVM), and there are WAY more libraries for doing machine learning.

THAT SAID, if you ever plan on scaling your work, you're going to have to get into Hadoop world. Scalding has become my go-to for normal data munging/manipulation and some simple classification stuff, even on my local machine. I've found the "split/apply/combine" paradigm syntax is far more intuitive than plyr or pandas, and it's nice to know that I can submit the exact same code to a 100 node cluster if I have to. However, if I want to run any iterative algorithm like SVM or even k-means, I know it's going to be extremely slow because Hadoop does not handle iteration well.

-----

1 point by tfturing 3851 days ago | link

I guess I "want" that answer to be true in the same sense Al Gore "wants" global warming to be true. I'm starting to feel that "big data" frameworks get more attention than they deserve since it clashes with the "Data Science for the Masses" mantra. My biggest fear is that people spend time on that instead of, or feel that is more important than, gaining an adequate background in computer programming and statistics. Also, I'm not sure why it warrants its own course on Udacity. Even if nobody uses your CS 101 website or search engine, you feel a sort of accomplishment when you complete it. You set out to build something, you make it and it doesn't cost anything more than the computer you already bought. I doubt you get that feeling of accomplishment when you type some MapReduce code you likely won't ever use to its full potential.

-----




RSS | Announcements