Geek Evolution – a home Hadoop cluster

Evolution of the Geek

Over the years, the definition of a geek has evolved. I guess it started with a pretty high bar (think Wozniak in a garage with wire-wrapped motherboards in the ’70s), and then dipped for a while.  Does it mean running Hadoop at home now?

Build your own PCs, add a network, DNS…

For a while it simply meant you had a PC at home (probably early to mid ’80s).   It then moved back up-scale to building your own PCs from components: case, motherboard, CPU, heat sink, drives, memory, etc.  It moved along to the requirement of having a couple of PCs at home that shared an Internet connection.  Eventually you need a few servers for file & print – and maybe a database or web server or two… Need a little internal DNS for any of that?

I have generally felt I was reasonably eligible for at least honorary geek status.  At 15 years old, I wrote my first software on an IBM mini-computer back in the mid-’70s, had a PC in the early to mid-’80s (and two EECS degrees), built my own desktops and servers from components in the mid-’90s, added a server cabinet and network in the early-’00s, etc.  Not sure if the fact that I have a Cisco PIX and know how to configure it from the command line counts for anything.

Home Hadoop Cluster

Using a few hours over the last 3 day weekend, I was able to bring up a Hadoop cluster on 3 CentOS nodes in my basement cabinet.  Things are heading for a six node cluster.  The “single-node cluster” was working in about 10-15 minutes.  I have always scratched my head at the concept of a “single-node” cluster.  Seems like an oxymoron to me.

Single-node “cluster” up and running – this is easy (I thought)…  The hard part was getting the distributed version working.  It is always some simple thing that hangs you up.  In this case, it was the fact that CentOS shares the machine’s hostname with the loopback connector in the /etc/hosts file.  This caused Java to bind to the loopback address (127.0.0.1) when it was listening on the NameNode and JobTracker.  It worked fine in a single node configuration as the DataNodes and TaskTrackers were also looking for the loopback connector on that machine.

Thank goodness for the invention of the search engine.  This handy little post saved me a lot of time debugging the issue:
http://markmail.org/message/i64a3vfs2v542boq#query:+page:1+mid:rvcbv7oc4g2tzto7+state:results

After tailing the logs to the DataNodes, I could see they could not connect to the NameNode.  Linux netstat showed that the NameNode was binding to the loopback connector.  I just was not thinking clearly enough to see that it was not also bound to the static IP address of the NameNode host.  Splitting the loopback connector and static IP address into two lines in the /etc/hosts file did the trick.  I thought the days of editing /etc/hosts were long over with the use of DNS.

The bar used to measure a geek

I guess the bar for being a home computer geek means running distributed processing from a rack in your basement in 2010.  Now on to a little MapReduce, Pig and Hive work this weekend.

Ted Cahall