Setting Up Disco - a JVM-free Alternative to Hadoop

- - posted in MapReduce, code, distributed, erlang, python

The Disco Project is a re-implmentation of MapReduce, but instead of using the JVM and the Hadoop Ecosystem, Disco takes a different approach. Instead, Disco uses a unique hybrid of Python and Erlang. This intrigues me because this brings Python based tools like Scikit-learn and NumPy to a MapReduce framework.

Disco also has it’s own merits that allow it to stand toe-to-toe with it’s much more widely known and slightly older brother Hadoop. All the standard facilities are here, a distributed file system (DDFS), a task tracker of sorts, but without the verbrositiy of the JVM, and also, a much more streamlined setup process.

I hit a couple of snags following the guide in the documentation, which is why I decided to write this post. However, this step by step will largely echo the guide, but be targeted towards getting a local development install up and running. Also, this is for OS X systems only, so we’ll be using homebrew for package installs.

First, grab some dependancies.

brew install python erlang lighttpd varnish

Make sure you have a local SSH server running as well. Thankfully, OS X comes with one, make sure you have it running by double checking System Preferences, Sharing, and making sure the Remote Login service is enabled.

Verify that your SSH key is also configured correctly and is working by doing ssh localhost and making sure you aren’t asked for a password.

If that succeeds without prompting for a password, then you should proceed. Setting up SSH keys can get pretty tricky and there are already a lot of guides out there for that, so I’ll skip over it. If you need help, try here.

Add the following to your ~/bash_profile, and source ~/bash_profile or restart your shell.

export DISCO_HOME=/Users/$USER/disco

Here are the commands to download and build Disco in a way that works well for development and local testing, with homebrew’s packages. This also installs the master branch, which is considered stable for production. The newer branch develop does work, and has a lot of improvements, but there’s a larger chance of hitting a bug.

cd ~/
git clone git:// $DISCO_HOME
git checkout master # Skip this step to live on the edge
cd lib && python install && cd ..

If you followed all the steps correctly, disco nodaemon should start disco without error.

In another shell, running python $DISCO_HOME/examples/util/ will confirm your setup by running the class of all MapReduce jobs, the Word Count.

That’s it, you’re all setup, happy hacking on Disco!