CSCI 6900 - Spring 2015

References

Here is a list of suggested papers for students to present. This is just a suggested list; you are free to suggest your own, just run them by me first.

Documentation

git Documentation

This is your destination for learning all things git. If you're running into problems, go here first. Note that this is different from bitbucket, which integrates git in a sort of "social programming network."

Guide for Hadoop hacking

Writing distributed programs is extremely difficult. This CMU wiki page highlights some of the strategies for writing and [especially] debugging your distributed Hadoop programs. Note the progression: start local (including unit tests), then move to the GSRC cluster, before finally testing your program on AWS.

Apache Spark

Fast, in-memory distributed computation. Supports Scala, Python, and Java. Runs on multiple frameworks, including Hadoop.

Apache Hadoop

Open-source implementation of the MapReduce distributed computing paradigm. Designed to be fault-tolerant, scale to thousands of machines, and function on clusters of heterogeneous machines.

Stanford's MMDS Course

This course borrows a healthy amount of content from Stanford's course on the mining of massive datasets. All lecture materials and PDF copies of the book are available at their website.

Carnegie Mellon's big data course

This course also borrows some content from Carnegie Mellon's 10-605 course, Machine Learning with Large Datasets. All lecture materials are available at their website.

CSCI 6900: Mining Massive Datasets

Course links

References

Documentation

git Documentation

Guide for Hadoop hacking

Apache Spark

Apache Hadoop

Stanford's MMDS Course

Carnegie Mellon's big data course