References
Here is a list of suggested papers for students to present. This is just a suggested list; you are free to suggest your own, just run them by me first.
- An Algorithm for the Principal Component Analysis of Large Data Sets
- Semi-Supervised Classification of Network Data Using Very Few Labels
- Fast, Accurate Detection of 100,000 Object Classes on a Single Machine
- Event Detection via Communication Pattern Analysis
- Distributed PCA and k-Means Clustering
- Factorbird - a Parameter Server Approach to Distributed Matrix Factorization
- Generalized Low Rank Models
- Tracking Climate Change Opinions from Twitter Data
- Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning
- NIFTY: A System for Large Scale Information Flow Tracking and Clustering
- Distributed Approximate Spectral Clustering for Large-Scale Datasets
- Style in the Long Tail: Discovering Unique Interests with Latent Variable Models in Large Scale Social E-commerce
- Large-Scale High-Precision Topic Modeling on Twitter
- Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
- Discretized Streams: Fault-Tolerant Streaming Computation at Scale
- Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud
Documentation
git Documentation
Guide for Hadoop hacking
Apache Spark
Apache Hadoop
Stanford's MMDS Course
Carnegie Mellon's big data course
This is your destination for learning all things git. If you're running into problems, go here first. Note that this is different from bitbucket, which integrates git in a sort of "social programming network."
Writing distributed programs is extremely difficult. This CMU wiki page highlights some of the strategies for writing and [especially] debugging your distributed Hadoop programs. Note the progression: start local (including unit tests), then move to the GSRC cluster, before finally testing your program on AWS.
Fast, in-memory distributed computation. Supports Scala, Python, and Java. Runs on multiple frameworks, including Hadoop.
Open-source implementation of the MapReduce distributed computing paradigm. Designed to be fault-tolerant, scale to thousands of machines, and function on clusters of heterogeneous machines.
This course borrows a healthy amount of content from Stanford's course on the mining of massive datasets. All lecture materials and PDF copies of the book are available at their website.
This course also borrows some content from Carnegie Mellon's 10-605 course, Machine Learning with Large Datasets. All lecture materials are available at their website.