Programming assignments

Assignment 4: LDA on Spark

Assignment 3: SVD on Spark

Assignment 2: K-Means on Hadoop

Assignment 1: Naive Bayes on Hadoop

Student presentations

Each student will deliver one presentation over the course of the semester (sign-up and schedule is here). This presentation should cover some recent publication related to big data, data mining, and/or scalable machine learning (here are some suggestions). The presentation should include a slide deck, and be 40-50 minutes in length, leaving time after for a discussion on the merits and methods of the paper.

Given the open-ended discussion following the presentation, I strongly encourage everyone to be familiar with the paper (as in, read it ahead of time).

The general framework for a presentation should be as follows:

Use of Git and BitBucket

This course places an emphasis on iterative software development. To that end, we will be making use of git for version control (availble for all operating systems), and bitbucket for assignment and project submission (free accounts are available).

Your assignments will be graded based on the content in your bitbucket repositories, so if you're having problems installing / configuring git or bitbucket, please ask for help. If you have never used a version control system before (e.g. CVS, subversion, mercurial, perforce, etc), please ask for help.