CSCI 6900 - Spring 2015

Announcements

[May 7, 2015]

Final projects have been graded, and all final grades have been posted on eLC and Athena. I thoroughly enjoyed reading the project papers. Thank you for a great semester!

[Mar 19, 2015]

Assignment 4 is now live on the assignments page, a whole day early!

[Mar 5, 2015]

Assignment 3 is now live on the assignments page!

Course overview

Distributed computing and the paradigm of "big data" have garnered a significant amount of attention in recent years as costs of capturing and storing information have plummeted; analytical bottlenecks have shifted from data acquisition and curation to downstream analysis. However, this shift has created its own set of problems, the most pertinent of which is that large datasets are computationally expensive to process. Algorithms that efficiently process data that fit in memory may become prohibitively expensive to use on larger datasets. Consequently, it can be difficult to gain an intuition for the underlying data and troubleshoot issues.

This course has three primary goals. First, it is intended to provide the student with an appreciation for the issues involved in deploying classic machine learning algorithms--classification, clustering, and dimensionality reduction--to work on datasets that do not fit in main memory. Second, it is intended to provide a working knowledge of and experience with some of the current distributed frameworks and their various philosophies (e.g. Hadoop, Mahout, Spark, Storm). Third, the course is intended to reinforce software engineering best-practices by providing students with hands-on opportunities to implement solutions using real-world datasets.

Prerequisites

None required; experience in software engineering (CSCI 4050/6050) is highly recommended, and some knowledge of basic statistics and machine learning (e.g. CSCI 8950) is a bonus, though this will be covered in class. At minimum, you should have a good programming background (Java, C, C++, Perl, Matlab, etc). If you're unsure, make an honest assessment of yourself with this pre-requisite check.

AWS Support

This course is supported in part by AWS in Education Grant award.

CSCI 6900: Mining Massive Datasets

Course links

Announcements

Course overview

Prerequisites

AWS Support