CSci 3360
Data Science I

John Miller
Fall 2018


Textbook

Main Text (ISL): An Introduction to Statistical Learning, 2013.
Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

Advanced Reference (ESL): The Elements of Statistical Learning, 2nd Ed., 2009.
Trevor Hastiei, Robert Tibshirani and Jerome Friedman


Guides

ScalaTion (DSS): Introduction to Data Science Using ScalaTion, 2018.
John A. Miller

R: Linear Regression Using R: An Introduction to Data Modeling, 2016.
David J. Lilja

Spark: Machine Learning Library (MLlib) Guide, 2018.
Apache Spark


Class Time

Day Plan Room
Monday 2:30-3:20 517 Forestry Resources 4
Tuesday 2:00-3:15 453 Chemistry
Thursday 2:00-3:15 453 Chemistry


Course Description

A rigorous overview of methods for text mining, image processing, and scientific computing. Core concepts in supervised and unsupervised analytics, dimensionality reduction, and data visualization will be explored in depth.


Course Topics

  1. ISL Ch. 1: Introduction
  2. ISL Ch. 2: Statistical Learning
  3. ISL Ch. 3: Linear Regression
  4. ISL Ch. 4: Classification
  5. ISL Ch. 5: Resampling Methods
  6. ISL Ch. 6: Linear Model Selection and Regularization
  7. ISL Ch. 8: Tree-Based Methods
  8. ISL Ch. 9, ESL Ch. 4, 12: Support Vector Machines
  9. ESL Ch. 11: Neural Networks


Grading

15% Exam I: topics = {see below} 9/25
5% Quiz: topics = {see below} 10/11
20% Exam II: topics = {see below} 11/15
25% Final Exam 12/11
30% Programs (groups of 3) [software: Java SE 8, Scala 2.12.4, ScalaTion 1.5, R 3.5.x, Apache Spark 2.3.x, SBT 1.1.x]
5% Homework/Tool Talks [presentation]
Exam I: closed notes and book; bring calculator; 1 page info sheet allowed.
Review Date: Sep 20, 2018
Exam Date: Sep 25, 2018
5 Questions:
(1) Mathematical Preliminaries, DSS 2.1-2.3,
(2) Statistical Learning, ISL 2.1-2.2,
(3) Simple Regression, DSS 4.4, ISL 3.1,
(4) Regression, DSS 4.5, ISL 3.2
(5) Regression Issues, DSS 4.5, ISL 3.3.

Quiz: closed notes and book; bring calculator; 2 page info sheet allowed.
Review Date: Oct 9, 2018
Exam Date: Oct 11, 2018
3 Questions:
(1) Simple Regression, DSS 4.4, ISL 3.1
(2) Multiple Regression, DSS 4.5, ISL 3.2
(3) Naive Bayes, DSS 5.6.

Exam II: closed notes and book; bring calculator; 2 page info sheet allowed.
Review Date: Nov 13, 2018
Exam Date: Nov 15, 2018
5 Questions:
(1) Bayesian Classifiers, DSS 5.1-5.7, ISL 2.2.3,
(2) Decision Trees, DSS 5.10, ISL 8.1,
(3) Logistics Regression, DSS 6.1-6.4, ISL 4.1-4.3,
(4) K-NN Classifier, DSS 6.7, ISL 2.2.3, 3.5,
(5) Perceptrons, DSS 9.2, ESL 11.

Final Exam: closed notes and book; bring calculator; 3 page info sheet allowed.
Review Date: Monday, Dec 3, 2018
Exam Date: Tueday, Dec 11, 2018: 3:30 - 6:30 pm
6 Questions:
(1) Statistics and Machine Learning (one page essay)
(2) Multiple Regression, DSS 4.5, ISL 3.2
(3) Naive Bayes, DSS 5.6,
(4) K-NN Classifier, DSS 6.7, ISL 2.2.3, 3.5,
(5) Cross-Validation, DSS 4.5, 5.1, ISL 5.1,
(6) Perceptrons/Neural Networks, DSS 9.2-9.6, ESL 11.


HW: Homework (individual)

Requirement: Present 1 (individual, subject to change).
No. Text Chapters/Sections Questions Due
1. DSS Ch. 2.1.15 (1) 3; (2) 4; (3) 5; (4) 6 8/23
2. DSS Ch. 2.2.9 (5) 2; (6) 3 8/30
3. ISL Ch. 2 (7) 2; (8) 9 9/6
4. ISL Ch. 3 (9) 5; (10) 8 9/13
5. ISL Ch. 3 (11) 9; (12) 14 .
6. ISL Ch. 4 (13) 6; (14) 11 .
7. ISL Ch. 5 (15) 3, (16) 5 .
8. ISL Ch. 6 (17) 4, (18) 8 .
9. ISL Ch. 8 (19) 3; (20) 8 .
10. ISL Ch. 9 (21) 3; (22) 7 .


TT: Tool Talks (group)

Requirement: Present 1 (group, subject to change).
No. Topic Description Talk Due
1. Scala Scala, SBT, ScalaTion . 8/21
2. R R Language . 8/28
3. Spark Apache Spark . 9/4
4. MLlib Apache Spark MLlib . 9/11
5. TensorFlow TensorFlow Machine Learning Library . 9/18
6. Keras The Python Deep Learning Library . 10/4
7. Kaggle Kaggle is the place to do data science projects (source for final project) . .
7. Weka Waikato Environment for Knowledge Analysis (Weka) . .
8. Watson Watson Analytics . .
9. Google Cloud AI Google Cloud AI: Fast, large scale, and easy-to-use AI services. . .


Projects (subject to change)

TA::

Submit projects to the TA by sending a zip file containing all files to the TA's email with the subject line "[3360] Group # Project #". One submission per group will be sufficient. Turn in your fully commented source code files, an SBT build.sbt file and a ReadMe file. The ReadMe file must contain instructions for compiling and running the program as well as a detailed explanation of who coded what parts of the program. Ten percent of the grade is determined by how well documented the project is (all interfaces, classes, fields, constructors, methods and parameters must be documented). Another ten percent of the grade is determined by how well the effort is divided between the two group members.

No. Description Starter Code (must be used) Comment Due
1. Regression Problem in ScalaTion, R and Spark MLlib TBA Multiple Linear Regression using the TBA dataset 9/21
2. Classification Problem in ScalaTion and R TBA Compare Several Classification Algorithms .
3. Neural Networks in ScalaTion and Keras TBA A Regression Problem and a Classification Problem. .
4. Term Project: Data Science Application TBA A two-page proposal giving a detailed description of the application you propose to develop must be submitted with project 3. Project includes data collection, data analytics, interpretation and recommendations for a real-world project. May use approved tools. The term project including a demo will be presented during the last week of class. Worth twice the points of regular projects. .

Term Project Proposal Presentations: November 8, 2018

  1. Data Science Problem to Address
  2. Data Science Process to Follow, e.g., see Process
  3. Data Sources
  4. Tools/Software to be Used

Term Project Final Presentations: November 26, 27, 29, 2018

  1. Data Science Problem Addressed
  2. Data Science Process Followed, e.g., see Process
  3. Data Sources
  4. Tools/Software Used
  5. Modeling Techniques Used (at least 3)
  6. Results - Tables and Graphs
  7. Interpretation of Results
  8. Recommendations

Points: One third for presentation and two thirds for submission. Submit presentation slides including the above items (in pdf) and source code by midnight, November, 30, 2018.


Policies