CSci 6360
Data Science II

John Miller
Spring 2023


Textbook

Textbook (ScalaTion): Introduction to Computational Data Science Using ScalaTion, 2020.
John A. Miller

Introduction to Data Science Using ScalaTion: Lesson Plans, 2020.
John A. Miller

Textbook (ISL): An Introduction to Statistical Learning, 2013.
Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

TextBook (ESL): The Elements of Statistical Learning, 2nd Ed., 2009.
Trevor Hastiei, Robert Tibshirani and Jerome Friedman

TextBook (MML): Mathematics for Machine Learning 2020.
Marc Peter Deisenroth, A Aldo Faisal and Cheng Soon Ong


Guides

R: Linear Regression Using R: An Introduction to Data Modeling, 2016.
David J. Lilja

Spark: Machine Learning Library (MLlib) Guide, 2018.
Apache Spark


Class Time

Day Time Plan Room
Tuesday 12:45-2:00 Lecture + TT Miller Plant Science (1061) 2102
Wednesday 12:40-1:30 Lecture Boyd (1023) 208
Thursday 12:45-2:00 Lecture + HW Miller Plant Science (1061) 2102


Course Description

An introduction to advanced analytics techniques in data science, including random forests, semi-supervised learning, spectral analytics, randomized algorithms, and just-in-time compilers. Distributed and out-of-core processing.


Course Topics

ScalaTion Chapters:

ISL Chapters:

ESL Chapters:


Grading

20% Exam I 2/?
20% Exam II 4/?
25% Final Exam 5/?
30% Programs (groups of 3-4 students) see Software
5% Homework/Tool Talks [presentation]
Software: Exam I
Review Date: Feb ?, 2023
Exam Date: Feb ?, 2023
5 Questions:

Exam II
Review Date: Apr ?, 2023
Exam Date: Apr ?, 2023
5 Questions:

Final Exam
Exam Date: May ?, 2023
6 Questions:


HW: Homework (pair)

Requirement: Present 1 (pair of students).

See eLC.


TT: Tool Talks (group)

Requirement: Present 1 (group).

See the TAs Updated List.

No. Topic Description Talk Due
1. R R Language . .
2. Spark Apache Spark . .
3. MLlib Apache Spark MLlib . .
4. TensorFlow TensorFlow Machine Learning Library . .
5. Keras The Python Deep Learning Library . .
6. Weka Waikato Environment for Knowledge Analysis (Weka) . .
7. Watson, Cognos Analytics Watson Analytics/Cognos Analytics . .
8. Parallel Matrix Operations GPU/TPU/FPGA . .


Projects

TA:

Submit projects to the TA following their instructions (one submission per group ). Turn in your fully commented source code files, an SBT build.sbt file and a ReadMe file. The ReadMe file must contain instructions for compiling and running the program as well as a detailed explanation of who coded what parts of the program. Five percent of the grade is determined by how well the project is documented. Another five percent of the grade is determined by how well the effort is divided between the group members.

See eLC for the three regular projects and more details on the term project. Project #1 on Ch. 6, Project #2 on Ch. 10, Project #3 TBD.

No. Description Starter Code (must be used) Comment Due
4. Term Project: Data Science Application . A two-page proposal giving a detailed description of the application you propose to develop must be submitted with Project 2. Project includes data collection, data analytics, interpretation and recommendations for a real-world project. May use ScalaTion, R, Scikit Learn, Keras, Spark or other approved software. The term project including a 25-minute presentation and demo will be presented during the last week of class. Must address the ten points/questions listed below. Worth twice the points of regular projects. .

Ten Questions:

  1. problem statement (focus/purpose of study)
  2. describe datasets (at least 2)
  3. data preprossing (techniques applied)
  4. visual examination of data
  5. modeling techniques used (at least 3)
  6. why they were chosen
  7. quality of fit for model, parameters
  8. feature selection
  9. interpretation of results
  10. recommendations of study

Optional Project:

  1. Elastic Net Regression: f_opt = sse + lambda ((1-alpha) b.normSq + alpha b.norm1). When alpha = 1 it becomes Lasso Regression and when alpha = 0 it becomes Ridge Regression. Between 0 and 1, it is a hybrid. Need to finish other factorization for Ridge Regression and implement Elastic Net Regression.
  2. Implement Gaussian TAN Bayes (TANBayesR) that supports continuous variables/features.
  3. Implement Hidden Markov Models (HMM).
  4. Implement Partial Least Squares (PLS).
  5. Select a modeling technique and get approval.


Policies