CSCI 3360 Data Science I

Course information

Welcome to CSCI 3360 Data Science! Data science is a rapidly growing field that combines traditional statistics, machine learning, data mining, and programming. It has been attracting a great deal of attentions from both academia and industry.

Instructor: Jaewo Lee
Email: jaewoo.lee@uga.edu
Office: BOYD 620
Office hours:

Tue. 3 - 4pm
Wed. 11:10 - 12:10am

TA: Yang Shi (yang.atrue@uga.edu)

hours: 10-11 am on Mondays and Wednesdays

Course Description

This course is designed as an introductory study of the theory and practice of data science. Data science is about learning from data to extract insight and knowledge. This course introduces computational and statistical tools used in data analysis to answer questions from data. To be specific, we will investigate on tools and methods for

data collection, data munging, cleaning
data exploration, hypothesis testing
statistical modeling
making inference on data (regression, classification, and clustering)
data visualization, and communication/interpretation of results.

Prerequisite

Students are expected to have a working knowledge of Python 2.7 (or 3.5+). All programming assignments must be completed using Python unless it is specified otherwise. Some elementary knowledge of statistics, linear algebra, and probability theory are expected. Those fundamentals will be provided as they are needed.

Textbooks

The main textbook for this course is

“An Introduction to Statistical Learning with Applications in R”
by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
Springer (PDF is available online!)

Here’s a headup for you: there’s no single book that can cover all the topics of data science due to its interdiciplinary nature. Hence, we will not closely follow the structure of our main textbook. However, students are expected to read the relevant chapters of the book as the course proceeds.

Supplementary books

Introduction to machine learning with python by Andreas C. Muller & Sarah Guido
Elements of statistical learning by Trevor Hastie et al. (PDF)
Pattern recognition and machine learning by Christopher M. Bishop
Convex optimization by Boyd and Vandenberghe (PDF)

Evaluation

Read this section carefully, as it defines how your grade is determined. The following rules will be strictly enforced.

Grading criteria

Evaluation will consist of 5 individual homeworks, exams (midterm and final), team project, and pop quizzes. Each submitted item (for example, homework, report, or presentation) will be graded without a curve.

Item	Portion	Description
Homework	50%	5 individual assigments involving problem solving, discussion, and python programming
Exams	25%	midterm (10%) and final (15%) Final exam is comprehensive.
Team project	20%	A semester long data science project - interim progress presentation (5%) - final report and presentation (15%)
Quiz	5%	Quizzes are comprehensive

Grading scale

The grade will be given based on the total scores, a weighted sum of collected graded items. It is computed using the following equation:

$\begin{aligned} \text{total score} &= \frac{\sum_{i=1}^5 \text{HW}_i}{500}*50\% + \frac{\text{midterm} + \text{final}}{\text{200}}*25\% \\ &\qquad + \frac{\text{interim PR}}{100}*5\% + \frac{\text{final PR + report}}{200}*15\% \\ &\qquad + \frac{\sum_{i=1}^k \text{quiz}_i}{100k}*5\%\,. \end{aligned}$

Note that the above equation assumes the maximum score for each graded item is 100, and this may differ from the actual assigned maximum.

Absolute grading system
Percentage	95%	90%	85%	80%	75%	70%	65%	60%	<60%
Grade	A	A-	B+	B	B-	C+	C	C-	D

Note that instructor may slighly adujust the percentage to account for score distribution.

Late Submission

All assignements are expected to be completed and sumitted to the eLC by due date. Normally, assignments are due by 11:59 pm on Fridays. Any assignment submitted after 00:01 am on the following day of due date will be considered late.
Late submission will be penalized by deducting 10% of total marks for the assignment for each day beyond the due time.

Academic Honesty

For all students enrolled in this course, it is assumed that they will abide by UGA’s academic honesty policy and procedures. Please refer to UGA’s A CULTURE OF HONESTY. All the linked documents in the url is a part of this syllabus.

For every individual assignment, students are welcome to discuss the problems and share ideas at high level. This means that you should not share anything concrete such as write-up or code fragments. The submitted item must be a work of yours. For example, you can discuss how to solve a homework problem and share an idea, but you have to write your own answer. An egregious violation of these academic honesty codes will results in F for the course.

Tentative Schedule

Fundamentals

Week	Topic	Note
Jan. 4	Course overview	HW0 OUT
Jan 9.	Warming up: review probability, linear algebra python
Jan 16	Data collection: web scraping and pandas pandas tutorial
Jan 23	Data visualization: matplotlib, seaborn matplotlib tutorial	HW1 OUT

Statistical models

Week	Topic	Note
Jan. 30	Regression (ISLR Ch. 3.1-3.3) linear regression ridge regression and lasso(ISLR Ch. 6.2) overfitting & regularization (ISLR Ch. 5.1)
Feb 6.	Classification logistic regression (ISLR Ch. 4.3) Naive bayes SVM (ISLR Ch. 9)	HW2 OUT
Feb 13	More on classification k-NN decision tree: CART, ID3 (ISLR Ch. 8.1) random forest (ISLR Ch. 8.2)
Feb 20	Feature engineering: feature selection (ISLR Ch. 6.5) regularization kernels	HW3 OUT
Feb 27	Dimensionality reduction (ISLR Ch. 10.1-10.2) SVD PCA	Midterm
Mar 6	Digression (BV Ch. 9.1-9.3) convex optimization stochastic gradient descent	HW4 OUT
Mar 20	Unsupervised learning: data clustering (PRML Ch. 9) k-means Gaussian Mixture Model (GMM)
Mar 27	Ensemble methods Boosting and bagging (ISLR Ch. 8.2) Gradient boosting

Case studies

Week	Topic	Note
Apr. 3	Recommender System collaborative filtering latent factor model	HW5 OUT
Apr. 10	Sentiment Analysis Bag-of-words model TFIDF working with NLTK library
Apr 17	Introduction to deep learning Multi-Layer Perceptrons Convolutional neural network Generative models
Apr 24	Team project presentation	Final exam