CSCI 3360 Data Science
1 Course info.
Welcome to CSCI 3360 Data Science! Data science is a rapidly growing field that combines traditional statistics, machine learning, data mining, and programming. It has been attracting a great deal of attentions from both academia and industry. Also, data scientist is selected as the most promising job in the United States^{1}.
 Instructor : Jaewoo Lee
 Email : jaewoo.lee@uga.edu
 Office : BOYD 620
 Office hours : Wed. 12:10 pm  1:10 pm, Thurs. 3 pm  4 pm
 TA : Yang Shi (Mon. 9:00 am  10 am, Tue. 2:15 pm to 3:15 pm), BOYD 307
2 Course description
This course is designed as an introductory study of the theory and practice of data science. Data science is about learning from data to extract insight and knowledge. This course introduces computational and statistical tools used in data analysis to answer questions from data. To be specific, we will investigate on tools and methods for
 data collection, data munging, cleaning
 data exploration, hypothesis testing
 statistical modeling
 making inference on data (regression, classification, and clustering)
 data visualization, and communication/interpretation of results.
2.1 Prerequisite
Students are expected to have a working knowledge of Python 2.7. All programming assignments must be completed using Python unless it is specified otherwise. Some elementary knowledge of statistics, linear algebra, and probability theory are expected, but not REQUIRED. Those fundamentals will be provided as they are needed.
3 Recommended textbooks
Here are some books that I recommend:
 Introduction to machine learning with python by Andreas C. Muller & Sarah Guido
 Elements of statistical learning by Trevor Hastie et al. PDF
 Pattern recognition and Machine learning by Chirstopher M. Bishop
 Convex optimization by Boyd and Vandenberghe PDF
While the first book may be listed as our main textbook and these are great books, we will not follow the structure of any book listed above, as our main focus will be covering some practical tools and techniques used in data science.
4 Evaluation criteria
4.1 Grading proportion
Portion  Description  

Homework  40%  4 individual assignments involving problem solving, discussion, programming 
Exams  35%  midterm (15%) and final (20%) 
Team project  25%  implementation of data analysis program 
 interim progress presentation (10%)  
 final report and presentation (15%) 
4.2 Grading scale^{2}
Each submitted item (for example, homework, report, or presentation) will be graded on a 100 point scale and then the numeric score may be curved to get a more reasonable grade distribution. In other words, rank is more important metric than the score on the graded item.
Grade  Percentage 

A  [82, 100) 
A  [80, 82) 
B+  [75, 80) 
B  [60, 75) 
B  [50, 60) 
C+  [40, 50) 
C  [25, 40) 
C  [15, 25) 
D  [0, 15) 
5 Academic Honesty
For all students enrolled in this course, it is assumed that they will abide by UGA's academic honesty policy and procedures. Please refer to UGA's A CULTURE OF HONESTY.
For every individual assignment, students are welcome to discuss the problems and share ideas (at high level), but the submitted item must be a work of yours. For example, you can discuss how to solve a homework problem and share an idea, but you have to write your own answer.
 Type your homework (\(\LaTeX\) is recommented).
 Do not write (and submit) something you don't understand or can't explain.
 Do not provide or make available your answer to others (no matter whether they are enrolled in or not)
 If you can't meet the submission deadline due to a illness, first inform the instructor by email and attach a doctor's note.
6 Tentative Schedule
Week  Topic  Note 

Jan. 5  Course overview  
I. Foundations  
Jan. 10 
Probability theory review: random variable, conditional probability, bayes theorem Python: working with data  panda, numpy 

Jan. 17 
Sampling and distributions Visualizing data: matplotlib 
HW1 OUT 
Jan. 24  Introduction to statistical inference, MLE  
II. Statistics  
Jan. 31 
Hypothesis testing, asymptotics, pvalues Python: chisquare test for independence Type I, Type II errors 

Feb. 7  Resampling technique (bootstrap), confidence interval  HW2 OUT 
III. Optimization  
Feb. 14  Fundamentals of convex optimization: convex set, convex function  
Feb. 21  Multivariate calculus  HW3 OUT 
Feb. 28  Gradient descent, stochastic gradient descent  Midterm exam 
IV. Machine Learning  
Mar. 6  Spring break week (no class)  
Mar. 14  Statistical learning theory: notation and setup Linear regression: least squares, Overfitting, Generalization 

Mar. 21  Regression in high dimensional space (ridge regression) regularization 
HW4 OUT 
Mar. 28  Linear classification: logistic regression  
Apr. 4  Naive Bayes Lazy learning: knearest neighbor algorithm 

Apr. 11  Data clustering, curse of dimensionality, Kmeans algorithm GMM and EM algorithm 

Apr. 18  Linear algebra review: SVD Dimensionality reduction (PCA) 

Apr. 25  Team project presentation  Final exam (TBD) 
Footnotes:
http://www.forbes.com/sites/kathryndill/2016/01/20/themostpromisingjobsof2016/#63032d5442f2
subject to change