# CSCI 3360 Data Science

## 1 Course info.

Welcome to CSCI 3360 Data Science! Data science is a rapidly growing field that combines traditional statistics, machine learning, data mining, and programming. It has been attracting a great deal of attentions from both academia and industry. Also, data scientist is selected as the most promising job in the United States1.

• Instructor : Jaewoo Lee
• Email : jaewoo.lee@uga.edu
• Office : BOYD 620
• Office hours : Wed. 12:10 pm - 1:10 pm, Thurs. 3 pm - 4 pm
• TA : Yang Shi (Mon. 9:00 am - 10 am, Tue. 2:15 pm to 3:15 pm), BOYD 307

## 2 Course description

This course is designed as an introductory study of the theory and practice of data science. Data science is about learning from data to extract insight and knowledge. This course introduces computational and statistical tools used in data analysis to answer questions from data. To be specific, we will investigate on tools and methods for

• data collection, data munging, cleaning
• data exploration, hypothesis testing
• statistical modeling
• making inference on data (regression, classification, and clustering)
• data visualization, and communication/interpretation of results.

### 2.1 Prerequisite

Students are expected to have a working knowledge of Python 2.7. All programming assignments must be completed using Python unless it is specified otherwise. Some elementary knowledge of statistics, linear algebra, and probability theory are expected, but not REQUIRED. Those fundamentals will be provided as they are needed.

### 2.2 Learning objectives

• Using Python, collect data from web and process the raw data into a form usable by data analysis algorithms.
• Summarize and visualize the data using statistical tools to quickly explore different aspects of complex data.
• Design a statistical experiment to test a hypothesis on data.
• Choose the most suitable statistical model for the given analysis task.
• Apply statistics and computational method (e.g., machine learning) to make predictions based on data.
• Implement (or modify) an analysis algorithm using python packages.
• Communicate with non-data science experts about analysis results, using effective statistics and visualizations.

## 3 Recommended textbooks

Here are some books that I recommend:

• Introduction to machine learning with python by Andreas C. Muller & Sarah Guido
• Elements of statistical learning by Trevor Hastie et al. PDF
• Pattern recognition and Machine learning by Chirstopher M. Bishop
• Convex optimization by Boyd and Vandenberghe PDF

While the first book may be listed as our main textbook and these are great books, we will not follow the structure of any book listed above, as our main focus will be covering some practical tools and techniques used in data science.

## 4 Evaluation criteria

Portion Description
Homework 40% 4 individual assignments involving problem solving, discussion, programming
Exams 35% midterm (15%) and final (20%)
Team project 25% implementation of data analysis program
- interim progress presentation (10%)
- final report and presentation (15%)

Each submitted item (for example, homework, report, or presentation) will be graded on a 100 point scale and then the numeric score may be curved to get a more reasonable grade distribution. In other words, rank is more important metric than the score on the graded item.

A [82, 100)
A- [80, 82)
B+ [75, 80)
B [60, 75)
B- [50, 60)
C+ [40, 50)
C [25, 40)
C- [15, 25)
D [0, 15)

For all students enrolled in this course, it is assumed that they will abide by UGA's academic honesty policy and procedures. Please refer to UGA's A CULTURE OF HONESTY.

For every individual assignment, students are welcome to discuss the problems and share ideas (at high level), but the submitted item must be a work of yours. For example, you can discuss how to solve a homework problem and share an idea, but you have to write your own answer.

• Type your homework ($$\LaTeX$$ is recommented).
• Do not write (and submit) something you don't understand or can't explain.
• Do not provide or make available your answer to others (no matter whether they are enrolled in or not)
• If you can't meet the submission deadline due to a illness, first inform the instructor by email and attach a doctor's note.

## 6 Tentative Schedule

Week Topic Note
Jan. 5  Course overview
I. Foundations
Jan. 10 Probability theory review: random variable, conditional probability, bayes theorem
Python: working with data - panda, numpy
Jan. 17 Sampling and distributions
Visualizing data: matplotlib
HW1 OUT
Jan. 24 Introduction to statistical inference, MLE
II. Statistics
Jan. 31 Hypothesis testing, asymptotics, p-values
Python: chi-square test for independence
Type I, Type II errors
Feb. 7 Resampling technique (bootstrap), confidence interval HW2 OUT

III. Optimization
Feb. 14 Fundamentals of convex optimization: convex set, convex function
Feb. 21 Multivariate calculus HW3 OUT
IV. Machine Learning
Mar. 6 Spring break week (no class)
Mar. 14 Statistical learning theory: notation and setup
Linear regression: least squares, Overfitting, Generalization
Mar. 21 Regression in high dimensional space (ridge regression)
regularization
HW4 OUT
Mar. 28 Linear classification: logistic regression
Apr. 4 Naive Bayes
Lazy learning: k-nearest neighbor algorithm
Apr. 11 Data clustering, curse of dimensionality, K-means algorithm
GMM and EM algorithm
Apr. 18 Linear algebra review: SVD
Dimensionality reduction (PCA)
Apr. 25 Team project presentation Final exam (TBD)