Welcome to CSCI 3360 Data Science! Data science is a rapidly growing field that combines traditional statistics, machine learning, data mining, and programming. It has been attracting a great deal of attentions from both academia and industry.
This course is designed as an introductory study of the theory and practice of data science. Data science is about learning from data to extract insight and knowledge. This course introduces computational and statistical tools used in data analysis to answer questions from data. To be specific, we will investigate on tools and methods for
Students are expected to have a working knowledge of Python 2.7 (or 3.5+). All programming assignments must be completed using Python unless it is specified otherwise. Some elementary knowledge of statistics, linear algebra, and probability theory are expected. Those fundamentals will be provided as they are needed.
The main textbook for this course is
“An Introduction to Statistical Learning with Applications in R”
by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
Springer (PDF is available online!)
Here’s a headup for you: there’s no single book that can cover all the topics of data science due to its interdiciplinary nature. Hence, we will not closely follow the structure of our main textbook. However, students are expected to read the relevant chapters of the book as the course proceeds.
Supplementary books
Read this section carefully, as it defines how your grade is determined. The following rules will be strictly enforced.
Evaluation will consist of 5 individual homeworks, exams (midterm and final), team project, and pop quizzes. Each submitted item (for example, homework, report, or presentation) will be graded without a curve.
Item  Portion  Description 

Homework  50%  5 individual assigments involving problem solving, discussion, and python programming 
Exams  25%  midterm (10%) and final (15%) Final exam is comprehensive. 
Team project  20%  A semester long data science project  interim progress presentation (5%)  final report and presentation (15%) 
Quiz  5%  Quizzes are comprehensive 
The grade will be given based on the total scores, a weighted sum of collected graded items. It is computed using the following equation:
\[ \begin{aligned} \text{total score} &= \frac{\sum_{i=1}^5 \text{HW}_i}{500}*50\% + \frac{\text{midterm} + \text{final}}{\text{200}}*25\% \\ &\qquad + \frac{\text{interim PR}}{100}*5\% + \frac{\text{final PR + report}}{200}*15\% \\ &\qquad + \frac{\sum_{i=1}^k \text{quiz}_i}{100k}*5\%\,. \end{aligned} \]
Note that the above equation assumes the maximum score for each graded item is 100, and this may differ from the actual assigned maximum.
Percentage  95%  90%  85%  80%  75%  70%  65%  60%  <60% 

Grade  A  A  B+  B  B  C+  C  C  D 
Note that instructor may slighly adujust the percentage to account for score distribution.
All assignements are expected to be completed and sumitted to the eLC by due date. Normally, assignments are due by 11:59 pm on Fridays. Any assignment submitted after 00:01 am on the following day of due date will be considered late.
Late submission will be penalized by deducting 10% of total marks for the assignment for each day beyond the due time.
For all students enrolled in this course, it is assumed that they will abide by UGA’s academic honesty policy and procedures. Please refer to UGA’s A CULTURE OF HONESTY. All the linked documents in the url is a part of this syllabus.
For every individual assignment, students are welcome to discuss the problems and share ideas at high level. This means that you should not share anything concrete such as writeup or code fragments. The submitted item must be a work of yours. For example, you can discuss how to solve a homework problem and share an idea, but you have to write your own answer. An egregious violation of these academic honesty codes will results in F for the course.
Week  Topic  Note 

Jan. 4  Course overview  HW0 OUT 
Jan 9. 
Warming up: review


Jan 16 
Data collection: web scraping and pandas pandas tutorial 

Jan 23 
Data visualization: matplotlib, seaborn matplotlib tutorial 
HW1 OUT 
Week  Topic  Note 

Jan. 30 
Regression (ISLR Ch. 3.13.3)


Feb 6. 
Classification

HW2 OUT 
Feb 13 
More on classification


Feb 20 
Feature engineering:

HW3 OUT 
Feb 27 
Dimensionality reduction (ISLR Ch. 10.110.2)

Midterm 
Mar 6 
Digression (BV Ch. 9.19.3)

HW4 OUT 
Mar 20 
Unsupervised learning: data clustering (PRML Ch. 9)


Mar 27 
Ensemble methods

Week  Topic  Note 

Apr. 3 
Recommender System

HW5 OUT 
Apr. 10 
Sentiment Analysis


Apr 17 
Introduction to deep learning


Apr 24  Team project presentation  Final exam 