CSCI 3360 Data Science I

Course information

Welcome to CSCI 3360 Data Science! Data science is a rapidly growing field that combines traditional statistics, machine learning, data mining, and programming. It has been attracting a great deal of attentions from both academia and industry.

Course Description

This course is designed as an introductory study of the theory and practice of data science. Data science is about learning from data to extract insight and knowledge. This course introduces computational and statistical tools used in data analysis to answer questions from data. To be specific, we will investigate on tools and methods for

Prerequisite

Students are expected to have a working knowledge of Python 2.7 (or 3.5+). All programming assignments must be completed using Python unless it is specified otherwise. Some elementary knowledge of statistics, linear algebra, and probability theory are expected. Those fundamentals will be provided as they are needed.

Textbooks

The main textbook for this course is

“An Introduction to Statistical Learning with Applications in R”
by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
Springer (PDF is available online!)

Here’s a headup for you: there’s no single book that can cover all the topics of data science due to its interdiciplinary nature. Hence, we will not closely follow the structure of our main textbook. However, students are expected to read the relevant chapters of the book as the course proceeds.

Supplementary books

Evaluation

Read this section carefully, as it defines how your grade is determined. The following rules will be strictly enforced.

Grading criteria

Evaluation will consist of 5 individual homeworks, exams (midterm and final), team project, and pop quizzes. Each submitted item (for example, homework, report, or presentation) will be graded without a curve.

Item Portion Description
Homework 50% 5 individual assigments involving problem solving, discussion, and python programming
Exams 25% midterm (10%) and final (15%)
Final exam is comprehensive.
Team project 20% A semester long data science project
- interim progress presentation (5%)
- final report and presentation (15%)
Quiz 5% Quizzes are comprehensive

Grading scale

The grade will be given based on the total scores, a weighted sum of collected graded items. It is computed using the following equation:

\[ \begin{aligned} \text{total score} &= \frac{\sum_{i=1}^5 \text{HW}_i}{500}*50\% + \frac{\text{midterm} + \text{final}}{\text{200}}*25\% \\ &\qquad + \frac{\text{interim PR}}{100}*5\% + \frac{\text{final PR + report}}{200}*15\% \\ &\qquad + \frac{\sum_{i=1}^k \text{quiz}_i}{100k}*5\%\,. \end{aligned} \]

Note that the above equation assumes the maximum score for each graded item is 100, and this may differ from the actual assigned maximum.

Absolute grading system
Percentage 95% 90% 85% 80% 75% 70% 65% 60% <60%
Grade A A- B+ B B- C+ C C- D

Note that instructor may slighly adujust the percentage to account for score distribution.

Late Submission

All assignements are expected to be completed and sumitted to the eLC by due date. Normally, assignments are due by 11:59 pm on Fridays. Any assignment submitted after 00:01 am on the following day of due date will be considered late.
Late submission will be penalized by deducting 10% of total marks for the assignment for each day beyond the due time.

Academic Honesty

For all students enrolled in this course, it is assumed that they will abide by UGA’s academic honesty policy and procedures. Please refer to UGA’s A CULTURE OF HONESTY. All the linked documents in the url is a part of this syllabus.

For every individual assignment, students are welcome to discuss the problems and share ideas at high level. This means that you should not share anything concrete such as write-up or code fragments. The submitted item must be a work of yours. For example, you can discuss how to solve a homework problem and share an idea, but you have to write your own answer. An egregious violation of these academic honesty codes will results in F for the course.

Tentative Schedule

Fundamentals

Week Topic Note
Jan. 4 Course overview HW0 OUT
Jan 9. Warming up: review
  • probability, linear algebra
  • python
Jan 16 Data collection: web scraping and pandas
pandas tutorial
Jan 23 Data visualization: matplotlib, seaborn
matplotlib tutorial
HW1 OUT

Statistical models

Week Topic Note
Jan. 30 Regression (ISLR Ch. 3.1-3.3)
  • linear regression
  • ridge regression and lasso(ISLR Ch. 6.2)
  • overfitting & regularization (ISLR Ch. 5.1)
Feb 6. Classification
  • logistic regression (ISLR Ch. 4.3)
  • Naive bayes
  • SVM (ISLR Ch. 9)
HW2 OUT
Feb 13 More on classification
  • k-NN
  • decision tree: CART, ID3 (ISLR Ch. 8.1)
  • random forest (ISLR Ch. 8.2)
Feb 20 Feature engineering:
  • feature selection (ISLR Ch. 6.5)
  • regularization
  • kernels
HW3 OUT
Feb 27 Dimensionality reduction (ISLR Ch. 10.1-10.2)
  • SVD
  • PCA
Midterm
Mar 6 Digression (BV Ch. 9.1-9.3)
  • convex optimization
  • stochastic gradient descent
HW4 OUT
Mar 20 Unsupervised learning: data clustering (PRML Ch. 9)
  • k-means
  • Gaussian Mixture Model (GMM)
Mar 27 Ensemble methods
  • Boosting and bagging (ISLR Ch. 8.2)
  • Gradient boosting

Case studies

Week Topic Note
Apr. 3 Recommender System
  • collaborative filtering
  • latent factor model
HW5 OUT
Apr. 10 Sentiment Analysis
  • Bag-of-words model
  • TFIDF
  • working with NLTK library
Apr 17 Introduction to deep learning
  • Multi-Layer Perceptrons
  • Convolutional neural network
  • Generative models
Apr 24 Team project presentation Final exam