Advanced Topics in Data Intensive Computing

Fall 2018

General Course Information:

Instructor: Lakshmish Ramaswamy (laks[AT]cs[dot]uga[dot]edu, 706-542-2737)

Time and Venue: Wednesdays - 11:15 AM to 12:05 PM (Harman Hall 102) ; Tuesdays & Thursdays - 11:00 PM to 12:15 PM (Hardman Hall 102)

Office Hours: Tuesdays & Thursdays - 12:15 PM to 1:15 PM (tentative).

Course Description:

Modern computing applications require storage, management and processing of petabytes of data. The data is not only extremely diverse – ranging from unstructured text and relational tables to complex graphs, but it is also dynamic. This course focuses on developing scalable architectures, algorithms and techniques for supporting various data intensive applications. The students will develop a deep understanding of the issues involved in storing and querying large amounts of various kinds of structured, unstructured and dynamic data. The students will also obtain hands on experience in developing data intensive systems and applications by working with frameworks such as Hadoop, Spark and TensorFlow.

Grading Policy (Tentative)

Class participation (including paper presentations and leading discussions) - 30%
Projects - 15%
Final Project proposal - 5%
Final project - 30%
Project Presentation - 10%
Final Project Report - 10%

Course Materials (Tentative -- Will be modified during the course of the semester)

A. Introduction to Data Intensive Computing

C. Lynch, "Big Data: How do Your Data Grow?", Nature 455, 28-29, September 2008.
A. Jacobs, "The Pathologies of Big Data", Communications of the ACM Vol. 52, No. 8, August 2009.
F. Frankel and R. Reid, "Big Data: Distilling Meaning from Data", Nature 455, 30, September 2008.
R. E. Bryant, R. H. Katz and E. D. Lozowska "Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science and Society"
T. Heym S. Tansley and K. Tolle (Eds) "The Fourth Paradigm: Data-Intensive Scientific Discovery"
C. Eaton, D. Deroos, T. Deutch, G. Lapis and P. Zikopoulos, "Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data".
M. Chen, S. Mao and Y. Liu, "Big Data: A Survey", Mobile Networks and Applications, Vol. 19, Issue 2, April 2014.

B. Background Materials

Tanenbaum and van Steen, "Distributed Systems: Principles and Paradigms", Second Edition, Prentice Hall Inc.
Kifer Bernstein and Lewis, "Database Systems: An Application Oriented Approach", Second Edition, Addison Wesley Inc.
Introduction to Information Retrieval by Manning, Raghavan, Sh�tze (Cambridge).
S. Choudhari and U. Dayal, "An Overview of Data Warehousing and OLAP Technology", ACM SIGMOD Record, 1997.
I. Stoica, R. Morris, D. Liben-Nowell, D. Karger, M. Kaashoek, F. Dabek and H. Balakrishnan, Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications, IEEE/ACM Transactions on Networking, 11(1), 2003.
L. Lamport, "Paxos Made Simple", 2001.
D. Mazieres "Paxos Made Practical".
S. Ghemawat et al. "The Google File System", SOSP 2003.

C. Map-Reduce ++

J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", OSDI 2004 -- (Lakshmish)
D. DeWitt and M. Stonebraker, "MapReduce: A Major Step Backwards", The Database Column January 2008 (Lakshmish)
M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz and I. Stoica "Improving MapReduce Performance in Heterogenous Environments", OSDI 2008
H- C. Yang, A. Dasdan, R-L. Hsiao and D. Stott Parker, "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters", SIGMOD 2007
R. Chaiken et. al., "SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets", VLDB 2008.
A. Abouzied et al. "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads", VLDB 2008.
A. Pavlo et. al, "A Comparison of Approaches to Large-Scale Data Analysis", SIGMOD 2009
V. K. Vavilapalli, "Apache Hadoop YARN: Yet Another Resource Negotiator", ACM SoCC 2013.
D. Peng and F. Dabek, "Large-Scale Incremental Processing Using Distributed Transactions and Notifications", OSDI 2010.
B. Li et al. "A Platform for Scalable One-Pass Analytics using MapReduce", SIGMOD 2011.
M. Zaharia et al. "Spark: Cluster Computing with Working Sets", HotCloud 2010.
R. S. Xin et al. "Shark: SQL and Rich Analytics at Scale", SIGMOD 2013.
B. Saha et al., "Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications", SIGMOD 2015.
M. Stonebraker, "Hadoop at a Crossroads?", Communications of the ACM Blog, August 2014 (Lakshmish)

D. Storage, Indexing and Retreival:

F. Chang, et al., BigTable: A Distributed Storage System for Strucutured Data, OSDI 2006.
G. DeCabdia, et al. "Dynamo: Amazon's Highly Available Key-Value Store", SOSP 2007.
A. Silberstein et al. "Efficient Bulk Insertion into a Distributed Ordered Table", SIGMOD 2008.
D. Ford et al. "Availability in Globally Distributed Storage Systems", OSDI 2010.
J. Stribling et al. "Flexible, Wide-Area Storage for Distributed Systems with WheelIFS",
J. Rao et al. "Using Paxos to Build a Scalable, Consistent and Highly Available Data Store", VLDB 2011.
B. Atikoglu et al. "Workload Analysis of a Large-Scale Key-Value Store", SIGMETRICS 2012.
H. Lim et al., "SILT: A Memory-Efficient High-Performance Key-Value Store", SOSP 2011.
R. Ramakrishnan et al. "Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics", SIGMOD 2017.

E. Big Graphs, RDF Data etc.:

G. Malewicz et. al., "Pregel: A System for Large-Scale Graph Processing", SIGMOD 2010.
B. Shao, H. Wang and Y. Li "The Trinity Graph Engine"
Z. Khayyat, et al. "Mizan: Optimizing Graph Mining Algorithms in Large Parallel Systems", Tech. report, KAUST, 2012
Y. Tian et al. "From 'Think Line a Vertex' to 'Think Like a Graph'", PVLDB Vol. 7, No. 3, 2014.
L. Wang et al., "How to Partition a Billion-Node Graph", ICDE 2014
A. Prat-Perez et al. "High Quality, Scalable and Parallel Community Detection for Large Real Graphs", WWW 2014.
J. E. Gonzalez et al. "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs", OSDI 2012.

F. Distributed Machine Learning:

Y. Low et al. "Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud", VLDB 2012.
T. Chilimbi et al. "Project Adam: Building an Efficient and Scalable Deep Learning Training System", OSDI 2014.
M. Li et al. "Scaling Distributed Machine Learning with the Parameter Server", OSDI 2014.
M. Abadi et al. "TensorFlow: A System for Large-Scale Machine Learning", OSDI 2016.

G. Big Data Applications:

J. Lin and D. Ryaboy, "Scaling Big Data Mining Infrastructure: The Twitter Experience", SIGKDD Explorations, 14(2).
Z. Chen and B. Liu, "Mining Topics in Documents: Standing on the Shoulders of Big Data", SIGKDD 2014.
D. Xie et al. "Simba: Efficient In-Memory Spatial Analytics". SIGMOD 2016.
M. Ota et al. "A Scalable Approach for Data-Driven Taxi Ride-Sharing Simulation", Big Data Conference 2015.
Z. Zhang et al., "Scientific Computing Meets Big Data Technology: An Astronomy Use Case", Big Data Conference 2015.
S. Goswami et al. "Lazer: Distributed Memory-Efficient Assembly of Large-Scale Genomics", Big Data Conference 2016.
H. Huang et al. "Android Malware Development on Public Malware Scanning Platforms: A Large-Scale Data-Driven Study", Big Data Conference 2016.
Y. Xu et al. "When Remote Sensing Data Meets Ubiquitous Urban Data: Fine Grained Air Quality Inference", Big Data Conference 2016.

Programming Projects

Programming Project 1 -- Due date: October 5, 2018.

Presentation Slides

Will be available on ELC.

Miscellanious Materials

How to read an engineering research paper? (by W. G. Griswold, Dept. of CSE, UCSD)
How to give a good research talk? (Jones, Hughes and Launchbury, Univ. of Glasgow)
Advice on research and writing (CS Dept. CMU)
On being a scientist (National Academy Press, Washington DC.)