Advanced Topics in Data Intensive Computing

Fall 2016

General Course Information:

Instructor: Lakshmish Ramaswamy (laks[AT]cs[dot]uga[dot]edu, 706-542-2737)
 
Time and Venue: Mondays - 11:15 AM to 12:05 PM (Pharmacy South 301) ; Tuesdays & Thursdays - 11:00 PM to 12:15 PM (Physics 254)
 
Office Hours: Tuesdays & Thursdays - 12:15 PM to 1:15 PM (tentative).

Course Description:

Modern computing applications require storage, management and processing of petabytes of data. The data is not only extremely diverse – ranging from unstructured text and relational tables to complex graphs, but it is also dynamic. This course focuses on developing scalable architectures, algorithms and techniques for supporting various data intensive applications. The students will develop a deep understanding of the issues involved in storing and querying large amounts of various kinds of structured, unstructured and dynamic data.  The students will also obtain hands on experience in developing data intensive systems and applications by working with frameworks such as Hadoop, HBase and GPS.

Grading Policy (Tentative)

Course Materials (Tentative -- Will be modified during the course of the semester)

Introduction to Data Intensive Computing
  1.   C. Lynch, "Big Data: How do Your Data Grow?", Nature 455, 28-29, September 2008.
  2. A. Jacobs, "The Pathologies of Big Data", Communications of the ACM Vol. 52, No. 8, August 2009.
  3. F. Frankel and R. Reid, "Big Data: Distilling Meaning from Data", Nature 455, 30, September 2008.
  4. R. E. Bryant, R. H. Katz and E. D. Lozowska "Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science and Society"
  5. T. Heym S. Tansley and K. Tolle (Eds) "The Fourth Paradigm: Data-Intensive Scientific Discovery"
  6. C. Eaton, D. Deroos, T. Deutch, G. Lapis and P. Zikopoulos, "Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data".
Background Materials

  1. Tanenbaum and van Steen, "Distributed Systems: Principles and Paradigms", Second Edition, Prentice Hall Inc.
  2. Kifer Bernstein and Lewis, "Database Systems: An Application Oriented Approach", Second Edition, Addison Wesley Inc.
  3. Introduction to Information Retrieval by Manning, Raghavan, Shütze (Cambridge).
  4. S. Choudhari and U. Dayal, "An Overview of Data Warehousing and OLAP Technology", ACM SIGMOD Record, 1997.
  5. I. Stoica, R. Morris, D. Liben-Nowell, D. Karger, M. Kaashoek, F. Dabek and H. Balakrishnan, Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications, IEEE/ACM Transactions on Networking, 11(1), 2003.
  6. L. Lamport, "Paxos Made Simple", 2001.
  7. D. Mazieres "Paxos Made Practical".
  8. S. Ghemawat et al. "The Google File System", SOSP 2003.

Map-Reduce ++
  1. J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", OSDI 2004 -- (Lakshmish)
  2. D. DeWitt and M. Stonebraker, "MapReduce: A Major Step Backwards", The Database Column January 2008 (Lakshmish)
  3. M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz and I. Stoica "Improving MapReduce Performance in Heterogenous Environments", OSDI 2008 (Priyanka).
  4. H- C. Yang, A. Dasdan, R-L. Hsiao and D. Stott Parker, "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters", SIGMOD 2007 (Likhita -- 09/20)
  5. R. Chaiken et. al., "SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets", VLDB 2008 (09/20).
  6. C. Olston, B. Reed, U. Srivastava, R. Kumar and A. Tomkins, "Pig Latin: A Not-So-Foreign Language for Data Processing", SIGMOD 2008 (Anuja -- 09/22) .
  7. A. Abouzied et al. "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads", VLDB 2008 (Ayda -- 09/22).
  8. A. Pavlo et. al, "A Comparison of Approaches to Large-Scale Data Analysis", SIGMOD 2009 (Sujeeth -- 09/22)
  9. D. Peng and F. Dabek, "Large-Scale Incremental Processing Using Distributed Transactions and Notifications", OSDI 2010 (Santosh -- 09/26)
  10. B. Li et al. "A Platform for Scalable One-Pass Analytics using MapReduce", SIGMOD 2011 (09/27).
  11. M. Zaharia et al. "Spark: Cluster Computing with Working Sets", HotCloud 2010 (Lakshmish -- 09/27).
  12. R. S. Xin et al. "Shark: SQL and Rich Analytics at Scale", SIGMOD 2013 (Saeid -- 10/06).
  13. X. Meng et al. "MLlib: Machine Learning in Apache Spark", JMLR 2016 (Lakshmish - 10/06)
Storage, Indexing and Retreival:
  1. F. Chang, et al., BigTable: A Distributed Storage System for Strucutured Data, OSDI 2006 (Mohammad Hossein -- 10/13).
  2. G. DeCabdia, et al. "Dynamo: Amazon's Highly Available Key-Value Store", SOSP 2007 (Pranjay -- 10/10).
  3. A. Silberstein et al. "Efficient Bulk Insertion into a Distributed Ordered Table", SIGMOD 2008 (Sakshi -- 10/13)
  4. D. Ford et al. "Availability in Globally Distributed Storage Systems", OSDI 2010 (Omar -- 10/17).
  5. J. Stribling et al. "Flexible, Wide-Area Storage for Distributed Systems with WheelIFS", (Supriya -- 10/17)
  6. M. Stonebraker, "Hadoop at a Crossroads?", Communications of the ACM Blog, August 2014 (Lakshmish -- 10/18)
Big Graphs, RDF Data etc.:
  1. G. Malewicz et. al., "Pregel: A System for Large-Scale Graph Processing", SIGMOD 2010 (Lakshmish -- 10/18).
  2. B. Shao, H. Wang and Y. Li "The Trinity Graph Engine" (Himanshu -- 10/18)
  3. Z. Khayyat, et al. "Mizan: Optimizing Graph Mining Algorithms in Large Parallel Systems", Tech. report, KAUST, 2012 (An Chen -- 10/27).
  4. Y. Tian et al. "From 'Think Line a Vertex' to 'Think Like a Graph'", PVLDB Vol. 7, No. 3, 2014 (Divya Spandana -- 10/27)
  5. A. Fard et al. "A Distributed Vertex-Centric Approach for Pattern Matching in Massive Graphs", BigData Conference 2013 (Usman -- 10/31).
  6. A. Fard et al. " Effective Caching Techniques for Accelerating Pattern Matching Queries", BigData Conference 2014 (Amitabh -- 11/01)
  7. J. Gao et al. "Continuous Pattern Detection over Billion-Edge Graph Using Distributed Framework", ICDE 2014 (Akanksha -- 11/01)
  8. L. Wang et al., "How to Partition a Billion-Node Graph", ICDE 2014 (Vishnu -- 11/03).
  9. A. Prat-Perez et al. "High Quality, Scalable and Parallel Community Detection for Large Real Graphs", WWW 2014.
Web and Social Big Data:
  1. J. Lin and D. Ryaboy, "Scaling Big Data Mining Infrastructure: The Twitter Experience", SIGKDD Explorations, 14(2).
  2. Z. Chen and B. Liu, "Mining Topics in Documents: Standing on the Shoulders of Big Data", SIGKDD 2014 (Zheliang -- 11/07)
  3. S. Jiang et al. "Learning Query and Document Relevance from a web-scale Click Graph", SIGIR 2016 (Anumita/Sonali -- 11/08).
  4. I. Guy et al. "Islands in the Stream: A Study of Item Recommendation within an Enterprise Social Stream". SIGIR 2015 (Jin -- 11/08).
Spatio-Temporal Data:
  1. D. Xie et al. "Simba: Efficient In-Memory Spatial Analytics". SIGMOD 2016 (Nitin -- 11/10).

Programming Projects

Programming Project 1 (Due Date: October 16, 2016)

Presentation Slides

Will be available on ELC.

Miscellanious Materials