Advanced Topics in Data Intensive Computing

Fall 2014

General Course Information:

Instructor: Lakshmish Ramaswamy (laks[AT]cs[dot]uga[dot]edu, 706-542-2737)
 
Time and Venue: Mondays - 2:30 PM to 3:20 PM (Boyd 306) ; Tuesdays & Thursdays - 2:00 PM to 3:15 PM (Aderhold 409)
 
Office Hours: Tue - 3:15 PM to 4:15 PM and Thu: 1:00 to 2:00 PM.

Course Description:

Modern computing applications require storage, management and processing of petabytes of data. The data is not only extremely diverse – ranging from unstructured text and relational tables to complex graphs, but it is also dynamic. This course focuses on developing scalable architectures, algorithms and techniques for supporting various data intensive applications. The students will develop a deep understanding of the issues involved in storing and querying large amounts of various kinds of structured, unstructured and dynamic data.  The students will also obtain hands on experience in developing data intensive systems and applications by working with frameworks such as Hadoop, HBase and GPS.

Grading Policy (Tentative)

Course Materials (Tentative -- Will be modified during the course of the semester)

Background Materials
  1. Distributed Systems: Principles and Paradigms, by Tanenbaum & van Steen (Second edition, Publisher: Prentice Hall, Inc.)
  2. Database Systems: An Application-Oriented Approach, by Kifer, Bernstein and Lewis (Second edition, Addision Wesley)
  3. Introduction to Information Retrieval by Manning, Raghavan, Shütze (Cambridge)
Introduction to Data Intensive Computing
  1.   C. Lynch, "Big Data: How do Your Data Grow?", Nature 455, 28-29, September 2008.
  2. A. Jacobs, "The Pathologies of Big Data", Communications of the ACM Vol. 52, No. 8, August 2009.
  3. F. Frankel and R. Reid, "Big Data: Distilling Meaning from Data", Nature 455, 30, September 2008.
  4. R. E. Bryant, R. H. Katz and E. D. Lozowska "Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science and Society"
  5. T. Heym S. Tansley and K. Tolle (Eds) "The Fourth Paradigm: Data-Intensive Scientific Discovery"
  6. C. Eaton, D. Deroos, T. Deutch, G. Lapis and P. Zikopoulos, "Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data".
Map-Reduce ++
  1. J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", OSDI 2004 -- (Lakshmish)
  2. Y. Yu, M. Israd, D. Fetterly, M. Budiu, U. Erlinsson, P. K. Gunda, J. Currey, "DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language", OSDI 2008 ( Awani Joshi  -- 09/15).
  3. M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz and I. Stoica "Improving MapReduce Performance in Heterogenous Environments", OSDI 2008 (Guannan-- 09/16).
  4. H- C. Yang, A. Dasdan, R-L. Hsiao and D. Stott Parker, "Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters", SIGMOD 2007 (Ugur -- 09/17).
  5. R. Chaiken et. al., "SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets", VLDB 2008 (Reshma -- 09/23).
  6. C. Olston, B. Reed, U. Srivastava, R. Kumar and A. Tomkins, "Pig Latin: A Not-So-Foreign Language for Data Processing", SIGMOD 2008 (Bita -- 09/25) .
  7. A. Abouzied et al. "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads", VLDB 2008 (Sara --09/29).
  8. A. Pavlo et. al, "A Comparison of Approaches to Large-Scale Data Analysis", SIGMOD 2009 (Sayali -- 09/30).
  9. D. Peng and F. Dabek, "Large-Scale Incremental Processing Using Distributed Transactions and Notifications", OSDI 2010 (Khushboo -- 10/02 )
  10. M. Eltabakh et al. "CoHadoop: Flexible Data Placement and its Exploitation in Hadoop", VLDB 2011 (Seyedamin).
  11. B. Li et al. "A Platform for Scalable One-Pass Analytics using MapReduce", SIGMOD 2011 (Rick Price).
Storage, Indexing and Retreival:
  1. F. Chang, et al., BigTable: A Distributed Storage System for Strucutured Data, OSDI 2006 (Lakshmish -- 10/13)
  2. G. DeCabdia, et al. "Dynamo: Amazon's Highly Available Key-Value Store", SOSP 2007 (Neda -- 10/20).
  3. A. Silberstein et al. "Efficient Bulk Insertion into a Distributed Ordered Table", SIGMOD 2008 (Arash -- 10/21).
  4. D. Ford et al. "Availability in Globally Distributed Storage Systems", OSDI 2010 (Delaram Yazdansepas -- 10/23).
  5. J. Stribling et al. "Flexible, Wide-Area Storage for Distributed Systems with WheelIFS", NSDI 2009 (10/27).
Graphs, RDF Data etc.:
  1. G. Malewicz et. al., "Pregel: A System for Large-Scale Graph Processing", SIGMOD 2010 (Lakshmish -- 10/28).
  2. B. Shao, H. Wang and Y. Li "The Trinity Graph Engine" (Sreekanth -- 10/30)
  3. S. Salihoglu and J. Widom, "GPS: A Processing System". Technical report, CS Dept. Stanford University, 2012 (Alireza -- 11/03).
  4. Z. Khayyat, et al. "Mizan: Optimizing Graph Mining Algorithms in Large Parallel Systems", Tech. report, KAUST, 2012 (Delaram Rahbarnia -- 11/10).
  5. A. Fard et al. "A Distributed Vertex-Centric Approach for Pattern Matching in Massive Graphs", BigData Conference 2013 (Aravind -- 11/04).
  6. A. Fard et al. " Effective Caching Techniques for Accelerating Pattern Matching Queries", BigData Conference 2014 (Sahar --11/06).
  7. Y. Tian et al. "From Think Like a Vertex to Think Like a Graph", VLDB 2013 (Michael -- 11/11).
  8. J. Gao et al. "Continuous Pattern Detection over Billion-Edge Graph Using Distributed Framework", ICDE 2014.
  9. L. Wang et al., "How to Partition a Billion-Node Graph", ICDE 2014 (Yu -- 11/11).
Data Fusion, Data Mining and Information Retreival
  1. L. Dong et al. "Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion", SIGKDD 2014 (Navid -- 11/13) .
  2. Z. Chen and B. Liu, "Mining Topics in Documents: Standing on the Shoulders of Big Data", SIGKDD 2014 (Hao -- 11/13).
  3. K. El-Arini et al. "Representing Documents Through Their Readers", SIGKDD 2013 (Manish -- 11/17).
  4. Li, Liu and Islam "Keyword-based Correlated Network Computation over Large Social Media", ICDE 2014 (Collin -- 11/18).
  5. Gupta et al., "Top-K Interesting Subgraph Discovery in Information Networks", ICDE 2014 (Pan -- 11/18)
  6. K-W Chang et al. "Efficient Pattern-based Time Series Classification on GPU", ICDM 2012 (Shu -- 11/20).
  7. N. Djuric and S. Vucetic "Efficient Visualization of Large-Scale Data Tables through Reordering and Entropy Minimization", ICDM 2013 (Ruichen -- 11/20).
  8. N. Katariya, A. Iyer and S. Sarawagi "Active Evaluation of Classifiers on Large Datasets", ICDM 2012 (Wei -- 12/01).
  9. A. Narang et al. "High Performance Offline and Online Distributed Collaborative Filtering", ICDM 2012 (12/02).

Programming Projects

Project -1 -- Due Date: October 10, 2014

Presentation Slides

Available on ELC.

Miscellanious Materials