Supporting Open Science in Big Data Frameworks and

Data Science Education

by

Michael E. Cotterell

(Under the Direction of John A.Miller)

Abstract

As the prevalence of data grows throughout the Big Data era, so does a need to provide and improve tools for the education and application of data-driven ana- lytics and scientific investigation. The main contributions of this research can be summarized as follows: i) We provide an overview of the open source ScalaTion project, a big data framework that supports big data analytics, simulation mod- eling, and functional data analysis. ii) We outline some of the Functional Data support in ScalaTion, including a performance comparison for the evaluation of B-spline basis functions that shows that our method is faster than some other popular libraries. iii) To demonstrate how to provide lightweight big data frame- work integration in open notebooks, we present the open source ScalaTion Kernel project, a custom Jupyter kernel that enables ScalaTion support in Jupyter note- books. iv) To demonstrate research using ScalaTion, we outline and evaluate a tight clustering algorithm, written using ScalaTion, for the functional data anal- ysis of time course omics data. v) To promote reproducibility in open science, we present the Applied Open Data Science (AODS) project, a collection of customized web applications for the hosting and sharing of open notebooks with ScalaTion support. This project also includes shareable, executable, and modifiable exam- ple notebooks that utilize ScalaTion to demonstrate various data science topics as well as detailed documentation on how to easily reproduce the environment in which the notebooks are hosted. Specifically, we propose and demonstrate, via readily accessible examples, methods to facilitate openness and reproducibility (both of results and infrastructure) in data science investigations using a big data framework.

Index words: open science; open notebooks; big data frameworks; data science; data science education .