Ontology-based Semantics vs Meta-learning for

Predictive Big Data Analytics

by

Mustafa Veysi Nural

(Under the Direction of John A.Miller)

Abstract

Predictive analytics in the big data era is taking on an ever increasingly important role. Issues related to choice on modeling technique, estimation procedure (or algorithm) and efficient execution can present significant challenges. For example, the selection of appropriate and most predictive models (i.e., the models that maximize the chosen performance criteria such as lowest error) for big data analytics often requires careful investigation and considerable expertise which might not always be readily available. In this thesis, we propose two alternative methods to assist data analysts and data scientists in selecting appropriate modeling techniques and building specific models as well as the rationale for the techniques and models selected. The first approach uses ontology-based semantics to assist selecting the most predictive model for a given dataset. To formally describe the modeling techniques, models, and results, we developed the Analytics Ontology that supports inferencing for semi-automated model selection. The ScalaTion framework, which currently supports over sixty modeling techniques for big data analytics, is used as a testbed for evaluating the use of semantic technology. In the second approach, we present a meta-learning system for selecting the most predictive regression algorithm in a predictive big data analytics setting. The meta-learning system uses meta-features characterizing the aspects of the dataset to select most predictive modeling techniques for that dataset. We show that our meta-learning system provides promising performance in predicting top performing modeling techniques for a given dataset. In addition to evaluating the system against existing baseline approaches, we also compare meta-learning approach with the ontology-assisted suggestion engine. Finally, we present detailed performance analysis of the regression algorithms, namely Lasso and Ridge Regression, that we have implemented in ScalaTion and show that they provide robust performance compared to R, both in terms of training time and error.

Index words: predictive big data analytics; automated modeling; meta-learning; ontology-based semantics; machine learning.