2017 IEEE International Congress on Big Data
Parallelization, Performance, and Efficient Cross-Validation
Hao Peng, Zhe Jin, and John A. Miller
Department of Computer Science
University of Georgia
Athens, GA, USA
Abstract
Bayesian Network algorithms are widely applied
in the fields of bioinformatics, document classification, big data,
and marketing informatics. In this paper, several Bayesian
Network algorithms are evaluated, including Naive Bayes,
Tree Augmented Naive Bayes, k-BAN, and k-BAN with Order
Swapping. The algorithms are implemented using Scala and
compared with the bnlearn library in R and Weka. Several
datasets with varying numbers of attributes and instances are
used to test the accuracy and efficiency of the implementations
of the algorithms provided by the three packages. When han-
dling huge datasets, issues involving accuracy, efficiency, and
serial vs. parallel execution become more critical and should
be addressed. We implemented several parallel algorithms as
well as an efficient way to perform cross-validations, resulting
in significant speedups.
Keywords - Big data; Analytics; Data mining; Classification;
Parallel programming; Bayesian networks;