Akram Farhadi

My research interest is in data science which employs techniques and theories drawn from computer science, mathematics, and statistics. My main research focus is on data classification in healthcare.

Data science:

Data science uses scientific methods, processes, algorithms, and systems to extract knowledge from data in both structured and unstructured forms. This concept unifies statistics, computer science, and machine learning in order to understand data and extract knowledge from it. It is more comprehensive than data analysis in terms of covering data analytics and using data mining and machine learning to predict the future. I use data mining to pull from existing raw data to look for emerging patterns that can help shape decision-making processes in healthcare.

Healthcare:

The healthcare environment is generally realized as “data rich” yet “knowledge poor” area. A considerable amount of data is available within the healthcare systems. However, there is a lack of effective analysis tools to discover hidden relationships and trends in data. Leveraging medical data analysis can lead to valuable knowledge, e.g. exploring hidden patterns in disease-related medical records to predict the likelihood of subjects getting a disease. Yet another factor motivating the use of data science applications in healthcare is to realize that the extracted knowledge is very useful to all parties involved in healthcare.

Data classification:

Classification is a data mining function that assigns items in a collection to target classes. The goal of classification is to accurately classify a new observation, based on the training set of data in which their membership is known. My research is to distinguish patients with breast cancer using multiple machine learning algorithms such as Logistic Regression, Gaussian Naive Bayes, K Nearest neighbor, Multilayers perceptron, Random Forest and Gradient Boosting.

Handling missing values:

Electronic Health Records (EHR) are usually utilized for research or management purposes. There are many reasons that cause the missing values in EHRs. These reasons include physician/nurse error, operator error and merging the records from multiple databases.

Handling missing data is an essential part of the healthcare data analysis. Instead of ignoring record with missing data, missing values can be imputed using efficient methods such as mean, mode, median, and MissForest.

Handling Imbalanced classes:

Medical data commonly has an imbalanced class distribution, where one class is expressed by a large number of instances while the other(s) contains a small fraction of the other class. Imbalance datasets often result in a classifier's suboptimal performance. There are some techniques such as SMOTE, undersampling, oversampling to resolve the problem of imbalanced classes. I compare them and try to investigate more in order to improve performance measurement of classifiers.

As a part of my research I have focused on answering the following questions:

Given the characteristics of a dataset which classifier is good for classification.
Which classifiers can handle missing values and which imputation methods is the best for our dataset.
Which resampling method works better to address the problem of imbalanced classes.
Classification of time series longitudinal data in healthcare

My previous work was on Forecasting high-frequency time series. I used regression models and deep learning to predict price of bitcoin a month ahead of time.