Cardiovascular Disease Prediction from Electrocardiogram by using Machine Learning Method: A Snapshot from the Subjects of the Malaysian Cohort [post]

2020 unpublished
Cardiovascular disease (CVD) is the leading cause of deaths worldwide. In 2017, CVD contributed to 13,503 deaths in Malaysia. The current approaches for CVD prediction are usually invasive and costly. Machine learning (ML) techniques allow an accurate prediction by utilizing the complex interactions among relevant risk factors. Results: This study presents a case-control study involving 60 participants from The Malaysian Cohort, which is a prospective population-based project. Five parameters,
more » ... amely, the R-R interval and root mean square of successive differences extracted from electrocardiogram (ECG), systolic and diastolic blood pressures, and total cholesterol level, were statistically significant in predicting CVD. Six ML algorithms, namely, linear discriminant analysis, linear and quadratic support vector machines, decision tree, k-nearest neighbor, and artificial neural network (ANN), were evaluated to determine the most accurate classifier in predicting CVD risk. ANN, which achieved 90% specificity, 90% sensitivity, and 90% accuracy, demonstrated the highest prediction performance among the six algorithms. Conclusions: In summary, by utilizing ML techniques, ECG data can serve as a good parameter for CVD prediction among the Malaysian multiethnic population. Background Cardiovascular disease (CVD) involves the heart and blood vessels and can lead to premature mortality [1]. CVD includes coronary heart disease (CHD), cerebrovascular disease, rheumatic heart disease, and other heart conditions. Approximately 17.9 million people die annually from CVD, which account for 31% of the total deaths worldwide [2]. In Malaysia, the incident of ischemic heart disease has substantially increased by 54% within 10 years and remained as the principal cause of deaths in 2017 [3]. CVD risk factors, namely, diabetes mellitus (DM), hyperlipidemia, obesity, hypertension, age, gender, smoking, and inactive lifestyle, are important predictors of CVD risk [4-5]. The Malaysian Cohort (TMC) project, which was initiated in 2006 to address the rising trends in non-communicable diseases (NCD), is a large prospective study involving 106,527 multiethnic participants [6]. More than 2000 parameters, including lipid profile, fasting blood glucose (FBG), body composition, blood relations of the selected input to the corresponding groups [45]. The DT classifier constructs a tree from the training data by using five selected features. The tree provides the rules to classify case and control data, and the rules were used to determine the group of the test data. Designing the tree is important to increase the classification performance [46] . In this study, Gini's diversity index was used as the split criterion, with a maximum number of splits set to 100. The kNN algorithm identifies similarities among training inputs in groups or classes. New inputs are classified by measuring the minimum distance between the test and training data. Those who are close to others are called neighbors [47] . A Euclidean distance of 10 neighbors was applied in this study to determine the nearest neighbor of the test data to the corresponding case or control group. ANN is a training method that emulates the human brain and is an outstanding method for predicting the relationship between the input and target values [48]. ANN has been widely used in cardiology applications for pattern recognition and classification tasks [18] . The feed-forward neural network of ANN can accurately classify ECG signals by optimizing the number of hidden layers, hidden neurons, learning algorithm, and transfer function used [49]. A two-layer feed-forward backpropagation network with five input neurons and one output neuron was used in this study. The network was trained with 10 different values of initial weights and biases (random "seed" of 1-10), 30 different numbers of hidden neurons (1-30 hidden neurons), 2 different training algorithms, and Levenberg-Marquardt ("trainlm") and Gradient descent with an adaptive learning rate ("traingda"). The logsigmoid transfer functions were used in both layers to scale the output from 0 to 1. Each trained model of every classifier was tested with 20 sets of test data (distinct from the training data) to examine the performance in terms of specificity, sensitivity, and accuracy. Specificity refers to the ability of the trained model to categorize healthy subjects in the control group. Sensitivity refers to the categorization of CVD risk subjects into the case group. Accuracy is the average of specificity and sensitivity and represents the overall performance of the model. The model of the six classifiers that can classify the test data into the respective group with the highest performance was selected as the best model for CVD risk prediction.
doi:10.21203/rs.2.22561/v1 fatcat:z4tjqbusufeqpozzjuwoxxnztq