Electrocardiogram signals classification using random forest method for web-based smart healthcare

ABSTRACT


INTRODUCTION
Indonesia is currently facing complex and diverse health problems, starting from dengue fever to the COVID-19 respiratory disease that is still happening today. In addition to these diseases, there is another disease that has been the number one cause of death in Indonesia until now, namely coronary heart disease (CHD) (acute coronary syndrome), called the silent killer. The death rate in Indonesia due to CHD reaches 26%. Therefore, to prevent the high mortality rate of CHD, early detection of CHD can be carried out with an electrocardiogram/electrocardiograph (ECG) examination, or an echocardiograph examination. ECG examination uses hardware made in previous studies to record ECG signals [1]. The tool is also equipped with software to save ECG signal data to a database. Research on ECG is one of the important studies because it can detect cardiovascular disease [2] cardiovascular disease is a disease caused by impaired function of the heart and blood vessels and can be classified as arrhythmic diseases. Arrhythmias are disorders that occur in the rhythm of the heart. There are many types of arrhythmias depending on the rhythm pattern that allows them to be identified and classified by type. Arrhythmias can indicate that there is a problem with the heart. Arrhythmias can be known by classifying heart rate patterns from the patient's ECG recording. Certain changes in heart rate patterns can be a sign of a more serious illness. Until now, there have been many studies that discuss the classification of this pattern, but there are still problems in determining the best features to recognize and classify the pattern of the heart rate. There are many types of feature extraction used in ECG research, such as the wavelet transform coefficient feature [3], [4] frequency-based features [5], and Hermit polynomials [6]. Most of them use the time domain representation and frequency of the ECG signal as the feature. One of the methods used to identify and classify patterns of heart rate is the R-R interval (RRI) method and performing interval extraction between the distances of the R peak of the ECG signal to the next R peak. In the repeatable runout (RRO) method there are 5 RRI features, namely R peak, local interval average, RRI 3 peaks and their ratios, and 10 RRI ratios. In previous research, the heart signal classification using a support vector machine (SVM), and the accuracy is 89% [7].
In this study, feature extraction will use the RRI calculation to increase the accuracy of the classification of heart rate classification. In addition, the classification method used is random forest. Classifications are labeled normal and abnormal. Abnormal is if the heart rhythm is abnormal. The classifications commonly used in ECG research are neural network (NN), K-nearest neighbor (KNN) [8], SVM [7], [9], fuzzy clustering neural network (FCNN) [10], Random Forest, Naive Bayes [9], [11] AdaBoost [9], convolution neural network (CNN) [4] and artificial neural networks (ANN) [9], [12]. Random forest is a combination of several decision trees. Bootstrap was applied for sample selection for each tree in the forest. Two-thirds of the selected data is used to train the tree, and classification is performed with the remaining data. The advantages of random forest method are easy to use, prevents over-fitting, and saves the resulting decision tree cluster for other datasets. And it can overcome noise and missing values and it can handle large amounts of data [13].
Early ECG detection can be used as one of the facilitations of the web-based smart health care application. Smart health care can be used on critically ill patients who need to be under constant monitoring; especially if there is no or unavailability of a doctor in the area where the patient lives. Smart health care can be used for early detection of disease, and doctors can be notified immediately. Smart health care has various facilities such as heart monitoring, ECG monitoring, blood pressure, and monitoring. In this study, ECG monitoring is connected to the internet of things (IoT) system, connected to a microcontroller: Raspberry Pi 3 B+ and then connected to the cloud message queuing telemetry transport (MQTT) broker. Then data from the IoT system is entered into the database server. The IoT system is an ECG hardware device with an AD83232 heart rate sensor equipped with software made in previous research [1]. The discussion of ECG signals classification using the random forest method for web-based smart health care is divided into several sections, namely section 2 discusses the detection of ECG signals, section 3 discusses the random forest classification method, section 4 discusses the ECG signal classification using random forest classification, section 5 discussed the design of the smart health care web application of monitoring ECG, section 6 discussed experimental results and discussion, and section 7 is a conclusion.

ELECTROCARDIOGRAM (ECG) SIGNAL CLASSIFICATION WITH RANDOM FOREST 2.1. Arrhythmia
Arrhythmias are disorders that occur in the rhythm of the heart. The heart rhythm in people with arrhythmias is usually too fast, too slow, or irregular. There are many types of arrhythmias depending on the rhythm of the heartbeat, namely atrial fibrillation (heart beats faster and irregularly), atrioventricular block (heart beats slower), supraventricular tachycardia (heart beats too fast), ventricular extrasystole (the presence of other beats outside the heart rate) and ventricular fibrillation (the heart only vibrates) [14]. Arrhythmias occur when the electrical impulses that regulate the heartbeat do not work normally. As for several things that cause these conditions, among others: consumption of cold medicine or allergy medicine, hypertension, diabetes, electrolyte disorders, thyroid disorders, heart valve disorders, and heart attacks.

Electrocardiogram (ECG)
The ECG system is used to measure the electrical activity of the heart and is usually used in medicine to detect heart disease. The ECG system can be used to detect abnormalities in the heartbeat by analyzing the electrical signals with each heartbeat and the combination of impulse waveforms created by the various specialized tissues of the heart. The abnormalities in the heartbeat, are known as arrhythmias.
The ECG signal has 5 main components [1], namely P wave, QRS wave, T wave, PR interval, and ST segment. In normal heart conditions the values for these components are P wave <0.3 mV high and <=0.12 seconds, the QRS wave is 0.06-0.12 seconds wide, the T wave is positive in all leads, the PR interval is 0.12-0.20 seconds wide and the ST segment is measured from the end of the QRS wave-the beginning of the T wave. The PQRST signal is shown in Figure 1.  [15], [16] An ECG can determine the normal heart rhythm, which is known as a rhythmic sinus rhythm or regular rhythm. A regular rhythm describes each P wave followed by a QRS complex. Any slight changes in cycle length are considered normal sinus rhythm. When the longest and shortest cycles exceed 0.12 seconds, this change in rhythm is called sinus arrhythmia. Here are some heart rhythm abnormalities that can be seen from the ECG recording [17], [18]: a) Rhythmic sinus rhythm  Regular rhythm with a frequency of 60-100 times per minute and R to R regular.  Normal P wave morphology, each P wave followed by a QRS complex.  Positive deflection P wave in lead II.  P wave and QRS complex are negatively deflected in lead aVR. b) Sinus arrhythmia  Meets sinus rhythm criteria, but is slightly irregular.  This is a normal physiological picture, which is often found in young healthy individuals.  This phenomenon occurs due to the influence of respiration1 irama sinus ritmis. c) Atrial fibrillation (AF) 15  The characteristic of AF is the absence of P waves and an irregularly irregular rhythm.  The morphology of the P wave is fibrillation. d) Ventricular tachycardia (VT)  There are >3 ventricular rhythms with a frequency of 100-250 beats per minute.  Wide QRS complex (QRS duration >0.12 seconds).  Occasionally a P wave is seen (arrow), but there is no association with the QRS complex. e) Ventricular fibrillation (VF)  The waves appear irregular with various morphology and amplitude.  P wave, QRS complex, or T wave not visible. f) Supraventricular tachycardia (SVT) PQRST  Regular tachycardia (frequency 140-280 beats per minute).  Narrow QRS complex (QRS complex duration < 0.12 sec).  The P wave is not visible. g) Sinus tachycardia Sinus rhythm with a heart rate of more than 100 beats/minute in adults, more than 120 beats/minute in children, and more than 150 beats/minute in infants. Frequency calculated using the R-R Interval feature. h) Sinus bradycardia Sinus rhythm with heart rate less than 60 beats/minute.

Methods 2.3.1. Electrocardiogram/electrocardiograph (ECG) signal calculation
The RR interval calculation takes five features consisting of peak amplitude, local average, RRI, R-R ratio interval (RRIR), and 10 RRIR [7]. a) Peak amplitude is the value of each R-peak.

b)
The local average is the average of ten intervals between R-peaks. Select an R-peak, then count five R-peak intervals behind it and five R-peak intervals in front of it. The calculation of the local mean is shown in (1), where LA is the local mean, x is the R-peak, and i is the R-peak position. c) Calculation of the RRI is divided into six, namely pre-RR, pre-RR2, pre-RR3, pre-RR4, post-RR, post-RR2, post-R3, post-RR4. Pre-RR is the interval between the selected R-peak and the previous R-peak. Pre-RR2 is the interval between the selected R-peak and the two previous R-peaks, and so on. Post-RR is the interval between the R-peak and the next R-peak. Post-RR2 is the interval between the selected R peak and the next two R peaks, and so on. Figure 2 shows the RRI Illustration. (2) e) 10 RRIR is the ratio between pre-RR with the interval of the R-peak to the next ten R-peaks, shown in (3).
The standard values used in this study are shown in Table 1. The portable ECG recording device from previous research [1] currently only performs signal extraction by taking the PT and Bpm interval values. In this study, the extraction feature is added by adding the RR-interval and RR Local extraction features. This is intended to add attributes from the classification to get good results by adding classification attributes. PT interval itself is the interval from point P to point T. Bpm is the patient's heart rate every minute. The formula for Interval PT and Bpm [11] is shown by (4) and (5).
Where pt is the length of starting point t and ending point t and np t is the number of pt.
Where nR is the number of points R and 60 is 1 minute (second).

Random forest
Random forest is a classification algorithm. It has a good level of accuracy. Random forest is an ensemble method consisting of several decision trees as a classifier. The class generated from this classification process is taken from most classes generated by the decision tree in the random forest. Here's the random forest algorithm [22]- [24]: a) K data is selected in the training set b) A decision tree is made from the K data selected. The number of N-trees (a collection of decision trees) needed to make is selected. Then repeat steps 1 and 2 up to 200/300/500 times. c) Every N-tree is created to predict the group of the new dataset. Then the new data set will enter the group that has the highest probability of all N-tree combinations.
The decision tree consists of root nodes, internal nodes, and leaf nodes. The node is decided by taking attributes and data randomly according to the applicable provisions. The root node is the node located at the top. The internal node is a branching node. It is because on that node there are at least two outputs and only one input. While the leaf node or terminal node is the last. Because it has only one input and no output. The decision tree begins by calculating the entropy value as a determinant of the level of attribute impurity and the value of information gain. To calculate the entropy value, the formula in (6) is used, while the information gain value is used in (7).
Where Y=case set and p(c|Y) is the proportion of Y value to class c.
Where values (a) are all possible values in the case set a, Yv is a subclass of Y with class v corresponding to class a, and Yes is all values that correspond to a. The selection of attributes as nodes (root or internal nodes) is based on the highest information gain value of the existing attributes. The gain ratio value is obtained from the calculation of information gain divided by split information. The value of split information can be seen in (8).
Where, Split Information (S, A) is the estimated value of the entropy of the input variable S which has class c, and |Si|/|S| is probability class I in the attribute.
The dataset used to train the model is the MIT-BIH dataset which contains ECG recordings. Respondents from the MIT-BIH dataset were patients of the male gender and aged between 40-70 years [25]. The process of classifying ECG signals through the process of data acquisition, data preprocessing, modeling, and evaluation. The process is depicted in Figure 3 with the following explanation. For this study, recordings of the MIT-BIH signal were taken for 2 minutes, 100 data to be detected into features of RR intervals, PT intervals, local RRs, and bpm through an ECG device [1] and stored in a CSV file as data training and data testing. Then the normal and abnormal labeling is done based on Table 1. The result of data acquisition is in Table 2.  Table 1 standard feature values of the ECG dataset. The following is the determination in Table 3 and the results of preprocessing are in Table 4.  Figure 4. The model is stored in pickle format so that it can be placed on the back-end server to be called when classification is done. Figure 4(a) shows model storage in pickle format. Figure 4(b) shows the model call on the server.

SMART HEALTHCARE WEB DESIGN ON THE ELECTROCARDIOGRAM (ECG) MONITORING FEATURE
In this study, web smart health care only uses the ECG monitoring feature. In its implementation, the smart health care web receives ECG data from the ECG device and then classification is carried out so that the web displays the signal from the ECG and the classification results, namely normal or abnormal heart conditions from the patient/respondent. Figure 5 shows smart healthcare web design. The web architecture is shown in Figure 5(a). On the smart healthcare web, users can view patient data, view classification results, view ECG signal results, and edit signal data. The use case is shown in Figure 5(b). Then the smart healthcare web system processes are in the flowchart shown in Figure 6.

.1. System implementation
Web smart health care is implemented using the Flask web application framework with the Python programming language. Flask has libraries that can be used to build the web. The following is a display from the smart health care web, where the ECG recording data page is a view of a patient who has recorded a heart signal with an ECG device, shown in Figure 7. List of ECG recording patient data page is shown in Figure 7(a). The Signal Edit page is a feature to add patient data in the form of age, gender, address, and recording date because when starting recording a heart signal with an ECG device, what is entered is only the name. Figure 7(b) shows the signal edit page. Then there is the prediction ECG signal page shown in Figure 8, a page to display the results of the classification of whether the patient's heart condition is normal or abnormal, as well as recording the ECG signal and the PT, Bpm, RR interval, and RR local interval values.

System testing
System testing was carried out for ECG Signal classification testing with random forest using a confusion matrix. This testing results in accuracy score, precision score, recall score, and F1 score. And for the application smart health care testing uses the user acceptance test (UAT) method, which is testing all the functions of the web.

Electrocardiogram (ECG) signal classification test
The ECG signal classification test uses testing data of 23 ECG signal data. The classification test using the confusion matrix is shown in Table From these conditions it can be calculated accuracy, precision, recall, error, and F1 Score, the results of the classification of ECG signals with the random forest algorithm have 96% accuracy, 100% precision, 94% recall, and F1 score is 0.97. The random forest algorithm is proven to have a higher accuracy of 96%, compared to other algorithms, namely Naïve Bayes [11] which has an accuracy of 75% by using the same data set for data testing.

Web test
Web testing is carried out using the UAT method, which is presenting the ECG tool along with the smart healthcare web to general practitioners and trying all the existing functions. The result is that the ECG signal data recording can function properly, the ECG signal data management also functions well, and the ECG signal classification can run well. The results of the UAT are shown in Table 6. Based on the results of the UAT web smart health care test, which integrates with the ECG tool, all functions work well, resulting in a web-based heart rhythm monitoring application that can later be used independently by patients.

CONCLUSION
The mortality rate for CHD in Indonesia is very high, around 26%, for this reason, it is very necessary to monitor heart rhythm for early detection of CHD. The ECG tool which is built and is directly related to the smart health care web application can perform early detection of heart abnormalities through an ECG examination. The results of the ECG examination can be displayed on the web and detected whether the patient's heart rhythm is normal or abnormal. The classification method used is random forest. Random forest is a classification algorithm with a good level of accuracy. Random forest is an ensemble method consisting of several decision trees as a classifier. The accuracy of the random forest method is 96% with an F1 score of 0.97. This accuracy value is very good compared to the classification using the Naïve Bayes method which is only 75%. For testing the smart health care web application using the UAT test, where the results of the ECG signal data recording function, ECG signal data management, and ECG signal classification can run well.