Disease prediction in big data healthcare using extended convolutional neural network techniques

Received May 3, 2019 Revised Feb 2, 2020 Accepted Mar 14, 2020 Diabetes Mellitus is one of the growing fatal diseases all over the world. It leads to complications that include heart disease, stroke, and nerve disease, kidney damage. So, Medical Professionals want a reliable prediction system to diagnose Diabetes. To predict the diabetes at earlier stage, different machine learning techniques are useful for examining the data from different sources and valuable knowledge is synopsized. So, mining the diabetes data in an efficient way is a crucial concern. In this project, a medical dataset has been accomplished to predict the diabetes. The R-Studio and Pypark software was employed as a statistical computing tool for diagnosing diabetes. The PIMA Indian database was acquired from UCI repository will be used for analysis. The dataset was studied and analyzed to build an effective model that predicts and diagnoses the diabetes disease earlier.


INTRODUCTION
As we know that the growth in technology helps the computers to produce huge amount of data. Additionally, such advancements and innovations in the medical database management systems generate large volumes of medical data. Healthcare industry contains very large and sensitive data. This data needs to be treated very careful to get benefitted from it. Diabetic Mellitus is a set of associated diseases in which the human body is unable to control the quantity of sugar in the blood. It results in high sugar levels in blood, may be as the body does not produce sufficient insulin, or may because cells do not react to the produced insulin. The focus is to develop the prediction models by using certain machine learning algorithms. The Machine Learning is an application of artificial intelligence as it helps the computer to learn on its own. The two classification of ML are supervised and unsupervised. The Supervised learning calculation utilizes the past experience to influence expectations on new or inconspicuous information while unsupervised calculations to can draw derivations from datasets. Machine learning algorithms are:

Supervised learning techniques: Classification
The procedure of finding the obscure information of the class name which is utilizing recent known information is called as class mark which is intern called as classification. The following are Popular Classification Algorithms:  Random forest

R studio
An Integrated Development Environment (IDE) for R programming language which was founded by Jjallaire is called as R Studio. The command line that R Studio uses is interpreter. R studio used for statistical computing and graphics. R Studio is having many built-in packages so it can manipulate huge dataset for analysis.

LITERATURE REVIEW
The usage of big data for predicting diabetes has been conducted in many researches.

PROPOSED SYSTEM
We propose a classification model with boosted accuracy to predict the diabetic patient. In this model, we have employed different machine learning techniques are using like classification, regression and clustering. The major focus is to increase the accuracy by using resample technique on a benchmark well renowned PIMA diabetes dataset that was acquired from UCI machine learning repository, having eight attributes and one class label. The proposed framework is shown in Figure 1. The description of each phase is mentioned.

Data selection
Data selection is a process in which the most relevant data is selected from a specific domain to derive values that are informative and facilitate learning. PIMA diabetes dataset having 8 attributes that are used to predict the diabetes at earlier stage. This dataset is obtained from UCI repository.

Data pre-processing
Data pre-processing is a Machine Learning technique that includes changing crude information into reasonable configuration. It includes Data Cleaning, Data Integration, Data Transformation, and Data Discretization.

Feature extraction through principle component analysis
Feature Extraction on the dataset to determine the most suitable set of attributes that can help achieve better classification. The set of attributes suggested by the PCA are termed as feature vector. Feature reduction or dimensionality reduction will be benefitted us by reducing the computation and space complexity.

Resampling Filter
The supervised Resample filter is applied to the pre-processed dataset. Re-sampling is a series of methods used to reconstruct your sample data sets, including training sets and validation sets. In this study, Boot strapping resampling technique to enhance the accuracy.

Support vector machine (SVM)
SVM is a division of Supervised Learning Algorithm. The strategy used to perform regression, classification and outlier detection of data.SVM will be grouping the information dependent that on the hyper plane. The hyper plane is used to totally isolate the two classes in the best way and the most extreme edge hyper plane ought to be picked as a best separator. The two types SVM Classifiers that are been used are used are: Linear Classifier and Non-Linear Classifier.

Decision tree
The algorithm which is mainly used to produce a classification on training data and regression model into a tree structure is called as Decision tree algorithm, it is based on previous data to classify/predict class or target variables of future/new data with the help of decision rules or decision trees. Decision tree can be useful for both numerical and categorical data. The tree in which the root node in each level is a starting point or the best splitting attribute in that position which helps to test on an attribute is called as complete decision tree. The yield of the test will create branches. Leaf hub will go about as a last class mark or target variable to characterize/foresee the new information. Arrangement rules are attracted from root to leaf.

Naïve bayes
The algorithm performs classification tasks in the field of ML are called as Naïve Bayes. It can perform classification very well on the dataset even it has huge records with multi class and binary class classification problems. The application of Naive Bayes is mainly to text analysis and Natural Language Processing. It works based on conditional probability. It can be represented (1).

K-nearest neighbors
The supervised classifier which is a best choice for K-NN is called as k-Nearest Neighbor. It is a best choice for the classification of k-NN kind of problems. In order to predict the target label of a test data, KNN which finds distance between nearest training data class labels and new test data point in the presence of K value? KNN uses K variable value between 0 to 10 normally.

Regression 4.2.1. Simple linear regression
The linear Regression algorithm which explains the relationship between independent and dependent variables to predict the values of the dependent variable is called as Simple Linear Regression algorithm. Simple regression uses one independent variable. The simple linear regression model is represented (2).
Here, x(independent variable) and y (dependant variable) are two factors involved in simple linear regression analysis. Also, b0 is the Y-intercept and b1 is the Slope.

Multiple linear regressions
It explains the relationship between two or more independent variables and a dependent variable to predict the values of the dependent variable. It uses two or more independent variables. Dependent variable has a continuous and independent variable has discrete or continuous values. The multiple linear regression model is represented as (3) y= (p0 +p1x1+p2x2+…+pnxn) Here x1, x2... xn (independent variable) and y (dependant variable) are two factors involved in multiple linear regression analysis. Also b0 is the y-intercept and p1, p2… pn is the slope.

Logistic regression
The predictive analysis which is used for the dependent variable is categorical called as Logistical Regression. Logistical Regression explains the relationship between one dependent variable and one or more independent variables. The various types of Logistic Regression are:  Multinomial Logistic Regression (many)  Binary Logistic Regression (two)  Ordinal Logistic Regression (1) The categorical response has only two possible outcomes. Multinomial Logistic Regression has three or more outcomes without ordering whereas Ordinal Logistic Regression has three or more outcomes with ordering.

Polynomial regression
The form of regression analysis which explains the relationship between the independent variable and dependent variable as an nth degree polynomial is called as polynomial regression. It fits a non-linear relationship between the value of independent variable and conditional mean of dependent variable. It is represented as (4).
x = a + b * y ^ n (4) Here p is Dependent Variable, q is Independent Variable and n is Degree. It is used to fit the data very well when the data is below and above the regression model. It minimizes the cost function and provides optimum result on the regression.

Linear discriminant analysis
The process of using various data items and applying different functions to that set to analyze classes of objects or items separately is called Linear Discriminant Analysis. Image Recognition and Predictive analytics use this Linear Discriminant Analysis

K-means clustering
The unsupervised machine learning algorithm which is used to solve clustering problems by classifying the dataset into a number of clusters k (group of similar objects), which defines the number of clusters which is assumed before classifying the dataset.

Hierarchical clustering
The type of clustering algorithm which is used to build a hierarchy of clusters is called hierarchical clustering. The two types of Hierarchical Clustering are:

Agglomerative clustering
It is used to group objects into clusters based on their similarity. The result obtained at last is a tree representation of objects called Dendrogram.

Divisive analysis
This is a best down methodology where all perceptions begin in one bunch, and parts are performed recursively as one moves down the pecking order. A hierarchical clustering is often represented as a dendrogram. Each cluster will be representing with centroids. Distance will be calculated by using linkage.

RESULTS AND ANALYSIS
Indian diabetes dataset named PIMA were used for analysis for this study. It consists of eight independent attributes and one independent class attribute. The study was implemented by R programming language using R Studio. Machine learning algorithms like classification (Decision Tree, Naïve Bayes, k-NN and Random Forest), regression (linear, multiple, logistic, LDA) and clustering (k-means, hierarchical agglomerative) are used to predict the diabetics disease in early stages as shown in Table 1. Measure Performance model by using accuracy as shown in Figure 2. Naïve Bayes 86% 5.
Hierarchical agglomerative 74% Figure 2. Comparison of accuracy of various algorithms

CONCLUSION AND FUTURE WORK
Deep Learning and Data mining plays an important role in various fields such as Artificial Intelligence (AI) and Machine Learning (ML), Database Systems and more. The core objective is to enhance the accuracy of predictive model. This PIMA dataset will increase the accuracy of almost all algorithms but the SVM and linear regression leads over others. In future many advanced deep learning techniques will be used to increasing the accuracy of the algorithms.