This project mainly uses python to visually analyze and descriptively analyze the relationship between various medical parameters and diabetes. Use scikit-learn machine learning tools for inferential analysis, standardize the data, use logistic regression algorithms to predict the test set data, and finally use the confusion model and accuracy to evaluate the model.
Main conclusion:Among the 768 people in the data set, 268 people are sick, 500 people are not sick, and the prevalence rate is 34.90%; the average glucose concentration, average diastolic blood pressure, and average Skinfold thickness, average serum insulin, average body mass index, and average diabetes spectrum function are all higher than normal people. The patients are generally between 27 and 47 years old, and the number of pregnancies is between 1 and 8; the parameters that are strongly related to diabetes are glucose, insulin, BMI, and skin_thick; using logistic regression prediction model, in the predicted 154 Of the Pima Indian women, a total of 124 were accurately predicted, with an accuracy rate of 80.5%.
The image is severely compressed after uploading, and it cannot be seen clearly in the diagram, please move to the CSDN forum:
1. Introduction to the data set
The data set is sourced from the National Institute of Diabetes, Digestive and Kidney Diseases in the United States. The purpose of the data set is to predict whether a patient has diabetes based on the existing diagnostic information. However, the database has certain limitations, especially the patients in the data set are all Pima Indian women who are 21 years of age or older.
CCF designated professional big data and artificial intelligence competition platform-DataFountain
Two 、Project analysis
1 Ask the question
Can the existing data be used to accurately predict whether a person has diabetes?
2 Understanding the data
Simply view the contents of the table. First you need to change the column name for better understanding and use.
The column name information of the new table is as follows:
2.1 View data overview, data type
The minimum data of glucose, blood_pressure, skin_thick, insulin, and BMI should not be 0. It can be understood that there is no data entry in these data columns, and there are data missing. Processing ideas: 1. Convert the 0 values of these columns into NaN values; 2. Calculate the average value of each column based on the outcome of the outcome; 3. Use the average value to fill in the missing values.
The data types are all numeric types.
3 Data cleaning
3.1 Data preprocessing
Replace the 0 values of glucose, blood_pressure, skin_thick, insulin, and BMI with NaN values.
Customize the function for calculating the average value of each column.
glucose is filled with missing values.
blood_pressure missing value filling.
skin_thick missing value filling.
insulin missing value filling.
BMI missing value filling.
After filling, check the missing values in each column, and the data set has no missing values.
3.2 Feature Selection
In Feature Selection Using the correlation coefficient method, it can be seen that the correlation between each parameter and the outcome is relatively high, and they are all positively correlated, so all parameters should be considered when forecasting. The first three parameters with strong correlation are glucose, insulin, and BMI, and the correlation coefficients are 0.50, 0.41, and 0.32, respectively.
It is also worth noting that glucose and insulin, skin_thick and BMI have a strong positive correlation.
Check the overall data description statistics before data standardization.
View descriptive statistics for groups without disease.
View the descriptive statistics of the diabetes group.
3.3 Data Standardization
In order to make the data have Comparability requires data standardization. The following uses the StandardScaler model for standardization. The standardized data is as follows.
4. Build a model
First check the number of diabetic patients in the data set. Among 768 people, 268 are sick and 500 are not. The prevalence rate is 34.90%.
4.1 Building a prediction model
Split into 80% training set and 20% test set. There are 614 pieces of data in the training set, which are used to train the model. There are 154 pieces of data in the test set, which are used to predict model results.
4.2 Training model
Select logistic regression algorithm Train the model.
1. Confusion model
As you can see in the confusion matrix, there are 154 Pima Among Indian women, the number of normal people who were correctly identified as not suffering from diabetes was 95, the number of diabetic patients who were correctly identified as having diabetes was 29, and the number of diabetic patients who were recognized as not having diabetes was 12.There are 18 normal people who have been identified as diabetes.Among 768 people, 268 people are sick, 500 people are not sick, the prevalence rate is 34.90%; the average glucose concentration, average diastolic blood pressure, average skinfold thickness, average serum insulin, average body mass index, average Diabetes spectrum function is higher than normal people. The sick are generally between 27 and 47 years old, and the number of pregnancies is between 1 and 8 times.
Conclusion of inferential analysis:The parameters that are strongly related to diabetes are glucose, insulin, BMI, skin_thick, among the predicted 154 Pima Indian women , A total of 124 people were accurately predicted, with an accuracy rate of 80.5%.
Four. Follow-up improvements
The accuracy rate of this forecast is 80.52%. In order to improve the accuracy of the forecast, the following measures can be taken:
1. Data collection, supplement the data of people under the age of 21 and men, and further expand The number of people in each age group;
2. Carry out feature engineering to improve the model parameters, such as extracting the effect of the combination of various parameters on the outcome.
Edited on 2019-08-21