This article describes the process I went through, using readmissions data on diabetes to build a model which could be used to predict readmissions.
To be clear, this is not an effort to come up with a practical model but more of an ML learning process. The goal was to analyze and review the data, explore the quantitative values and the categorical values, put together different models and compare the performance.
The work for this article was published in a Deepnote Notebook which everyone can review. The Notebook is still a work in progress, so I expect to make improvements to it and this article if needed.
This work is not unique. The dataset is available publically, has been around for a while, and is often used to teach machine learning. You can find similar Medium articles and work shared on Kaggle.
This project focuses on diabetes readmissions and analyses the dataset called “Diabetes 130 US hospitals for years 1999–2008” available from the University of California Irvine.
The dataset represents 10 years (1999–2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.
- It is an inpatient encounter (a hospital admission).
- It is a diabetic encounter, that is, one during which any kind of diabetes was entered into the system as a diagnosis.
- The length of stay was at least 1 day and at most 14 days.
- Laboratory tests were performed during the encounter.
- Medications were administered during the encounter.
The data contains such attributes as a patient number, race, gender, age, admission type, time in hospital, the medical specialty of admitting physician, number of lab tests performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatients, inpatient, and emergency visits in the year before the hospitalization, etc.
The first 20 rows of data are pasted below:
Credited: the authror
The data was submitted on behalf of the Center for Clinical and Translational Research, Virginia Commonwealth University, a recipient of NIH CTSA grant UL1 TR00058, and a recipient of the CERNER data. John Clore (firstname.lastname@example.org), Krzysztof J. Cios (email@example.com), Jon DeShazo (firstname.lastname@example.org), and Beata Strack (email@example.com). This data is a de-identified abstract of the Health Facts database (Cerner Corporation, Kansas City, MO).
Cerner Corporation makes an Electronic Health Record system in use in many hospitals and outpatient clinics. The data was extracted and compiled from the Cerner EHR systems from many hospitals. The initial work was done to facilitate the research by Dr. Clore and was funded by the NIH. Dr. Clore’s goal was specifically around the impact of the HbA1c results on readmissions (Strack, 2014). The paper by Strack was also used as a reference to better understand the data, the various diagnostic codes and facilitated the pre-processing necessary for good model implementations.
We have a few goals with this analysis:
- The high-level goal is, of course, to set up a model that could predict if a patient will be readmitted within 30 days.
- To accomplish this, we will perform various steps to prepare the data
- We will also build multiple models of different types and compare their performance.
The data preprocessing was handled in a few steps.
First, based on my reading of the paper and a simple glance at the data (head output pasted above), I elected to drop the following columns:
- The weight included very few values.
- The medical_specialty was challenging to decode without more information.
- Same for the payer_code, which was also not very likely to impact my models
- The encounter_id is a unique value for this set
- , and the patient_nbr (a patient code) was not required.
If I redid this work or expanded on it, I would include more exploration around the patient code. Making sure the same code does not show up with different gender or race, for example could be good quality control on the data. This last value could be useful to ensure quality data but was not used in this project.
Next, I tested a new module called pandas_profiling which turned out to be a very effective way to review a large amount of data quickly. The tool produces a widget that gives a great overview of the data. The output of the first pass of this profiling tool can be seen here profile1.html.
Using the profiling results, I completed the preprocessing of the data.
Categorical Data (discrete variables)
The gender, male or female, is classic categorical data. I normalized this feature by creating a single “isFemale” binary feature. I also removed the one patient of unknown gender. We can then drop the gender feature.
With race, I simply created categorical binary features (“dummies”) for each race.
These data required a bit more work. There are three independent diagnostics (diag_1, diag_2, and diag_3), and each can be over 700 different values. For this purpose, I simplified the diagnostic into a few relevant categories.
For numeric icd9 codes, I followed information from the paper (Strack, 2014) going with the data:
“The following abbreviations are used for particular icd9 codes: “circulatory” for icd9: 390–459, 785, “digestive” for icd9: 520–579, 787, “genitourinary” for icd9: 580–629, 788, “diabetes” for icd9: 250.xx, “injury” for icd9: 800–999, “musculoskeletal” for icd9: 710–739, “neoplasms” for icd9: 140–239, “respiratory’’ for icd9: 460–519, 786, and “other” for otherwise.
The above gave me the blueprint needed to map the “relevant” codes and leave most of the rest as “other.” A few of the diagnostic codes start with E and with V, so not knowing how those should be mapped, I left them as “other.” Once converted to actual “textual” diagnosis, I covered those into binary categorical variables easy to use in my model:
The drug columns, numerous and hard to follow, also made this more challenging. To start, I removed all the drugs that were rarely present, like examide, citoglipton, and metformin-rosiglitazone. Next, for the more common drugs, I created a binary variable where NO is FALSE and either “Down,” “Steady” or “Up” means the drug was “present,” so TRUE.
A1C and GluSerum tests
According to the paper (Strack, 2014),
>8 would be considered abnormal. Same for
>300 for the GluSerum tests. I decided to combine these two to reduce the number of categorical variables. I replaced the coded results with “Abnorm” and created binary dummies from these. I dropped the *_None resulting binary leaving with A1Cresult_Norm, A1Cresult_Abnorm and glu_serum_Norm, glu_serum_Abnorm. I felt this would make it easy to model.
Change and diabetesMed
These two features simply need to be converted:
After finalizing my review, I elected to also remove a few more columns I was not clear would yield useful information for this model
I could no easily find a meaningful mapping for these IDs and give the information meaning. To avoid compromising the model, I dropped the columns.
Readmissions (our target or outcome)
The outcome or the feature we are looking to predict is readmission. The current dataset includes a variable
readmitted, which has a few possible values:
- it can be
NOfor never readmitted
>30if readmitted after 30 days
<30if readmitted under 30 days.
I created a single dummy binary variable, “readmitted” which is
TRUEfor patients who were readmitted under 30 days, the outcome we are looking to predict.
So we are looking to model a discrete outcome.
Quantitative Data (continuous variables)
Age was presented in ranges. I converted these ranges back to a numerical value using the decade:
- 10–20 == 10
- 20–30 == 20
The other numerical variables did not require special processing.
I re-ran the profiling tool after all the changes to my imported dataset to validate the results and review the work. The results of this second profiling can be seen here profile2.html.
The first and most obvious visualization was to take a quick look at my numerical values, see the ranges and the spreads:
Credited: the author
Next, I plotted my outcome variable to see if the outcome was in balance or not.
Credited: the author
The outcome is not quite balanced, with a very high percentage of cases not requiring readmission within 30 days. My final chart was to more closely the “time_in_hospital” variable as it related to readmissions. I plotted the time_in_hospital value using a Seaborn KDEPlot. A kernel density estimate (KDE) plot is a method for visualizing the distribution of observations in a dataset analogous to a histogram. KDE represents the data using a continuous probability density curve in one or more dimensions.
Credited: the author
Interesting to note that if the number of days in hospital increased, the readmission rate is getting decreased. From this graph, it seems as patients admitted for anywhere from 2 to 4 days are more likely to be readmitted. I considered exploring more of the continuous variables using a similar technique, but I was pressed for time.
We used a K-mean clustering algorithm to evaluate cluster size with our data. The plan was to explore the resulting clusters with hierarchical clustering. The K-mean results clearly indicate no improvements past 4 clusters, so this became our target for clustering approaches.
Credited: the author
k-4Kmeans model, I studied and plotted centroids:
Credited: the author
Model training and prediction
Our outcome is discrete (readmitted YES or NO), so I will be comparing three different types of models, but four models total:
- Logistic regression (logit)
- I re-did a logistic regression model after removing some variables. I reduced based on the p-values resulting and constantly removing the higher values until all my variables were under 0.05. (newlogit)
- Classification trees (fullClassTree)
- Finally, an artificial neural network— using a Keras ANN (annmodel)
We had to choose models who handled discrete results since our predicted outcome is a Y/N question (the readmission of the patients based on variables contained in their health record. In machine learning, there’s something called the “No Free Lunch” theorem. In a nutshell, it states that no one algorithm works best for every problem, and it’s especially relevant for supervised learning (i.e., predictive modeling).
To start, we could discount models primarily focused on continuous output (numerical or quantitative predictions). We ignored:
- Multiple linear regression
- Regression Trees
The models we selected all offer an array of different features, and each performs best in certain circumstances.
Classification models are all those models designed for discrete outcomes. Logistic regression is the classification alternative to linear regressions. The predictions are mapped to be 0 and 1 using a logistic function.
- Strengths: Outputs have a nice probabilistic interpretation, and the algorithm can be regularized to avoid overfitting. Logistic models can be updated easily with new data using stochastic gradient descent.
- Weaknesses: Logistic regression tends to underperform when there are multiple or non-linear decision boundaries. They are not flexible enough to naturally capture more complex relationships.
Classification trees are the classification alternative to regression trees. They are based on “decision trees.”
- Strengths: As with regression, classification tree ensembles also perform very well in practice. They are robust to outliers, scalable, and able to naturally model non-linear decision boundaries thanks to their hierarchical structure.
- Weaknesses: Unconstrained, individual trees are prone to overfitting, but this can be alleviated by ensemble methods.
Deep learning is often used for continuous output but can also be easily adapted to classification problems. We chose a Keras-ANN model, in large part based on past experience.
- Strengths: Deep learning performs very well when classifying for audio, text, and image data. Which is not our case here. Our data set was simpler.
- Weaknesses: As with regression, deep neural networks require very large amounts of data to train, so it’s not treated as a general-purpose algorithm.
After building and running my models, I used the confusion matrix and the accuracy to compare the results.
Confusion Matrix and accuracy for Logistic Regression:
Confusion Matrix (Accuracy 0.8844) Prediction Actual 0 1 0 34627 75 1 4457 62 — — — — — — — — — — — — — — — — — — — — — — — — — — — — — - Confusion Matrix and accuracy for LogReg reduced: Confusion Matrix (Accuracy 0.8846) Prediction Actual 0 1 0 34631 71 1 4457 62 — — — — — — — — — — — — — — — — — — — — — — — — — — — — — - Confusion Matrix and accuracy for LogReg reduced: Confusion Matrix (Accuracy 0.8226) Prediction Actual 0 1 0 31789 2913 1 4045 474 — — — — — — — — — — — — — — — — — — — — — — — — — — — — — - Acccuracy for ANN model: Out: 0.8914356827735901
The gains chart for the first three models confirmed the performance of the full and the reduced Logistic Regressions were the same. It also confirmed the performance of the classification model was inferior:
Credited: the author
ROC and AUC
Credited: the author
A few conclusions can be made:
- The lowest-performing model was the Classification Tree. I wish I had more time to revisit why, as it was not what I was expecting. But very quickly, my Logistic Regressions models and my Keras-ANN models were outperforming the Classification Tree, so I spent less time on it.
- The removal of all the variables kept the results consistent (if not slightly improved), given the benefit on computing resources needed, using the reduced would be beneficial. I was expecting the result to improve by removing the high p-value features but in the end, getting the same high accuracy model using fewer features has value.
- The Keras-ANN model was more computationally intensive but yielded the highest accuracy. Interestingly, the AUC was lower than with the Classification Tree with low epochs (<10) and surpassed the Logistic Regression once reaching over 70. My final results used epochs=100.
The work done by Strack, for which the dataset was created, came to the conclusion that the to obtain HbA1c results for a patient could be considered a useful predictor of readmission. This seems supported by this experiment since the A1C_Abnorm feature remained in the regression with a P-value lower than 0.05. They also used a logistic regression model and used a P-value breakpoint to select relevant features. Unlike in this project, they used pairwise comparisons with and without A1Cresults to focus their study on the impact on this feature.
“First, we fitted a logistic model with all variables butHbA1c. We refer to this model as the core model. Second, we added HbA1c to the core model. Third, we added pairwise interactions to the core model (without HbA1c) and kept only the significant ones. Finally, we added pairwise interactions with HbA1c, leaving only the significant ones in the final model.”
In the end, I would choose the Logistic Regression (or Linear Classification) for this task. It was very quick and easy to set up, and it would rerun easily and gave consistent results with better AUC than other approaches.
I recently discovered Evidently AI, which makes tools to analyze and monitor machine learning models. I would like to redo some of this work using some of their tools and compare the results.
I was hoping it would make the review more consistent and keep the code cleaner. As evidenced by my code, I struggled with the confusion matrix for the ANN model, for example.
I was also hoping it would facilitate modifying models, varying parameters, and comparing results.
Strack, B., DeShazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J., & Clore, J. N. (2014). Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed Research International, 2014, 781670.
“Modern Machine Learning Algorithms: Strengths and Weaknesses.” EliteDataScience, 9 June 2020, elitedatascience.com/machine-learning-algorithms.
Shmueli, Galit, Bruce, Peter C., Gedeck, Peter, Patel, Nitin R… Data Mining for Business Analytics. Wiley. Kindle Edition