Using R and clinical heuristics to explore the Heritage Health Prize: what do we gain?
in Blog

Using R and clinical heuristics to explore the Heritage Health Prize: what do we gain?

The recent opening of the Heritage Health Prize both represents a milestone and raises a cautionary flag. On the one hand, crowdsourced analytics prizes have never tackled anything so noble (not to discount predicting movie ratings), but on the other hand, are we just looking for nails because we all have hammers?

There is a great introduction to importing and preparing the data set here. What next?

If you were just planning to grind the data set straight through your Weka engine, or simply run an ensemble of 100,000 decision trees (am I allowed to say random forest in my blog?) through your Beowulf cluster, you can stop reading here. If, however, you wonder if an understanding of pathophysiology, epidemiology, and clinical medicine might yield some insight into your approach for analytics in this competition, read on.


An epidemiologist often wants to know, what is the burden of disease in a population? It would be interesting to see the prevalence of each condition in the data set. The number of claims submitted for each condition can be used as a proxy.

Editor’s Note: I’m not posting any of the actual plots, since I’m not sure how that plays with the rules of the competition. All the code below will get you to the plots described, but you’ll need to get the data for yourself. If some legal person tells me it’s okay, I’m happy to post the plots later.

# load the claims file
Claims_Y1 <- read.csv('Claims_Y1.csv',header=TRUE)
# make a histogram assessing the burden of disease in the community

Note: If you are experiencing trouble with coord_flip() in ggplot2, there appears to be a known bug in plyr 1.5, and it will be corrected in the next release. The error looks like: Error in inherits(unit, "unit") : object 'ymax' not found

Now let’s see which of these are most commonly associated with a hospitalization.

# make histogram looking at hospitalization proportion for each condition

Now, looking at only those claims that had an inpatient stay, what does the distribution of health conditions look like?

# histogram of claims for inpatient stay only
inpatient <- subset(Claims_Y1,LengthOfStay!="")

Recent research has also shown that the majority of costs are incurred for a very small number of patients. The concept of “high utilizers” is explored well in the recent New Yorker article by Atul Gawande, who highlights important work being done in Camden to identify what he calls “hot spotters.” Does the distribution of utlization in this data set match what has been shown elsewhere? We don’t have cost data, but we can look at the number of claims per patient. Predicting hospitalizations for “hot spotters” is easy.

# find distributiion of hospital utilization by claim
# find distribution of number of hospitalizations (not length) by patient 
# is there a cleaner way to do this?
Claims_Y1$hospitalized <- Claims_Y1$LengthOfStay!=""
num.hosp <-$MemberID,Claims_Y1$hospitalized))
num.hosp <- subset(num.hosp, Var2==TRUE)
ggplot(num.hosp) + geom_histogram(aes(x=Freq), fill='midnightblue')
# a little easier to see hot spotters with a rescaled y-axis (can't use log-scale for zero values)
ggplot(num.hosp) + geom_histogram(aes(x=Freq), fill='midnightblue', binwidth=0.5) + scale_y_sqrt()

Note: Thanks to @wiknin for pointing out the missing line in the code!

Once we know which pathologies are most associated with hospitalization, we can think about the identifiers of those pathologies. The predictors of hospitalization for congestive heart failure (CHF), asthma, and a gastrointestinal bleed will all be very different. Hopefully this is where medical heuristics can inform an analytic approach. For a few of these conditions, medical researchers have already decided which are the most important variables to consider. For coronary heart disease, the two-year probability of an event is given by the Framingham risk score, and for stroke, the annual risk is given by the CHADS2 score. Even more straightforward, a woman who has a positive urine test for the beta fragment of human chorionic gonadotropin is likely to be hospitalized within the next 36 weeks to deliver a baby. If you are lucky enough to have a quantitative level on the beta-hCG, you could even estimate how far along she is. Much of the value of this clinical research will depend on the types of laboratory and other data made available through Kaggle in the coming months.

For those interested in learning more about clinical heuristics, the United States National Library of Medicine runs PubMed, an incredible free service for searching through the medical literature. I recommend learning the MeSH database, which can provide very specific searches paramterers based on a desired topic. Many of the papers are open-access (i.e. free), and anyone affiliated with a university will be able to access subscriptions to major medical journals.

Does your team have a consulting physician? Do you really need one?

With all the promised predictions, what do we gain in the end? In biostatitistics a distinction is made between statistical and clinical significance. Garlic, for example, was shown in one trial to lower blood pressure by a statistically significant 1 mmHg. This has no clinical significance whatsoever. What is the clinical significance of improved predictive analytics for hospitalization? Dubious at best. Most physicians already know which of their patients will be back in the hospital within a few months, weeks, or even days. The “frequent flyers” in emergency rooms arrive as no surprise to the staff who routinely care for them. Improved discharge education, medication reconcilitation, telephone and home visit follow-ups, increased proximity of preventive services to patients with transportation barriers, access to a primary care physician, presence of dilligent case workers, community health promoters, strong support networks, linguistic and culturally appropriate services, etc, etc, we already know go a long way to reducing hospitalizations. The challenge for the next round of health care reformers is to build institutions, networks, and incentives that support people’s health, rather than just predicting who will have disease.

I hope the prize will stimulate a discussion around its own usefulness in improving the lives of human beings. If not, $3 million could also fund a lot of nurses, physicians, community health workers, social workers, health education programs…

Post Comment