Tuesday, March 18, 2014

EMR Data access: The New Zealand National Minimum Dataset

Before I continue I want to acknowledge that, for good reason, it is difficult to get access to electronic medical records data.  That makes it very tough for statisticians who are interested in working on data analysis problems in medicine.  For those of you playing the home game, I will try to use at least moderately accessible data wherever possible, though there will be times when that is impossible.  Until something more accessible comes along, I plan to use the New Zealand National Minimum Dataset.  This dataset comes with a number of challenges, but at least it is accessible (for a fee).  The version I have contains data from patient encounters at New Zealand hospitals occurring in the years 2006 through 2012.

Positive features of this dataset:
  • You can get it.
  • New Zealand is an island, so we have the full account of admissions for most of the population.
  • The data set is moderately large - 3.55 million visits from 1.3 million patients.
Negative features of this dataset:
  • We do not have information about patients visits that occur outside of the hospital setting.
  • Lab values, medications, physicians' notes and many other interesting features of medical records are not available.
  • The data have been filtered.
  • We only know when someone has died if they died in the hospital
There are only a few other options for analysis of medical records if you are among the majority who do not to have access at your job.  There are periodic contests in which small medical records data sets are published along with a very specific task.  These include a Kaggle contest in which contestants are challenged to identify patients with diabetes, and annual contests published by i2b2This year's i2b2 challenge is split into two tracks, one to de-identify electronic health data and another to identify risk factors of heart disease over time.  A final data set that is worth mentioning even though it is not officially available is from the British National Health Service (NHS).  They are planning to make available all electronic health data from NHS patients.  If you are interested, keep an eye on care.data website for details and access as they emerge.

I am lucky to have access to some medical records in my job and have therefore not made a careful study of the publically available EHR datasets.  Readers are encouraged to post information about other publically available record sets (even if they require a fee for access) in the comments section.

New Zealand National Minimum Dataset.
There are some features of the New Zealand data set that are important for understanding what can be done.  All ages recorded in the dataset are between 18 and 65.  This will impose some limits on our ability to look at diseases that are prevalent in children or in old age.  In addition, the gender ratios follow an unusual pattern (Figure 1).  There are vastly more women than men in the dataset between the ages of 18 and 40.  This produces some odd results when looking for relationships between diseases, as we will see in the next article.
There are many more women than men between 18 and 40 years old.
Figure 1. Age / gender distribution.  There are vastly more visits by women between the ages of 18 and 40 than there are men.  

Imperfections.  This dataset has been very carefully cleaned.  For example, there is not even one missing age value and there are only 9 patient visits associated with unknown gender.  This is quite unusual in medical records and suggests that records with missing data are simply not being reported.  Keep in mind that we are working with a lot of data; if you had to manually enter the gender of 1.3 million people a total of 3.5 million times, how many missing values would there be?  How many mistakes?  To get an idea of how accurate these data are, we can look for internal discrepancies. 

Consider, for example that a patient’s gender is entered every time they are admitted to the hospital.  We can therefore look for instances in which the gender of a patient changes from one visit to the next.  Of the 400,000 patients with two or more visits, there are 400 patients who have more than one gender recorded in the record. Admittedly, this would be accurate for patients who have undergone a sex change, but among the 400 are 114 whose gender has changed 2 or more times.  Surely that level of indecision about one’s sexuality is exceedingly rare! There are also 2 men in the record who have been admitted to the hospital to deliver a baby.

In addition to gender, we can look for discrepancies in age.  The record contains separate entries for age at discharge (in years) and date of discharge.  These two entries can be used to estimate a patient’s birth date to within 1 year.  There are 2,152 patients whose estimated birth dates lead to impossibilities – the lowest and highest estimates vary by more than one year.  There are 109 patients with estimated birth dates separated by more than 10 years!

I am pointing out flaws in the record not to disparage this particular record set – this level of accuracy is on par with other record sets that I have looked at.  The main point is that, no matter what is done with electronic medical records, it must be done with the understanding that there are errors in the data.  An algorithm or model that depends absolutely on any particular feature of the data is guaranteed to make mistakes.  How critical those mistakes are will depend heavily on the application.

Figure 2 Readmissions versus days of use.  There are generally two different ways for a patient to become “expensive” from the point of view of a hospital system.  If they are often readmitted within 30 days of discharge or if they have diseases that demand very long hospital stays.
Expensive patients. Based on a previous article, we know that hospitals are very keenly interested in avoiding early readmission to the hospital.  Can we identify those patients who have the highest rates of hospital admission?  One of the shocking features of healthcare is the amount of resources that can be utilized by the sickest patients.  The largest number of independent visits recorded for any single patient in the record set is 1175.  Since we are looking at seven years of data, this implies that there are individual patients who are readmitted to some hospital almost every other day!  

There is an often cited “80/20 rule” (attached to Pareto distributions) which states that in many real world situations 80% of the resources are controlled by 20% of the population.  This rule has been specifically applied to healthcare expenses in political debates and there is evidence that it is accurate.  However, in the New Zealand dataset we find that the 20% of patients who had the most hospital admissions accounted for only around 53% of total.  This is perhaps due to the limited number of days available in the calendar.  Another possibility is that admission to the hospital is not a good proxy for healthcare expenses.

Even though it is somewhat at odds with the Centers for Medicare and Medicaid Services rule about 30 day readmissions, it seems likely that total time spent in the hospital is a better proxy for total expenditure.  We have both start and end dates for all admissions in the data set, so we can compute the total amount of time spent in the hospital for each patient.  You can see along the bottom of Figure 2 that there is an obvious group of patients who have very few admissions, but who spend extraordinary amounts of time in the hospital.  Going back to the 80/20 rule, we find that 71% of the resources (hospital days) are utilized by the top 20% of patients.  This is still not quite the level reported in American media, but it is within range.  It is possible that the average cost per day of patients who are in the hospital is higher for patients who are there more;  this might explain the discrepancy. It even seems likely since those patients are probably sicker.  It is also possible that some feature of New Zealand’s medical system leads to more egalitarian utilization patterns.

Based on what we’ve seen, we can identify patients who have been expensive in the interval from 2006 through 2012.  There are a host of questions that come out of this analysis, but very few answers.  Are those same patients going to be expensive in 2013?  What are the features of those patients?  Are they associated with particular diseases?  Can we tell which patients are on a path to spend a lot of time in the hospital?  If so, are there preventive medicine options that can be implemented to head that off?

With the new rules regarding preexisting conditions, insurance companies are no longer allowed to use previous expenses to set premiums.  However, some hospitals are taking on part of the expense of caring for patients.  What will they do to decrease the cost of these patients?  Ideally they will seek out more efficient and successful ways to care for them.  Hopefully the availability of data and the importance of the question will lead to solutions that work for everyone involved.  

Thursday, March 6, 2014

Disruptive Transformation for Hospital Systems and a Couple of Places where Statisticians can Help

For many years hospitals and physicians have been paid by insurance companies for each procedure they perform.  This may seem reasonable since every procedure performed, from discussing a patient's disease to performing the most complicated surgery, requires resources.  However, it creates a perverse set of financial incentives for physicians and hospital systems.  Sicker patients lead to more procedures which lead to more revenue.  The financial incentive is to make patients sicker!

Physicians and hospital administrators recognize the wrongness of this incentive structure.  Only criminally anti-social individuals would actively pursue "upselling" as a legitimate means of increasing hospital revenue.  Therefore, in order to obscure and minimize the effect of this financial incentive, physicians are shielded from the costs of the procedures they perform and hospital administrations typically do not monitor the health of their patient population.

There is a movement in healthcare to impose financial incentives for healthcare providers to make patients healthier.  Recent changes in (1) the rules by which the Centers for Medicare and Medicaid Services (CMS) must operate and (2) federal law regarding the implementation of electronic health records, are beginning to make this change a reality. 

As of 2012 CMS can work with hospitals or groups of physicians to create "Accountable Care Organizations" (ACOs).  Under this payment structure, care providers are given a fixed fee for each patient for whom they are responsible; if they can save money in the care of that patient they get to pocket the savings.  This is similar to the fee structure of Health Maintenance Organizations that were reviled by patients in the 1980's because they were financially incentivized to minimize patient interactions. 

There are some key differences which, if taken advantage of, can lead to a different outcome for ACOs.  First, with ACOs and other "risk bearing" healthcare organizations there are penalties when patients do not do well.  This leads to a question into which statistics can offer insight; since everybody is different, what does it mean for a patient to be doing well? Second, the opportunities for communication between hospital systems and patients have vastly improved since the 1980’s.  Try searching for “frustrated with hospital” on twitter and you will readily see that communication from the patient to the hospital is already very robust.  Third, due to the Affordable Care Act, there is a growing percentage of the population who are directly responsible for choosing their own health insurance.  It is natural to demand the most expensive insurance from one’s employer if a choice of health insurance provider is not part of the hiring process.  However, when an individual is deciding between plans with vastly different prices, choosing a plan that encourages maximizing the number of procedures no longer seems like the obvious choice – it shouldn’t have been anyway.

In addition to changing fee structures, as of 2012 healthcare organizations are required to maintain electronic medical records.  The original intention of this law was to encourage the free exchange of health information between providers in order to minimize duplication of effort; if a patient has an x-ray at one hospital, it should not be repeated the next day if they show up at a different hospital.  In practice, this objective has not yet been realized because every hospital has its own EMR and those systems are not interoperable – even if they were purchased from the same vendor.  However, what has been created is a vast trove of data about the health of individual patients.  The potential of this huge amount of data to affect all aspects of healthcare cannot be overstated.  In particular, risk bearing hospital systems now have both the financial incentives and the necessary data to track the health of their patient population and be proactive in the treatment of disease.  Successful physician-statistician collaborations are needed to turn this data into information that hospital systems can act upon.

Driving uptake of IT and Statistics in Healthcare

The strongest and earliest driver encouraging hospitals to implement electronic medical records originates from CMS in the form of a couple of different penalties.  The “meaningful use” requirement is being implemented in three increasingly strict phases (phase 1, phase 2 and phase 3).  Those hospitals that are deemed not to be utilizing their electronic medical records in a meaningful way will be penalized, beginning in 2015, with a 1% decrease in CMS payouts.  The penalty increases by 1% yearly up to a total of 5% for consistent failure to achieve meaningful use – tens of millions of dollars for hospitals with large Medicare and Medicaid populations.  Most of the meaningful use definitions require solutions that are straightforward even if they are technically complicated to implement.  As of early 2014, I am not aware of any “meaningful use” applications that involve statistical solutions, though I can imagine improvements to the current versions that might.  Here I discuss a statistical approach to identifying homogeneous groups of patients (one of the elements of meaningful use in phase 2).  Whether improvements like these are financially viable will depend heavily on the way that financial incentives are structured for hospitals.

The second penalty (again a 1% incrementally increasing penalty), and the one that has led the industry to seek out statistical solutions, is a reduction in payment for hospitals with too many patients who are readmitted within 30 days of discharge.  There has been an explosion of statistical models attempting to predict early readmission – there is an open access survey in JAMA for those who want greater detail.  To my knowledge, all of the models attempting to accomplish this are logistic regressions which, in their final form, rely on a clearly defined set of data (independent variables) with which to make predictions.  Implementing these models in practice is challenging because electronic records do not follow fixed standards, patient populations vary significantly between hospitals, and every hospital record system is plagued by missing and incorrectly coded data. Finally, it is not always clear how far back into a record one must go.  A typical statistical approach to modeling the time varying state of a patient is to assume that all the relevant information for predicting the future is available in the present (see hidden Markov model and memorylessness).  However, if two patients come to the hospital with skin infections, and one was diagnosed years earlier with diabetes, the severity of their infection and their chances of returning within 30 days are very different.

I have described some of the disruptive changes that hospitals are undergoing as a result of changing incentives and the availability of electronic health records.  The availability of this data will disrupt healthcare delivery at all levels; insurance companies, contract research organizations, pharma, regulators and consumers are all seeing (and will continue to see) disruption. 


Perhaps the most exciting thing for statisticians is the availability of a vast array of statistical challenges in the healthcare industry that are financially viable, able to tolerate uncertainty and just downright fun to work on.  We will likely never get to a point where computers can be trusted to make medical decisions for patients, but a tricorder reminiscent of Star Trek might be just around the corner, and even a 1% increase in efficiency for the half-trillion dollar drug discovery industry would be tremendously valuable.