Before I continue I want to acknowledge that, for good
reason, it is difficult to get access to electronic medical records data. That makes it very tough for statisticians
who are interested in working on data analysis problems in medicine. For those of you playing the home game, I will
try to use at least moderately accessible data wherever possible, though there
will be times when that is impossible. Until
something more accessible comes along, I plan to use the New Zealand National
Minimum Dataset. This dataset comes
with a number of challenges, but at least it is accessible (for a fee). The version I have contains data from patient
encounters at New Zealand hospitals occurring in the years 2006 through 2012.
Positive features of this dataset:
- You can get it.
- New Zealand is an island, so we have the full account of admissions for most of the population.
- The data set is moderately large - 3.55 million visits from 1.3 million patients.
- We do not have information about patients visits that occur outside of the hospital setting.
- Lab values, medications, physicians' notes and many other interesting features of medical records are not available.
- The data have been filtered.
- We only know when someone has died if they died in the hospital
I am lucky to have access to some medical records in my job
and have therefore not made a careful study of the publically available EHR
datasets. Readers are encouraged to post
information about other publically available record sets (even if they require
a fee for access) in the comments section.
New Zealand National Minimum Dataset.
There are some features of the New Zealand
data set that are important for understanding what can be done. All ages recorded in the dataset are between
18 and 65. This will impose some limits
on our ability to look at diseases that are prevalent in children or in old
age. In addition, the gender ratios
follow an unusual pattern (Figure 1).
There are vastly more women than men in the dataset between the ages of
18 and 40. This produces some odd
results when looking for relationships between diseases, as we will see in the
next article.
Figure 1. Age / gender distribution. There are vastly more visits by women between
the ages of 18 and 40 than there are men.
|
Imperfections. This dataset has been very carefully
cleaned. For example, there is not even
one missing age value and there are only 9 patient visits associated with
unknown gender. This is quite unusual in
medical records and suggests that
records with missing data are simply not being reported. Keep in mind that we are working with a lot
of data; if you had to manually enter the gender of 1.3 million people a total
of 3.5 million times, how many missing values would there be? How many mistakes? To get an idea of how accurate these data are,
we can look for internal discrepancies.
Consider, for example that a patient’s gender is entered
every time they are admitted to the hospital.
We can therefore look for instances in which the gender of a patient
changes from one visit to the next. Of the 400,000 patients with two or more visits, there
are 400 patients who have more than one gender recorded in the record. Admittedly,
this would be accurate for patients who have undergone a sex change, but among the 400
are 114 whose gender has changed 2 or more times. Surely that level of indecision about one’s
sexuality is exceedingly rare! There are also 2 men in the record who have been
admitted to the hospital to deliver a baby.
In addition to gender, we can look for discrepancies in
age. The record contains separate
entries for age at discharge (in years) and date of discharge. These two entries can be used to estimate a
patient’s birth date to within 1 year.
There are 2,152 patients whose estimated birth dates lead to impossibilities
– the lowest and highest estimates vary by more than one year. There are 109 patients with estimated birth dates
separated by more than 10 years!
I am pointing out flaws in the record not to disparage this
particular record set – this level of accuracy is on par with other record sets
that I have looked at. The main point is
that, no matter what is done with electronic medical records, it must be done
with the understanding that there are errors in the data. An algorithm or model that depends absolutely
on any particular feature of the data is guaranteed to make mistakes. How critical those mistakes are will depend
heavily on the application.
Expensive patients. Based
on a
previous article, we know that hospitals are very keenly interested in
avoiding early readmission to the hospital.
Can we identify those patients who have the highest rates of hospital admission?
One of the shocking features of
healthcare is the amount of resources that can be utilized by the sickest
patients. The largest number of
independent visits recorded for any single patient in the record set is
1175. Since we are looking at seven
years of data, this implies that there are individual patients who are
readmitted to some hospital almost every other day!
There is an often cited “80/20 rule” (attached
to Pareto distributions) which states that in many real world situations 80% of
the resources are controlled by 20% of the population. This rule has been specifically applied to
healthcare expenses in political debates and there is evidence
that it is accurate. However, in the
New Zealand dataset we find that the 20% of patients who had the most hospital admissions
accounted for only around 53% of total.
This is perhaps due to the limited number of days available in the calendar. Another possibility is that admission to the
hospital is not a good proxy for healthcare expenses.
Even though it is somewhat at odds with the Centers for
Medicare and Medicaid Services rule about 30 day readmissions, it seems likely
that total time spent in the hospital is a better proxy for total
expenditure. We have both start and end
dates for all admissions in the data set, so we can compute the total amount of
time spent in the hospital for each patient.
You can see along the bottom of Figure 2 that there is an obvious group
of patients who have very few admissions, but who spend extraordinary amounts
of time in the hospital. Going back to
the 80/20 rule, we find that 71% of the resources (hospital days) are utilized by
the top 20% of patients. This is still
not quite the level
reported in American media, but it is within range. It is possible that the average cost per day
of patients who are in the hospital is higher for patients who are there more; this might explain the discrepancy. It even seems likely since those patients are probably sicker. It is also possible that some feature of New
Zealand’s medical system leads to more egalitarian utilization patterns.
Based
on what we’ve seen, we can identify patients who have been expensive in the
interval from 2006 through 2012. There
are a host of questions that come out of this analysis, but very few
answers. Are those same patients going
to be expensive in 2013? What are the
features of those patients? Are they
associated with particular diseases? Can
we tell which patients are on a path to spend a lot of time in the hospital? If so, are there preventive medicine options
that can be implemented to head that off?
With the new
rules regarding preexisting conditions, insurance companies are no longer
allowed to use previous expenses to set premiums. However, some hospitals are taking on part of
the expense of caring for patients. What
will they do to decrease the cost of these patients? Ideally they will seek out more
efficient and successful ways to care for them. Hopefully the availability of data and the importance
of the question will lead to solutions that work for everyone involved.
No comments:
Post a Comment