Another good article talking about the quality and reproducibility issues of AI in healthcare.
d41586-023-00023-2.pdf (293.9 KB)
Another good article talking about the quality and reproducibility issues of AI in healthcare.
d41586-023-00023-2.pdf (293.9 KB)
I struggle with these types of articles. The problem of over fitting and the need for at least a training and testing data sets are laboriously described during training. The problem of folks feeding random images or data into the model have been hilariously described. No one wants to be that person. All standard stats, ML/AI school stuff.
So why is health so dumbfounded that through rediscovering the wheel that it needs to be round. As if this was a magnificent discovery. I am sure they mean well but this seems to be not so much a traveling circus, but more of a mobile advertisement on the need for a training program for Health Informaticans and Clinical Data Scientists leading to registration.
@greig This is off-topic but its good that you mention Health Informaticians. I have heard of that some of the existing data, digital and telehealth roles in around the country are only temporary and due to finish this year. Is that taking a huge backward step?
@i.hunter, pass, as I now work for Te Aka Whai Ora, I am not in those loops any more. It certainly would be a step backwards for clinical informatics and clinical informaticians if permanent roles did not follow the transitional framework. Equally, knowing the belief systems of some in that space I would have to say that I would be more disappointed than surprised if they were not.
Probably because there are swathes of people who have invested interests in selling AI as a new and better wheel than the one we already have. The septic shock modeling is confusing - we have reasonable statistical models for predicting likelihood of death from sepsis. When the ML models have a larger area under the ROC curve (using large numbers, and completely independent of the sample(s) used to develop the model) than the statistical models, they should be used in their place. However putting them in clinical use at the current time without good peer reviewed and published evaluations demonstrating superiority seems worrying.
There are plenty of articles on ML models in sepsis against clinical accumen and also ML model against ML model. Unfortunately these remind me of big pharma trials where the drug is tested against placebo when the control arm should actually be an accepted standard treatment.
Similarly the closed box of ML modeling is a real worry - current practice of developing better statistical models and publishing the covariates so others can apply the model to their own data set without subscribing to (expensive) systems to take advantage of the modelling/triggers etc. is more open and fairer than these closed proprietary systems.
@brentm, absolutely. The points you make are the exact ones I would make on a specific issue like septicemia. I have argued that we should be treating an new ML/AI algorithm as we would a new drug. If it’s not OK for Dr Utterly Mad to cook up a new drug in his garden shed and start administering it to patients because he thinks its a good idea. Why would it be any better an idea for him (and its usually a him) to do the same for a new ML/AI tool that some automated AI writing tool has generated because he thinks it’s a good idea.
It’s not that validation can’t be done and variable analyses can’t be done. We just need to, including providing technical support to enthusiasts and then independently review it. Luckily the latter is starting to get underway, so that is a plus. Although it remains advisory, which may have its own issues depending on the skill level of the report’s recipient. Pharmac at least has qualified pharmacists and health has…? I simply don’t know.
Absolutely a great idea.
Expertise and experience in the modelling is still needed in order to interpret the “algorithm labels”. For example, they quote an (out of sample) AUROC (area under the receiver operator curve) of 0.87 for survival to discharge, which is certainly up there with the better mortality prediction models (although I would be surprised if it were significantly better if we could see the 95% confidence intervals, which ideally would also be provided) .
Understanding mortality prediction involves more than just measures of discrimination and fit however. Diagnosing progression to death is much easier than predicting eventual death at the point of first contact - that is to say, the physiology, including vital signs and lab results deteriorate markedly as time progresses in those who die, compared to those who recover for any reason (good treatment or good luck). And so the longer lead time before making the prediction, the better a model will perform based on these measures. Many models predict death based only on variables available in the first hour of arrival. These are inherently less discriminatory (slightly lower AUROC ~ 0.80 - 0.85) but are far more useful from both a clinical intervention and quality benchmarking point of view. The article uses data from whilever the patient is in the Emergency Department (ED). This time can be very variable and is very specific to both individual hopitals, systems of care and regions. It is also interesting if time spent in the ED is actually a confounder - that is patients more likely to die within a few hours are kept in the emergency department until death (and thus there data is not included), or moved to hospital wards more quickly. Similarly hospitals under intense stress may make it more likely the patient receves inferior care and also influence time spent in the ED.
There is also no mention of what was done with data from patients who had documentation about limiting active medical interventions - there are worse fates than death for some. Even more complex is the fact that limitations are not dichotomous everything or nothing questions. Nonetheless measures of discrimination will improve if including these patients alongside data about limitations - however from both a clinical and benchmarking point of view, it is usually not very useful to include these patients.
On reflection, it has struck me writing this, that while labeling might be useful, we actually need a process where expert reviewers have a look at the “algorithm label”, feedback improvements and deficits regarding the information contained in the “label” to the authors, and finally have an expert editorial available alongside the “label”. Come to think of it, I have heard of similar processes like this before…
Or treat them like medicines., but “yes”. The onerous should be on the team to provide the accuracy, validity and local applicability data