Tools that can rapidly assimilate and assess a large amount of content about a patient and present the most relevant information back to clinicians are the holy grail of health IT. Generative AI and Large Language Models are a fantastic step towards this goal. But…they are just a step…
There was a bit of a flurry of excitement recently when some researchers took the latest GPT-4V vision model and threw some radiology images at it for interpretation. https://arxiv.org/pdf/2309.17421.pdf
It responded with what on first glance looked to be amazing accuracy - except it was dead wrong on one example and appears to have simply made stuff up on another - and that’s the elephant in the room - hallucinations.
There are lots of breathless descriptions of how LLMs can “pass the USMLE” with 80% accuracy. Sounds amazing. But it’s an interesting thought experiment to think about what that 80% means if you take the LLM and pretend it’s a real clinician.
Let’s think about Radiology again and look at errors in this domain. Its no secret that Radiologists miss things (approximately 1% of the time in fact!), but these are typically errors of perception - good search methods and practice can reduce the rate of these errors, and AI is actually super good at helping with this! There are validated image analysis AI models in production with FDA clearance that, for example, help radiologist’s identify lung nodules. Oh to have these deployed across NZ…
Errors of interpretation are far more concerning though i.e. a Radiologist sees something but mis-diagnoses it through a lack of experience or knowledge - in the paper above this is the Jones fracture diagnosis. The Generative AI “sees” the fracture, but confidently mis-diagnoses the actual fracture type #awkward… (see section 4.1 page 33)
The subsequent image analysis on the same page is where things get super-concerning - the hallucination. The AI describes a nodule in the right upper lobe - yet the right upper lobe almost certainly isn’t even visible in the supplied CT slice (let alone a nodule) #superawkward
-
Errors where you miss seeing/perceiving some information = not great but hopefully a fixable thing through improved review process and experience.
-
Errors where you perceive something but attribute it incorrectly = more concerning and needs active addressing
-
Errors where you just make stuff up? That’s malpractice.
Generative AI/LLMs are super exciting, super useful (in some verticals) and super enticing in medicine. But I think they should be evaluated and assessed in the same manner as a new therapeutic. Phase 1, Phase 2 and Phase 3 trials with clear understanding of the risks, benefits and potential adverse side effects of their use.
Humans are hugely susceptible to trusting technology, particularly if it appears to be slick and well presented. Even the concept of a LLM that absorbs notes and summarises them should (in my opinion) raise massive red flags around the potential for end-users to implicitly trust the output because “a computer did it”