GPT-4 (and beyond!)

Fun example circulating socials showing a guy using LLMs to get a second (more accurate) opinion on his dog’s condition.

Quite impressive.

I do find numerical operations quite entertaining to see in LLMs, knowing they aren’t actually performing any maths under the hood – just statistically predicting the next token based on their training set.

I’ve just gone ahead and upgraded my OpenAI account for GPT-4, and it can certainly do a lot of reasonable knowledge synthesis. For example, I prompted it with:
“Could you please write me a step-by-step guide to the initial assessment of a patient presenting to the emergency department with suspected aortic dissection?”

And the output is not terribly dissimilar to our Canterbury Hospital HealthPathways document.

Anyone else been playing with LLMs and trying to imagine where they could augment healthcare operations?

2 Likes

I’ve been doing some experiments and GPT-4 is very powerful. It could be very useful for improving healthcare but comes with some pretty serious risks (particularly cybersecurity).

This interview with Sam Altman is worth watching/listening to to see where they are headed (AGI):

OpenAI have documented some of the risks here: https://cdn.openai.com/papers/gpt-4-system-card.pdf

I think we need a moratorium on training new models beyond GPT-4 (GPT-5 is due to finish training later this year) due to the risks involved. There is an open letter here people can sign if they agree: https://futureoflife.org/open-letter/pause-giant-ai-experiments/

3 Likes

A lot of sensitive information is probably already entered into this system which could potentially re-surface I think.

Another interesting video on this topic I just saw:

1 Like

Certainly, we need more rules and regulations and as fast as possible but I am unsure if this moratorium would help or push the whole development into more “unregulated” areas - as it seems to me that everyone can now learn how to program their own AI… https://plat.ai/blog/how-to-build-ai/

2 Likes

I’d follow the lead of Geoffrey Hinton on this. He is probably the most credible expert on where things are going and he is calling for a moratorium. I don’t think there is really any serious regulation going on at the moment and even Sam Altman thinks the government needs to regulate AI:

Agree you can certainly download an open source LLM similar to ChatGPT and do whatever you want with it: https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ but there are not many organisations that can train LLMs better than GPT-4 and the moratorium people are saying they should stop what they are doing until proper regulations can be put in place.

1 Like

EU’s AI regulation plan was a hopeful attempt at regulation until generative AI came into the picture. Latest commentary here: https://www.politico.eu/article/eu-plan-regulate-chatgpt-openai-artificial-intelligence-act/#:~:text=The%20regulation%2C%20proposed%20by%20the,some%20instances%20of%20facial%20recognition.

2 Likes

“Regulating AI” is just so vague, but it’s worth exploring where guardrails are appropriate to ensure the safety of the public for a set of applications.

LLMs themselves are still, effectively, at the novelty stage for many critical applications without substantial human oversight – but, at the same point, their generative fluency absolutely has potential for misuse.

For the moment, of course, any time you want to feel smarter than the bot, you can just ask it to do some minimally novel basic math and watch it fail.

2 Likes

It’s pretty bad at difficult maths but probably better than me unfortunately! How a LLM can do any maths at all is beyond me though… I asked it an this is what it said:

Seems to somehow use this method to be pretty good a basic maths:

There is now a plugin to Wolfram that allows it to do complex maths as well:

1 Like

Apparently, some feel the open letter doesn’t go far enough…

“If intelligence says that a country outside the agreement is building a GPU cluster, be less scared of a shooting conflict between nations than of the moratorium being violated; be willing to destroy a rogue datacenter by airstrike.”

“Make it explicit in international diplomacy that preventing AI extinction scenarios is considered a priority above preventing a full nuclear exchange, and that allied nuclear countries are willing to run some risk of nuclear exchange if that’s what it takes to reduce the risk of large AI training runs.”

2 Likes

Hi @rradecki

I’ve been playing a bit with Chat GPT. I also have a few thoughts on that dog.

The dog

I think there is one thing worse than dissing your vet on Twitter, and that is enthusiastically embracing a new and powerful technology based on anecdote, especially n=1.

I can see similarities to the patient who comes to clinic clutching a bucketful of information they’ve gleaned from the web. This can provide very valuable information. But quite often the bucket is full of red herrings. This should also be distinguished from the patient with an identified, rare, chronic condition—where quite often they are more knowledgeable about their disease than the average doctor, and sometimes even more so than an expert in the field. I see the role of the clinician here as providing much-needed pathophysiological insight and a therapeutic context, which the patient may still require to complete their picture.

I find it interesting how Chat GPT embraced the idea that you treat babesiosis with an ‘antidote’, too.

The problems

Just today there were several articles on this topic in the New England Journal of Medicine, notably this one: https://www.nejm.org/doi/full/10.1056/NEJMsr2214184. They seem to have missed a lot of the important issues.

The obvious issues with Chat GPT are its glib ability to fill in areas of lack of knowledge with what has been termed ‘hallucinations’, and the fact that it can pretty much only work on the corpus of knowledge it’s been supplied with.

A much bigger issue is that—as you hint—it has no internal model whatsoever. It doesn’t ‘get’ meta-cognition or counterfactuals. It is ‘merely’ dealing with word association. As you pointed out, it is crap at maths, but it also cannot test its internal model against reality, because it doesn’t have one.

Playing with Chat GPT

It is fun to play with. A lot of fun. For example, with adequate specifications, it can compose a non-trivial program written in say Perl that runs perfectly, first time.

It can also be infuriatingly stupid, not just owing to its limited, biased knowledge base (women and minorities get short shrift) but also because it lacks metacognition or any underlying model of the process under discussion. Let’s try an example

Me: Can you state Ohm’s law and its cardiovascular analogy?

Chat GPT: Ohm’s law states that the current flowing through a conductor is directly proportional to the voltage applied across it, and inversely proportional to the resistance of the conductor. It can be expressed mathematically as:

I = V/R

where I is the current flowing through the conductor, V is the voltage applied across it, and R is the resistance of the conductor.

In the cardiovascular system, a similar analogy can be made by comparing blood flow (I) to electrical current, blood pressure (V) to voltage, and vascular resistance (R) to electrical resistance. Therefore, the analogy of Ohm’s law for the cardiovascular system can be expressed as:

Blood flow = Blood pressure / Vascular resistance

This was followed by an adequate description of the analogy. But then I asked…

Me: Can you use that second equation to explain why blood pressure might drop in a patient with vasodilated septic shock?

Chat GPT: Yes, the equation Blood flow = Blood pressure / Vascular resistance can be used to explain why blood pressure might drop in a patient with vasodilated septic shock.

Septic shock … [blah, blah]

In other words, the decrease in vascular resistance (R) in septic shock leads to an increase in blood flow (I), which must be compensated for by an increase in blood pressure (V) …

Chat GPT was subsequently profusely apologetic that it had things back-to-front, but still didn’t seem to ‘get’ the model it had just described. So I then asked it to rearrange the equation, making Blood pressure the dependent variable. Which it did. Now check this out …

Me: Teleologically speaking, which of the two variants of the equation makes more sense in terms of the ‘objectives’ of cardiovascular homeostasis?

Chat GPT: Both variants of the equation, Blood flow = Blood pressure / Vascular resistance and Blood pressure = Blood flow x Vascular resistance, are important for understanding cardiovascular homeostasis and the regulation of blood flow and pressure in the body.

However, from a teleological perspective, the second variant of the equation, Blood pressure = Blood flow x Vascular resistance, may make more sense in terms of the objectives of cardiovascular homeostasis. This is because blood pressure is a critical parameter that needs to be regulated to maintain adequate perfusion to the body’s organs and tissues.

In the cardiovascular system, blood pressure is regulated by a complex interplay of factors, including cardiac output, vascular resistance, and hormonal and neural regulation. By expressing blood pressure as a dependent variable that is influenced by changes in blood flow and vascular resistance, the equation highlights the importance of maintaining an appropriate balance between these factors to achieve optimal cardiovascular function.

In other words, the equation emphasizes the idea that changes in blood flow or vascular resistance can have a significant impact on blood pressure, and that the body needs to adapt and regulate these factors to maintain adequate perfusion to the tissues. This is an essential aspect of cardiovascular homeostasis, as changes in blood pressure can have serious consequences for organ function and overall health.

Overall, while both variants of the equation are important for understanding the cardiovascular system, the second variant may be more relevant from a teleological perspective, as it highlights the central importance of blood pressure regulation in achieving cardiovascular homeostasis.

This is so impressive that you might be forgiven for forgetting that all of this is simply pattern matching. I won’t bore you with my subsequent enquiry into what affects blood flow (assuming SVR is fixed) but this amply confirmed what we already know about pattern matching—but no model!

Other issues

There are even bigger issues. One is that re-training these bots is non-trivial, so the tendency is to cast the oracle in stone. It will not adapt to changing knowledge or circumstances until the next expensive iteration—which will no doubt be dictated by economic factors. GPT-3 cost at least US$ 4.6 million to train (with a huge carbon footprint). GPT-4 likely a lot more.

A second issue is how inscrutable this all is. Chat GPT is triply inscrutable because not only don’t we know what went into the mix (the data are secret), or how this was done (the source code is a secret), but we also have the usual issue of the ‘explanation’ for a recommended course of action depending on nothing other than a matrix of coefficients—that cannot be explained. (Even its ‘explanations’ are more of the same!)

A third issue pertains more to societies where legalism prevails (like the US) because you can be sure it’s only a matter of time before Chat GPT is invoked (if not put on the witness stand!) to explain what a stupid decision the hard-working doctor made in the middle of the night, looking after a dozen sick people.

The consequence will surely be that “just to be sure” more and more clinicians—especially younger ones—will ask Chat GPT, who will recommend a huge bunch of extra investigations, increasing not just costs, but also the VOMIT syndrome (Victim Of Medical Imaging Technology) where ‘incidentalomas’ result in not just extra tests, but harm related to those extra tests.

But just yesterday I gave my house officer ‘homework’ and he seemed more than happy to do this simply by asking Chat GPT. (I was less than happy).

You should also note that the approach taken by Chat GPT is provably sub-optimal, because it does not use Bayesian logic in attacking problems.

The BIG issue

In my opinion however, the really big issue relates to whether we can learn to leverage the smarts of this new technology without allowing ourselves to become dumb. If our modelling and diagnostic abilities atrophy because of the convincing spiel of the chatbot, we will become poorer and medical innovation and improvement will be terminally stifled.

My 2c, Dr Jo.

3 Likes

I tried to get GPT-4 to do bayesian reasoning:

Me: do you understand bayesian logic? Can you apply it to a medical problem? I will give you some information about a patient and you should use bayesian logic and your knowledge of probabilities of diagnoses to create the differential diagnosis? OK?

GPT4: As an AI language model, I can help you apply Bayesian logic to a medical problem based on the information you provide about the patient. However, please note that I am not a medical professional, and my knowledge is based solely on the information available up to September 2021. My suggestions should not be considered as professional medical advice. Always consult a qualified healthcare professional for medical advice and diagnosis.

That being said, please provide the information about the patient and any relevant probabilities or data, and I will do my best to help you create a differential diagnosis using Bayesian logic.

Me: OK, the patient presents with abdominal pain in the centre of their abdomen - what is the most likely diagnosis

GPT4: As an AI language model, I’m not able to directly observe patients or make a definitive diagnosis. However, I can provide some common causes of central abdominal pain based on the information you provided. Keep in mind that this is not an exhaustive list, and the actual cause of pain might not be among these conditions.

Some common causes of central abdominal pain include:

  1. Gastroenteritis: Inflammation of the stomach and intestines, usually due to a viral or bacterial infection.
  2. Indigestion: Discomfort in the upper abdomen often related to eating, which can be caused by various factors like overeating, eating too fast, or consuming spicy or fatty foods.
  3. Gastric ulcers: Open sores in the lining of the stomach that can cause pain in the central abdomen.
  4. Pancreatitis: Inflammation of the pancreas, which can cause severe central abdominal pain.
  5. Small bowel obstruction: A blockage in the small intestine, which can cause abdominal pain and other symptoms like nausea, vomiting, and constipation.

To create a differential diagnosis and prioritize the possibilities using Bayesian logic, you would need to provide additional information about the patient, such as their age, medical history, other symptoms, and any relevant probabilities or data related to the potential diagnoses.

Remember, it’s essential to consult a healthcare professional for an accurate diagnosis and appropriate treatment.

Me: OK, some more information follows. Use a bayesian approach and explain your working. The patient gets worse. They develop a fever and the pain seems to have moved to the RLQ of their abdomen.

GPT4: With the additional information provided, the patient’s presentation is more suggestive of acute appendicitis, which is characterized by the initial central abdominal pain that later localizes to the right lower quadrant (RLQ) of the abdomen, often accompanied by a fever. However, it’s important to remember that I am an AI language model, and my response should not be considered professional medical advice. Always consult a qualified healthcare professional for medical advice and diagnosis.

To apply Bayesian logic to this problem, we would need the initial probabilities (priors) for each potential diagnosis, as well as the likelihood of observing the given symptoms given each diagnosis. Since we don’t have specific probabilities, I will provide a qualitative explanation of the Bayesian approach in this case.

Let’s consider two hypotheses: A1: The patient has acute appendicitis. A2: The patient has a different condition causing abdominal pain.

Initially, we had a set of potential diagnoses (e.g., gastroenteritis, indigestion, gastric ulcers, pancreatitis, small bowel obstruction). As the patient’s condition worsened, they developed a fever and the pain localized to the RLQ, which is more consistent with acute appendicitis.

Bayesian logic allows us to update our belief in each hypothesis based on this new evidence. The likelihood of observing these new symptoms is higher for acute appendicitis than for the other conditions. Therefore, we can update our prior probabilities for A1 and A2 using Bayes’ theorem:

P(A1 | B) = (P(B | A1) * P(A1)) / P(B) P(A2 | B) = (P(B | A2) * P(A2)) / P(B)

Here, B represents the new evidence (fever and RLQ pain). Since P(B | A1) > P(B | A2), the posterior probability P(A1 | B) will be greater than P(A2 | B), making acute appendicitis a more likely diagnosis given the new symptoms.

However, this analysis is qualitative, and without specific probabilities, it’s not possible to provide a quantitative assessment. Please consult a healthcare professional for a proper evaluation and diagnosis.

I got it to make up a scenario:

Me: create a scenario of a medical case that needs to be reasoned using bayesian logic - use actual statistical data from reputable sources for the priors.

GPT4: Here’s a medical case scenario where Bayesian logic can be applied using actual statistical data. Please note that the data used in this example is sourced from research studies and might not be up-to-date or representative of the current situation.

Scenario: A 45-year-old female patient comes to the clinic with a persistent dry cough and shortness of breath. She has no history of smoking, and she’s worried about the possibility of having lung cancer.

We’ll consider two hypotheses: A1: The patient has lung cancer. A2: The patient does not have lung cancer.

We need to determine the prior probabilities (P(A1) and P(A2)) and the likelihood of observing the symptoms given each hypothesis (P(B|A1) and P(B|A2)).

Prior probabilities: According to the American Cancer Society, the overall lifetime risk of developing lung cancer for a female non-smoker in the U.S. is approximately 1.3%.

  • P(A1) = 0.013 (the probability that the patient has lung cancer)
  • P(A2) = 1 - P(A1) = 0.987 (the probability that the patient does not have lung cancer)

Likelihood of observing the symptoms: A persistent cough is a common symptom of lung cancer, but it is also a symptom of many other respiratory conditions. The likelihood of observing a persistent cough given lung cancer is around 50-75% (according to Cancer Research UK). The likelihood of observing a persistent cough given no lung cancer can be estimated at around 10% (based on common respiratory illnesses such as asthma, bronchitis, and allergies).

  • P(B|A1) = 0.6 (assuming a 60% likelihood of a persistent cough given lung cancer)
  • P(B|A2) = 0.1 (assuming a 10% likelihood of a persistent cough given no lung cancer)

Now we can apply Bayes’ theorem to update our beliefs based on the patient’s symptoms:

P(A1|B) = (P(B|A1) * P(A1)) / P(B) P(A2|B) = (P(B|A2) * P(A2)) / P(B)

We also need to find P(B), the probability of observing a persistent cough. We can use the law of total probability:

P(B) = P(B|A1) * P(A1) + P(B|A2) * P(A2) P(B) = 0.6 * 0.013 + 0.1 * 0.987 P(B) ≈ 0.1067

Now we can find the posterior probabilities:

P(A1|B) ≈ (0.6 * 0.013) / 0.1067 ≈ 0.073 P(A2|B) ≈ (0.1 * 0.987) / 0.1067 ≈ 0.927

Based on this Bayesian analysis, the probability of the patient having lung cancer given her symptoms is approximately 7.3%, while the probability of her not having lung cancer is approximately 92.7%.

Please note that this is a simplified example and should not be taken as medical advice. It is essential to consult a healthcare professional for proper evaluation, diagnosis, and treatment.

Me: take into account the age

GPT4: To incorporate age in our Bayesian analysis, we need to find age-specific lung cancer incidence rates. The prior probabilities should be adjusted according to the age of the patient, which in this case is 45 years old.

According to data from the National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) program, the age-specific lung cancer incidence rate for females aged 40-44 in the United States is approximately 14.6 per 100,000 women.

To adjust our prior probabilities:

  • P(A1) = 0.000146 (the probability that a 45-year-old female patient has lung cancer)
  • P(A2) = 1 - P(A1) = 0.999854 (the probability that a 45-year-old female patient does not have lung cancer)

Now, we can apply Bayes’ theorem with the updated prior probabilities:

P(B) = P(B|A1) * P(A1) + P(B|A2) * P(A2) P(B) = 0.6 * 0.000146 + 0.1 * 0.999854 P(B) ≈ 0.1007

And find the posterior probabilities:

P(A1|B) ≈ (0.6 * 0.000146) / 0.1007 ≈ 0.00087 P(A2|B) ≈ (0.1 * 0.999854) / 0.1007 ≈ 0.99913

After taking into account the patient’s age, the probability of the patient having lung cancer given her symptoms is approximately 0.087%, while the probability of her not having lung cancer is approximately 99.913%.

Please remember that this is a simplified example and should not be taken as medical advice. It is crucial to consult a healthcare professional for proper evaluation, diagnosis, and treatment.

Me: take into account SOB

GPT4: To incorporate shortness of breath (SOB) into our Bayesian analysis, we will consider the likelihood of observing both a persistent dry cough and shortness of breath given each hypothesis (P(B|A1) and P(B|A2)).

Let’s define B as the observation of both a persistent dry cough and shortness of breath. We need to find P(B|A1) and P(B|A2) and update our calculations accordingly.

  • P(B|A1): According to Cancer Research UK, shortness of breath is a common symptom of lung cancer, with 20-30% of patients reporting it. Since cough and shortness of breath can occur simultaneously in lung cancer patients, let’s assume that the likelihood of observing both symptoms given lung cancer is 40% (0.4).
  • P(B|A2): Shortness of breath is also a symptom of various other respiratory conditions. Let’s assume that the likelihood of observing both a persistent dry cough and shortness of breath given no lung cancer is 5% (0.05).

Now we can update our Bayesian analysis with the new likelihoods:

P(B) = P(B|A1) * P(A1) + P(B|A2) * P(A2) P(B) = 0.4 * 0.000146 + 0.05 * 0.999854 P(B) ≈ 0.0514

Then, we find the posterior probabilities:

P(A1|B) ≈ (0.4 * 0.000146) / 0.0514 ≈ 0.00113 P(A2|B) ≈ (0.05 * 0.999854) / 0.0514 ≈ 0.99887

After taking into account the patient’s age and the presence of both a persistent dry cough and shortness of breath, the probability of the patient having lung cancer given her symptoms is approximately 0.113%, while the probability of her not having lung cancer is approximately 99.887%.

Please remember that this is a simplified example and should not be taken as medical advice. It is crucial to consult a healthcare professional for proper evaluation, diagnosis, and treatment.

2 Likes

Good morning.

Here is an health application out of Y-combinator to compare with ChatGPT:

“We combine a base LLM with a proprietary clinical knowledge database and human (physician) reinforcement training to help doctors diagnose patients and generate accurate clinical plans in seconds.”

Glass ai

Regards, Ben.

1 Like

Yes – I look at GPT4 as more a demonstration of LLMs, rather than a specifically usable reasoning/diagnostic engine. Its use requires recognition of its limitations and underlying mechanistic features – its outputs being the culmination of layers of statistical probabilities simply guessing at the next token in sequence based on the training set it’s digested. Hence, even something with an appearance of intelligence – like the Bayesian reasoning – is just a mash-up of various pages of examples of using Bayes’ theorem, rather than an insightful de novo example. But, within the limitations, there is still value hidden within even GPT4.

Regarding other, medical-specific applications – yes, if the auto-regressive training uses a specific template for output to measure “success”, then more domain-specific versions can be implemented. For repetitive tasks and administrative applications like ambient medical transcriptions, tuned LLMs can almost certainly put together an output framework with a high level of operational acceptability. These would likely still require an amount of domain expert review, but this is likely one of their first valid applications in our space.

Update – Epic announcing they are going to integrate GPT4 as an add-on:

“Our investigation of GPT-4 has shown tremendous potential for its use in healthcare. We’ll use it to help physicians and nurses spend less time at the keyboard and to help them investigate data in more conversational, easy-to-use ways.”Seth Hain, Senior Vice President of Research and Development at Epic

https://azure.microsoft.com/en-us/blog/introducing-gpt4-in-azure-openai-service/

1 Like

Yes - you said “it also cannot test its internal model against reality, because it doesn’t have one.” I think you are absolutely right, this is the biggest downfall. And the biggest challenge we face. We need to halt the development of this until we figure out a way to discriminate the virtual reality and our real world… or is it going to blur anyway???

2 Likes

Just a question, do we understand the implications of privacy, security, confidentiality or other ethical/data sovereignty and basic social license for clinicians to use any AI solutions? Are we aware “we” may have just released any ‘sample’ real data will result in the medical-in-confidence content held by a 3rd party and potentially openly exposed internationally?

1 Like

Hinton has just resigned from Google:

We have a session on this at the NAIAEAG this month so am casting round for conversations about how as a health system we deal with these tools being available. If you have any thoughts about what Te Whatu Ora specifically could/should do I am interested in hearing them.

I am personally biased towards educating people about the benefits and risks of using these tools, not banning them. The reason for this is that these won’t be going away (and banning is very very hard for an organisation like ours) and they will rapidly get to the point where truth is implicitly assumed rather than rigorously debated. In this I am thinking about the example of Google Search and Wikipedia and the impacts they have had on human knowledge and information sharing. I also like Dr Jo’s comment further up about the similarity with patients turning up with unfiltered information from the internet about their symptoms.

So education and use needs to start happening now when we can see how ridiculous the answers can be and so we can test it - rather than ban it until it is good/right (when our lack of education/experience means we miss that it could still be wrong)?

Lastly the “putting confidential data into these tools” problem I think is overblown. Just as you wouldn’t put patient information into this forum, I think most people realise they shouldn’t be putting corporate secrets or other peoples information (eg. medical records) into public places on the internet and this is just another one of those places - like google drive or Airtable or free versions of software etc.

Jon

1 Like

Hi Jon
I agree it ultimately is another source of information. With regards to the health sector, it becomes a source of information that looks like it is correct but may not be and there are many different stakeholders in health who might use such tools to inform their decision making.

So, banning use won’t work, prohibition should that banning leads to underground/blackmarket circulation. Better to keep it in the public eye where it can stimulate debate. As said below, its like google only it looks like it is more likely to ‘true’ content in the way it is presented.

Yes to education about its/their existence as AI tools. Yes to expecting stakeholders to use these tools, for example patients may present with a printout of a search to a clinician. It should be the start of a discussion rather than being dismissed or ignored.

However, that said there are few resources within health to deal with such AI tools and their output and consequences on an individual basis (one clinician, one patient). It may well need a public information campaign to inform people (all stakeholders) of the pros and cons of using such tools.

Is this something Te Whatu Ora would consider? The last thing we need is for patients and whānau to lose trust in the health sector over this or for it to become a means to cause breakdowns in care due to individual arguments around the validity of the information produced AI tools.

Sorry, I have rambled on enough.
Cheers Inga

5 Likes