Various research groups have been teasing the idea of an AI doctor for the better half of the past decade. In late December, computer scientists from Google and DeepMind put forth their version of an AI clinician that can diagnose a patient’s medical conditions based on their symptoms, using a large language model called PaLM.
Per a preprint paper published by the group, their model scored 67.6 percent on a benchmark test containing questions from the US Medical License Exam, which they claim surpassed previous state-of-the-art software by 17 percent. One version of it performed at a similar level to human clinicians. But, there are plenty of caveats that come with this algorithm, and others like it.
Here are some quick facts about the model: It was trained on a dataset of over 3,000 commonly searched medical questions, and six other existing open datasets for medical questions and answers, including medical exams and medical research literature. In their testing phase, the researchers compared the answers from two versions of the AI to a human clinician, and evaluated these responses for accuracy, factuality, relevance, helpfulness, consistency with current scientific consensus, safety, and bias.
Adriana Porter Felt, a software engineer that works on Google Chrome who was not a part of the paper, noted on Twitter that the version of the model that answered medical questions similarly to human clinicians accounts for the added feature of “instruction prompt tuning, which is a human process that is laborious and does not scale.” This includes carefully tweaking the wording of the question in a specific way that allows the AI to retrieve the correct information.
The researchers even wrote in the paper that their model “performs encouragingly, but remains inferior to clinicians,” and that the model’s “comprehension [of medical context], recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.” For example, every version of the AI missed important information and included incorrect or inappropriate content in their answers at a higher rate compared to humans.
Language models are getting better at parsing information with more complexity and volume. And they seem to do okay with tasks that require scientific knowledge and reasoning. Several small models, including SciBERT and PubMedBERT, have pushed the boundaries of language models to understand texts loaded with jargon and specialty terms.
But in the biomedical and scientific fields, there are complicated factors at play and many unknowns. And if the AI is wrong, then who takes responsibility for malpractice? Can the source of the error be traced back to a source when much of the algorithm works like a black box? Additionally, these algorithms (mathematical instructions given to the computer by programmers) are imperfect and need complete and correct training data, which is not always available for various conditions across different demographics. Plus, buying and organizing health data can be expensive.
Answering questions correctly on a multiple-choice standardized test does not convey intelligence. And the computer’s analytical ability might fall short if it were presented with a real-life clinical case. So while these tests look impressive on paper, most of these AIs are not ready for deployment. Consider IBM’s Watson AI health project. Even with millions of dollars in investment, it still had numerous problems and was not practical or flexible enough at scale (it ultimately imploded and was sold for parts).
Google and DeepMind do recognize the limitations of this technology. They wrote in their paper that there are still several areas that need to be developed and improved for this model to be actually useful, such as the grounding of the responses in authoritative, up-to-date medical sources and the ability to detect and communicate uncertainty effectively to the human clinician or patient.