Michael Barbella, Managing Editor03.12.24
With the growing popularity of large language model (LLM) chatbots—a type of artificial intelligence (AI) used by ChatGPT, Google Bard and BingAI—it is important to determine the accuracy of musculoskeletal health information they provide. And the consensus is these chatbots are not the best sources of reliable data.
Three new studies analyzed the validity of information chatbots gave to patients for certain orthopedic procedures, assessing the accuracy of the ways in which chatbots present research advancements and clinical decision making. While the studies found that certain chatbots provide concise summaries across a wide spectrum of orthopedic conditions, each demonstrated limited accuracy depending on the category. Researchers agree that orthopedic surgeons remain the most reliable source of information. The findings will help those in the field understand the efficacy of these AI tools, if the use by patients or non-specialist colleagues could introduce bias or misconceptions and how future enhancements can make chatbots a potentially valuable tool for patients and physicians in the future.
Potential Misinformation, Dangers Associated With Chatbots
Led by Branden Sosa, a fourth-year medical student at Weill Cornell Medicine, this study assessed the accuracy of Open AI ChatGPT 4.0, Google Bard, and BingAI chatbots to explain basic orthopedic concepts, integrate clinical information, and address patient queries. Each chatbot was prompted to answer 45 orthopedic-related questions spanning several categories—bone physiology, referring physician, and patient query—and then assessed for accuracy. Two independent, blinded reviewers scored responses on a scale of 0-4, assessing accuracy, completeness and useability. Responses were analyzed for strengths and limitations within categories and across chatbots. The research team found the following trends:
Researchers led by Jenna A. Bernstein, M.D., an orthopedic surgeon at Connecticut Orthopaedics, sought to investigate how accurately ChatGPT 4.0 answered patient questions by developing a list of 80 commonly asked patient questions about knee and hip replacements. Each question was queried two times in ChatGPT; first asking the questions as written, and then prompting the ChatGPT to answer the patient questions “as an orthopedic surgeon.” Each surgeon on the team evaluated the accuracy of each set of answers and rated these on a scale of one to four. Agreement was assessed between the two surgeons’ evaluation of each set of ChatGPT answers. The association between the question prompt and response accuracy were both assessed using two means of statistical analysis (Cohen’s kappa and Wilcoxon signed-rank test, respectively). The findings are as follows:
Researchers at the Hospital for Special Surgery in New York City, led by Kyle Kunze, M.D., assessed the propensity for ChatGPT 4.0 to provide medical information about the Latarjet procedure for patients with anterior shoulder instability. The overall goal of this study was to understand whether this chatbot could demonstrate potential to serve as a clinical adjunct and help both patients and providers through providing accurate medical information.
To answer this question, the team first conducted a Google search using the query “Latarjet” to extract the top ten frequently asked questions (FAQs) and associated sources concerning the procedure. They then asked ChatGPT to perform the same search for FAQs to identify the questions and sources provided by the chatbot. Highlights of the findings included:
Three new studies analyzed the validity of information chatbots gave to patients for certain orthopedic procedures, assessing the accuracy of the ways in which chatbots present research advancements and clinical decision making. While the studies found that certain chatbots provide concise summaries across a wide spectrum of orthopedic conditions, each demonstrated limited accuracy depending on the category. Researchers agree that orthopedic surgeons remain the most reliable source of information. The findings will help those in the field understand the efficacy of these AI tools, if the use by patients or non-specialist colleagues could introduce bias or misconceptions and how future enhancements can make chatbots a potentially valuable tool for patients and physicians in the future.
Potential Misinformation, Dangers Associated With Chatbots
Led by Branden Sosa, a fourth-year medical student at Weill Cornell Medicine, this study assessed the accuracy of Open AI ChatGPT 4.0, Google Bard, and BingAI chatbots to explain basic orthopedic concepts, integrate clinical information, and address patient queries. Each chatbot was prompted to answer 45 orthopedic-related questions spanning several categories—bone physiology, referring physician, and patient query—and then assessed for accuracy. Two independent, blinded reviewers scored responses on a scale of 0-4, assessing accuracy, completeness and useability. Responses were analyzed for strengths and limitations within categories and across chatbots. The research team found the following trends:
- When prompted with orthopedic questions, OpenAI ChatGPT, Google Bard, and BingAI provided correct answers that covered the most critical salient points in 76.7%, 33%, and 16.7% of queries, respectively.
- When providing clinical management suggestions, all chatbots displayed significant limitations by deviating from the standard of care and omitting critical steps in workup such as ordering antibiotics before cultures or neglecting to include key studies in diagnostic workup.
- When asked less complex patient queries, ChatGPT and Google Bard provided mostly accurate responses but often failed to elicit critical medical history pertinent to fully address the query.
- A careful analysis of citations provided by chatbots revealed an oversampling of a small number of references and 10 faulty links that were nonfunctional or led to incorrect articles.
Researchers led by Jenna A. Bernstein, M.D., an orthopedic surgeon at Connecticut Orthopaedics, sought to investigate how accurately ChatGPT 4.0 answered patient questions by developing a list of 80 commonly asked patient questions about knee and hip replacements. Each question was queried two times in ChatGPT; first asking the questions as written, and then prompting the ChatGPT to answer the patient questions “as an orthopedic surgeon.” Each surgeon on the team evaluated the accuracy of each set of answers and rated these on a scale of one to four. Agreement was assessed between the two surgeons’ evaluation of each set of ChatGPT answers. The association between the question prompt and response accuracy were both assessed using two means of statistical analysis (Cohen’s kappa and Wilcoxon signed-rank test, respectively). The findings are as follows:
- When assessing the quality of the ChatGPT responses, 26% (21 of 80 responses) had an average scale of three (partially accurate, but incomplete) or less when asked without a prompt, and 8% (six of 80 responses) had an average grade of less than three when preceded by a prompt. As such, researchers summarized that ChatGPT is still not an adequate resource to answer patient questions and further work to develop an accurate orthopedic-focused chatbot is needed.
- ChatGPT performed substantially better when appropriately prompted to answer patient questions “as an orthopedic surgeon” with 92% accuracy.
Researchers at the Hospital for Special Surgery in New York City, led by Kyle Kunze, M.D., assessed the propensity for ChatGPT 4.0 to provide medical information about the Latarjet procedure for patients with anterior shoulder instability. The overall goal of this study was to understand whether this chatbot could demonstrate potential to serve as a clinical adjunct and help both patients and providers through providing accurate medical information.
To answer this question, the team first conducted a Google search using the query “Latarjet” to extract the top ten frequently asked questions (FAQs) and associated sources concerning the procedure. They then asked ChatGPT to perform the same search for FAQs to identify the questions and sources provided by the chatbot. Highlights of the findings included:
- ChatGPT demonstrated the ability to provide a broad range of clinically relevant questions and answers and derived information from academic sources 100% of the time. This was in opposition to Google, which included a small percentage of academic resources, combined with information found on surgeons’ personal websites and larger medical practices.
- The most common question category for both ChatGPT and Google was technical details (40%); however, ChatGPT also presented information concerning risks/complications (30%), recovery timeline (20%), and evaluation of surgery (10%).