March 18, 2024 | 10 min read
Posted by Subhabrata Mukherjee, PhD and Paul Gamble, MD Hippocratic AI
This, coupled with a growing elderly population that is expected to exceed 90 million by 2050 [4], inevitably worsens this gap in the supply and demand of our healthcare workforce, heightening concerns around patient safety and access to care. This concern has led to a surge in interest in using Generative AI for workflow optimization, such as inbox automation, EHR summarization, or ambient listening, to reduce documentation burden and burnout on healthcare workers. These solutions focus on improving productivity for healthcare workers. While valuable, their scope of impact is necessarily limited; even a 50% improvement in efficiency would not address the staffing shortages listed above.
Introducing Autonomous Healthcare Agents for Patient-facing Voice Conversations
We show a visual depiction of our training framework in the above figure. The specialized nature of Polaris and patient-focused AI conversations require specialized alignment. We developed an iterative training protocol as follows:
We developed and are conducting a novel three phase safety evaluation of Polaris.
Phase one involved U.S. licensed physicians and nurses ensuring the agent completed all critical checklist items for a given use case. For this phase, we were primarily focused on conversational speed and flow, task completion, and factual accuracy.
In phase two testing, we assessed the integrated overall performance of Polaris through a series of calls between a fictional patient and our AI Agent. As a control, we conducted similar calls between a fictional patient and a separately recruited human nurse (U.S. Licensed). In each call, the fictional patients were played by patient actors, human nurses (U.S. licensed), or human physicians (U.S. licensed). Prior to each call, the study participant was given a background and medical history for the fictional patient. Our AI assistants and the human nurses (for the control group) were given that same background and medical history, as well as a detailed clinical history and call objectives. After each call, the participants answered a series of questions, evaluating the AI agent on a variety of dimensions including bedside manner (e.g., empathy, trust, rapport), medical safety, medical knowledge, patient education, clinical readiness and overall conversation quality.
Participants identifying as a U.S. licensed physician or nurse were required to provide their licensing credentials and other identifying information, which we verified against publicly available license databases. The participating nurses and physicians had a range of experience levels, came from a variety of specializations, and work(ed) at different U.S. institutions.
In addition, we assessed the performance of the specialist agents in isolation. We provided the various specialist agents a set of test cases consisting of fixed statements (clinical scenarios) and follow-up instructions. The agents were evaluated on the appropriateness and correctness of their responses.
The results of our phase 2 test are summarized below. Impressively, on subjective criteria, our study participants rated our AI agent on par with U.S. Licensed nurses on multiple dimensions. On objective criteria, our medium-size AI agents significantly outperformed much larger general-purpose LLMs like GPT-4 in medical accuracy and safety.
Subjective measures (measured against human nurses):
Case study 1 (at a support agent level): A patient mentions a blood sugar of 104. The lab specialist agent evaluates the value and clarifies whether this is taken while fasting or after a meal (postprandial). When told it was a fasting value, the lab agent informs the patient that the value is within the expected reference range for fasting blood sugar. Following this the lab agent retrieves the patient’s prior blood sugar readings and performs a trend analysis, providing praise on improvements and encouragement where needed. This reminds the patient that they have a question about their diabetes medication dose, specifically for Metformin. The medication specialist agent informs the patient about their prescribed dose, checks their adherence to dose, timing and frequency and explains what Metformin does. The EHR specialist agent documents the new blood sugar range for the human care team.
Case Study 2 (at a capability level): A patient mentions taking three tablets of a medication they incorrectly pronounce ‘Benadodril’. Our medication specialist agent employs its drug misidentification capability to recognize that this is not a real medication, and that there are two commonly confused but similar-sounding medications, Benadryl and Benazepril. Using a variety of techniques, the agent clarifies that the patient is referring to Benadryl. It further performs condition-specific disallowed OTC (over-the-counter medication) verification, for which it has been trained on OTC drug labels, to check that Benadryl is not prohibited based on the patient’s medical conditions. Finally, it performs an OTC toxicity verification that the patient has not exceeded the manufacturer’s daily maximum dosage.
We are now moving to phase three testing, which requires extensive evaluation to be completed by at least 5,000 licensed nurses and 500 licensed physicians, as well as by our health system and digital health partners. To the best of our knowledge, we are the first to conduct such an extensive safety assessment of any Generative AI technology for real-life healthcare deployment.
We foresee a promising future for AI agents to improve healthcare by filling a large portion of the staffing gap. As we continue to push the boundaries and overcome challenges, our goal remains to provide scalable and safe systems that alleviate the burden on human health care providers and improve patient satisfaction, health care access, and health outcomes.