How Open Evidence Is Building a Better Chatbot for Medical Researchers

Building a Better Chatbot for Medical Researchers

How Open Evidence Is Building a Better Chatbot for Medical Researchers

Early-stage venture capital fund Xfund focuses on identifying and building relationships with high-potential founders who “think laterally”—beyond the boundaries of disciplines and industry categories. The company moves into emerging markets in collaboration with its founder-partners and has a reputation for spotting and cultivating new entrepreneurial talent. For many of the startups in its portfolio, Xfund is the first or among the first investors. 

This is notably the case in Xfund’s longtime collaboration with polymath founder Daniel Nadler. Xfund was an early investor in Nadler’s first company, Kensho Technologies, and his latest, Open Evidence. “It’s an honor to partner with Daniel on this latest company,” beamed Patrick S. Chung, Managing General Partner of Xfund. “Daniel represents for us the ideal liberal arts founder—the kind of person that Xfund was built to back.” 

Daniel Nadler – poet, artist, innovator 

One of Nadler’s career tracks is as a poet. He earned an NPR Best Book of the Year award for his 2016 debut collection Lacunae: 100 Imagined Ancient Love Poems. As a visual artist, he uses neural networks of his own design to morph his photographs into expressive, digitally transfigured images. He also creates sculptures using traditional bronze-casting and three-dimensional digital sculpting techniques in a mixed-media approach unique to his creative process.  

As an entrepreneur, Nadler founded Kensho in 2013, while still immersed in his doctoral program at Harvard University. Five years later, S&P Global acquired Kensho for $550 million, which Forbes then described as the largest price for an AI company to that date. Nadler completed his PhD in 2016, with his doctoral work centered on statistical and econometric innovations in low-probability, high-impact event-modeling. Today, Kensho integrates its machine learning capabilities into S&P Global’s vast universe of data to enhance the company’s real-time business intelligence. 

Expanding access to newer, better data 

Nadler developed Open Evidence to fill in the blank spaces in the map of ChatGPT’s data universe. While businesses across many sectors have rushed to develop applications using OpenAI’s GPT-4, the technology’s limitations have become as visible as its successes. For instance, until fall 2023, the quality of information that ChatGPT could provide was time-limited, in that the data set on which it was trained wasn’t updated beyond September 2021. 

This is one of the key limitations of large language models (LLMs) like GPT-4 for complex fields like medicine and law, where accuracy is a life-or-death issue. Physicians, lawyers, journalists, financial professionals, and researchers in many other fields can’t rely on LLMs to provide actionable professional insights without constant—and costly—retraining and refreshing of their datasets with up-to-the-nanosecond information.  

OpenEvidence’s “open book” 

This is the problem Daniel Nadler set himself to solve when he assembled a team of fellow PhDs and a supercomputer, working with more than $30 million in capital funding. Nadler and his team built OpenEvidence, which he has described as adding “open book” capabilities to traditional “closed book” LLMs that have not undergone updating and retraining on new data.  

Nadler’s journey toward founding Open Evidence began with his observation that, during the first years of the COVID-19 pandemic, medical professionals were overwhelmed with a ballooning body of research about every aspect of the disease. This has typically been the case for researchers studying other medical conditions, who operate in a world where papers are published at the rate of two a minute. As on the trading floor, so in the research lab and the consultation room: professionals struggle to hear and calibrate the signals amid the noise. 

Nadler’s solution was to create the OpenEvidence chatbot to integrate a “real-time firehose” of new research documents into existing LLMs. Simply put, this involves opening up access to new data for the AI before it begins working on the answer to a prompt. Researchers call this method retrieval augmented generation, or RAG. 

The result: Asking the OpenEvidence chatbot a clinical question brings up a wide variety of new studies and citations published since any particular LLM’s last update. The results’ citations come with easily identified original text, allowing researchers to immediately cross-check answers against the AI’s sources. 

Getting to the root of the LLM problem 

Other major problems that have become apparent with LLM-powered chatbots are their capacity for producing incomplete, decontextualized, or outright wrong answers, distorting accurate information, and “hallucinating.” This last phenomenon involves an LLM “making up” a completely fake piece of data or citing nonexistent sources when it encounters a question outside the dataset it was trained on.

 The root of all these problems is that the quality of an LLM’s answers is only as good as the internet whose data serves as its foundation—the internet whose flawed sources begin with flawed human beings.  

Answering the needs of medical professionals 

That’s why OpenEvidence is doing foundational work to ground its LLM in ways that specifically support the medical field, where researchers prize accuracy and currency above all else.  

By the fall of 2023, some 10,000 clinicians had created OpenEvidence accounts, based on the chatbot’s performance in Mayo Clinic Platform Accelerate. OpenEvidence is now challenging UpToDate, the largest database of its kind, and one that has already built up a global user base of 2 million within the healthcare professions.  

However, UpToDate is human-written and human-edited, whereas OpenEvidence takes full advantage of increasingly sophisticated and interactive AI capabilities. Rather than input general queries, users can pose more specific questions based on their patient’s case. One user of both systems told Forbes that OpenEvidence provides superior time savings over its larger rival, producing answers in seconds, rather than the minutes needed by UpToDate.  

OpenEvidence can scan tens of thousands of peer-reviewed journals in its 35-million-article dataset, sparing its users the chore of scanning massive amounts of text. It delivers not only source citations along with its responses, but also the full text of the relevant sections of its citations. Its turnaround time for making new material accessible is around 24 hours. 

OpenEvidence also follows standard scientific protocol of ranking more-often-cited journals higher in its search results than those cited less frequently, resulting in a higher quality of results. 

Pushing LLMs to new heights 

In a remarkable achievement, as described by Daniel, OpenEvidence “has become the first AI in history to score above 90% on the United States Medical Licensing Examination (USMLE). Previously, AIs such as ChatGPT and Google’s Med-PaLM 2 have reported scores of 59% and 86%, respectively… A widely cited study published in the BMJ in 2016 estimated that medical errors were the third leading cause of death in the United States, after heart disease and cancer. At that scale, any system that could augment a physician and reduce medical errors on an absolute basis by even 5-10% would be extraordinarily impactful to the lives of tens of thousands of patients in the United States alone.”

Chung says that “results like this bring into stark relief the fact that ChatGPT is simply not at the accuracy threshold for it to be usable in a clinical setting.”

Nadler continues: “OpenEvidence is a retrieval-augmented language model system (like a language models-meets-a-medical search engine), so it is not pulling answers from just its weights (which risks hallucination) like ChatGPT does; rather it goes out and finds the relevant clinical trials and guidelines and reads them, and answers on that basis, which also makes it current (ChatGPT keeps telling you it knows nothing about the BA.5 COVID variant since its knowledge is cutoff in September 2021). Also, critically, OpenEvidence provides citations and sources, so physicians can trace and trust the answers.” 

LLMs are revolutionary in their ability to mimic human intelligence and increase productivity in almost every industry. It’s clear, however, that researchers must address serious issues with accuracy and limitations on training—especially if medicine, the law, and other “accuracy-critical” domains are to benefit from LLMs. That’s why Open Evidence’s work is so exciting: it represents a foundational step toward LLMs reaching their full potential.