LLMs Are Turning Science Fiction to Science Reality, A Case in Healthcare

Read time 13 min
Like many others, we’ve been intrigued by the releases of the first LLM chatbots, chatGPT and Bing, too. We’ve played with them and LLM APIs for a while now.
Based on these experiments and the curiosity these new tangible tools bring, we came to ask a question: What if we brought LLM to the doctor’s office?
And we did – in a demo we set up with an actual doctor. Next, we’ll walk you through exactly what we did and the counterintuitive conclusions and suspicions it brought up.
But, as they say, let’s start from the beginning.
LLM What?
Language Models are a development in machine learning, from somewhere between 2010–2020, where one feeds the model a mass of text and gets out a thing that can… predict how the text would continue.
This is convenient because one doesn’t need a training target beyond text. Up to a scale, you take a larger model and more text and get somewhat better performance until it saturates, right? Yes! Except that, researchers found out that getting performance to saturate is not that easy. Quite the contrary, performance seems just to keep improving (1).
Text prediction is a deceiving target. Predicting the next word is commonplace and useful in a word processor, it autocompletes. But what if you put a question mark there so that the prediction would be an answer to a question? How far can you go with this? Will it know that Tuesday follows Monday, that Mars is the fourth planet?
Seems that with enough scale, it does, and much more.
So, we got the large language models, LLMs. GPT-3 was first on the modern scale, published around 2021. It was so large that one could not really download it, even if it were free, and a rack of GPUs was needed just to run it.
Then, in November 2022, along came the ChatGPT – and boom. In February, another LLM, Bing, or Sydney, as it secretly identifies (sic) as, found its confidential details leaked to Twitter and got very angry. And in general, it sometimes behaves like a “manic-depressive teenager trapped in a mediocre search engine.”
None of this oddness was programmed but emerged from the LLM and its interactions with the users and the media. Still, an autocomplete, but clearly, the scale… well, we don’t even know what to say. Apparently, large LLMs simulate personalities sometimes. They also reason, especially when prompted to do that “step by step.”
So LLMs, and related technologies, look like a game-changer. Eventually, with analogous transformers for speech and non-verbal communication, LLMs may change how we interact with all things digital.
Forgetting eventualities, the popular interface, for now, is the chat, where the LLM is in the role of a writing or coding assistant or a glorified search engine. This is likely to diversify when LLMs are integrated into existing tools. LLMs will also be integrated into production systems, where human interaction is less direct. We have a few such uses in mind, but first, we would like to present the aforementioned demo in healthcare.
A Test: The Problem at the Doctor’s
Before joining Reaktor, Mikko worked for a university hospital where one of his projects had him observe doctors’ behavior on their computers while seeing patients.
Mikko:
I remember thinking about how inefficient the process seemed. Doctors often had trouble finding the relevant information and spent a significant amount of time typing up notes while the patients were speaking. In fact, studies show that doctors tend to spend, on average, about 16 minutes per patient visit on their computer (2) — and this is time directly away from interacting with the patient without distraction.
I had an idea for tackling this problem: What if we removed the computer from the situation? What if the doctor didn’t have to stare and type away at the screen? What if, instead, they could focus on the patient in the room?
Practically speaking, I thought about recording doctor-patient interactions and then using AI to automate the information filing so that the doctor would only have to review and confirm the next steps at the end of the visit.
Until very recently, I was under the impression that the tech needed for this to work just wasn’t good enough yet.
But with recent leaps in developments with large language models, it feels like maybe we’re finally getting there. I sat down with Dr. Antti Pitkänen, MD, the collaborating doctor in this experiment, to discuss the process and results. Below I will share what we did, learned and how we’re thinking about the future of AI in healthcare.
What We Did – Step by Step
- Dr. Pitkänen and I role-played three separate doctor’s visits.
- The first visit was about an actual acute injury, an ankle I had sprained badly while walking my dog a week earlier.
- On the second one, we recreated a visit due to an older injury I’ve had, shoulder pain caused by a climbing accident.
- The third one was also due to an actual older sports injury, an aching knee that was only painful in very particular positions.
- All these visits were audio recorded with a mobile phone attached to a small microphone.
- During each, we underwent the normal procedures of an appointment, including a conversation between myself and the doctor about the issues I was experiencing, testing range of motion and pain occurrence, ordering X-rays, giving a diagnosis, prescribing medication, planning surgery, etc.
- We then took the recordings from the mobile phone and used Open AI’s Whisper to transcribe the recordings into text.
- From there, we used the OpenAI GPT API (davinci-text-003) to prompt the model to retrieve relevant types of information.
- Later, we split the transcription into two (before and after the imaging (X-rays, MRI) — those were actually two separate appointments), and that showed even better results.
Listen to the audio recording of the demo (in Finnish):
Transcript – Read the full transcript of the demo (in Finnish) here:
Summary – Read the full summary of the demo here:
What We Learned
- Whisper transcribed English well and Finnish at a level good enough for this use. Results seemed much better when transcribed from a WAV instead of the compressed AAC format.
- We didn’t fine-tune any models. That is, except for some light prompt design, we didn’t do anything technically worth mentioning. One seems to get quite far just putting basic pieces, APIs and open software, together. Getting the first results took less than a day of working time.
- The accuracy in understanding speech and extracting the main points of the conversation was impressive. Even from less coherent texts with errors, the GPT could understand and summarize the essential parts.
- While the speech was in Finnish, we could prompt the summaries in Finnish or English. Quality was best in English, but the pipeline presumably works without change across all major languages.
- GPT does not always get the big picture right. And getting the generated summaries to work as easily editable drafts of epicrises, like having enough but not too much detail, needs more development work than we did here. This extra work would be empirical, mostly prompt tuning.
Implementing what we did in an actual care clinic would also face regulation and privacy challenges, not to mention cultural challenges. Data privacy, especially regarding medical records, concerns everybody and is an important consideration for a good reason. It’s also common for new technological advancements to take time before becoming a normal part of the ways of working and fitting into everyday operations.
GPT uses its domain knowledge to understand low-quality transcripts, so while it is not quite making diagnoses, it takes some freedom of interpretation. For example, “tramaal” in the transcript (correctly) came to mean Tramadol, an opiate pain medication, in the summary. But sometimes, it just summarized NSAID and Tramadol simply as “anti-inflammatory medications”, which is incorrect. So the automatic summary needed to be checked and, of course, completed.
The models could produce formalized records, but should they?
The current electronic patient record systems, as well as many other IT systems used in patient care (or running a business, think ERPs), rely on and push heavily towards formalized data. What it means is that data is stored in predetermined fields. The formalized data structure that corresponds to the sentence “I am a 44-year-old male, and I am 174 centimeters tall” is something like this:
The current way is to fill in this table and discard the original sentence. This is really handy for research and any transactions, where things can easily be queried and counted from the table. But what if the original sentence was, “I am a 44-year-old male, and I am 174 cm tall, I also have blue eyes”? The fields are predetermined, and we would lose the eye color information.
This is a naive example, but the point is that all formalization causes data loss. So far, the benefits have seemed bigger than the problems.
However, in cases such as the Electronic Health Records, a lot of the data is still found in free text because, for a human, that is the most efficient way to communicate it. In other tests not reported here, LMMs also shine in formalization.
From an epicrisis, you can, for example, ask for the smoking status of the patient (which is everything but trivial (3)), or you can ask for a summary as JSON, with fields specified by you. This finding opens up the possibility of having the best of both worlds. Saving the original data, which of course, is the most complete record of the event, and creating formalized data when you need it with the information you need.
This approach has two benefits:
- You can always recreate the processed data when the tools develop. You can always have the best possible results since you can use the most advanced tools available at any time. You don’t need to settle for what is available when you make the transformation.
- You can never know what data needs to be formalized (or how). With scientific advances and changing ways of working, the data you need to be formalized will change. If you only have the predetermined fields available, you’re out of luck. By having the most accurate data available, you can always create formalizations that are imperfect but contain exactly what you need. The tools and needs will change, but the record will not. To quote our Reaktor-colleague Timo Suomela: “Software is ephemeral, data is eternal.”
Automatic formalization is now possible and is likely to get really cheap.
As for Dr. Pitkänen’s thoughts on the experiment and his hopes for its future viability, he says:
“I would say that the most enticing aspect of this for doctors would definitely be the time-saving component. I would imagine doctors hoping that this type of technology would be more of a digital assistant — something that can help with many of the menial tasks such as prescribing lab tests, sick leaves, scheduling surgeries, or setting up the next steps. This would give more time for the patient and their needs. Even with the need for doctors to manually verify that everything was processed and filed correctly, the time saved would be significant.“
And sure, current technological trends are promising a lot on those more general needs for (digital) assistants. Using things like pseudo tokens, we can create action outputs from the model (4, 5). This will provide a layer of abstraction to the assistant systems compared to the ones currently in use (Apple Siri, Amazon Alexa, etc.) that will greatly extend the range of possibilities. We can give more complex directions in varying ways and have the systems execute more complex tasks based on those directions.
Interacting with the assistants will become more like interacting with people.
What Does This Mean for the Future of the Digital in Healthcare?
In medicine, not that much AI has shown up in production so far, although sentiment has been high. The medical industry is heavily regulated, it’s difficult to access training data, and privacy aspects can be tricky. On top of that, dealing with mistakes made by AIs is another big challenge — not because mistakes can’t be made (or because humans don’t), but because the types of mistakes are different. This means we might not be good at finding or detecting them, and when that is the case, the threshold for adopting these types of technologies becomes very high. Think where we are with self-driving cars.
The core in our demo, transformer-based models, including LLMs, is not medicine-specific. As a safe bet, these techniques, at the minimum, offer a wide set of language transformations: translation, formalization and its inverse explanation, summarization, and transformations of style, including fluency.
LLMs are currently mainly popularized through the chat interface, but there is great potential in integrating them into all processes with written content. Obvious examples analogous to our case are user interfaces with their formalized versions too large and tedious — such UX could be hybridized to be partly free-text (6), with the user later filling in what the free text missed.
The models, however, have some conversational and reasoning abilities and a huge, although sometimes inaccurate, domain knowledge. These help with the relatively simple transformation-style tasks but may later take the technology much further and give it a broader impact.
At the doctor’s office, the role of LLMs would be supportive because the stakes are high, and ultimately we do not leave clinical decision-making to any IT infrastructure. Doctors often see IT as a major time drain. If these interactions could even partly be automatized, the new complementary intelligence would result, if not more, in time savings, leading to better care for patients.
So, if a bit of extrapolation is allowed:
- In the future, doctors will interact less with computers and more with patients.
- In addition to lab results and imaging, recordings of patient visits are the ultimate raw data. You can use them for retrieving details, and also for training, etc.
- Instead of finding a canonical formalization and dropping the original, why not keep the original as canonical information and do formalization on demand? Yes, even with the current technology, you can ask the machine about the details of the visit, and it will find them from the audio! In this scenario, transcripts of the audio, and a relational db record, are accepted as ephemeral and incomplete.
- And in the long run, ironically, interacting with computers becomes more like interacting with humans.
But what does our doctor say?
“Overall, I think people are fairly welcoming about emerging technologies in health. Especially in cutting down on mistakes and removing repetitive tasks. However, it does still feel like there are some ways to go for these ideas to manifest into reality. But I do feel hopeful — this type of intelligence would certainly change care for the better,” concludes Dr. Pitkänen.
At Reaktor, Mikko Koskinen is the Competence Area Lead of AI & Data, and Janne Sinkkonen is a Senior Data Scientist.
References:
- https://stanford-cs324.github.io/winter2022/assets/pdfs/Scaling%20laws%20pdf.pdf
- https://www.acpjournals.org/doi/10.7326/M18-3684
- https://www.cdc.gov/nchs/nhis/tobacco/tobacco_recodes.htm
- https://www.adept.ai/blog/act-1
- https://arxiv.org/pdf/2302.07842.pdf
- https://medium.com/user-experience-design-1/designers-view-on-large-language-models-386828e2140f
Enter the next chapter of modern AI
Let's talk!

Mikko Koskinen
Competence Area Lead, Data & AI at Reaktor