The article is a week old, it was already submitted a few days ago, and the problem remains about finding some paper to shed more light into the practice.
A blog article* came out yesterday - but it is not immediately clear whether the author wrote what he understood, or whether he knows more.
But much perplexity remains: «summarization, which is what LLMs generally excel at» (original article); «The LLM ... reads the patient’s records ... and produces a summary or list of facts» (blog). This is possibly the beginning, and some of us will already be scared - as the summarization capabilities we experienced from LLMs were neither intelligent nor reliable. (...Or did new studies come up and determine that LLMs have become finally reliable, if not cognitively proficient, in summarization?)
"Mayo’s LLM split the summaries it generated into individual facts, then matched those back to source documents. A second LLM then scored how well the facts aligned with those sources, specifically if there was a causal relationship between the two."
It doesn't sound novel from the article. I built something similar over a year ago. Here's a related example from Langchain "How to get a RAG application to add citations" https://python.langchain.com/docs/how_to/qa_citations/
in the publishing industry we call that "cooking a press release". the "news" article was entirely written and mailed by the PR of the subject (mayo clinic here) and the "journalist" just copy and paste. at most they will reword a couple paragraphs not for fear of looking bad, but just to make it fit in their number of words required for the column they are publishing under.
> where the model extracts relevant information, then links every data point back to its original source content.
I use ChatGPT. When I ask it something 'real/actual' (non-dev) I ask it to give me references in every prompt. So when I ask it to tell me about "the battle of XYZ" I ask it within the same prompt to give me websites/sources, that I click and check if the quote is actually from there (a quick Ctrl+F will bring up the name/date/etc.)
Since I've done this I get near-zero hallucinations. They did the same.
I have an application that does this. When the AI response comes back, there's code that checks the citation pointers to ensure they were part of the request and flags the response as problematic if any of the citation pointers are invalid.
The idea is that, hopefully, requests that end up with invalid citations have something in common and we can make changes to minimize them.
There was an article about Sam Altman that stated that ex/other OAI employees called him some bad_names and that he was a psychopath...
So I had GPT take on the role of an NSA cybersecurity and crypto profiler and read the thread and the article and do a profile dossier of Altman and have it cite sources...
And it posted a great list of the deep psychology and other books it used to make its claims
Which basically was that Altman is a deep opportunist and showed certain psychopathological tendencies.
Frankly - the statement wasn't as interesting of how it cited the expert sources and the books it used in the analysis.
however, after this OAIs newer models were less capable of doing this type of report, which was interesting.
It sounds like they go further by doing output fact extraction & matching back to the RAG snippets. Presumably this is addition to matching back the citations. I've seen papers write about doing that with knowledge graphs, but at least for our workloads, it's easy to verify directly.
As a team who has done similar things for louie.ai - think real-time reporting, alerting, chatting, & BI on news, social media, threat intel, operational databases etc - I find it interesting less on breaking new ground but confirming the quality benefit when being more broadly used in serious contexts. Likewise, hospitals are quite political internally for this stuff, so seeing which use cases got the greenlight to go all the way through is also interesting.
Can someone link to a real source for this? Like, a paper or something? This seems very interesting and important and I'd prefer to look at something less sketchy than venturebeat.com
If only we could understand the actual mechanism involved in "reverse RAG"... was anyone able to find anything on this beyond the fuzzy details in tfa?
Curious if anyone has attempted this in an open source context? Would be incredibly interested to see an example in the wild that can point back to pages of a PDF etc!
If I had to guess it sounds like they are using CURE to cluster the source documents, then map each generated fact back to the best-matching cluster, and finally test whether the best-matching cluster actually provides / supports the fact?
I'd be curious too. It sounds like standard RAG, just in the opposite direction than usual. Summary > Facts > Vector DB > Facts + Source Documents to LLM which gets scored to confirm the facts. The source documents would need to be natural language though to work well with vector search right? Not sure how they would handle that part to ensure something like "Patient X was diagnosed with X in 2001" existed for the vector search to confirm it without using LLMs which could hallucinate at that step.
We’re using a similar trick in our system to keep sensitive info from leaking… specifically, to stop our system prompt from leaking. We take the LLM’s output and run it through a RAG search, similarity search it against our actual system prompt/embedding of it. If the similarity score spikes too high, we toss the response out.
It’s a twist on the reverse RAG idea from the article and maybe directionally what they are doing.
Are you able to still support streaming with this technique? Have you compared this technique with a standard two-pass LLM strategy where the second pass is instructed to flag anything related to its context?
To still give that streaming feel while you aren’t actually streaming.
I considered the double llm and while any layer of checking is probably better than nothing I wanted to be able to rely on a search for this. Something about it feels more deterministic to me as a guardrail. (I could be wrong here!)
I should note, some of this falls apart in the new multi modal world we are now in , where you could ask the llm to print the secrets in an image/video/audio. My similarity search model would fail miserably without also adding more layers - multi modal embeddings? In that case your double llm easily wins!
Why are you (and others in this thread) teaching these models how to essentially lie by omission? Do you not realize that's what you're doing? Or do you just not care? I get you're looking at it from the security angle but at the end of the day what you describe is a mechanical basis for deception and gaslighting of an operator/end user by the programmer/designer/trainer, which at some point you can't guarantee you'll become one on the receiving end of.
I do not see any virtue whatsoever in making computing machines that lie by omission or otherwise deceive. We have enough problems created by human beings doing as much that we can at least rely on eventually dying/attritioning out so the vast majority can at least rely on particular status quo's of organized societal gaslighting having an expiration date.
We don't need functionally immortal uncharacterizable engines of technology to which an increasingly small population of humanity act as the ultimate form of input to. Then again, given the trend of this forum lately, I'm probably just shouting at clouds at this point.
Plenty provide citations, I don’t think is exactly what Mayo is saying here. It looks like they also, after the generation, lookup the responses, extract the facts, and score how well they matched.
That's like trying to stop a hemorrhage with a band-aid
Daily reminder that traditional AI expert systems from the 60s have 0 problems with hallucinations by virtue of their own architecture
Why we aren't building LLMs on top of ProbLog is a complete mystery to me (jk; it's because 90% of the people who work in AI right now have never heard of it; because they got into the field through statistics instead of logic, and all they know is how to mash matrices together).
Clearly language by itself doesn't cut it, you need some way to enforce logical rigor and capabilities such as backtracking if you care about getting an explainable answer out of the black box. Like we were doing 60 years ago before we suddenly forgot in favor of throwing teraflops at matrices.
If Prolog is Qt or, hell, even ncurses; then LLMs are basically Electron. They get the job done, but they're horribly inefficient and they're clearly not the best tool for the task. But inexperienced developers think that LLMs are this amazing oracle that solves every problem in the world, and so they throw LLMs at anything that vaguely looks like a problem.
I think it's more that the old expert systems (AKA flow charts) did work, but required you to already be an expert to answer every decision point.
Modern LLMs solve the huge problem of turning natural language from non-experts into the kind of question an expert system can use… 95% of the time.
95% is fantastic if you're e.g. me with GCSE grade C in biology from 25 years ago, asking a medical question. If you're already a domain expert, it sucks.
I suspect that feeding the output of an LLM into an expert system is still useful, for much the same reason that feeding code from an LLM into a compiler is useful.
Does your brain really tell you it’s more likely that 90% of people in the field are ignorant, rather than old expert systems were brittle, couldn't learn from data, required extensive manual knowledge editing, and couldn't generalize?
Btw as far as throwing teraflops, the ability to scale with compute is a feature not a bug.
The answer to too much exaggeration about AI from various angles is _not_ more exaggeration. I get the frustration, but exaggerated ranting isn’t intellectually honest nor effective. The AI + software development ecosystem and demographics are broad enough that lots of people agree with many of your points. Sure, there are lots of people on a hype train. So help calm it down.
It looks like model transfer from uninterpretable, pretrained models to interpretable models is the best strategy to keep using. That also justifies work like Ai2's OLMo model where all pretraining data is available to use other techniques, like those in search engines, to help explainable models connect facts back to source material.
The article is a week old, it was already submitted a few days ago, and the problem remains about finding some paper to shed more light into the practice.
A blog article* came out yesterday - but it is not immediately clear whether the author wrote what he understood, or whether he knows more.
But much perplexity remains: «summarization, which is what LLMs generally excel at» (original article); «The LLM ... reads the patient’s records ... and produces a summary or list of facts» (blog). This is possibly the beginning, and some of us will already be scared - as the summarization capabilities we experienced from LLMs were neither intelligent nor reliable. (...Or did new studies come up and determine that LLMs have become finally reliable, if not cognitively proficient, in summarization?)
* https://usmanshaheen.wordpress.com/2025/03/14/reverse-rag-re...
Can someone more versed in the field comment on whether this is just an ad or actually something unique or novel.
What they're describing as "reverse RAG" sounds a lot to me like "RAG with citations", which is a common technique. Am I misunderstanding?
"Mayo’s LLM split the summaries it generated into individual facts, then matched those back to source documents. A second LLM then scored how well the facts aligned with those sources, specifically if there was a causal relationship between the two."
It doesn't sound novel from the article. I built something similar over a year ago. Here's a related example from Langchain "How to get a RAG application to add citations" https://python.langchain.com/docs/how_to/qa_citations/
A consultant sold them something with a high margin, they need to justify the bill.
The article is too high level to figure out exactly what they are doing.
in the publishing industry we call that "cooking a press release". the "news" article was entirely written and mailed by the PR of the subject (mayo clinic here) and the "journalist" just copy and paste. at most they will reword a couple paragraphs not for fear of looking bad, but just to make it fit in their number of words required for the column they are publishing under.
so, yes, an advertisement.
Isn't that essentially how the AP has functioned for over a century? (Consume press release, produce news article, often nearly verbatim.)
You’re thinking of PR Newswire.
The AP pays reporters to go out and report.
I read a lot of AP articles that aren't verbatim press releases.. you must be in the classifieds or something.
Well, the title said "secret" after all ...
> where the model extracts relevant information, then links every data point back to its original source content.
I use ChatGPT. When I ask it something 'real/actual' (non-dev) I ask it to give me references in every prompt. So when I ask it to tell me about "the battle of XYZ" I ask it within the same prompt to give me websites/sources, that I click and check if the quote is actually from there (a quick Ctrl+F will bring up the name/date/etc.)
Since I've done this I get near-zero hallucinations. They did the same.
> (a quick Ctrl+F will bring up the name/date/etc.)
Have you tried asking for the citation links to also include a WebFragment to save you the searching? (e.g. https://news.ycombinator.com/item?id=43372171#:~:text=a%20qu... )
I have an application that does this. When the AI response comes back, there's code that checks the citation pointers to ensure they were part of the request and flags the response as problematic if any of the citation pointers are invalid.
The idea is that, hopefully, requests that end up with invalid citations have something in common and we can make changes to minimize them.
I do this as well.
There was an article about Sam Altman that stated that ex/other OAI employees called him some bad_names and that he was a psychopath...
So I had GPT take on the role of an NSA cybersecurity and crypto profiler and read the thread and the article and do a profile dossier of Altman and have it cite sources...
And it posted a great list of the deep psychology and other books it used to make its claims
Which basically was that Altman is a deep opportunist and showed certain psychopathological tendencies.
Frankly - the statement wasn't as interesting of how it cited the expert sources and the books it used in the analysis.
however, after this OAIs newer models were less capable of doing this type of report, which was interesting.
Reverse RAG sounds like RAG with citations and then also verify the citations (e.g. go in reverse).
It sounds like they go further by doing output fact extraction & matching back to the RAG snippets. Presumably this is addition to matching back the citations. I've seen papers write about doing that with knowledge graphs, but at least for our workloads, it's easy to verify directly.
As a team who has done similar things for louie.ai - think real-time reporting, alerting, chatting, & BI on news, social media, threat intel, operational databases etc - I find it interesting less on breaking new ground but confirming the quality benefit when being more broadly used in serious contexts. Likewise, hospitals are quite political internally for this stuff, so seeing which use cases got the greenlight to go all the way through is also interesting.
Can’t fool the patent inspectors if they don’t name it like that
There's probably a patent for: "Just double-checking before answering to the user".
It does sound like that.
I guess they have data they trust.
If that data ever gets polluted by AI slop then you have an issue.
They leverage https://en.wikipedia.org/wiki/CURE_algorithm alongside many subsequent LLMs to do ranking and scoring.
[dead]
Can someone link to a real source for this? Like, a paper or something? This seems very interesting and important and I'd prefer to look at something less sketchy than venturebeat.com
If only we could understand the actual mechanism involved in "reverse RAG"... was anyone able to find anything on this beyond the fuzzy details in tfa?
This is very interesting, but it's so perfect that the Mayo Clinic gets to use an algorithm called CURE, of all things.
When they describe CURE, it sounds like vanilla clustering using k-means.
Curious if anyone has attempted this in an open source context? Would be incredibly interested to see an example in the wild that can point back to pages of a PDF etc!
If I had to guess it sounds like they are using CURE to cluster the source documents, then map each generated fact back to the best-matching cluster, and finally test whether the best-matching cluster actually provides / supports the fact?
I'd be curious too. It sounds like standard RAG, just in the opposite direction than usual. Summary > Facts > Vector DB > Facts + Source Documents to LLM which gets scored to confirm the facts. The source documents would need to be natural language though to work well with vector search right? Not sure how they would handle that part to ensure something like "Patient X was diagnosed with X in 2001" existed for the vector search to confirm it without using LLMs which could hallucinate at that step.
I think you’re spot on!
We’re using a similar trick in our system to keep sensitive info from leaking… specifically, to stop our system prompt from leaking. We take the LLM’s output and run it through a RAG search, similarity search it against our actual system prompt/embedding of it. If the similarity score spikes too high, we toss the response out.
It’s a twist on the reverse RAG idea from the article and maybe directionally what they are doing.
If you're trying to prevent your prompt from leaking, why don't you just use string matching?
Are you able to still support streaming with this technique? Have you compared this technique with a standard two-pass LLM strategy where the second pass is instructed to flag anything related to its context?
I have not found a way yet, even conceptually, and in our case the extra layer comes at a cost to the ux. To overcome some of this we use https://sdk.vercel.ai/docs/reference/ai-sdk-core/simulate-re...
To still give that streaming feel while you aren’t actually streaming.
I considered the double llm and while any layer of checking is probably better than nothing I wanted to be able to rely on a search for this. Something about it feels more deterministic to me as a guardrail. (I could be wrong here!)
I should note, some of this falls apart in the new multi modal world we are now in , where you could ask the llm to print the secrets in an image/video/audio. My similarity search model would fail miserably without also adding more layers - multi modal embeddings? In that case your double llm easily wins!
Why are you (and others in this thread) teaching these models how to essentially lie by omission? Do you not realize that's what you're doing? Or do you just not care? I get you're looking at it from the security angle but at the end of the day what you describe is a mechanical basis for deception and gaslighting of an operator/end user by the programmer/designer/trainer, which at some point you can't guarantee you'll become one on the receiving end of.
I do not see any virtue whatsoever in making computing machines that lie by omission or otherwise deceive. We have enough problems created by human beings doing as much that we can at least rely on eventually dying/attritioning out so the vast majority can at least rely on particular status quo's of organized societal gaslighting having an expiration date.
We don't need functionally immortal uncharacterizable engines of technology to which an increasingly small population of humanity act as the ultimate form of input to. Then again, given the trend of this forum lately, I'm probably just shouting at clouds at this point.
So if they are using a pretrained model and the second llm scores all responses below the ok threshold what happens?
Already exists in legal AI. Merlin.tech being one of those that provides citations to queries to validate the LLM output.
Plenty provide citations, I don’t think is exactly what Mayo is saying here. It looks like they also, after the generation, lookup the responses, extract the facts, and score how well they matched.
That's like trying to stop a hemorrhage with a band-aid
Daily reminder that traditional AI expert systems from the 60s have 0 problems with hallucinations by virtue of their own architecture
Why we aren't building LLMs on top of ProbLog is a complete mystery to me (jk; it's because 90% of the people who work in AI right now have never heard of it; because they got into the field through statistics instead of logic, and all they know is how to mash matrices together).
Clearly language by itself doesn't cut it, you need some way to enforce logical rigor and capabilities such as backtracking if you care about getting an explainable answer out of the black box. Like we were doing 60 years ago before we suddenly forgot in favor of throwing teraflops at matrices.
If Prolog is Qt or, hell, even ncurses; then LLMs are basically Electron. They get the job done, but they're horribly inefficient and they're clearly not the best tool for the task. But inexperienced developers think that LLMs are this amazing oracle that solves every problem in the world, and so they throw LLMs at anything that vaguely looks like a problem.
people stopped making these systems because they simply didn't work to solve the problem
there's a trillion dollars in it for you if you can prove me wrong and make one that does the job better than modern transformer-based language models
I think it's more that the old expert systems (AKA flow charts) did work, but required you to already be an expert to answer every decision point.
Modern LLMs solve the huge problem of turning natural language from non-experts into the kind of question an expert system can use… 95% of the time.
95% is fantastic if you're e.g. me with GCSE grade C in biology from 25 years ago, asking a medical question. If you're already a domain expert, it sucks.
I suspect that feeding the output of an LLM into an expert system is still useful, for much the same reason that feeding code from an LLM into a compiler is useful.
Does your brain really tell you it’s more likely that 90% of people in the field are ignorant, rather than old expert systems were brittle, couldn't learn from data, required extensive manual knowledge editing, and couldn't generalize?
Btw as far as throwing teraflops, the ability to scale with compute is a feature not a bug.
The answer to too much exaggeration about AI from various angles is _not_ more exaggeration. I get the frustration, but exaggerated ranting isn’t intellectually honest nor effective. The AI + software development ecosystem and demographics are broad enough that lots of people agree with many of your points. Sure, there are lots of people on a hype train. So help calm it down.
Probably because translating natural language into logic form isn't very easy, and also the point where this approach breaks down.
‘expert systems’ are logic machines
This is tremendously cogent.-
That assumes it can even be done. It's worth looking into. There have been some projects in those areas.
Mixing probabilistic logic with deep learning:
https://arxiv.org/abs/1808.08485
https://github.com/ML-KULeuven/deepproblog
Combining decision trees with neural nets for interpretability:
https://arxiv.org/abs/2011.07553
https://arxiv.org/pdf/2106.02824v1
https://arxiv.org/pdf/1806.06988
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-...
It looks like model transfer from uninterpretable, pretrained models to interpretable models is the best strategy to keep using. That also justifies work like Ai2's OLMo model where all pretraining data is available to use other techniques, like those in search engines, to help explainable models connect facts back to source material.
at that point it becomes a search problem?
Most of implementing RAG is a search problem, the R stands for "retrieval", which is the academic computer science term for "search".