Chengshuo Dai

Exploring the intersection of knowledge graphs and LLMs for biomedical mechanism retrieval, and how DAG constraints reduce semantic drift.

Background: Why Standard RAG Falls Short in Biomedicine

When I first started reading about RAG (Retrieval-Augmented Generation), it seemed like a clean solution to LLM hallucinations — just retrieve relevant documents and let the model reason over them. But the more I dug into biomedical applications, the more I realized that vanilla RAG has a structural weakness: it retrieves text chunks, not relationships.

In biomedicine, the relationships are the whole point. Knowing that "metformin treats type 2 diabetes" is less useful than knowing why — the causal chain from drug → target gene → pathway → disease mechanism. A bag of retrieved paragraphs can't naturally express that. This is where GraphRAG comes in.

GraphRAG, introduced by Microsoft Research in 2024, extends the RAG paradigm by incorporating graph-structured data — specifically knowledge graphs (KGs) — into the retrieval process. Unlike baseline RAG systems that rely on vector search to retrieve semantically similar text, GraphRAG leverages the relational structure of graphs to retrieve and process information based on domain-specific queries. In a biomedical context, this means nodes can be genes, drugs, diseases, and phenotypes, while edges encode relationships like "inhibits", "causes", or "upregulates".

The Core Problem: Semantic Drift in LLM Reasoning

Before getting into GraphRAG specifics, it's worth naming the enemy clearly: semantic drift.

When LLMs are asked to reason about complex multi-hop questions — like "what is the mechanism by which Drug X affects Disease Y through Pathway Z?" — they tend to drift. The model starts confidently, but as it chains inferences, small errors compound. The final answer might be grammatically fluent and even partially correct, but the intermediate reasoning steps have quietly departed from ground truth. This is a form of hallucination that's particularly dangerous in high-stakes domains.

The underlying generative methodology of these models, which sequentially predict tokens based on statistical patterns learned from massive text corpora, renders them susceptible to hallucinations, defined as outputs that are syntactically fluent yet factually incorrect. In a biomedical setting, such inaccuracies pose significant risks in biomedicine, where even minor errors can misdirect research efforts, delay critical therapeutic discoveries, or compromise patient safety.

Standard domain-specific fine-tuning (BioBERT, PubMedBERT, etc.) helps but doesn't fully solve this — it embeds knowledge implicitly in parameters with no way to verify provenance or update dynamically.

Enter Knowledge Graphs + GraphRAG

The key insight behind GraphRAG in biomedical settings is that biological knowledge has inherent structure. Drug mechanisms, gene-disease associations, and metabolic pathways aren't arbitrary text — they're directed, typed, multi-relational graphs. So why not represent them as such?

A 2026 paper published in GigaScience (Joy & Su, Scripps Research) put this idea into practice with BTE-RAG — a framework integrating LLMs with BioThings Explorer (BTE), an API federation of over 60 authoritative biomedical knowledge sources. BTE-RAG dynamically executes targeted, query-focused graph traversals to retrieve concise, mechanistically pertinent evidence, formulates this evidence into declarative context statements, and augments model prompts accordingly.

The results were striking. On gene-centric mechanistic questions, BTE-RAG increased accuracy from 51 to 75.8% for GPT-4o mini and from 69.8 to 78.6% for GPT-4o. In metabolite-focused questions, the proportion of responses with high cosine similarity scores rose by over 77% for GPT-4o. The takeaway: grounding LLM reasoning in structured mechanistic graphs dramatically reduces hallucination and drift on biomedical tasks.

What makes this different from just "adding more context"? The graph structure enforces typed relationships — it doesn't just say "metformin and AMPK are related," it says "metformin activates AMPK." That specificity is what prevents the model from confabulating plausible-sounding but wrong causal stories.

Why DAG Constraints Matter: Taming the Reasoning Path

Now for the part I find most interesting: the role of Directed Acyclic Graphs (DAGs) in controlling how reasoning unfolds.

In any knowledge graph, there's a risk that retrieval becomes circular or loops through redundant nodes, generating bloated context that confuses the LLM rather than helping it. DAGs eliminate this by enforcing a constraint: no cycles. Every edge points in one direction, and traversal follows a topological order. For mechanistic reasoning, this is natural — causality itself is acyclic. A drug causes a gene expression change, which leads to a pathway activation, which results in a disease outcome. The arrow only goes one way.

A 2025/2026 paper accepted at AAAI 2026, LogicRAG, built explicitly on this intuition. LogicRAG begins by decomposing the input query into a set of subproblems and constructing a directed acyclic graph (DAG) to model the logical dependencies among them. To support coherent multi-step reasoning, LogicRAG then linearizes the graph using topological sort, so that subproblems can be addressed in a logically consistent order.

The practical effect of DAG linearization is significant. By processing subproblems in topological order, the LLM never has to reason "backwards" against its own prior outputs. Each step is grounded in the outputs of its upstream nodes. This is essentially a structural guarantee against semantic drift — the reasoning path is constrained before the LLM ever generates a token.

LogicRAG applies graph pruning to reduce redundant retrieval and uses context pruning to filter irrelevant context, significantly reducing the overall token cost. This is a nice bonus — not only does the DAG structure improve accuracy, it makes the system more efficient by cutting down the noise the model has to wade through.

Ontology-Grounded Graphs: Taking It Further

Another angle I came across was ontology-grounded knowledge graphs — instead of building a KG from scratch or relying on heuristic extraction, you anchor the graph to a formal biomedical ontology (like SNOMED-CT or UMLS). This provides semantic typing guarantees at the schema level, not just at the instance level.

A 2026 paper in the Journal of Biomedical Informatics (Ali et al.) applied this to clinical question answering. Their GraphRAG framework significantly outperformed baseline models: while ChatGPT-4 achieved 37% accuracy and DeepSeek-R1 achieved 52%, the ontology-grounded approach achieved 98% accuracy, with hallucination rates reduced from approximately 63% in ChatGPT-4 to just 1.7%.

98% accuracy on clinical QA is a remarkable number. The key mechanism here is that the ontology constrains which node-to-node relationships are even permitted in the graph. If your ontology says "Drug → targets → Gene" is a valid edge type, but "Drug → causes → Doctor" is not, then retrieval is automatically filtered before it reaches the LLM. You're not just checking the model's output — you're constraining the input space.

This connects back to the DAG idea: both approaches work by reducing the degrees of freedom in the reasoning process. Fewer spurious paths through the graph = fewer opportunities for semantic drift.

A Concrete Example: Drug Mechanism Retrieval

To make this more concrete, imagine querying: "How does imatinib treat chronic myeloid leukemia (CML)?"

Standard LLM approach: The model draws on training data to describe BCR-ABL, tyrosine kinase inhibition, and Gleevec's approval history. This is probably mostly correct, but the specific mechanistic chain (imatinib → BCR-ABL kinase domain binding → inhibition of ATP binding → blocked phosphorylation → reduced cancer cell proliferation → apoptosis) may be imprecise or missing links.

GraphRAG approach: The system traverses the knowledge graph from imatinib → [targets] → BCR-ABL → [inhibits] → tyrosine kinase activity → [blocks] → cell proliferation, retrieves those typed triples as structured context, formats them into declarative statements, and conditions the LLM's generation on that explicit evidence. The mechanistic chain is verified at each hop.

With DAG constraints: The traversal order is topologically sorted. The model reasons about drug-target binding before reasoning about downstream pathway effects — never conflating cause and consequence.

The BTE-RAG paper operationalized exactly this kind of mechanistic retrieval at scale, building benchmarks from DrugMechDB that encode these multi-hop drug→gene→disease pathways, and showing that graph-grounded LLMs consistently outperform parameter-only ones on this task.

Current Limitations and Open Questions

This area is moving fast, but it's not without real problems:

Graph construction quality: Most pipelines still rely on LLMs or classical NER/RE systems to extract entities and relationships from text. GPT-4.0 demonstrated strong performance in extracting biomedical relationships from semi-structured data, achieving F1-scores above 0.881, but even at that level, noisy extraction compounds over a large KG. Garbage in, garbage out — a KG full of low-confidence edges undermines the whole structure.

Dynamic knowledge: Biomedical knowledge updates rapidly. A KG built from papers from three years ago may have outdated mechanism annotations. BTE-RAG partially addresses this via API federation (live queries to source databases rather than a static snapshot), but this trades off latency for freshness.

DAG vs. cyclic reality: Biological systems do contain feedback loops — signaling pathways with negative feedback, regulatory circuits, etc. Forcing DAG structure may lose some of that complexity. There's probably a tradeoff between reasoning tractability and biological fidelity that hasn't been fully worked out yet.

Evaluation: Benchmarks like DrugMechDB and GeneTuring are valuable, but evaluating mechanistic reasoning quality (not just final answer correctness) remains hard. A model can get the right answer for the wrong reasons.

Summary

GraphRAG in biomedicine addresses something fundamental: the mismatch between how biological knowledge is organized (as a typed, directed, multi-relational graph) and how standard RAG retrieves it (as flat text chunks). By grounding LLM reasoning in explicit graph traversals — especially with DAG constraints that enforce topological ordering — we get more faithful, interpretable, and hallucination-resistant mechanistic reasoning.

The results from papers like BTE-RAG (GigaScience 2026), the ontology-grounded GraphRAG (JBI 2026), and LogicRAG (AAAI 2026) are consistent: structured graph retrieval beats unstructured retrieval on biomedical tasks, often by significant margins. The direction is clear even if the engineering details are still being worked out.

For me, the most compelling takeaway is the DAG insight: you don't just want the LLM to have the right information — you want the retrieval architecture to guide the reasoning order. Semantic drift is hard to patch in the output; it's easier to prevent in the input.

References

Joy, J. & Su, A.I. (2026). Federated knowledge retrieval elevates large language model performance on biomedical benchmarks (BTE-RAG). GigaScience, 15, giag007. https://doi.org/10.1093/gigascience/giag007
Ali, M., Taha, Z., & Morsey, M.M. (2026). Ontology-grounded knowledge graphs for mitigating hallucinations in LLMs for clinical question answering. Journal of Biomedical Informatics, 175, 104993. https://doi.org/10.1016/j.jbi.2026.104993
(AAAI 2026). You Don't Need Pre-built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures (LogicRAG). arXiv:2508.06105.
IBM Think. (2026, February). What is GraphRAG? https://www.ibm.com/think/topics/graphrag