AI & TechnologyRegulation

AI Compliance Architecture: Three Decisions That Outweigh Your Framework Choice

Dr. Oliver Gausmann · April 13, 2026 · 9 min read

Abstract black-and-white 3D wire network with interconnected nodes, representing knowledge graph architecture and multi-agent systems in AI compliance

Why is framework choice the wrong first question?

Most framework comparisons rate speed, developer experience and scalability. None of them rate whether the framework is fit for regulated environments. Those are two different evaluations with two different outcomes.

After Part 1 of this series, one question kept coming back: „Which framework should we use?" It is a fair question, and it skips three decisions that outweigh any framework choice. This article works through those decisions first. The right AI compliance architecture with the wrong framework still works. The wrong architecture, even with the best framework, does not. 70% of regulated companies rebuild their AI agent stack entirely within the first 90 days because the first architecture does not hold up in production¹.

Where does an AI compliance architecture need determinism, and where can it work probabilistically?

Three months ago, audit support at a mid-sized insurer, with around 450 employees. The company has been running a RAG system for about a year that helps claims handlers answer questions about policy terms. The internal audit's test query: What reporting deadline applies to water damage under the 2022 terms? The vector database returns a 93% match but cites a paragraph from the 2018 terms, which is for a different product generation with different deadlines. The auditor had their findings.

The reason is instructive. A probabilistic system answered a deterministic question. Semantic similarity tells you that two texts cover the same topic. An auditor asks something different. The auditor wants to know: is this requirement fulfilled, yes or no, and where is the evidence?

I hear „we need to build multi-agent systems" and „we need RAG" in every other conversation. My follow-up question: for what, exactly? Compliance demands determinism. Half the industry is building probabilistic systems because multi-agent and RAG are the terms that are working on conference stages right now. The architecture decides. The framework choice comes after. A recent survey of 517 German mid-market companies confirms the pattern: 40% have deployed AI, but only 21% have an AI strategy. Tools get introduced before the architecture questions are answered².

The critical distinction here is not binary. A compliance status has four states. Compliant means the control is documented, implemented, and evidenced. Partial means the control exists, but evidence is missing or stale. Gap means no matching control exists. Not applicable means the control does not apply to this company profile. A system that measures only similarity cannot distinguish these four states. It says „94% match." An architecturally sound system says „partial: policy present, evidence expired 14 months ago."

This principle holds across domains. In compliance, „MFA is recommended" and „MFA is mandatory" mean different things. In quality management, „measurement was recorded" and „measurement is within tolerance" mean different things. In financial controls, „invoice exists" and „invoice is approved and posted" mean different things. In every case, a document that mentions a topic does not prove that an obligation is met.

Where determinism is required vs. where AI can work probabilistically
Task	Mode	Technical implementation
Which regulation applies?	Deterministic	Graph query based on company profile
Which controls are mandatory?	Deterministic	Graph traversal through regulatory structure
Which evidence is missing?	Deterministic	Database check with timestamp
How well does a policy cover a control?	Probabilistic with guardrails	Semantic verification with structured output
How should a policy recommendation be phrased?	Probabilistic with guardrails	LLM generation grounded in control references
How should 40 gaps be prioritised by business relevance?	Probabilistic with guardrails	LLM ranking with risk scoring

The left column tolerates no „it depends." The right column adds genuine value through AI but needs guardrails. Making this split before writing a single line of code saves you the architecture rebuild after the first failed audit.

When is a knowledge graph the better choice over vector search?

The second decision concerns the layers of your AI compliance architecture. Most RAG implementations have one layer: vector search. For compliance, that is insufficient.

A compliance system that survives an audit needs three layers. The knowledge graph represents regulatory structure and decides what is applicable. Vector search finds semantically relevant documents within the applicable scope. The LLM explains why a result matters and produces human-readable reports. Gartner calls knowledge graphs a „Critical Enabler" for generative AI in regulated environments³. The reason: knowledge graphs deliver the structured truth that LLMs cannot produce on their own.

The sequence of these three layers is a deliberate architecture decision. The graph filters first. Vector search refines second. The LLM explains last. The LLM is never the source of truth. In compliance, that is a governance requirement.

The accuracy gap is measurable. GraphRAG systems reach 90.6% accuracy on exact-match compliance benchmarks. Standard RAG manages 65.6%⁴. That is 25 percentage points an auditor will notice.

Here is what that looks like in practice. A company profile contains structured data: industry (financial services), jurisdiction (EU), data types (personal and financial), company size (200 employees). This input drives the graph query. Result: DORA, GDPR and ISO 27001 are applicable. HIPAA is not, because the company does not process US health data. Only after this filter does vector search begin, and only within the applicable regulations.

Without this first step, vector search scans the entire regulatory database. It finds „similar" paragraphs from regulations that do not even apply to the company. With this step, it searches only where it is actually relevant. In our experience, this cuts false positives by more than half (estimate).

The entire workflow fits into eight lines:

bash

12345678INPUT:  Company profile (industry, jurisdiction, size, data types)
STEP 1: Graph query → filter applicable regulations           [deterministic]
STEP 2: Graph query → load associated controls                [deterministic]
STEP 3: Vector search → match internal policies to controls   [probabilistic]
STEP 4: Semantic verification → classify control status       [probabilistic]
        (compliant | partial | gap | not applicable)
STEP 5: Risk assessment → prioritise gaps                     [probabilistic]
OUTPUT: Structured compliance report with evidence chain

AI compliance review workflow

The tags show it clearly: decision 1 determines which technology goes where. Steps 1 and 2 are graph queries with no LLM involvement. From step 3 onwards, AI contributes, but with structured outputs and predefined classification categories.

When does a compliance system need agents, and when is a pipeline enough?

The third decision is the one most often answered wrong.

A pipeline follows a fixed sequence of steps. Inputs and outputs are structured. Results are reproducible. Ask the same question twice, get the same answer. For most compliance tasks, that is exactly right.

An agent makes runtime decisions. It chooses which step to execute next, which data sources to combine, when to stop. That is powerful, and for certain tasks it is necessary. For many compliance tasks, it is unnecessarily risky. A Google DeepMind/MIT study from December 2025 quantifies it: multi-agent systems cost many times more per solved task than single agents, while producing worse results⁵.

„Which regulations apply to this company?" is a deterministic graph query. Not an agent problem. „Analyse this 200-page policy against 42 controls, identify gaps, prioritise by business risk, and suggest policy amendments" is a multi-step problem where intermediate results shape the next step. That is where agents earn their place.

Even then, the right approach is orchestrated workflows with defined steps and human approval gates. Not autonomous loops where an agent decides everything at runtime. Frameworks like LangGraph provide durable StateGraphs with checkpointing and human-in-the-loop gates⁶. A compliance agent needs checkpoints after each step, human review gates before high-risk decisions, and an audit trail documenting why each decision was made.

The difference between a marketing chatbot and a compliance system comes down to one question: is every intermediate step traceable and auditable? An autonomous agent that decides everything on its own is a risk no compliance officer will sign off on. According to industry analysis, only 16% of all enterprise deployments qualify as actual agents (systems that plan, execute, observe and adapt). The rest are, in the authors' words, „glorified chatbots with an API call"¹.

How does an AI compliance architecture verify evidence in six steps?

The architecture decisions from the previous sections define the frame. Evidence verification shows how that frame works in practice. In an audit-proof AI compliance architecture, a compliance record moves through six clearly defined steps. The bridge across frameworks is the „Common Control" layer: it translates between ISO 27001 A.9.4.2, SOC 2 CC6.1 and DORA Art. 9, all of which require access control in different wording. Without this abstraction layer, you have to match every framework individually against every policy. With it, you match once and get coverage across frameworks.

Upload the evidence. The document is loaded into the system and assigned a unique ID for the audit trail.
Extract relevant information. Either from structured fields or via OCR, depending on document type. Metadata such as issue date, issuer and validity period is captured.
Validate completeness and format. Missing signature? Date in the future? File unreadable? Anything structurally broken gets filtered out here.
Map to control via graph. The system maps the evidence through the knowledge graph to a specific control: which regulation, which requirement, which control point does it support?
Freshness check. Is the evidence still valid? A 2023 pen-test certificate does not prove 2026 systems are secure. Expired evidence shifts status from compliant to partial.
Compliance status classification. The system classifies the evidence as compliant, partial, gap or not applicable, and writes the result with justification to the audit trail.

Each of these six steps produces an audit trail entry. Freshness is a hard condition here, not an optional filter. Automated freshness checks are the place where AI systems make the biggest operational difference, because no compliance team with 200 pieces of evidence can keep up manually.

Data sovereignty is a non-negotiable for mid-market. Regulatory texts (DORA, ISO 27001, EU AI Act) are public and can be processed in a cloud environment. Internal policies, evidence and risk assessments are confidential. A mid-sized company with 200 employees does not send its internal audit documents into a US cloud. An AI compliance architecture must support hybrid deployment: public data in the cloud, confidential data on-premise or in an EU data centre.

Where the real complexity sits

The biggest surprise in our project was the initial knowledge graph build. More effort than models, vector search and orchestration combined.

Every regulatory document has to be processed by an LLM to extract entities (regulations, requirements, controls) and their relationships. That costs money. Realistic range: EUR 40 to 80 per document (estimate). For a starter set of 70 documents, you are at EUR 2,000 to 5,000 before the first query runs. That number appears in no tutorial I have read.

Second surprise: ontology quality assurance. Auto-extracted relationships are not always correct. An LLM interprets a recommendation as a mandate, or links a control to the wrong requirement. Manual review is mandatory, and not a one-off task. Regulations change, new frameworks arrive, existing relationships need updating. It is an ongoing process that requires dedicated capacity.

The third insight surprised us. We now express regulatory requirements as testable statements. „GIVEN protocol is TCP AND port is 22, WHEN evidence is evaluated, THEN action must be DROP." It sounds like software testing because it is. The lawyers on the team write these tests themselves now. No joke. Expressing compliance requirements as formal test cases forces a level of precision natural language alone does not deliver.

Our Take

The framework question is the wrong first question.

Starting with „should we use LangGraph or CrewAI" skips the three decisions that structure this article. The right first question is: which parts of my system must produce deterministic results?

Once you have answered that, you know where a knowledge graph belongs, where vector search is enough, and where an LLM adds genuine value. The framework choice follows almost automatically.

What we have learned at Ethenios is an organisational insight. Architecture decisions outweigh tooling decisions. A modular, well-designed AI compliance architecture works with LangGraph, with CrewAI, or with a self-built orchestrator. The wrong architecture fails with any framework. Well, actually it fails with the first auditor who says „show me the evidence chain."

For mid-market, this is good news. Recent data from Germany shows that 36% of mid-sized companies with 50+ employees now use AI, compared to six percent in the 2016 to 2018 period⁷. Most are early enough in the journey to resolve architecture questions before production. Mid-market has real advantages over large corporations here: faster decisions, less legacy, fewer parallel initiatives blocking each other.

In Part 3, we show what framework evaluation looks like once these three decisions are already made. The result differs significantly from the usual comparison tables, because we rate by criteria missing from most comparisons: auditability, checkpoint capability and traceability of every intermediate result.

If you want to put your AI compliance architecture to the test, we start with a 30-minute readiness check. We identify which of your compliance processes require deterministic results, and where AI has the biggest impact.

Sources

¹t3n, KI-Agenten scheitern an Architekturfehlern, April 2026: Nur 16% der Enterprise-Deployments sind echte Agenten, 70% der regulierten Unternehmen bauen ihren Stack in den ersten 90 Tagen um

²HKA/KARL-Studie, Oktober 2025: KI-Einsatz im deutschen Mittelstand (n=517, 20-500 Beschäftigte): 40% setzen KI ein, nur 21% haben eine KI-Strategie

³Gartner, Knowledge Graphs as Critical Enabler for Generative AI, 2026: Knowledge Graphs als Critical Enabler in regulierten Umgebungen

⁴Han et al., GraphRAG Accuracy Study, 2024: GraphRAG erreicht 90,6% Accuracy bei exakten Antwort-Matches vs. 65,6% bei Standard-RAG

⁵Google DeepMind/MIT, Multi-Agent Cost Study, Dezember 2025: Multi-Agenten-Systeme sind pro Aufgabe um ein Vielfaches teurer als Single-Agenten, bei schlechteren Ergebnissen

⁶LangChain, LangGraph Documentation, 2026: Durable StateGraphs mit Checkpointing und Human-in-the-Loop-Gates

⁷KfW-Mittelstandspanel, Pressemitteilung Februar 2026: 20% der Mittelständler nutzen KI, 36% bei 50+ Mitarbeitenden, 53% bei FuE-treibenden Unternehmen

Was this article helpful?

Have questions about this topic?

Schedule a conversation