We use cookies to enhance your browsing experience, serve personalised ads or content, and analyse our traffic. By clicking "Accept All", you consent to our use of cookies.
Powered by
Customise Consent Preferences
We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.
The cookies that are categorised as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ...
Always Active
Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.
Cookie
__cf_bm
Duration
1 hour
Description
This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
Cookie
_cfuvid
Duration
session
Description
Cloudflare sets this cookie to track users across sessions to optimize user experience by maintaining session consistency and providing personalized services
Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.
No cookies to display.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.
Cookie
session-id
Duration
1 hour
Description
Amazon Pay uses this cookie to maintain a "session" that spans multiple days and beyond reboots. The session information includes the identity of the user, recently visited links and the duration of inactivity.
Performance cookies are used to understand and analyse the key performance indexes of the website which helps in delivering a better user experience for the visitors.
No cookies to display.
Advertisement cookies are used to provide visitors with customised advertisements based on the pages you visited previously and to analyse the effectiveness of the ad campaigns.
Most framework comparisons rate speed, developer experience and scalability. None of them rate whether the framework is fit for regulated environments. Those are two different evaluations with two different outcomes.
After Part 1 of this series, one question kept coming back: „Which framework should we use?" It is a fair question, and it skips three decisions that outweigh any framework choice. This article works through those decisions first. The right AI compliance architecture with the wrong framework still works. The wrong architecture, even with the best framework, does not. 70% of regulated companies rebuild their AI agent stack entirely within the first 90 days because the first architecture does not hold up in production1.
Where does an AI compliance architecture need determinism, and where can it work probabilistically?
Three months ago, audit support at a mid-sized insurer, with around 450 employees. The company has been running a RAG system for about a year that helps claims handlers answer questions about policy terms. The internal audit's test query: What reporting deadline applies to water damage under the 2022 terms? The vector database returns a 93% match but cites a paragraph from the 2018 terms, which is for a different product generation with different deadlines. The auditor had their findings.
The reason is instructive. A probabilistic system answered a deterministic question. Semantic similarity tells you that two texts cover the same topic. An auditor asks something different. The auditor wants to know: is this requirement fulfilled, yes or no, and where is the evidence?
I hear „we need to build multi-agent systems" and „we need RAG" in every other conversation. My follow-up question: for what, exactly? Compliance demands determinism. Half the industry is building probabilistic systems because multi-agent and RAG are the terms that are working on conference stages right now. The architecture decides. The framework choice comes after. A recent survey of 517 German mid-market companies confirms the pattern: 40% have deployed AI, but only 21% have an AI strategy. Tools get introduced before the architecture questions are answered2.
The critical distinction here is not binary. A compliance status has four states. Compliant means the control is documented, implemented, and evidenced. Partial means the control exists, but evidence is missing or stale. Gap means no matching control exists. Not applicable means the control does not apply to this company profile. A system that measures only similarity cannot distinguish these four states. It says „94% match." An architecturally sound system says „partial: policy present, evidence expired 14 months ago."
This principle holds across domains. In compliance, „MFA is recommended" and „MFA is mandatory" mean different things. In quality management, „measurement was recorded" and „measurement is within tolerance" mean different things. In financial controls, „invoice exists" and „invoice is approved and posted" mean different things. In every case, a document that mentions a topic does not prove that an obligation is met.
Where determinism is required vs. where AI can work probabilistically
Task
Mode
Technical implementation
Which regulation applies?
Deterministic
Graph query based on company profile
Which controls are mandatory?
Deterministic
Graph traversal through regulatory structure
Which evidence is missing?
Deterministic
Database check with timestamp
How well does a policy cover a control?
Probabilistic with guardrails
Semantic verification with structured output
How should a policy recommendation be phrased?
Probabilistic with guardrails
LLM generation grounded in control references
How should 40 gaps be prioritised by business relevance?
Probabilistic with guardrails
LLM ranking with risk scoring
Mode
Which regulation applies?
Deterministic
Which controls are mandatory?
Deterministic
Which evidence is missing?
Deterministic
How well does a policy cover a control?
Probabilistic with guardrails
How should a policy recommendation be phrased?
Probabilistic with guardrails
How should 40 gaps be prioritised by business relevance?
Probabilistic with guardrails
Technical implementation
Which regulation applies?
Graph query based on company profile
Which controls are mandatory?
Graph traversal through regulatory structure
Which evidence is missing?
Database check with timestamp
How well does a policy cover a control?
Semantic verification with structured output
How should a policy recommendation be phrased?
LLM generation grounded in control references
How should 40 gaps be prioritised by business relevance?
LLM ranking with risk scoring
The left column tolerates no „it depends." The right column adds genuine value through AI but needs guardrails. Making this split before writing a single line of code saves you the architecture rebuild after the first failed audit.
When is a knowledge graph the better choice over vector search?
The second decision concerns the layers of your AI compliance architecture. Most RAG implementations have one layer: vector search. For compliance, that is insufficient.
A compliance system that survives an audit needs three layers. The knowledge graph represents regulatory structure and decides what is applicable. Vector search finds semantically relevant documents within the applicable scope. The LLM explains why a result matters and produces human-readable reports. Gartner calls knowledge graphs a „Critical Enabler" for generative AI in regulated environments3. The reason: knowledge graphs deliver the structured truth that LLMs cannot produce on their own.
The sequence of these three layers is a deliberate architecture decision. The graph filters first. Vector search refines second. The LLM explains last. The LLM is never the source of truth. In compliance, that is a governance requirement.
The accuracy gap is measurable. GraphRAG systems reach 90.6% accuracy on exact-match compliance benchmarks. Standard RAG manages 65.6%4. That is 25 percentage points an auditor will notice.
Here is what that looks like in practice. A company profile contains structured data: industry (financial services), jurisdiction (EU), data types (personal and financial), company size (200 employees). This input drives the graph query. Result: DORA, GDPR and ISO 27001 are applicable. HIPAA is not, because the company does not process US health data. Only after this filter does vector search begin, and only within the applicable regulations.
Without this first step, vector search scans the entire regulatory database. It finds „similar" paragraphs from regulations that do not even apply to the company. With this step, it searches only where it is actually relevant. In our experience, this cuts false positives by more than half (estimate).
The entire workflow fits into eight lines:
bash
12345678INPUT: Company profile (industry, jurisdiction, size, data types)
STEP 1: Graph query → filter applicable regulations [deterministic]
STEP 2: Graph query → load associated controls [deterministic]
STEP 3: Vector search → match internal policies to controls [probabilistic]
STEP 4: Semantic verification → classify control status [probabilistic](compliant | partial | gap | not applicable)
STEP 5: Risk assessment → prioritise gaps [probabilistic]
OUTPUT: Structured compliance report with evidence chain
AI compliance review workflow
The tags show it clearly: decision 1 determines which technology goes where. Steps 1 and 2 are graph queries with no LLM involvement. From step 3 onwards, AI contributes, but with structured outputs and predefined classification categories.
When does a compliance system need agents, and when is a pipeline enough?
The third decision is the one most often answered wrong.
A pipeline follows a fixed sequence of steps. Inputs and outputs are structured. Results are reproducible. Ask the same question twice, get the same answer. For most compliance tasks, that is exactly right.
An agent makes runtime decisions. It chooses which step to execute next, which data sources to combine, when to stop. That is powerful, and for certain tasks it is necessary. For many compliance tasks, it is unnecessarily risky. A Google DeepMind/MIT study from December 2025 quantifies it: multi-agent systems cost many times more per solved task than single agents, while producing worse results5.
„Which regulations apply to this company?" is a deterministic graph query. Not an agent problem. „Analyse this 200-page policy against 42 controls, identify gaps, prioritise by business risk, and suggest policy amendments" is a multi-step problem where intermediate results shape the next step. That is where agents earn their place.
Even then, the right approach is orchestrated workflows with defined steps and human approval gates. Not autonomous loops where an agent decides everything at runtime. Frameworks like LangGraph provide durable StateGraphs with checkpointing and human-in-the-loop gates6. A compliance agent needs checkpoints after each step, human review gates before high-risk decisions, and an audit trail documenting why each decision was made.
The difference between a marketing chatbot and a compliance system comes down to one question: is every intermediate step traceable and auditable? An autonomous agent that decides everything on its own is a risk no compliance officer will sign off on. According to industry analysis, only 16% of all enterprise deployments qualify as actual agents (systems that plan, execute, observe and adapt). The rest are, in the authors' words, „glorified chatbots with an API call"1.
How does an AI compliance architecture verify evidence in six steps?
The architecture decisions from the previous sections define the frame. Evidence verification shows how that frame works in practice. In an audit-proof AI compliance architecture, a compliance record moves through six clearly defined steps. The bridge across frameworks is the „Common Control" layer: it translates between ISO 27001 A.9.4.2, SOC 2 CC6.1 and DORA Art. 9, all of which require access control in different wording. Without this abstraction layer, you have to match every framework individually against every policy. With it, you match once and get coverage across frameworks.
Upload the evidence. The document is loaded into the system and assigned a unique ID for the audit trail.
Extract relevant information. Either from structured fields or via OCR, depending on document type. Metadata such as issue date, issuer and validity period is captured.
Validate completeness and format. Missing signature? Date in the future? File unreadable? Anything structurally broken gets filtered out here.
Map to control via graph. The system maps the evidence through the knowledge graph to a specific control: which regulation, which requirement, which control point does it support?
Freshness check. Is the evidence still valid? A 2023 pen-test certificate does not prove 2026 systems are secure. Expired evidence shifts status from compliant to partial.
Compliance status classification. The system classifies the evidence as compliant, partial, gap or not applicable, and writes the result with justification to the audit trail.
Each of these six steps produces an audit trail entry. Freshness is a hard condition here, not an optional filter. Automated freshness checks are the place where AI systems make the biggest operational difference, because no compliance team with 200 pieces of evidence can keep up manually.
Data sovereignty is a non-negotiable for mid-market. Regulatory texts (DORA, ISO 27001, EU AI Act) are public and can be processed in a cloud environment. Internal policies, evidence and risk assessments are confidential. A mid-sized company with 200 employees does not send its internal audit documents into a US cloud. An AI compliance architecture must support hybrid deployment: public data in the cloud, confidential data on-premise or in an EU data centre.
Where the real complexity sits
The biggest surprise in our project was the initial knowledge graph build. More effort than models, vector search and orchestration combined.
Every regulatory document has to be processed by an LLM to extract entities (regulations, requirements, controls) and their relationships. That costs money. Realistic range: EUR 40 to 80 per document (estimate). For a starter set of 70 documents, you are at EUR 2,000 to 5,000 before the first query runs. That number appears in no tutorial I have read.
Second surprise: ontology quality assurance. Auto-extracted relationships are not always correct. An LLM interprets a recommendation as a mandate, or links a control to the wrong requirement. Manual review is mandatory, and not a one-off task. Regulations change, new frameworks arrive, existing relationships need updating. It is an ongoing process that requires dedicated capacity.
The third insight surprised us. We now express regulatory requirements as testable statements. „GIVEN protocol is TCP AND port is 22, WHEN evidence is evaluated, THEN action must be DROP." It sounds like software testing because it is. The lawyers on the team write these tests themselves now. No joke. Expressing compliance requirements as formal test cases forces a level of precision natural language alone does not deliver.
Our Take
The framework question is the wrong first question.
Starting with „should we use LangGraph or CrewAI" skips the three decisions that structure this article. The right first question is: which parts of my system must produce deterministic results?
Once you have answered that, you know where a knowledge graph belongs, where vector search is enough, and where an LLM adds genuine value. The framework choice follows almost automatically.
What we have learned at Ethenios is an organisational insight. Architecture decisions outweigh tooling decisions. A modular, well-designed AI compliance architecture works with LangGraph, with CrewAI, or with a self-built orchestrator. The wrong architecture fails with any framework. Well, actually it fails with the first auditor who says „show me the evidence chain."
For mid-market, this is good news. Recent data from Germany shows that 36% of mid-sized companies with 50+ employees now use AI, compared to six percent in the 2016 to 2018 period7. Most are early enough in the journey to resolve architecture questions before production. Mid-market has real advantages over large corporations here: faster decisions, less legacy, fewer parallel initiatives blocking each other.
In Part 3, we show what framework evaluation looks like once these three decisions are already made. The result differs significantly from the usual comparison tables, because we rate by criteria missing from most comparisons: auditability, checkpoint capability and traceability of every intermediate result.
If you want to put your AI compliance architecture to the test, we start with a 30-minute readiness check. We identify which of your compliance processes require deterministic results, and where AI has the biggest impact.