We use cookies to enhance your browsing experience, serve personalised ads or content, and analyse our traffic. By clicking "Accept All", you consent to our use of cookies.
Powered by
Customise Consent Preferences
We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.
The cookies that are categorised as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ...
Always Active
Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.
Cookie
__cf_bm
Duration
1 hour
Description
This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
Cookie
_cfuvid
Duration
session
Description
Cloudflare sets this cookie to track users across sessions to optimize user experience by maintaining session consistency and providing personalized services
Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.
No cookies to display.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.
Cookie
session-id
Duration
1 hour
Description
Amazon Pay uses this cookie to maintain a "session" that spans multiple days and beyond reboots. The session information includes the identity of the user, recently visited links and the duration of inactivity.
Performance cookies are used to understand and analyse the key performance indexes of the website which helps in delivering a better user experience for the visitors.
No cookies to display.
Advertisement cookies are used to provide visitors with customised advertisements based on the pages you visited previously and to analyse the effectiveness of the ad campaigns.
Two weeks back, in a workshop with a RegTech scale-up. The opening question was the standard one: which multi-agent framework for mid-market, LangGraph or CrewAI? After three hours, the whiteboard had something else on it. A longer list of evaluation criteria. Vendor stability. The skill pool inside the team. Regulatory roadmap. Audit primitives. Some of it technical, plenty of it organizational. By the end of the session, every single axis on that list weighed more than the opening question.
Two data points frame why this happens. ThoughtWorks moved LangGraph out of the Adopt ring in November 20251. Microsoft consolidated AutoGen and Semantic Kernel into Microsoft Agent Framework in April 2026, with the 1.0 release shipping the same month2. Anyone who placed a framework bet in early 2025 is looking at a different recommendation landscape today. That is the velocity of a field without established best practices, and it changes how you make architecture decisions.
Four frameworks lead the multi-agent architecture market: LangGraph, CrewAI, Microsoft Agent Framework, and Haystack. Which of them fits a mid-market company is decided by nine criteria that extend classical architecture checklists. GitHub stars and benchmark speed do not settle that question. Audit primitives, vendor velocity risk, and the in-house skill pool weigh heavier than any feature list. This article delivers the nine criteria, a cluster view of the framework landscape, and a selection heuristic that builds on the architecture decisions from Part 2.
Why most multi-agent framework comparisons miss your actual problem?
Standard comparison tables grade speed, developer experience, GitHub stars, and token consumption in benchmark scenarios. These criteria are part of the picture. They run short because they say little about whether a framework is still viable in 18 months.
After the three architecture decisions for AI compliance from Part 2, the starting position is different. You have decided where determinism is required and where probabilistic reasoning is enough, whether knowledge graph and vector search are combined, whether a workflow runs as a pipeline or as an orchestrated agent. The framework question now reads: which tool implements those decisions most reliably, in a market whose vendors revise their roadmaps every few months?
The evaluation axes shift. Audit primitives become the mandatory axis, since EU AI Act and DORA both require lifecycle logging. Vendor velocity risk replaces GitHub stars as the stability indicator. Maturity counts twice: for the framework itself, and for the team that has to operate it.
Which frameworks even exist?
Follow only the English-language tech discussion, and you see four to five names dominate: LangGraph, CrewAI, AutoGen, LlamaIndex. The German mid-market job ad pattern is different. Framework names appear less often than in enterprise listings. The typical mid-market posting reads "AI engineer" or "data scientist with LLM experience" without naming a stack. The reason is structural: a mid-market IT department with three seniors, one of whom does AI, picks the option that least devalues existing knowledge.
Skill bias of this kind is measurable. The Stanford-ADP study from August 2025 found that employment for software developers aged 22 to 25 in highly AI-exposed roles dropped by approximately 13 percent since late 2022 in the United States, while older cohorts in the same roles saw 6 to 9 percent growth4. The German pattern matches: Indeed Hiring Lab Germany reported a 33 percent drop in software development job postings between January and November 20245, and the German Federal Employment Agency registered 31 percent more unemployed software developers in July 2025 compared to the prior year6. On top of that, 109,000 IT positions remain unfilled in Germany with 85 percent of companies reporting shortages3, while 70 percent of employees receive no AI training from their employer7. The senior pool defines the skill base, and junior pipelines have stopped feeding it at the previous rate.
Cluster view fits the actual market better than a top list. Five clusters with one or two representatives each.
The first cluster covers code-first orchestration with built-in persistence, with LangGraph as the leading representative. Strength: audit trail and checkpointing. The second cluster groups role-based multi-agent frameworks. CrewAI sits next to AutoGen, now folded into Microsoft Agent Framework. Strength: multi-step analysis tasks with role separation. The third cluster combines workflow engines with AI capabilities. LlamaIndex Workflows brings code-first composition, n8n adds low-code orchestration with self-hosting as a default. The fourth cluster matters most for German contexts: DACH open source plus controlled in-house build. Haystack from deepset Berlin handles knowledge graph and retrieval logic with an air-gapped deployment path; OpenAI Agents SDK and Claude Agent SDK enable controlled minimal stacks for teams that prefer direct API access over framework abstraction. The fifth cluster covers cloud-native agent services. Microsoft Agent Framework on Azure AI Foundry and AWS Bedrock Agents matter where existing cloud contracts and compliance certifications already live.
The grouping orders the landscape rather than ranking it. It sets up the next nine criteria.
Which nine criteria extend the classical architecture checklist?
A classical architecture checklist tests six non-functional requirements: resilience, scalability, maintainability, replaceability, security, cost. Those six remain mandatory. Once the decision is being made in a field that changes faster than the typical lifespan of a software system, they need extensions.
In the RegTech workshop, we extended the checklist by nine items. Each item has a technical and an organizational component.
Model agnosticism and LLM swappability. Is the business logic decoupled from the model provider? Can you swap GPT, Claude, and Mistral via configuration without touching application code? LangGraph, CrewAI, AutoGen, and Haystack are model-agnostic. OpenAI Agents SDK and Claude Agent SDK are not. That is a lock-in axis that can become expensive in 18 months.
Vendor velocity risk. Who carries the framework, how stable is that carrier? LangChain Inc. closed a Series B in October 2025 at 125 million dollars on a 1.25 billion valuation8. Microsoft folded AutoGen into MAF. CrewAI was founded in 2024 and closed an 18 million dollar Series A, led by Insight Partners17. deepset closed a 30 million dollar Series B in August 2023, led by Balderton Capital18. These numbers say nothing about product quality. They do say something about the probability that the project is still alive and maintained two years from now.
Reproducibility despite stochastic output. EU AI Act Article 12 requires automatic logging across the lifecycle of high-risk AI systems9. DORA Article 12 requires similar coverage in the financial sector since January 202510. Frameworks with native eval suites, trace logs, and seed management hold a structural advantage here.
Token cost trajectory and cost observability. A 30-step conversation on a leading model costs roughly 0.50 to 2.00 dollars per execution depending on model and context size (estimate based on current provider pricing). At 10,000 daily executions, that is 5,000 to 20,000 dollars per day in LLM calls alone11. A framework that does not surface those costs per trace makes later model substitution significantly harder.
Data residency and routing control. Which data goes where, and is that enforceable in the framework? Haystack supports air-gapped deployment. Azure AI Foundry offers EU regions. n8n self-hosts natively. OpenAI and Claude SDKs are tied to their respective US providers, with EU data boundary commitments.
Maturity relative to the skill pool. Which frameworks does your team handle without launching a training initiative? Which would require external consultants or career-switcher programmes? 22 percent of mid-market companies run career-switcher programmes12. Those programmes typically cover whatever is most visible in the discussion space, not necessarily what fits your architecture.
Lock-in per layer separately. Do not assess lock-in in aggregate. Separate model lock-in, framework lock-in, vector DB lock-in, and observability lock-in. A framework can be model-agnostic and still pull you into its observability platform. LangGraph draws teams toward LangSmith. That is legitimate but should be a deliberate decision.
Capability gap detection. When is your stack too small for the task, and which signals show that early? A framework that does not log when it hits its limits delays the moment you notice the switch.
Regulatory roadmap alignment. EU AI Act phases through 2027. DORA has been in force since January 2025. NIS2 is in implementation in Germany. BSI C5:2026 has been the new cloud security baseline since April 2026. Which framework features cover these requirements today, which are announced, which are missing?
How do the four frameworks score against these criteria?
We rate LangGraph, CrewAI, Microsoft Agent Framework, and Haystack on the nine criteria plus the three audit primitives checkpointing, replay, and human-in-the-loop. The rating reflects published sources and our own engagements. It is a snapshot from April 2026.
Four frameworks against nine criteria plus three audit primitives, April 2026 snapshot
Criterion
LangGraph
CrewAI
MAF (Microsoft)
Haystack
Model agnosticism
high
high
high
high
Vendor velocity risk
medium (Series B 2025)
high (young company, 2024)
low (Microsoft)
medium (Series B 2023)
Audit trail native
native via LangSmith
external build
OpenTelemetry, Entra ID
logging built in
Checkpointing
native, time-travel
custom build via Celery/Redis
native (MAF 1.0)
via document stores
Human-in-the-loop
explicit
validation nodes
native
pipeline composition
Data residency
cloud + self-hosted
CrewAI Enterprise (cloud)
Azure EU regions
on-premise, air-gapped
Maturity (as of 04/2026)
1.0 GA since 10/2025
1.0 stable since late 2025
1.0 GA 04/2026 (consolidation)
established since 2020
Skill pool in DACH mid-market
growing
medium
growing (Microsoft ecosystem)
strong in DACH OSS teams
Lock-in per layer
low model, high observability
low
medium (Azure)
low
LangGraph
Model agnosticism
high
Vendor velocity risk
medium (Series B 2025)
Audit trail native
native via LangSmith
Checkpointing
native, time-travel
Human-in-the-loop
explicit
Data residency
cloud + self-hosted
Maturity (as of 04/2026)
1.0 GA since 10/2025
Skill pool in DACH mid-market
growing
Lock-in per layer
low model, high observability
CrewAI
Model agnosticism
high
Vendor velocity risk
high (young company, 2024)
Audit trail native
external build
Checkpointing
custom build via Celery/Redis
Human-in-the-loop
validation nodes
Data residency
CrewAI Enterprise (cloud)
Maturity (as of 04/2026)
1.0 stable since late 2025
Skill pool in DACH mid-market
medium
Lock-in per layer
low
MAF (Microsoft)
Model agnosticism
high
Vendor velocity risk
low (Microsoft)
Audit trail native
OpenTelemetry, Entra ID
Checkpointing
native (MAF 1.0)
Human-in-the-loop
native
Data residency
Azure EU regions
Maturity (as of 04/2026)
1.0 GA 04/2026 (consolidation)
Skill pool in DACH mid-market
growing (Microsoft ecosystem)
Lock-in per layer
medium (Azure)
Haystack
Model agnosticism
high
Vendor velocity risk
medium (Series B 2023)
Audit trail native
logging built in
Checkpointing
via document stores
Human-in-the-loop
pipeline composition
Data residency
on-premise, air-gapped
Maturity (as of 04/2026)
established since 2020
Skill pool in DACH mid-market
strong in DACH OSS teams
Lock-in per layer
low
Three observations from the table matter more than the individual scores. First, no framework wins on all nine axes. Second, the framework with the strongest audit trail story (LangGraph) is not the one with the strongest data residency story (Haystack). Third, lock-in axes vary by layer, not by framework. Choosing LangGraph plus LangSmith creates a different lock-in profile than running LangGraph with your own observability stack.
A note on Microsoft Agent Framework: 1.0 went GA in April 2026, which initially looks "new" in any maturity column. MAF is in fact the consolidation of AutoGen (Microsoft Research, in production use since 2023) and Semantic Kernel (stable since 2023). The codebase and engineering team carry several years of production experience. The unified API is new, the substance behind it is not2.
Where framework choice does become architectural
The core thesis from Part 2 reads: architecture weighs heavier than framework choice. That thesis needs precision. There is one place where framework choice does materially affect compliance outcomes: audit primitives.
DORA Article 12 requires automatic logging across the lifecycle of ICT systems10. EU AI Act Article 12 requires similar coverage for high-risk AI9. Frameworks with native checkpointing, time-travel debugging, and persistent state graphs satisfy these requirements with significantly less custom build than frameworks where you wire up persistence, replay, and audit trail by hand.
LangGraph holds a measurable head start here. Classic AutoGen did not, MAF is catching up. CrewAI requires custom build through Celery queues and Redis stores13. If DORA or a comparable audit obligation is the primary driver of your architecture, that is one place where framework choice becomes architectural. It is one criterion out of nine. It can be the most important one depending on the mandate.
What does mid-market need to consider that enterprises can ignore?
Enterprises absorb switch costs. Their dedicated AI platform teams complete a framework migration in twelve months. Mid-market companies do not have that reserve. The KfW Mittelstand panel for February 2026 shows: 20 percent of SMEs use AI, the figure rises to 53 percent at R&D-driven companies. The gap typically opens where skill pool and platform reserve are missing14. 76 percent of SMEs have no AI governance framework, 91 percent consider it critical15. 19 percent have a structured AI roadmap. That gap is the actual challenge.
Three points shift accordingly for mid-market. First, the skill pool. If your IT department has three seniors, the question shifts: which framework can those three run in production within four weeks? Technical superiority on paper helps little when the team cannot carry the implementation. Second, data residency from Part 2 of this series. A 200-person company typically does not send internal audit documents to a US cloud. Frameworks with air-gapped or self-hosting paths carry different weight here than in enterprise contexts. Third, vendor velocity. In mid-market, a forced framework switch costs several quarters of delivery speed because the platform team that could absorb the switch alongside production work is missing.
Hidden champion reality often differs from what the English-language discussion suggests. Haystack, the German open source framework from deepset, has documented production deployments at the European Commission, the German Federal Ministry of Research, the German Armed Forces, the State of Baden-Württemberg, Airbus, Lufthansa Industry Solutions, Infineon, and LEGO16. That list rarely appears in the major English-language top-5 comparisons. For a Baden-Württemberg machine builder focused on data sovereignty, it is a substantial data point.
How do you select the right framework cluster for your architecture?
Instead of a recommendation per framework, a five-step heuristic. It prioritises from the primary driver to the secondary.
If DORA, EU AI Act, or a comparable audit obligation is the primary driver, look at cluster one first (code-first with persistence, representative LangGraph). Native checkpointing primitives save build effort and ease audit preparation.
If data sovereignty is non-negotiable (defense, KRITIS, BaFin-regulated firms, pharma with materials data), look at cluster four first (DACH open source, representative Haystack) or cluster three with self-hosting (n8n). Air-gapped or on-premise paths are best documented there.
If you are on a Microsoft or AWS stack with active enterprise contracts and existing C5 attestations, look at cluster five first (cloud-native, representatives MAF on Azure AI Foundry or Bedrock Agents). You inherit compliance certifications from the platform.
If the workload sits in multi-step research and review chains where roles split clearly (one agent gathers, another summarises, a third checks against policy), and audit trail is not the top priority, look at cluster two first (role-based, with CrewAI as the typical entry point). Fastest prototyping, lowest learning curve.
If the skill pool is thin and rapid integration into existing workflows matters, look at cluster three with n8n first. Low-code lowers the requirement on senior engineering.
This heuristic is not exclusive. A real architecture decision often combines two clusters. Haystack for KG and retrieval logic, LangGraph for the agent orchestration on top, Postgres for persistence. Such combinations are frequently the most robust answer because they pick the strongest option per layer.
Our Take
The three-hour workshop with the RegTech scale-up ended with an insight we had not expected. We had not picked a framework by the end. We had designed an evaluation method. The method carries the actual value because it still holds in six months, when the choice it produces today will already be replaced by another.
The thesis from Part 2 holds: architecture decisions outweigh framework choice. It needs a refinement. Audit primitives are the one place where framework choice materially shifts the compliance outcome. If you carry DORA or EU AI Act responsibility, weight that axis higher than the other eight. If you build an internal knowledge management system, weight it lower.
What we see at Convios across mandates: the skill pool question is consistently underestimated. A framework that looks superior in the whitepaper costs double in mid-market when the existing seniors do not run it fluently. External buildup runs six to twelve months (build duration from Part 1 on multi-agent RAG). The market keeps moving during that window.
The honest, occasionally uncomfortable answer reads: pick the framework your existing seniors can run in production within four weeks, and invest the freed time in clean architecture around it. A modular architecture survives a framework switch with manageable effort. Where modularity is missing, every switch forces a rewrite.
ThoughtWorks moved LangGraph out of Adopt in November 2025. Six months earlier it was the default choice. Six months from now something else may be default. An architecture that survives that movement turns the framework question into a detail question. That is the goal.
If you want to test your framework selection against these nine criteria or build your own evaluation method, we start with a 30-minute intro call.