AI & TechnologySoftware & Delivery

Multi-Agent Frameworks for Mid-Market: Four Contenders, Nine Criteria, One Uncomfortable Answer

Dr. Oliver Gausmann · April 29, 2026 · 11 min read

Wall covered with hundreds of paint sample cards in an ordered grid, sorted from green through blue to purple

Two weeks back, in a workshop with a RegTech scale-up. The opening question was the standard one: which multi-agent framework for mid-market, LangGraph or CrewAI? After three hours, the whiteboard had something else on it. A longer list of evaluation criteria. Vendor stability. The skill pool inside the team. Regulatory roadmap. Audit primitives. Some of it technical, plenty of it organizational. By the end of the session, every single axis on that list weighed more than the opening question.

Two data points frame why this happens. ThoughtWorks moved LangGraph out of the Adopt ring in November 2025¹. Microsoft consolidated AutoGen and Semantic Kernel into Microsoft Agent Framework in April 2026, with the 1.0 release shipping the same month². Anyone who placed a framework bet in early 2025 is looking at a different recommendation landscape today. That is the velocity of a field without established best practices, and it changes how you make architecture decisions.

Four frameworks lead the multi-agent architecture market: LangGraph, CrewAI, Microsoft Agent Framework, and Haystack. Which of them fits a mid-market company is decided by nine criteria that extend classical architecture checklists. GitHub stars and benchmark speed do not settle that question. Audit primitives, vendor velocity risk, and the in-house skill pool weigh heavier than any feature list. This article delivers the nine criteria, a cluster view of the framework landscape, and a selection heuristic that builds on the architecture decisions from Part 2.

Why most multi-agent framework comparisons miss your actual problem?

Standard comparison tables grade speed, developer experience, GitHub stars, and token consumption in benchmark scenarios. These criteria are part of the picture. They run short because they say little about whether a framework is still viable in 18 months.

After the three architecture decisions for AI compliance from Part 2, the starting position is different. You have decided where determinism is required and where probabilistic reasoning is enough, whether knowledge graph and vector search are combined, whether a workflow runs as a pipeline or as an orchestrated agent. The framework question now reads: which tool implements those decisions most reliably, in a market whose vendors revise their roadmaps every few months?

The evaluation axes shift. Audit primitives become the mandatory axis, since EU AI Act and DORA both require lifecycle logging. Vendor velocity risk replaces GitHub stars as the stability indicator. Maturity counts twice: for the framework itself, and for the team that has to operate it.

Which frameworks even exist?

Follow only the English-language tech discussion, and you see four to five names dominate: LangGraph, CrewAI, AutoGen, LlamaIndex. The German mid-market job ad pattern is different. Framework names appear less often than in enterprise listings. The typical mid-market posting reads "AI engineer" or "data scientist with LLM experience" without naming a stack. The reason is structural: a mid-market IT department with three seniors, one of whom does AI, picks the option that least devalues existing knowledge.

Skill bias of this kind is measurable. The Stanford-ADP study from August 2025 found that employment for software developers aged 22 to 25 in highly AI-exposed roles dropped by approximately 13 percent since late 2022 in the United States, while older cohorts in the same roles saw 6 to 9 percent growth⁴. The German pattern matches: Indeed Hiring Lab Germany reported a 33 percent drop in software development job postings between January and November 2024⁵, and the German Federal Employment Agency registered 31 percent more unemployed software developers in July 2025 compared to the prior year⁶. On top of that, 109,000 IT positions remain unfilled in Germany with 85 percent of companies reporting shortages³, while 70 percent of employees receive no AI training from their employer⁷. The senior pool defines the skill base, and junior pipelines have stopped feeding it at the previous rate.

Cluster view fits the actual market better than a top list. Five clusters with one or two representatives each.

The first cluster covers code-first orchestration with built-in persistence, with LangGraph as the leading representative. Strength: audit trail and checkpointing. The second cluster groups role-based multi-agent frameworks. CrewAI sits next to AutoGen, now folded into Microsoft Agent Framework. Strength: multi-step analysis tasks with role separation. The third cluster combines workflow engines with AI capabilities. LlamaIndex Workflows brings code-first composition, n8n adds low-code orchestration with self-hosting as a default. The fourth cluster matters most for German contexts: DACH open source plus controlled in-house build. Haystack from deepset Berlin handles knowledge graph and retrieval logic with an air-gapped deployment path; OpenAI Agents SDK and Claude Agent SDK enable controlled minimal stacks for teams that prefer direct API access over framework abstraction. The fifth cluster covers cloud-native agent services. Microsoft Agent Framework on Azure AI Foundry and AWS Bedrock Agents matter where existing cloud contracts and compliance certifications already live.

The grouping orders the landscape rather than ranking it. It sets up the next nine criteria.

Which nine criteria extend the classical architecture checklist?

A classical architecture checklist tests six non-functional requirements: resilience, scalability, maintainability, replaceability, security, cost. Those six remain mandatory. Once the decision is being made in a field that changes faster than the typical lifespan of a software system, they need extensions.

In the RegTech workshop, we extended the checklist by nine items. Each item has a technical and an organizational component.

Model agnosticism and LLM swappability. Is the business logic decoupled from the model provider? Can you swap GPT, Claude, and Mistral via configuration without touching application code? LangGraph, CrewAI, AutoGen, and Haystack are model-agnostic. OpenAI Agents SDK and Claude Agent SDK are not. That is a lock-in axis that can become expensive in 18 months.

Vendor velocity risk. Who carries the framework, how stable is that carrier? LangChain Inc. closed a Series B in October 2025 at 125 million dollars on a 1.25 billion valuation⁸. Microsoft folded AutoGen into MAF. CrewAI was founded in 2024 and closed an 18 million dollar Series A, led by Insight Partners¹⁷. deepset closed a 30 million dollar Series B in August 2023, led by Balderton Capital¹⁸. These numbers say nothing about product quality. They do say something about the probability that the project is still alive and maintained two years from now.

Reproducibility despite stochastic output. EU AI Act Article 12 requires automatic logging across the lifecycle of high-risk AI systems⁹. DORA Article 12 requires similar coverage in the financial sector since January 2025¹⁰. Frameworks with native eval suites, trace logs, and seed management hold a structural advantage here.

Token cost trajectory and cost observability. A 30-step conversation on a leading model costs roughly 0.50 to 2.00 dollars per execution depending on model and context size (estimate based on current provider pricing). At 10,000 daily executions, that is 5,000 to 20,000 dollars per day in LLM calls alone¹¹. A framework that does not surface those costs per trace makes later model substitution significantly harder.

Data residency and routing control. Which data goes where, and is that enforceable in the framework? Haystack supports air-gapped deployment. Azure AI Foundry offers EU regions. n8n self-hosts natively. OpenAI and Claude SDKs are tied to their respective US providers, with EU data boundary commitments.

Maturity relative to the skill pool. Which frameworks does your team handle without launching a training initiative? Which would require external consultants or career-switcher programmes? 22 percent of mid-market companies run career-switcher programmes¹². Those programmes typically cover whatever is most visible in the discussion space, not necessarily what fits your architecture.

Lock-in per layer separately. Do not assess lock-in in aggregate. Separate model lock-in, framework lock-in, vector DB lock-in, and observability lock-in. A framework can be model-agnostic and still pull you into its observability platform. LangGraph draws teams toward LangSmith. That is legitimate but should be a deliberate decision.

Capability gap detection. When is your stack too small for the task, and which signals show that early? A framework that does not log when it hits its limits delays the moment you notice the switch.

Regulatory roadmap alignment. EU AI Act phases through 2027. DORA has been in force since January 2025. NIS2 is in implementation in Germany. BSI C5:2026 has been the new cloud security baseline since April 2026. Which framework features cover these requirements today, which are announced, which are missing?

How do the four frameworks score against these criteria?

We rate LangGraph, CrewAI, Microsoft Agent Framework, and Haystack on the nine criteria plus the three audit primitives checkpointing, replay, and human-in-the-loop. The rating reflects published sources and our own engagements. It is a snapshot from April 2026.

Four frameworks against nine criteria plus three audit primitives, April 2026 snapshot

Criterion	LangGraph	CrewAI	MAF (Microsoft)	Haystack
Model agnosticism	high	high	high	high
Vendor velocity risk	medium (Series B 2025)	high (young company, 2024)	low (Microsoft)	medium (Series B 2023)
Audit trail native	native via LangSmith	external build	OpenTelemetry, Entra ID	logging built in
Checkpointing	native, time-travel	custom build via Celery/Redis	native (MAF 1.0)	via document stores
Human-in-the-loop	explicit	validation nodes	native	pipeline composition
Data residency	cloud + self-hosted	CrewAI Enterprise (cloud)	Azure EU regions	on-premise, air-gapped
Maturity (as of 04/2026)	1.0 GA since 10/2025	1.0 stable since late 2025	1.0 GA 04/2026 (consolidation)	established since 2020
Skill pool in DACH mid-market	growing	medium	growing (Microsoft ecosystem)	strong in DACH OSS teams
Lock-in per layer	low model, high observability	low	medium (Azure)	low

LangGraph

Model agnosticism: high
Vendor velocity risk: medium (Series B 2025)
Audit trail native: native via LangSmith
Checkpointing: native, time-travel
Human-in-the-loop: explicit
Data residency: cloud + self-hosted
Maturity (as of 04/2026): 1.0 GA since 10/2025
Skill pool in DACH mid-market: growing
Lock-in per layer: low model, high observability

CrewAI

Model agnosticism: high
Vendor velocity risk: high (young company, 2024)
Audit trail native: external build
Checkpointing: custom build via Celery/Redis
Human-in-the-loop: validation nodes
Data residency: CrewAI Enterprise (cloud)
Maturity (as of 04/2026): 1.0 stable since late 2025
Skill pool in DACH mid-market: medium
Lock-in per layer: low

MAF (Microsoft)

Model agnosticism: high
Vendor velocity risk: low (Microsoft)
Audit trail native: OpenTelemetry, Entra ID
Checkpointing: native (MAF 1.0)
Human-in-the-loop: native
Data residency: Azure EU regions
Maturity (as of 04/2026): 1.0 GA 04/2026 (consolidation)
Skill pool in DACH mid-market: growing (Microsoft ecosystem)
Lock-in per layer: medium (Azure)

Haystack

Model agnosticism: high
Vendor velocity risk: medium (Series B 2023)
Audit trail native: logging built in
Checkpointing: via document stores
Human-in-the-loop: pipeline composition
Data residency: on-premise, air-gapped
Maturity (as of 04/2026): established since 2020
Skill pool in DACH mid-market: strong in DACH OSS teams
Lock-in per layer: low

Three observations from the table matter more than the individual scores. First, no framework wins on all nine axes. Second, the framework with the strongest audit trail story (LangGraph) is not the one with the strongest data residency story (Haystack). Third, lock-in axes vary by layer, not by framework. Choosing LangGraph plus LangSmith creates a different lock-in profile than running LangGraph with your own observability stack.

A note on Microsoft Agent Framework: 1.0 went GA in April 2026, which initially looks "new" in any maturity column. MAF is in fact the consolidation of AutoGen (Microsoft Research, in production use since 2023) and Semantic Kernel (stable since 2023). The codebase and engineering team carry several years of production experience. The unified API is new, the substance behind it is not².

Where framework choice does become architectural

The core thesis from Part 2 reads: architecture weighs heavier than framework choice. That thesis needs precision. There is one place where framework choice does materially affect compliance outcomes: audit primitives.

DORA Article 12 requires automatic logging across the lifecycle of ICT systems¹⁰. EU AI Act Article 12 requires similar coverage for high-risk AI⁹. Frameworks with native checkpointing, time-travel debugging, and persistent state graphs satisfy these requirements with significantly less custom build than frameworks where you wire up persistence, replay, and audit trail by hand.

LangGraph holds a measurable head start here. Classic AutoGen did not, MAF is catching up. CrewAI requires custom build through Celery queues and Redis stores¹³. If DORA or a comparable audit obligation is the primary driver of your architecture, that is one place where framework choice becomes architectural. It is one criterion out of nine. It can be the most important one depending on the mandate.

What does mid-market need to consider that enterprises can ignore?

Enterprises absorb switch costs. Their dedicated AI platform teams complete a framework migration in twelve months. Mid-market companies do not have that reserve. The KfW Mittelstand panel for February 2026 shows: 20 percent of SMEs use AI, the figure rises to 53 percent at R&D-driven companies. The gap typically opens where skill pool and platform reserve are missing¹⁴. 76 percent of SMEs have no AI governance framework, 91 percent consider it critical¹⁵. 19 percent have a structured AI roadmap. That gap is the actual challenge.

Three points shift accordingly for mid-market. First, the skill pool. If your IT department has three seniors, the question shifts: which framework can those three run in production within four weeks? Technical superiority on paper helps little when the team cannot carry the implementation. Second, data residency from Part 2 of this series. A 200-person company typically does not send internal audit documents to a US cloud. Frameworks with air-gapped or self-hosting paths carry different weight here than in enterprise contexts. Third, vendor velocity. In mid-market, a forced framework switch costs several quarters of delivery speed because the platform team that could absorb the switch alongside production work is missing.

Hidden champion reality often differs from what the English-language discussion suggests. Haystack, the German open source framework from deepset, has documented production deployments at the European Commission, the German Federal Ministry of Research, the German Armed Forces, the State of Baden-Württemberg, Airbus, Lufthansa Industry Solutions, Infineon, and LEGO¹⁶. That list rarely appears in the major English-language top-5 comparisons. For a Baden-Württemberg machine builder focused on data sovereignty, it is a substantial data point.

How do you select the right framework cluster for your architecture?

Instead of a recommendation per framework, a five-step heuristic. It prioritises from the primary driver to the secondary.

If DORA, EU AI Act, or a comparable audit obligation is the primary driver, look at cluster one first (code-first with persistence, representative LangGraph). Native checkpointing primitives save build effort and ease audit preparation.
If data sovereignty is non-negotiable (defense, KRITIS, BaFin-regulated firms, pharma with materials data), look at cluster four first (DACH open source, representative Haystack) or cluster three with self-hosting (n8n). Air-gapped or on-premise paths are best documented there.
If you are on a Microsoft or AWS stack with active enterprise contracts and existing C5 attestations, look at cluster five first (cloud-native, representatives MAF on Azure AI Foundry or Bedrock Agents). You inherit compliance certifications from the platform.
If the workload sits in multi-step research and review chains where roles split clearly (one agent gathers, another summarises, a third checks against policy), and audit trail is not the top priority, look at cluster two first (role-based, with CrewAI as the typical entry point). Fastest prototyping, lowest learning curve.
If the skill pool is thin and rapid integration into existing workflows matters, look at cluster three with n8n first. Low-code lowers the requirement on senior engineering.

This heuristic is not exclusive. A real architecture decision often combines two clusters. Haystack for KG and retrieval logic, LangGraph for the agent orchestration on top, Postgres for persistence. Such combinations are frequently the most robust answer because they pick the strongest option per layer.

Our Take

The three-hour workshop with the RegTech scale-up ended with an insight we had not expected. We had not picked a framework by the end. We had designed an evaluation method. The method carries the actual value because it still holds in six months, when the choice it produces today will already be replaced by another.

The thesis from Part 2 holds: architecture decisions outweigh framework choice. It needs a refinement. Audit primitives are the one place where framework choice materially shifts the compliance outcome. If you carry DORA or EU AI Act responsibility, weight that axis higher than the other eight. If you build an internal knowledge management system, weight it lower.

What we see at Convios across mandates: the skill pool question is consistently underestimated. A framework that looks superior in the whitepaper costs double in mid-market when the existing seniors do not run it fluently. External buildup runs six to twelve months (build duration from Part 1 on multi-agent RAG). The market keeps moving during that window.

The honest, occasionally uncomfortable answer reads: pick the framework your existing seniors can run in production within four weeks, and invest the freed time in clean architecture around it. A modular architecture survives a framework switch with manageable effort. Where modularity is missing, every switch forces a rewrite.

ThoughtWorks moved LangGraph out of Adopt in November 2025. Six months earlier it was the default choice. Six months from now something else may be default. An architecture that survives that movement turns the framework question into a detail question. That is the goal.

If you want to test your framework selection against these nine criteria or build your own evaluation method, we start with a 30-minute intro call.

Sources

¹ThoughtWorks Technology Radar Volume 33, November 2025: LangGraph aus dem Adopt-Ring entfernt

²Microsoft Agent Framework GA 1.0: Konsolidierung von AutoGen und Semantic Kernel, April 2026

³Bitkom Fachkräfte-Studie 2025: 109.000 unbesetzte IT-Stellen, 85 Prozent Mangelmeldung

⁴Brynjolfsson, Chandar, Chen: Canaries in the Coal Mine?, Stanford Digital Economy Lab, August 2025: Beschäftigungsrückgang von 13 Prozent bei 22- bis 25-jährigen Software-Entwicklern in stark KI-exponierten Berufen seit Ende 2022

⁵Indeed Hiring Lab Deutschland, Jobs and Hiring Trends Report 2025: Softwareentwicklung minus 33,3 Prozent offene Stellen Januar bis November 2024

⁶Bundesagentur für Arbeit, IT-Arbeitsmarktbericht Juli 2025: 31,3 Prozent mehr arbeitslose Softwareentwickler im Jahresvergleich

⁷Bitkom KI-Studie 2026: 70 Prozent der Beschäftigten erhalten keine KI-Fortbildung

⁸LangChain Series B Oktober 2025: 125 Millionen Dollar bei 1,25 Milliarden Bewertung

⁹EU AI Act Artikel 12: Logging-Pflichten für Hochrisiko-KI-Systeme

¹⁰DORA Artikel 12: Logging-Anforderungen für IKT-Systeme im Finanzsektor, in Kraft seit 17.01.2025

¹¹Token-Kostenanalyse für Multi-Agenten-Workflows, Beispielrechnung GPT-4o

¹²Bitkom IT-Fachkräfte-Studie 2025: 22 Prozent der Unternehmen mit Quereinsteiger-Programmen

¹³Vergleich der Audit-Trail-Implementierung in LangGraph, CrewAI und AutoGen

¹⁴KfW-Mittelstandspanel, Fokus Volkswirtschaft Nr. 533, Februar 2026: 20 Prozent der KMU nutzen KI, 53 Prozent bei FuE-treibenden Unternehmen

¹⁵Maximal Digital KMU-Studie 2025 (n=455): 76 Prozent ohne KI-Governance, 91 Prozent halten es für kritisch, 19 Prozent mit strukturiertem KI-Fahrplan

¹⁶deepset Produktionsreferenzen Haystack, dokumentierte Public-Sector- und DACH-Industrie-Einsätze

¹⁷Insight Partners "Behind the Investment: CrewAI" (Lead-Investor der Series A)

¹⁸TechCrunch, 9. August 2023: deepset secures 30m Series B led by Balderton Capital

Was this article helpful?

Have questions about this topic?

Schedule a conversation