← All Insights
    AI & TechnologySoftware & Delivery

    Multi-Agent Frameworks for Mid-Market: Four Contenders, Nine Criteria, One Uncomfortable Answer

    Dr. Oliver Gausmann · April 29, 2026 · 11 min read

    Wall covered with hundreds of paint sample cards in an ordered grid, sorted from green through blue to purple

    Two weeks back, in a workshop with a RegTech scale-up. The opening question was the standard one: which multi-agent framework for mid-market, LangGraph or CrewAI? After three hours, the whiteboard had something else on it. A longer list of evaluation criteria. Vendor stability. The skill pool inside the team. Regulatory roadmap. Audit primitives. Some of it technical, plenty of it organizational. By the end of the session, every single axis on that list weighed more than the opening question.

    Two data points frame why this happens. ThoughtWorks moved LangGraph out of the Adopt ring in November 20251. Microsoft consolidated AutoGen and Semantic Kernel into Microsoft Agent Framework in April 2026, with the 1.0 release shipping the same month2. Anyone who placed a framework bet in early 2025 is looking at a different recommendation landscape today. That is the velocity of a field without established best practices, and it changes how you make architecture decisions.

    Four frameworks lead the multi-agent architecture market: LangGraph, CrewAI, Microsoft Agent Framework, and Haystack. Which of them fits a mid-market company is decided by nine criteria that extend classical architecture checklists. GitHub stars and benchmark speed do not settle that question. Audit primitives, vendor velocity risk, and the in-house skill pool weigh heavier than any feature list. This article delivers the nine criteria, a cluster view of the framework landscape, and a selection heuristic that builds on the architecture decisions from Part 2.

    Why most multi-agent framework comparisons miss your actual problem?

    Standard comparison tables grade speed, developer experience, GitHub stars, and token consumption in benchmark scenarios. These criteria are part of the picture. They run short because they say little about whether a framework is still viable in 18 months.

    After the three architecture decisions for AI compliance from Part 2, the starting position is different. You have decided where determinism is required and where probabilistic reasoning is enough, whether knowledge graph and vector search are combined, whether a workflow runs as a pipeline or as an orchestrated agent. The framework question now reads: which tool implements those decisions most reliably, in a market whose vendors revise their roadmaps every few months?

    The evaluation axes shift. Audit primitives become the mandatory axis, since EU AI Act and DORA both require lifecycle logging. Vendor velocity risk replaces GitHub stars as the stability indicator. Maturity counts twice: for the framework itself, and for the team that has to operate it.

    Which frameworks even exist?

    Follow only the English-language tech discussion, and you see four to five names dominate: LangGraph, CrewAI, AutoGen, LlamaIndex. The German mid-market job ad pattern is different. Framework names appear less often than in enterprise listings. The typical mid-market posting reads "AI engineer" or "data scientist with LLM experience" without naming a stack. The reason is structural: a mid-market IT department with three seniors, one of whom does AI, picks the option that least devalues existing knowledge.

    Skill bias of this kind is measurable. The Stanford-ADP study from August 2025 found that employment for software developers aged 22 to 25 in highly AI-exposed roles dropped by approximately 13 percent since late 2022 in the United States, while older cohorts in the same roles saw 6 to 9 percent growth4. The German pattern matches: Indeed Hiring Lab Germany reported a 33 percent drop in software development job postings between January and November 20245, and the German Federal Employment Agency registered 31 percent more unemployed software developers in July 2025 compared to the prior year6. On top of that, 109,000 IT positions remain unfilled in Germany with 85 percent of companies reporting shortages3, while 70 percent of employees receive no AI training from their employer7. The senior pool defines the skill base, and junior pipelines have stopped feeding it at the previous rate.

    Cluster view fits the actual market better than a top list. Five clusters with one or two representatives each.

    Five clusters of the multi-agent framework landscapeCluster 1Code-first withpersistenceLangGraphaudit trail, checkpointingCluster 2Role-basedmulti-agentCrewAI, AutoGen / MAFmulti-step analysisCluster 3Workflow engineswith AILlamaIndex, n8nself-host, low-codeCluster 4DACH open source andcontrolled in-houseHaystack, OpenAI / Claude SDKdata residency, air-gappedCluster 5Cloud-nativeagent servicesAzure AI Foundry, Bedrockplatform compliance inheritedSkill bias reality in DACH mid-market109,000 unfilled IT positions in GermanyBitkom 2025Junior software jobs: 13 to 20 percent decline since 2022Stanford-ADP study70 percent of employees receive no AI trainingWhen the skill pool runs thin, the familiar wins.The appropriate gets passed over.

    The first cluster covers code-first orchestration with built-in persistence, with LangGraph as the leading representative. Strength: audit trail and checkpointing. The second cluster groups role-based multi-agent frameworks. CrewAI sits next to AutoGen, now folded into Microsoft Agent Framework. Strength: multi-step analysis tasks with role separation. The third cluster combines workflow engines with AI capabilities. LlamaIndex Workflows brings code-first composition, n8n adds low-code orchestration with self-hosting as a default. The fourth cluster matters most for German contexts: DACH open source plus controlled in-house build. Haystack from deepset Berlin handles knowledge graph and retrieval logic with an air-gapped deployment path; OpenAI Agents SDK and Claude Agent SDK enable controlled minimal stacks for teams that prefer direct API access over framework abstraction. The fifth cluster covers cloud-native agent services. Microsoft Agent Framework on Azure AI Foundry and AWS Bedrock Agents matter where existing cloud contracts and compliance certifications already live.

    The grouping orders the landscape rather than ranking it. It sets up the next nine criteria.

    Which nine criteria extend the classical architecture checklist?

    A classical architecture checklist tests six non-functional requirements: resilience, scalability, maintainability, replaceability, security, cost. Those six remain mandatory. Once the decision is being made in a field that changes faster than the typical lifespan of a software system, they need extensions.

    In the RegTech workshop, we extended the checklist by nine items. Each item has a technical and an organizational component.

    Classical architecture checklist, extended by nine AI-specific criteriaClassical NFR checkliststill mandatory1. Resilience2. Scalability3. Maintainability4. Replaceability5. Security6. CostAI-specific extensionsnew for multi-agent architectures1. Model agnosticism2. Vendor velocity risk3. Reproducibility4. Token cost trajectory5. Data residency routing6. Maturity vs. skill pool7. Lock-in per layer8. Capability gap detection9. Regulatory roadmapTriggerThe market reorders every six months.Classical NFRs do not cover a field without established best practices.

    Model agnosticism and LLM swappability. Is the business logic decoupled from the model provider? Can you swap GPT, Claude, and Mistral via configuration without touching application code? LangGraph, CrewAI, AutoGen, and Haystack are model-agnostic. OpenAI Agents SDK and Claude Agent SDK are not. That is a lock-in axis that can become expensive in 18 months.

    Vendor velocity risk. Who carries the framework, how stable is that carrier? LangChain Inc. closed a Series B in October 2025 at 125 million dollars on a 1.25 billion valuation8. Microsoft folded AutoGen into MAF. CrewAI was founded in 2024 and closed an 18 million dollar Series A, led by Insight Partners17. deepset closed a 30 million dollar Series B in August 2023, led by Balderton Capital18. These numbers say nothing about product quality. They do say something about the probability that the project is still alive and maintained two years from now.

    Reproducibility despite stochastic output. EU AI Act Article 12 requires automatic logging across the lifecycle of high-risk AI systems9. DORA Article 12 requires similar coverage in the financial sector since January 202510. Frameworks with native eval suites, trace logs, and seed management hold a structural advantage here.

    Token cost trajectory and cost observability. A 30-step conversation on a leading model costs roughly 0.50 to 2.00 dollars per execution depending on model and context size (estimate based on current provider pricing). At 10,000 daily executions, that is 5,000 to 20,000 dollars per day in LLM calls alone11. A framework that does not surface those costs per trace makes later model substitution significantly harder.

    Data residency and routing control. Which data goes where, and is that enforceable in the framework? Haystack supports air-gapped deployment. Azure AI Foundry offers EU regions. n8n self-hosts natively. OpenAI and Claude SDKs are tied to their respective US providers, with EU data boundary commitments.

    Maturity relative to the skill pool. Which frameworks does your team handle without launching a training initiative? Which would require external consultants or career-switcher programmes? 22 percent of mid-market companies run career-switcher programmes12. Those programmes typically cover whatever is most visible in the discussion space, not necessarily what fits your architecture.

    Lock-in per layer separately. Do not assess lock-in in aggregate. Separate model lock-in, framework lock-in, vector DB lock-in, and observability lock-in. A framework can be model-agnostic and still pull you into its observability platform. LangGraph draws teams toward LangSmith. That is legitimate but should be a deliberate decision.

    Capability gap detection. When is your stack too small for the task, and which signals show that early? A framework that does not log when it hits its limits delays the moment you notice the switch.

    Regulatory roadmap alignment. EU AI Act phases through 2027. DORA has been in force since January 2025. NIS2 is in implementation in Germany. BSI C5:2026 has been the new cloud security baseline since April 2026. Which framework features cover these requirements today, which are announced, which are missing?

    How do the four frameworks score against these criteria?

    We rate LangGraph, CrewAI, Microsoft Agent Framework, and Haystack on the nine criteria plus the three audit primitives checkpointing, replay, and human-in-the-loop. The rating reflects published sources and our own engagements. It is a snapshot from April 2026.

    Four frameworks against nine criteria plus three audit primitives, April 2026 snapshot

    CriterionLangGraphCrewAIMAF (Microsoft)Haystack
    Model agnosticismhighhighhighhigh
    Vendor velocity riskmedium (Series B 2025)high (young company, 2024)low (Microsoft)medium (Series B 2023)
    Audit trail nativenative via LangSmithexternal buildOpenTelemetry, Entra IDlogging built in
    Checkpointingnative, time-travelcustom build via Celery/Redisnative (MAF 1.0)via document stores
    Human-in-the-loopexplicitvalidation nodesnativepipeline composition
    Data residencycloud + self-hostedCrewAI Enterprise (cloud)Azure EU regionson-premise, air-gapped
    Maturity (as of 04/2026)1.0 GA since 10/20251.0 stable since late 20251.0 GA 04/2026 (consolidation)established since 2020
    Skill pool in DACH mid-marketgrowingmediumgrowing (Microsoft ecosystem)strong in DACH OSS teams
    Lock-in per layerlow model, high observabilitylowmedium (Azure)low

    LangGraph

    Model agnosticism
    high
    Vendor velocity risk
    medium (Series B 2025)
    Audit trail native
    native via LangSmith
    Checkpointing
    native, time-travel
    Human-in-the-loop
    explicit
    Data residency
    cloud + self-hosted
    Maturity (as of 04/2026)
    1.0 GA since 10/2025
    Skill pool in DACH mid-market
    growing
    Lock-in per layer
    low model, high observability

    CrewAI

    Model agnosticism
    high
    Vendor velocity risk
    high (young company, 2024)
    Audit trail native
    external build
    Checkpointing
    custom build via Celery/Redis
    Human-in-the-loop
    validation nodes
    Data residency
    CrewAI Enterprise (cloud)
    Maturity (as of 04/2026)
    1.0 stable since late 2025
    Skill pool in DACH mid-market
    medium
    Lock-in per layer
    low

    MAF (Microsoft)

    Model agnosticism
    high
    Vendor velocity risk
    low (Microsoft)
    Audit trail native
    OpenTelemetry, Entra ID
    Checkpointing
    native (MAF 1.0)
    Human-in-the-loop
    native
    Data residency
    Azure EU regions
    Maturity (as of 04/2026)
    1.0 GA 04/2026 (consolidation)
    Skill pool in DACH mid-market
    growing (Microsoft ecosystem)
    Lock-in per layer
    medium (Azure)

    Haystack

    Model agnosticism
    high
    Vendor velocity risk
    medium (Series B 2023)
    Audit trail native
    logging built in
    Checkpointing
    via document stores
    Human-in-the-loop
    pipeline composition
    Data residency
    on-premise, air-gapped
    Maturity (as of 04/2026)
    established since 2020
    Skill pool in DACH mid-market
    strong in DACH OSS teams
    Lock-in per layer
    low

    Three observations from the table matter more than the individual scores. First, no framework wins on all nine axes. Second, the framework with the strongest audit trail story (LangGraph) is not the one with the strongest data residency story (Haystack). Third, lock-in axes vary by layer, not by framework. Choosing LangGraph plus LangSmith creates a different lock-in profile than running LangGraph with your own observability stack.

    A note on Microsoft Agent Framework: 1.0 went GA in April 2026, which initially looks "new" in any maturity column. MAF is in fact the consolidation of AutoGen (Microsoft Research, in production use since 2023) and Semantic Kernel (stable since 2023). The codebase and engineering team carry several years of production experience. The unified API is new, the substance behind it is not2.

    Where framework choice does become architectural

    The core thesis from Part 2 reads: architecture weighs heavier than framework choice. That thesis needs precision. There is one place where framework choice does materially affect compliance outcomes: audit primitives.

    DORA Article 12 requires automatic logging across the lifecycle of ICT systems10. EU AI Act Article 12 requires similar coverage for high-risk AI9. Frameworks with native checkpointing, time-travel debugging, and persistent state graphs satisfy these requirements with significantly less custom build than frameworks where you wire up persistence, replay, and audit trail by hand.

    LangGraph holds a measurable head start here. Classic AutoGen did not, MAF is catching up. CrewAI requires custom build through Celery queues and Redis stores13. If DORA or a comparable audit obligation is the primary driver of your architecture, that is one place where framework choice becomes architectural. It is one criterion out of nine. It can be the most important one depending on the mandate.

    What does mid-market need to consider that enterprises can ignore?

    Enterprises absorb switch costs. Their dedicated AI platform teams complete a framework migration in twelve months. Mid-market companies do not have that reserve. The KfW Mittelstand panel for February 2026 shows: 20 percent of SMEs use AI, the figure rises to 53 percent at R&D-driven companies. The gap typically opens where skill pool and platform reserve are missing14. 76 percent of SMEs have no AI governance framework, 91 percent consider it critical15. 19 percent have a structured AI roadmap. That gap is the actual challenge.

    Three points shift accordingly for mid-market. First, the skill pool. If your IT department has three seniors, the question shifts: which framework can those three run in production within four weeks? Technical superiority on paper helps little when the team cannot carry the implementation. Second, data residency from Part 2 of this series. A 200-person company typically does not send internal audit documents to a US cloud. Frameworks with air-gapped or self-hosting paths carry different weight here than in enterprise contexts. Third, vendor velocity. In mid-market, a forced framework switch costs several quarters of delivery speed because the platform team that could absorb the switch alongside production work is missing.

    Hidden champion reality often differs from what the English-language discussion suggests. Haystack, the German open source framework from deepset, has documented production deployments at the European Commission, the German Federal Ministry of Research, the German Armed Forces, the State of Baden-Württemberg, Airbus, Lufthansa Industry Solutions, Infineon, and LEGO16. That list rarely appears in the major English-language top-5 comparisons. For a Baden-Württemberg machine builder focused on data sovereignty, it is a substantial data point.

    How do you select the right framework cluster for your architecture?

    Instead of a recommendation per framework, a five-step heuristic. It prioritises from the primary driver to the secondary.

    Decision heuristic: which cluster is the better starting point?What is your primary driver?order: from primary to secondaryAudit obligation primaryDORA, EU AI Act→ Cluster 1(LangGraph)Data sovereignty primaryKRITIS, defense, pharma→ Cluster 4 (Haystack)or 3 (n8n)Cloud stack in placeAzure / AWS contracts→ Cluster 5 (MAF,Bedrock)Secondary drivers, if none of the above three apply:Multi-step analysisresearcher, reviewer, approver→ Cluster 2 (CrewAI)Skill pool thinrapid integration needed→ Cluster 3 (n8n, low-code)Note: heuristic often combines two clustersA real architecture often combines two clusters:Haystack for knowledge graph and retrieval logic+ LangGraph for agent orchestration+ Postgres for persistenceRecommendationPick the strongest option per layer.
    1. If DORA, EU AI Act, or a comparable audit obligation is the primary driver, look at cluster one first (code-first with persistence, representative LangGraph). Native checkpointing primitives save build effort and ease audit preparation.
    2. If data sovereignty is non-negotiable (defense, KRITIS, BaFin-regulated firms, pharma with materials data), look at cluster four first (DACH open source, representative Haystack) or cluster three with self-hosting (n8n). Air-gapped or on-premise paths are best documented there.
    3. If you are on a Microsoft or AWS stack with active enterprise contracts and existing C5 attestations, look at cluster five first (cloud-native, representatives MAF on Azure AI Foundry or Bedrock Agents). You inherit compliance certifications from the platform.
    4. If the workload sits in multi-step research and review chains where roles split clearly (one agent gathers, another summarises, a third checks against policy), and audit trail is not the top priority, look at cluster two first (role-based, with CrewAI as the typical entry point). Fastest prototyping, lowest learning curve.
    5. If the skill pool is thin and rapid integration into existing workflows matters, look at cluster three with n8n first. Low-code lowers the requirement on senior engineering.

    This heuristic is not exclusive. A real architecture decision often combines two clusters. Haystack for KG and retrieval logic, LangGraph for the agent orchestration on top, Postgres for persistence. Such combinations are frequently the most robust answer because they pick the strongest option per layer.

    Our Take

    The three-hour workshop with the RegTech scale-up ended with an insight we had not expected. We had not picked a framework by the end. We had designed an evaluation method. The method carries the actual value because it still holds in six months, when the choice it produces today will already be replaced by another.

    The thesis from Part 2 holds: architecture decisions outweigh framework choice. It needs a refinement. Audit primitives are the one place where framework choice materially shifts the compliance outcome. If you carry DORA or EU AI Act responsibility, weight that axis higher than the other eight. If you build an internal knowledge management system, weight it lower.

    What we see at Convios across mandates: the skill pool question is consistently underestimated. A framework that looks superior in the whitepaper costs double in mid-market when the existing seniors do not run it fluently. External buildup runs six to twelve months (build duration from Part 1 on multi-agent RAG). The market keeps moving during that window.

    The honest, occasionally uncomfortable answer reads: pick the framework your existing seniors can run in production within four weeks, and invest the freed time in clean architecture around it. A modular architecture survives a framework switch with manageable effort. Where modularity is missing, every switch forces a rewrite.

    ThoughtWorks moved LangGraph out of Adopt in November 2025. Six months earlier it was the default choice. Six months from now something else may be default. An architecture that survives that movement turns the framework question into a detail question. That is the goal.

    If you want to test your framework selection against these nine criteria or build your own evaluation method, we start with a 30-minute intro call.

    Sources

    1ThoughtWorks Technology Radar Volume 33, November 2025: LangGraph aus dem Adopt-Ring entfernt

    2Microsoft Agent Framework GA 1.0: Konsolidierung von AutoGen und Semantic Kernel, April 2026

    3Bitkom Fachkräfte-Studie 2025: 109.000 unbesetzte IT-Stellen, 85 Prozent Mangelmeldung

    4Brynjolfsson, Chandar, Chen: Canaries in the Coal Mine?, Stanford Digital Economy Lab, August 2025: Beschäftigungsrückgang von 13 Prozent bei 22- bis 25-jährigen Software-Entwicklern in stark KI-exponierten Berufen seit Ende 2022

    5Indeed Hiring Lab Deutschland, Jobs and Hiring Trends Report 2025: Softwareentwicklung minus 33,3 Prozent offene Stellen Januar bis November 2024

    6Bundesagentur für Arbeit, IT-Arbeitsmarktbericht Juli 2025: 31,3 Prozent mehr arbeitslose Softwareentwickler im Jahresvergleich

    7Bitkom KI-Studie 2026: 70 Prozent der Beschäftigten erhalten keine KI-Fortbildung

    8LangChain Series B Oktober 2025: 125 Millionen Dollar bei 1,25 Milliarden Bewertung

    9EU AI Act Artikel 12: Logging-Pflichten für Hochrisiko-KI-Systeme

    10DORA Artikel 12: Logging-Anforderungen für IKT-Systeme im Finanzsektor, in Kraft seit 17.01.2025

    11Token-Kostenanalyse für Multi-Agenten-Workflows, Beispielrechnung GPT-4o

    12Bitkom IT-Fachkräfte-Studie 2025: 22 Prozent der Unternehmen mit Quereinsteiger-Programmen

    13Vergleich der Audit-Trail-Implementierung in LangGraph, CrewAI und AutoGen

    14KfW-Mittelstandspanel, Fokus Volkswirtschaft Nr. 533, Februar 2026: 20 Prozent der KMU nutzen KI, 53 Prozent bei FuE-treibenden Unternehmen

    15Maximal Digital KMU-Studie 2025 (n=455): 76 Prozent ohne KI-Governance, 91 Prozent halten es für kritisch, 19 Prozent mit strukturiertem KI-Fahrplan

    16deepset Produktionsreferenzen Haystack, dokumentierte Public-Sector- und DACH-Industrie-Einsätze

    17Insight Partners "Behind the Investment: CrewAI" (Lead-Investor der Series A)

    18TechCrunch, 9. August 2023: deepset secures 30m Series B led by Balderton Capital

    Was this article helpful?

    Have questions about this topic?

    Schedule a conversation