How to build a multi-agent system using Elasticsearch and LangGraph

Agent Builder is available now as a tech preview. Get started with an Elastic Cloud Trial, and check out the documentation for Agent Builder here.

Introduction

Orchestrating multiple AI agents is the current challenge in production LLM systems. Just as an orchestra needs a conductor to coordinate musicians, multi-agent systems need intelligent orchestration to ensure specialized agents collaborate cohesively, learn from mistakes, and continuously improve.

We will build a multi-agent system that implements the reflection pattern, an emerging pattern where agents collaborate in structured feedback loops to iteratively improve the quality of their responses. By the end, we will have built a demo that analyzes IT incidents, performing semantic search on logs, generating root cause analyses, and self-correcting until a defined quality threshold is met.

The architecture combines three complementary technologies.

LangGraph orchestrates cyclical workflows that allow agents to critique and improve their own outputs, something impossible in traditional DAG-based engines.
Elasticsearch acts as the data backbone, providing hybrid search through the ELSER model (combining semantic and keyword), storing long-term memory for continuous learning.
Ollama provides local LLM models for development, but the system is designed to work with major providers that expose their models via API (OpenAI, Anthropic, etc.).

Prerequisites

To follow this tutorial and run the demo, you will need:

Software and tools:

Python 3.10 or higher
Elasticsearch (any version: Serverless, Cloud, or local - free trial at https://cloud.elastic.co)
Ollama installed for local LLM models (https://ollama.ai)

Setting up the Python environment:

After installing, verify it's running and download the model:

All code is available in the GitHub repository below:

We recommend using a virtual environment to avoid dependency conflicts:

Configure environment variables (.env):

Create a .env file in the project root with the following variables:

How to get Elasticsearch credentials:

1. Access Elastic Cloud Console

2. Create a Deployment or Project

a. For the serverless model, create a Project (a free trial is available).

b. For the managed model, create a Deployment.

3. Copy the Endpoint (ex, https://xxx.es.region.gcp.elastic-cloud.com:443)

4. Create an API Key in Security → API Keys

Deploying ELSER in Kibana

Deploy the ELSER Model in Kibana

Before running the Python script, you need to manually deploy the ELSER model in Kibana:

Step 1: Access Trained Models

Open your Kibana
Navigate to: Menu (☰) → Machine Learning → Trained Models

Step 2: Find and Deploy ELSER

3. Search for .elser_model_2_linux-x86_64 or .elser_model_2

4. Click "Deploy" (or "Start deployment")

Step 3: Configure Deployment

5. Deployment name: Leave default or use "elser-incident-analysis"

6. Optimize for: Select "Search" (important!)

7. Number of allocations: 1 (sufficient for development)

8. Threads per allocation: 1 9. Click "Start"

Step 4: Wait for Deployment

Deployment takes approximately 2-3 minutes. Monitor the status:

downloading → starting → started

When the status is "started" (green), the model is ready for use.

Initialize indices and example data

Now run the setup script:

Expected output:

The tutorial is designed to be progressive. We start by explaining why multi-agent systems are needed, then detail the reflection pattern and its orchestration with LangGraph, explore how Elasticsearch provides essential data capabilities, and finally execute the POC step by step with detailed output explanations.

Architecture overview

Before diving into implementation details, it's important to understand how the components connect. The diagram below shows the complete system flow, from the initial user query to finalization and saving to long-term memory.

Reflection pattern architecture structure

The workflow implements a cyclical pattern where output quality is continuously evaluated. The flow begins with the user's query and sequentially passes through the SearchAgent (hybrid search in Elasticsearch), AnalyserAgent (analysis generation), and ReflectionAgent (quality evaluation). The crucial point is the Router: if the quality reaches the 0.8 threshold, the system finalizes and saves the result; otherwise, the iteration counter is incremented and the flow returns to the AnalyserAgent, this time incorporating feedback from the previous reflection. This cycle repeats until quality is satisfactory or the maximum number of iterations (3 by default) is reached.

Agent specialization in our system:

1. SearchAgent: Queries Elasticsearch with hybrid search (semantic + keyword)

2. AnalyserAgent: Reasons about logs and generates root cause analysis

3. ReflectionAgent: Evaluates output quality and provides feedback

Why 3 iterations? This limit was chosen as a reasonable standard based on observations during the development of this demo:

Protects against infinite loops when the model cannot improve
We rarely saw significant improvements after 3 attempts
You can adjust via the MAX_REFLECTION_ITERATIONS environment variable leveraged by LangGraph

Elasticsearch acts as the data backbone, providing not only hybrid search for the SearchAgent, but also storing long-term memory (agent-memory index). LangGraph manages orchestration, ensuring the state is shared correctly between agents and that the cyclical flow works deterministically. Let's now explore each component in detail.

1. Why do we need multi-agents?

The problem with single LLM calls

A single API call to an LLM is powerful for simple tasks, but fails in complex workflows. Let's look at two real scenarios:

Scenario 1: Root Cause Analysis (RCA)

Limitations of a single LLM:

Context window too small (cannot fit all logs)
No access to tools (cannot query, cannot fetch logs from Elasticsearch, query CPU/memory usage, check service status)
No quality control (can hallucinate causes)
No memory (repeats analysis if the incident occurs again)

Scenario 2: Security Incident Triage

Limitations of a single LLM:

Cannot search threat intelligence databases
No structured investigation workflow
No audit trail (compliance requirement)
Does not learn from past incidents

The central limitation of single LLM calls: While modern LLMs can access external data through function calling and tools, a single request-response interaction cannot orchestrate complex multi-step workflows with feedback loops, maintain long-term memory across sessions, or coordinate specialized agents critiquing each other's outputs.

Multi-agent systems: specialization and coordination

Multi-agent architectures solve these problems by dividing responsibilities. Instead of a single LLM trying to do everything, each agent specializes in a specific task: one agent searches for data in external sources (solving the context window limit), another analyzes and reasons, and a third validates quality (eliminating hallucinations). The shared state between agents is persisted in a database, creating long-term memory that survives between executions.

Note on Architectural Patterns: This tutorial focuses on the Reflection Pattern, but production multi-agent systems often combine multiple patterns:

Planning Agents: Divide complex tasks into executable subtasks (consult RAG for plan templates)
Tool-Use Agents: Execute real-world actions (restart services, deploy, etc.)
Reflection Agents: Evaluate quality and provide feedback (our focus)

We chose Reflection because it is the most critical pattern for ensuring quality and reliability. By implementing a loop re-evaluation mechanism where outputs are continuously critiqued and refined, the pattern significantly reduces hallucinations and improves response accuracy.

Orchestration: how agents coordinate

Having multiple specialized agents solves the problem of responsibilities, but creates a new challenge: who coordinates execution? This is where orchestration comes in.

Orchestration is the process of coordinating multiple agents to work together towards a common goal. Think of a conductor leading an orchestra: each musician (agent) plays their instrument (specific task), but the conductor decides when each plays and how the parts connect.

In our system, LangGraph acts as this conductor, coordinating the execution flow:

How LangGraph orchestrates:

Manages shared state (Agent-to-Agent Communication): Each agent reads and writes to an IncidentState object that contains query, search_results, analysis, quality_score, etc. This communication ensures all agents work with the same data without conflicts.
Controls execution flow: Defines the order our agents are invoked (in this case, SearchAgent → AnalyserAgent → ReflectionAgent) and implements conditional routing. outer decides the next step based on the quality_score.
Enables feedback cycles: Allows the workflow to return to the AnalyserAgent multiple times, something impossible in traditional engines that only support linear flows (DAGs).
Ensures conflict-free coordination: Each agent executes in turn, receives the updated state, and passes the result forward deterministically.

Without orchestration, we would have 3 isolated agents with no capacity to collaborate. With LangGraph, we have a coordinated multi-agent system where each agent contributes to the final goal: generating a high-quality incident analysis.

Benefits of this approach:

Each agent focuses on a specific responsibility, allowing individual optimization. Agents can be swapped independently, facilitating maintenance and upgrades. Quality control happens through dedicated reflection, ensuring reliable outputs. Additionally, the system improves over time by learning from past decisions stored in long-term memory.

The role of persistent storage:

Multi-agent systems need three types of memory:

Type	Scope	Managed By	Duration
Short-term (Context Window)	Current conversation	LLM	During chat
Working (State)	Between agents in the workflow	LangGraph	During execution
Long-term (Database)	Past decisions, patterns	Elasticsearch	Permanent

In this tutorial, we use Elasticsearch for long-term memory because it provides semantic search (ELSER), hybrid queries, and natural integration with logs/metrics.

2. Orchestration: The cyclical reflection pattern (LangGraph)

In the previous section, we saw that our system has 3 specialized agents and that LangGraph coordinates their execution. Now, let's understand why we need a specific tool for orchestration and how it implements the reflection pattern.

AI agent orchestration has unique requirements that traditional tools do not meet:

Feedback cycles: Agents need to repeat tasks based on quality evaluations
Conditional routing: The next action depends on the result of the previous action (not a fixed flow)
Mutable shared state: Multiple agents read and modify the same context
Durability: The system needs to survive failures and resume from where it left off

Unlike DAGs (Directed Acyclic Graphs) - linear flows without loops, AI agents need cycles to implement reflection (where the agent critiques its own output), perform retries with feedback, and conduct multi-turn reasoning.

Using LangGraph for the workflow

LangGraph (from LangChain) was designed specifically for agent workflows, offering native cyclical flows, conditional routing based on agent outputs, and built-in state management.

Designing agent workflows with LangGraph

The Reflection Pattern: self-correction loops

Without quality control, LLMs can produce incomplete or hallucinated outputs:

Problems:

No root cause identified
No impact quantification
Vague recommendations
No evidence from actual logs

The solution is to add a reflection loop where a specialized agent evaluates the quality of the output:

Reflection loop solution with a specialized reflection agent evaluating the quality of the output

Implementation: The three components

1. ReflectionAgent: Critique with scoring

2. Router: conditional logic

3. LangGraph: The cyclical flow

Workflow visualization:

LanGgraph multi agent with reflection pattern workflow

The diagram above shows the complete LangGraph workflow structure. Note the key elements:

START → AnalyserAgent: Entry point defined via set_entry_point()
AnalyserAgent → ReflectionAgent: Linear edge via add_edge()
ReflectionAgent → decisor_router: Conditional decision
decisor_router → increment: When quality < 0.8 (increments counter and retries)
increment → AnalyserAgent: Closes the reflection loop (feedback edge marked in pink in the diagram)
decisor_router → finalize: When quality ≥ 0.8 or max iterations
finalize → END: Workflow conclusion

This feedback edge (highlighted in pink in the diagram for easy visualization) creates the reflection cycle that differentiates this pattern from traditional DAGs.

Results: quality improvement over iterations

Execution example:

3. The data layer: Elasticsearch as long-term memory and RAG

Multi-agent systems need persistent storage for two use cases:

Long-Term Memory (LTM): Past decisions and learnings
Retrieval Augmented Generation (RAG): Contextual data (logs, docs)

In this section, we will implement these capabilities using Elasticsearch.

The two indices of the system

The system uses two distinct indices in Elasticsearch, each with a specific purpose:

Index	Purpose	Created By	Used For
incident-logs	Store incident logs	setup_elser.py	Hybrid search (SearchAgent)
agent-memory	Long-Term Memory (LTM)	System runtime	Save successful decisions

Let's explore each index in detail, starting with the data source.

Long-term memory: Learning from past decisions

Without persistent memory, the system repeatedly solves the same incident from scratch:

The solution is to save successful decisions in Elasticsearch for future retrieval:

Impact:

Hybrid search: Semantic (ELSER) + keyword (BM25)

Why hybrid search matters: Hybrid search combines the precision of keyword matching with the semantic understanding of ML models, ensuring you find both exact matches and conceptually related content that pure semantic search might miss.

Configuring ELSER (Elasticsearch semantic model):

Implementation:

Index 1: incident-logs (Data Source for RAG)

This is the index where the incident logs that the system analyzes are stored. It contains the special semantic_text field that automatically generates ELSER embeddings:

Index 2: agent-memory (continuous learning)

Stores successful analyses for future retrieval (Long-Term Memory). Each document is a "memory" that the system can query:

How the system uses these memories:
1. Semantic search on the content field: Finds similar solutions even with different words

2. Filters by quality_score >= 0.80: Only learns from high-quality decisions

3. Orders by _score (relevance) + timestamp: Prioritizes recent and relevant solutions

4. Injects top 3 into the AnalyserAgent's prompt: Accelerates analysis using past solutions as a template

Result: Recurring incidents are resolved faster, usually in 1 iteration.

How semantic search retrieves memories

When a new incident arrives, the system searches for similar memories by comparing concepts, not just words.

Code retrieval example:

1. Retrieval function (hybrid semantic search)

2. Where it is called (inside the AnalyserAgent)

3. How it is used in the prompt

Measurable gain: In recurring incidents, the system is faster, resolving in 1 iteration instead of 3, by using past solutions as a template.

Elasticsearch provides two capabilities (LTM, RAG) in a unified system. Hybrid search uses keyword matching (BM25) to pre-filter and reduce the search space, then applies semantic search (ELSER) on the filtered results—combining speed with semantic understanding. Optimization for time series allows efficient management of workflow history over time. Natural integration with observability logs and metrics means you can unify agents and operational data on the same platform. Finally, with ELSER semantically indexing all data, the agents themselves can query logs and past decisions using natural language—enabling them to retrieve contextually relevant information.

4. Running the demo

Quick setup

If you haven't already set up the environment, refer back to the Prerequisites section at the beginning of the article.

Run the DEMO

Run the analysis:

After confirming that Ollama is running and the model is downloaded, execute:

Actual output (with annotations):

Complete execution flow

The diagram above shows the interaction between User, SearchAgent, AnalyserAgent, ReflectionAgent, and Elasticsearch during the 2 iterations. Observe how:

The SearchAgent queries Elasticsearch only once at the beginning of the workflow (15 logs found via hybrid search)
The AnalyserAgent reuses these logs and generates analysis 2 times (successfully improving from iteration 1 to iteration 2 based on reflection feedback)
The ReflectionAgent evaluates quality 2 times (providing critical feedback in iteration 1 and approval in iteration 2)
The Router decides between "increment" (retry) and "finalize" after each reflection
In iteration 2, the quality score (0.85) exceeds the threshold (0.80), triggering automatic finalization
The final high-quality result is saved in the agent-memory index for long-term memory, enabling faster resolution of similar future incidents

Conclusion

We built a self-correcting multi-agent system that demonstrates a new way of designing AI applications. Instead of relying on single LLM calls that can produce inconsistent outputs, we implemented a reflection pattern where specialized agents collaborate, critique, and iteratively improve their results.

The pillars of this system are equally important. The Reflection Pattern provides the self-correction mechanism through structured feedback loops. LangGraph orchestrates cyclical workflows that go beyond the limitations of traditional DAGs. And Elasticsearch unifies semantic search and long-term memory.

The complete code is available on GitHub. Experiment with your own data, adjust the quality criteria for your domain, and explore how the reflection pattern can improve your AI systems.

References:

Report an issue