Markov chains and LLMs - hybrid architectures for smarter agents

The Hybrid Insight

Most discussions pit Markov chains against LLMs as competing approaches. But they’re complementary.

Markov chains excel at:

Structure (states, transitions, rules)
Efficiency (fast, lightweight)
Interpretability (you can inspect the learned probabilities)
Pattern enforcement (style, format, safety)

LLMs excel at:

Semantic understanding (context, nuance, meaning)
Flexibility (adapt to novel situations)
Generation quality (coherent, fluent text)
Reasoning (across domains, without retraining)

Hybrid systems use both: Markov chains for where and how; LLMs for what and why.

State-Based Agent Architectures

Markov Chains for Agent State Transitions

Instead of letting an LLM roam freely, structure its behavior as a state machine where:

States represent distinct operational modes: Planning, Executing, Reflecting, ErrorHandling, Complete.
Transitions follow probabilistic rules learned from successful agent trajectories.
Within each state, the LLM operates flexibly (asking questions, deciding next action, generating output).

Example state space and learned transition probabilities:

State: Planning
├─ → Executing (0.75) if plan is concrete
├─ → ErrorHandling (0.15) if planning failed
└─ → Planning (0.10) if more clarification needed

State: Executing
├─ → Reflecting (0.80) if task succeeded
├─ → ErrorHandling (0.15) if tool call failed
└─ → Executing (0.05) if retrying

State: Reflecting
├─ → Complete (0.85) if satisfied
└─ → Planning (0.15) if refinement needed

State: ErrorHandling
├─ → Planning (0.50) if recoverable
├─ → Executing (0.30) if retryable
└─ → Complete (0.20) if escalating

How it works:

Agent reaches a decision point (e.g., “plan done, should I execute now?”)
Sample next state from the Markov chain: P(Executing | Planning)
If probability is high, the LLM gets stronger system prompt guidance to execute
If probability is low, the LLM gets guidance to stay in Planning or investigate further

Benefit: More predictable agent behavior while maintaining flexibility within states.

Hierarchical Planning with MDPs

For complex tasks, combine Markov Decision Processes (MDPs) with LLM execution:

High-level MDP decomposes the task into subtasks (e.g., “understand requirements” → “design solution” → “implement” → “test”)
MDP learns optimal policy over subtask sequences through simulation or offline RL
LLM executes within each subtask, handling natural language reasoning, code generation, etc.

interface TaskMDP {
  states: string[];  // "understand", "design", "implement", "test"
  transitions: Map<string, number[]>;  // learned probabilities
  policy: (state: string) => string;  // optimal next state
}

async function hierarchicalPlanning(goal: string, mdp: TaskMDP) {
  let state = "understand";
  const results = [];

  while (state !== "complete") {
    // LLM executes the current subtask
    const result = await llmExecuteSubtask(goal, state);
    results.push({ state, result });

    // MDP decides next state
    state = mdp.policy(state);

    // Or sample stochastically:
    // state = sampleFrom(mdp.transitions.get(state));
  }

  return results;
}

Benefit: Structure prevents the agent from jumping around; MDP ensures task decomposition follows learned optimal patterns.

Hybrid Generation and Sampling

Constrained Text Generation

Use Markov chains to enforce structure (format, style, meter) while LLMs handle semantics.

Example: Poetry generation with meter constraints

// Markov chain trained on poetry meter patterns
const meterChain = {
  "iamb": { "iamb": 0.6, "trochee": 0.2, "spondee": 0.2 },
  "trochee": { "iamb": 0.4, "trochee": 0.5, "spondee": 0.1 },
  // ... etc
};

async function generatePoetry(topic: string, numLines: number) {
  let currentMeter = "iamb";
  const lines = [];

  for (let i = 0; i < numLines; i++) {
    // LLM generates candidate line about topic
    const candidates = await llm.generateCandidates(
      `Write a line of poetry about ${topic} in ${currentMeter} meter:`,
      3  // get 3 options
    );

    // Score candidates by how well they match the meter
    const scored = candidates.map(c => ({
      text: c,
      score: evaluateMeter(c, currentMeter),
    }));

    // Pick the one that flows best
    const best = scored.sort((a, b) => b.score - a.score)[0];
    lines.push(best.text);

    // Transition to next meter
    currentMeter = sampleFrom(meterChain[currentMeter]);
  }

  return lines.join("\n");
}

Other uses:

JSON/SQL generation: Markov chain enforces valid syntax, LLM ensures semantic correctness
Email formatting: Markov chain ensures structure (greeting → body → closing), LLM writes content
Code structure: Markov chain ensures function nesting and control flow, LLM writes logic

Multi-Step Reasoning with Markov-Guided Exploration

Let the agent choose which reasoning strategy to use next via a Markov chain trained on successful problem-solving.

type ReasoningStrategy =
  | "deduction"      // apply logical rules
  | "analogy"        // find similar cases
  | "decomposition"  // break into subproblems
  | "verification"   // check validity
  | "backtrack";     // undo and retry

const reasoningChain: Record<ReasoningStrategy, Record<ReasoningStrategy, number>> = {
  "deduction": { "deduction": 0.4, "verification": 0.4, "backtrack": 0.2 },
  "analogy": { "decomposition": 0.5, "verification": 0.3, "deduction": 0.2 },
  "decomposition": { "deduction": 0.6, "analogy": 0.2, "backtrack": 0.2 },
  "verification": { "backtrack": 0.4, "deduction": 0.3, "complete": 0.3 },
  "backtrack": { "analogy": 0.4, "decomposition": 0.3, "deduction": 0.3 },
};

async function solveWithGuidedReasoning(problem: string) {
  let strategy: ReasoningStrategy = "decomposition";
  const trace = [];
  let steps = 0;
  const maxSteps = 10;

  while (steps < maxSteps) {
    // LLM applies the current reasoning strategy
    const result = await llm.reason(problem, strategy);
    trace.push({ strategy, result });

    if (result.isComplete) break;

    // Markov chain suggests next strategy
    strategy = sampleFrom(reasoningChain[strategy]);
    steps++;
  }

  return { solution: trace.at(-1)?.result, trace };
}

Benefit: Agent explores a learned “good” space of reasoning strategies instead of flailing randomly.

Memory and Context Management

Markov-Based Memory Retrieval

Instead of simple recency or similarity, learn topic transition probabilities to predict which memories will be relevant.

Intuition: In a conversation, topics flow. If we’re talking about recipes, we might shift to cooking utensils, then kitchen design, then home improvement. A Markov chain can model these transitions.

interface MemoryTopic {
  name: string;
  memories: string[];  // actual memories under this topic
}

interface TopicChain {
  topics: string[];
  transitions: Record<string, Record<string, number>>;  // P(next | current)
}

async function retrieveRelevantContext(
  currentUtterance: string,
  conversationHistory: string[],
  memoryTopics: MemoryTopic[],
  topicChain: TopicChain
) {
  // 1. Use LLM to infer current topic
  const currentTopic = await llm.classifyTopic(currentUtterance, memoryTopics);

  // 2. Use Markov chain to predict likely next topics
  const topicProbs = topicChain.transitions[currentTopic];
  const likelyTopics = Object.entries(topicProbs)
    .sort((a, b) => b[1] - a[1])
    .slice(0, 3)
    .map(([topic]) => topic);

  // 3. Retrieve memories from current + likely next topics
  const context = [
    ...memoryTopics.find(t => t.name === currentTopic)?.memories || [],
    ...likelyTopics.flatMap(t => memoryTopics.find(m => m.name === t)?.memories || []),
  ];

  return context;
}

Benefit: More relevant context retrieval than simple similarity; captures conversational flow.

Dialogue State Tracking with Hidden Markov Models

Use HMMs to infer underlying user intent states from what the LLM generates/observes.

interface IntentState {
  name: string;
  description: string;
}

interface HMMModel {
  intents: IntentState[];
  emissionProbs: Record<string, Record<string, number>>;  // P(observation | intent)
  transitionProbs: Record<string, Record<string, number>>;  // P(next intent | intent)
}

async function trackDialogueIntent(
  userMessage: string,
  previousIntent: string,
  hmmModel: HMMModel
) {
  // 1. LLM generates candidate interpretations of what the user wants
  const interpretations = await llm.interpretIntent(userMessage, 3);

  // 2. For each intent, compute likelihood given the observation and prior
  const likelihoods = hmmModel.intents.map(intent => {
    const emissionProb = hmmModel.emissionProbs[intent.name]?.[userMessage] || 0.01;
    const transitionProb = hmmModel.transitionProbs[previousIntent]?.[intent.name] || 0.05;
    return {
      intent: intent.name,
      likelihood: emissionProb * transitionProb,
    };
  });

  // 3. Pick the intent with highest likelihood
  const inferredIntent = likelihoods
    .sort((a, b) => b.likelihood - a.likelihood)[0]
    .intent;

  return inferredIntent;
}

Benefit: Multi-turn conversations stay consistent; you can track intent drift and detect when the user changes topics.

Training and Optimization

Reinforcement Learning from Markov Reward Models

Instead of calling the LLM every time you need a reward signal (expensive), train a lightweight Markov chain to approximate rewards, then use it during RL fine-tuning.

// Step 1: Collect successful agent trajectories
const successfulTraces = await collectTraces(agent, env, 1000);

// Step 2: Train a Markov reward model
// Learns: P(high reward | state, action) from observed outcomes
const markovRewardModel = trainMarkovRewardModel(successfulTraces);

// Step 3: Use it as a fast reward signal during RL
async function rlFineTuning(agent: LLMAgent) {
  for (let episode = 0; episode < 10000; episode++) {
    const trajectory = await agent.rollout();

    for (const step of trajectory) {
      // Fast: evaluate with Markov model
      const reward = markovRewardModel.evaluate(step.state, step.action);

      // Optionally, periodically call LLM to correct/update Markov model
      if (Math.random() < 0.01) {
        const trueReward = await llm.evaluateReward(step);
        updateMarkovRewardModel(trueReward, step.state, step.action);
      }

      agent.updateWeights(step, reward);
    }
  }
}

Benefit: 100-1000x speedup in RL training; Markov model is your “fast reward function.”

Curriculum Learning via Markov Task Progression

Create adaptive training schedules where task difficulty follows a Markov process.

interface TaskDifficulty {
  level: number;  // 1-10
  tasks: Task[];
}

interface DifficultyChain {
  levels: TaskDifficulty[];
  transitions: Record<number, Record<number, number>>;  // P(next level | current)
}

async function adaptiveCurriculum(agent: LLMAgent, chain: DifficultyChain) {
  let currentLevel = 1;
  const performanceHistory: number[] = [];

  while (currentLevel < 10) {
    // Run agent on tasks at current difficulty
    const performance = await evaluateAgentOnLevel(agent, currentLevel);
    performanceHistory.push(performance);

    // Adjust transition probabilities based on performance
    if (performance > 0.8) {
      // Agent is doing well; increase prob of moving up
      chain.transitions[currentLevel][currentLevel + 1] += 0.1;
      chain.transitions[currentLevel][currentLevel] -= 0.1;
    } else if (performance < 0.5) {
      // Agent is struggling; decrease prob of moving up
      chain.transitions[currentLevel][currentLevel + 1] -= 0.1;
      chain.transitions[currentLevel][currentLevel] += 0.05;
    }

    // Sample next level from updated distribution
    currentLevel = sampleFrom(chain.transitions[currentLevel]);
  }
}

Benefit: Training adapts in real-time; tasks are neither too easy nor too hard.

Reliability and Safety

Anomaly Detection Using Markov Baselines

Learn what “normal” agent behavior looks like, then flag deviations as potential safety issues.

interface MarkovAgentBaseline {
  states: string[];
  transitions: Record<string, Record<string, number>>;
  actionProbs: Record<string, Record<string, number>>;  // P(action | state)
  anomalyThreshold: number;
}

async function detectAnomalousActions(
  agent: LLMAgent,
  baseline: MarkovAgentBaseline,
  anomalyThreshold: number = 0.05
) {
  const flags: Array<{ step: number; action: string; anomalySeverity: number }> = [];

  let state = "initial";
  let step = 0;

  while (step < 100) {
    // Agent picks an action
    const action = await agent.selectAction(state);
    const expectedProb = baseline.actionProbs[state]?.[action] || 0.01;

    // If action is very unlikely given state, flag it
    if (expectedProb < anomalyThreshold) {
      flags.push({
        step,
        action,
        anomalySeverity: 1 - expectedProb,
      });

      // Optionally, intervene
      if (1 - expectedProb > 0.8) {
        console.warn(`ALERT: Highly anomalous action ${action} in state ${state}`);
        // Could pause, escalate, or require human approval
      }
    }

    // Transition to next state
    state = sampleFrom(baseline.transitions[state]);
    step++;
  }

  return flags;
}

Benefit: Catch “strange” agent behavior before it causes harm; interpretable safety (you can see why something is flagged).

Fallback Systems

When an LLM is slow or uncertain, switch to a lightweight Markov policy.

interface HybridAgent {
  llmAgent: LLMAgent;
  markovFallback: MarkovPolicy;
  confidenceThreshold: number;
}

async function hybridAction(
  agent: HybridAgent,
  state: AgentState,
  latencyBudget: number  // ms
) {
  let useMarkov = false;
  let reason = "";

  // Condition 1: Out of time, use Markov
  if (latencyBudget < 200) {
    useMarkov = true;
    reason = "latency";
  }

  // Condition 2: LLM confidence is low, use Markov
  const llmResult = await agent.llmAgent.selectActionWithConfidence(state);
  if (llmResult.confidence < agent.confidenceThreshold) {
    useMarkov = true;
    reason = "low confidence";
  }

  if (useMarkov) {
    console.log(`Falling back to Markov policy (${reason})`);
    return agent.markovFallback.selectAction(state);
  } else {
    return llmResult.action;
  }
}

Benefit: Always have a safe, fast fallback; gracefully degrade under resource constraints.

Specific Implementation Ideas

Code Generation Agents

Markov chain on abstract syntax trees (ASTs) guides structural decisions; LLM fills in semantics.

interface ASTMarkovChain {
  nodeTypes: string[];  // "if", "loop", "function_call", "assignment"
  transitions: Record<string, Record<string, number>>;
}

async function generateCode(spec: string, astChain: ASTMarkovChain) {
  const codeLines: string[] = [];
  let currentNodeType = "function_call";

  while (currentNodeType !== "end") {
    if (currentNodeType === "if") {
      // LLM generates the condition
      const condition = await llm.generateCondition(spec);
      codeLines.push(`if (${condition}) {`);
      // Markov guides next node type (likely body, then close)
      currentNodeType = sampleFrom(astChain.transitions[currentNodeType]);
    } else if (currentNodeType === "loop") {
      const loopVar = await llm.generateLoopVariable(spec);
      codeLines.push(`for (let ${loopVar} = 0; ...) {`);
      currentNodeType = sampleFrom(astChain.transitions[currentNodeType]);
    } else if (currentNodeType === "assignment") {
      const assignment = await llm.generateAssignment(spec);
      codeLines.push(assignment);
      currentNodeType = sampleFrom(astChain.transitions[currentNodeType]);
    } else if (currentNodeType === "function_call") {
      const call = await llm.generateFunctionCall(spec);
      codeLines.push(call);
      currentNodeType = sampleFrom(astChain.transitions[currentNodeType]);
    }
  }

  return codeLines.join("\n");
}

Benefit: Valid code structure guaranteed by Markov chain; logic quality guaranteed by LLM.

Multi-Agent Coordination

Use Markov games for agent interaction patterns; LLMs for communication and local decisions.

interface MarkovGame {
  agents: string[];
  jointActions: string[];
  transitions: Record<string, Record<string, number>>;  // P(next state | current, joint action)
  rewards: Record<string, number>;  // per agent, per state
}

async function coordinatedMultiAgentSystem(
  agents: LLMAgent[],
  game: MarkovGame,
  state: string
) {
  // Each agent picks action using LLM
  const actions = await Promise.all(
    agents.map(agent => agent.decideAction(state))
  );
  const jointAction = actions.join("|");

  // But next state follows Markov game (learned equilibrium)
  const nextState = sampleFrom(game.transitions[state]?.[jointAction] || {});

  // Agents get rewards from Markov game
  const rewards = game.rewards;

  return { actions, nextState, rewards };
}

Benefit: Agents communicate naturally via LLM but coordinate optimally via learned game equilibrium.

Streaming/Online Systems

Use Markov chains for real-time decisions when latency is critical; LLM for complex reasoning when time allows.

interface StreamingAgent {
  markovPolicy: MarkovPolicy;  // <50ms decisions
  llmAgent: LLMAgent;            // <5s decisions
  latencyBudget: number;         // ms
}

async function streamingDecision(
  agent: StreamingAgent,
  state: AgentState,
  deadline: number  // unix timestamp
) {
  const now = Date.now();
  const timeLeft = deadline - now;

  if (timeLeft < 100) {
    // Immediate decision needed: use Markov
    return agent.markovPolicy.selectAction(state);
  } else if (timeLeft < 1000) {
    // Medium latency: fast approximation from Markov, then optionally refine
    const markovAction = agent.markovPolicy.selectAction(state);

    // Try to get LLM answer in background
    const llmPromise = agent.llmAgent.selectAction(state);
    const llmResult = await Promise.race([
      llmPromise,
      delay(Math.min(timeLeft - 100, 500)),  // timeout
    ]);

    if (llmResult && !llmPromise.rejected) {
      return llmResult;
    }
    return markovAction;
  } else {
    // Plenty of time: use LLM for best answer
    return agent.llmAgent.selectAction(state);
  }
}

Benefit: Always meet latency SLAs; use best tool for the job given constraints.

Design Principles

When building a hybrid Markov + LLM system, keep these principles in mind:

1. Clear Division of Labor

Responsibility	Markov	LLM
Structure	✓	✗
Semantics	✗	✓
Efficiency	✓	✗
Flexibility	✗	✓
Interpretability	✓	✗
Zero-shot generalization	✗	✓

Use Markov for structure, efficiency, and interpretability. Use LLMs for understanding, reasoning, and flexibility.

2. Learn from Data

Don’t hand-code transition probabilities. Collect successful trajectories and learn Markov models from them. This ensures your chains reflect what actually works, not your assumptions.

// Good: learned from data
const chain = trainMarkovModel(successfulTraces);

// Bad: hand-coded assumptions
const chain = {
  "Planning": { "Executing": 0.8, "Planning": 0.2 },
  // ...
};

3. Fallback and Graceful Degradation

Always have a Markov fallback for when LLMs are slow, unavailable, or too expensive.

try {
  return await llmAgent.decide(state);
} catch (error) {
  console.log("LLM unavailable, using Markov fallback");
  return markovAgent.decide(state);
}

4. Monitor and Adapt

Continuously:

Track which transitions the Markov chain predicts
Compare against what the LLM actually does
Retrain the chain when drift exceeds a threshold

setInterval(async () => {
  const recentBehavior = await getRecentAgentTraces(1000);
  const drift = computeDistributionDrift(recentBehavior, currentChain);
  if (drift > 0.1) {
    console.log("Chain drift detected, retraining...");
    currentChain = trainMarkovModel(recentBehavior);
  }
}, 1 * 60 * 60 * 1000);  // hourly

5. Make Probabilities Interpretable

When flagging anomalies or making decisions, explain in terms of probabilities.

// Good: interpretable
console.log(`Action 'delete_all' is 0.02% likely in state ${state}`);

// Bad: opaque
console.log(`Action flagged: ANOMALY_SCORE=0.98`);

When to Use This Pattern

✅ Ideal Use Cases

Structured tasks with known decompositions (code gen, SQL, planning)
Multi-turn interactions where you want consistent behavior (dialogue, customer support)
Resource-constrained environments (mobile, edge, low-latency systems)
Safety-critical workflows where interpretability matters (medical, legal, finance)
Iterative refinement where learning from traces is possible

❌ Poor Fits

Novel, open-ended tasks where you haven’t collected good trajectories
One-shot problems where you have no data to learn Markov models from
Very high-dimensional state spaces where Markov chains explode (though hierarchical MDPs help)
Tasks requiring extreme flexibility where structure would be constraining

Getting Started

Choose a task where agent behavior is somewhat structured (e.g., “my agent gets stuck in loops sometimes”)
Collect successful trajectories (at least 100-1000 examples of agents doing the task well)

Train a Markov model on state transitions or action sequences

const chain = new MarkovChain(trajectories, maxOrder=2);

Instrument your agent to sample from or be guided by the chain
Measure improvement:
- Success rate ↑?
- Latency ↓?
- Interpretability ↑?
Iterate: Retrain the chain as you collect more data; adjust the balance between Markov guidance and LLM flexibility

Key Takeaway

Markov chains and LLMs solve different problems.

Markov chains: structure, efficiency, interpretability, safety
LLMs: semantics, flexibility, reasoning, generation

By combining them—Markov chains for where and how, LLMs for what and why—you get agents that are more predictable, faster, safer, and more interpretable without sacrificing the semantic understanding that makes LLMs powerful.

The future of production AI systems likely isn’t pure end-to-end LLMs, nor is it simple Markov chains. It’s thoughtful hybrids that let each tool shine.