MarkyMarkov - Markov Chain-Based Code Guidance for LLM Agents and Humans Alike

Open Table of Contents

Outline Details

Outline Details

1. Introduction

What is Markymarkov?

Markymarkov is a Markov Chain-based code guidance system designed to help LLM agents generate better code. Unlike traditional linters or type checkers that enforce fixed rules, Markymarkov learns patterns directly from your codebase and uses those patterns to validate and guide code generation in real-time.

At its core, Markymarkov operates on a simple but powerful principle: code is sequential, and code patterns are learnable. By analyzing existing code, it builds probabilistic models of what patterns typically follow other patterns—both at the syntactic level (how code is structured) and at the semantic level (what code idioms are preferred).

Key characteristics:

Pattern-based: Learns from your actual codebase, not predefined rules
Probabilistic: Provides confidence scores, not just pass/fail
Fast: Sub-millisecond lookups enable real-time guidance during generation
Interpretable: Can explain why a pattern is expected or unexpected
Language-agnostic architecture: Currently focused on Python, but extensible to other languages
Deployable: Exports models as executable Python modules with no external dependencies

The Problem It Solves

Modern LLM-based code generation has created a paradox: generated code is often syntactically correct does not always follow stylistic idioms or algorithmic patterns.

Consider this scenario:

An LLM generates a Python function that uses a valid but non-idiomatic pattern
The code parses without errors
Type checkers are satisfied
But it violates your team’s coding conventions
A human reviewer flags it for rewriting

This is the gap Markymarkov addresses. Today’s development teams face several related challenges:

Style Enforcement at Scale
- Teams want consistency across codebases
- Manual code review can’t catch every style deviation
- Linters only check static rules, not learned patterns
- No way to programmatically enforce “this is how we write code”
LLM Code Generation Quality
- LLMs generate valid code, but not idiomatic code
- Agents can’t distinguish between “correct” and “our style”
- Temperature and sampling can’t capture organizational conventions
- Training on diverse data means diverse output
Validation Gaps
- Linters are rule-based (hard to maintain)
- Type checkers focus on types, not patterns
- AST visitors can find structure but miss intent
- No standard way to validate “does this follow our patterns?”
Training Data Analysis
- Hard to understand what patterns dominate your codebase
- Difficult to identify anomalies or tech debt
- No way to measure consistency across teams
- Can’t extract “what makes our code unique?”

Markymarkov solves these by learning patterns from your code and providing:

Real-time validation: Is this pattern expected?
Confidence scores: How idiomatic is this code?
Diagnostic information: Where does it diverge and why?
Generation guidance: What patterns should come next?

Why Markov Chains for Code?

You might wonder: “Why not use deep learning, LSTMs, or transformers?” The answer reveals important truths about code and pattern matching.

Code is Inherently Sequential
- Code flows through AST traversal order
- Control flow follows predictable paths
- Pattern composition is chain-like
- Markov chains were literally invented for this use case
The Markov Property Holds for Code
- “The next pattern depends only on the current pattern” ✓
- Previous history can be captured in n-gram state
- Two-state or three-state context is often sufficient
- Rare cases benefit from higher-order models
Computational Efficiency
- Training: O(n) pass through codebase
- Inference: O(1) hash table lookup
- Deep learning: GPU-heavy, requires infrastructure
- Marky: CPU-friendly, runs anywhere
- 1000 files/minute training speed (AST model)
- <1ms query latency (cached lookups)
Interpretability
- Deep learning: “Why did it choose this?” → Inscrutable
- Markov chains: “What’s the probability?” → Auditable
- Can explain: “After X pattern, Y is expected 85% of the time”
- Developers can reason about confidence scores
- No black box; built on transparent math
Data Efficiency
- Deep learning: Needs massive datasets
- Markov chains: Work well with any size codebases
- Effective with 100+ files
- Scales gracefully to 100K+ files
- Graceful degradation (unknown patterns still provide value)
Integration with LLMs
- LLMs already generate token sequences
- Markov models provide per-token guidance
- Natural fit for token-by-token validation
- Easy to integrate into sampling/generation loops
- No need to retrain LLM; validate output instead

Real-World Example: Why This Matters

Imagine your team uses these patterns:

Guard clauses for early returns (not nested ifs)
List comprehensions over manual loops
Context managers for resource handling
f-strings over .format()

An LLM might generate perfectly valid alternatives:

Nested if-else chains (valid, not your style)
For loops with manual appends (valid, not your style)
Try-finally blocks (valid, not your style)
.format() calls (valid, not your style)

Markymarkov learns these preferences from your codebase and can:

During generation: Guide the LLM toward idiomatic patterns
During validation: Flag deviations with confidence scores
During review: Explain why code is unexpected
For training: Help fine-tune models on your patterns

The Markymarkov Approach

Rather than debating “right” vs. “wrong” code style, Markymarkov asks: “What does this codebase do?” and “Does this code match those patterns?”

This shifts the conversation from:

“That’s wrong” → “That’s not how we do it”
“Use this style” → “Your code uses different patterns than training”
“Bad practice” → “Low confidence match to learned patterns”

The result: objective, data-driven style validation that teams can understand and trust.

2. Core Concept

Two-Level Architecture (AST + Semantic)

Marky’s core innovation is its two-level validation architecture. This dual approach gives you the best of both worlds: structural correctness and stylistic idiomaticity.

Level 1: AST Patterns (Syntactic Correctness)

Analyzes the Abstract Syntax Tree—how code is structurally composed
Models transitions between node types: “What follows a function definition?”
Answers: “Is the syntactic structure valid for this codebase?”
Example patterns:
- FunctionDef → Return (functions contain returns)
- If → Assign (assignments can appear in if blocks)
- For → Call (for loops can contain function calls)
Think of it as: “Does the code structure make syntactic sense?”

Level 2: Semantic Patterns (Style & Idioms)

Analyzes high-level coding idioms and patterns
Models transitions between detected patterns: “What comes after a guard clause?”
Answers: “Is the code written in the style we prefer?”
Example patterns:
- guard-clause → return-list (guard clauses often precede returns)
- loop-filter → process-item (filtered loops often process items)
- init-method → assign (constructors often assign properties)
Think of it as: “Does this code match how we write code?”

Why Two Levels?

The two-level approach solves a fundamental problem: syntax and style are orthogonal concerns.

Consider this example: Both of these are syntactically valid:

# Style A: Nested conditionals
def validate(items):
    if items:
        if len(items) > 0:
            for item in items:
                if item.valid:
                    process(item)
            return True
    return False

# Style B: Guard clauses
def validate(items):
    if not items:
        return False

    valid_items = [item for item in items if item.valid]
    process_all(valid_items)
    return True

Both parse successfully. Both produce valid ASTs. But they’re very different in:

Readability (guard clauses are clearer)
Idiomaticity (Python community prefers B)
Maintainability (B is easier to modify)

AST alone can’t distinguish these. Both have valid structures. You need semantic analysis to capture the how and why of code organization.

Conversely, semantic analysis alone can’t catch structure errors. A semantic pattern might be recognized, but applied to syntactically invalid code. AST validates the foundation.

Together, they form a powerful validation layer:

AST ensures: “The code is well-formed”
Semantic ensures: “The code follows our conventions”
Combined: “The code is correct and idiomatic”

How Markov Models Learn Code Patterns

The process of learning happens in three stages:

Stage 1: Code Extraction

Input: Python codebase (100s-1000s of files)
┗Parse each file to AST
┗Extract semantic patterns (52 high-level idioms)
┗Output: Lists of patterns and AST sequences

For example, a function might produce:

AST sequence: [Module, FunctionDef, FunctionDef, Return, Return]
Semantic sequence: [function-transformer, guard-clause, return-list, return-none]
Location tracking: [(line 10, col 4), (line 12, col 8), ...]

Stage 2: N-gram Creation The sequences are converted to n-grams (chains of N consecutive states):

For order=2 with semantic patterns:

Sequence: [function-transformer, guard-clause, return-list, return-none]
2-grams:
  - (function-transformer, guard-clause) → next is return-list
  - (guard-clause, return-list) → next is return-none

Each n-gram becomes a key in a frequency table:

{
  ('function-transformer', 'guard-clause'): {
    'return-list': 3,
    'return-none': 1,
    'string-format': 1,
  },
  ('guard-clause', 'return-list'): {
    'return-none': 5,
  },
}

Stage 3: Probability Calculation From frequencies, we calculate probabilities:

P(return-list | function-transformer, guard-clause) = 3 / (3+1+1) = 0.60
P(return-none | function-transformer, guard-clause) = 1 / 5 = 0.20
P(string-format | function-transformer, guard-clause) = 1 / 5 = 0.20

These probabilities become confidence scores during validation. A probability of 0.60 means: “In training data, when we see this context, the next state was return-list 60% of the time.”

Transition Example

Let’s trace through a real example. Given training code:

def process_data(items):
    if not items:                    # ← guard-clause pattern detected
        return None                   # ← return-none pattern

    results = [x.transform() for x in items]  # ← return-computed pattern
    return results                    # ← return-list pattern

Semantic patterns extracted: [guard-clause, return-none, return-computed, return-list]

2-grams learned:

guard-clause → return-none (confidence: 0.8)
return-none → return-computed (confidence: 0.3)
return-computed → return-list (confidence: 0.9)

Later, when validating generated code with sequence [guard-clause, return-none, return-computed, return-list]:

✓ guard-clause → return-none: Matches! (0.8 confidence)
✓ return-none → return-computed: Unusual but known (0.3 confidence)
✓ return-computed → return-list: Matches! (0.9 confidence)

Result: Valid code, moderate-to-high confidence.

From Training to Deployment

The journey from your codebase to real-time validation has four steps:

Step 1: Training (Offline, One-Time)

$ markymarkov train /path/to/codebase models/

Scans all Python files
Builds AST and semantic models
Calculates transition probabilities
Duration: ~1 minute for 1000 files

Step 2: Export (One-Time)

# Models are automatically exported as:
# models/ast_model.py
# models/semantic_model.py

Each is a Python file containing:

Pre-calculated probability tables
Pattern definitions
No external dependencies
~5-50 KB per model

Example structure:

# semantic_model.py
TRANSITIONS = {
    ('guard-clause', 'return-none'): {
        'return-computed': 0.8,
        'return-list': 0.15,
        'function-transformer': 0.05,
    },
    # ... 100s more transitions ...
}

MODEL_METADATA = {
    'order': 2,
    'total_transitions': 847,
    'unique_patterns': 23,
}

Step 3: Loading (Startup, <100ms)

from models.semantic_model import TRANSITIONS, MODEL_METADATA

model = MarkovCodeGuide.from_table(TRANSITIONS, MODEL_METADATA)

Just Python imports, no parsing
Models load in milliseconds
Fits in memory (1-50 MB)
Ready for validation

Step 4: Validation (Real-Time, <1ms per query)

# During code generation
confidence = model.check_transition(current_pattern, next_pattern)
# confidence = 0.8 (80% match to training data)

Hash table lookup: O(1)
Sub-millisecond latency
Cache-friendly
Can handle 50K+ queries/second

The Deployment Advantage

Notice that once trained, Markymarkov needs:

No Python AST parser (patterns pre-extracted)
No semantic analyzer (patterns pre-detected)
No machine learning framework (probabilities pre-calculated)
Just a Python dict and some math

This makes Marky:

Fast: No parsing overhead
Portable: Works in any Python environment
Reliable: Deterministic, no randomness
Versionable: Export as .py files, commit to git
Debuggable: Can inspect transition tables manually

The Full Pipeline

Putting it together:

TRAINING PHASE (Offline, One-Time)
├─ Codebase → Parser → AST/Semantic Patterns
├─ Patterns → N-gram Creator → Transition Frequencies
├─ Frequencies → Probability Calculator → Confidence Scores
└─ Export → Python Module (models/semantic_model.py)

INFERENCE PHASE (Online, Real-Time)
├─ Load Model (import models/semantic_model.py)
├─ Generated Code → Pattern Extractor → Pattern Sequence
├─ Sequence → N-gram Splitter → (context, next)
├─ Lookup in TRANSITIONS → Confidence Score
└─ Return Score to LLM Agent or Validator

This pipeline gives you the best of both worlds:

Learning power: Captures real patterns from your code
Runtime efficiency: No expensive computation at inference time
Transparency: Understand why code matches or doesn’t match
Flexibility: Retrain anytime your codebase evolves

3. Architecture Deep Dive

Level 1: AST Patterns (Syntactic Correctness)

The AST (Abstract Syntax Tree) level operates at the structural foundation of code. It answers the question: “Is this code structurally valid for our codebase?”

How AST Extraction Works

Python’s ast module parses code into a tree representation where each node represents a syntactic construct:

Module
├─ FunctionDef (name='process')
│  ├─ arguments
│  ├─ If
│  │  ├─ Compare
│  │  └─ Return
│  └─ For
│     ├─ expr
│     └─ Expr (Call)
└─ FunctionDef (name='validate')
   ├─ Return
   └─ Return

Markymarkov extracts parent-child transitions from this tree:

(Module, FunctionDef)
(FunctionDef, If)
(FunctionDef, For)
(If, Return)
(For, Expr)
(Expr, Call)
...

What This Captures

AST patterns capture structural rules about valid code composition:

What can contain what? (FunctionDef can contain Return)
What sequences are valid? (If blocks can be followed by For loops)
What’s never valid? (Return outside function)

Common AST transitions you’d see in typical code:

(FunctionDef, Return)      ✓ Functions can return values
(If, Assign)               ✓ Can assign in if blocks
(For, If)                  ✓ Can nest if in for loop
(FunctionDef, ClassDef)    ✗ Can't define class inside function (rarely)
(Return, FunctionDef)      ✗ Can't define function after return

Concrete Example

Let’s say your codebase has these patterns frequently:

def validate(x):
    if not x:
        return False
    return True

def process(items):
    if not items:
        return []
    return [x.upper() for x in items]

def transform(data):
    if isinstance(data, str):
        return data.strip()
    if data is None:
        return ""
    return str(data)

Markymarkov extracts:

(Module, FunctionDef) - appears 3 times
(FunctionDef, If) - appears 3 times
(If, Return) - appears 4 times
(If, If) - appears 1 time (nested ifs)
(Return, FunctionDef) - appears 2 times

Transition probabilities:

P(If | FunctionDef) = 3/3 = 1.0   (very common)
P(Return | If) = 4/4 = 1.0         (very common)
P(If | If) = 1/3 = 0.33            (less common)
P(ClassDef | FunctionDef) = 0      (never seen)

Later, when validating generated code with structure:

def my_func():
    if condition:
        return result

AST validation checks:

✓ Module → FunctionDef (0.99 confidence)
✓ FunctionDef → If (1.0 confidence)
✓ If → Return (1.0 confidence)

Result: Structurally valid and idiomatic for this codebase.

Why AST Alone Isn’t Enough

Two functions with identical AST structure:

# Version A
def validate(items):
    valid = []
    for item in items:
        if item.ok:
            valid.append(item)
    return valid

# Version B
def validate(items):
    return [item for item in items if item.ok]

Both have AST:

FunctionDef → Return
For/Comprehension → If

But Version B is more idiomatic Python. AST can’t tell you that.

Performance Characteristics

AST extraction is highly efficient:

Single pass through code
Linear time complexity O(n)
Training speed: ~1000 files/minute
Model size: 10-50 MB for typical codebases
Lookup: O(1) hash table access, <1 microsecond

Level 2: Semantic Patterns (Code Style & Idioms)

While AST validates structure, semantic patterns validate intent and style. This level asks: “Does this code follow how we write code?”

How Semantic Pattern Detection Works

Semantic patterns are high-level abstractions detected by analyzing code behavior:

# Pattern: GUARD_CLAUSE
if not items:
    return []

# Pattern: RETURN_NONE
if condition:
    return None

# Pattern: LOOP_FILTER
for item in items:
    if item.valid:
        process(item)

# Pattern: LIST_COMPREHENSION
[x.transform() for x in items if x.valid]

# Pattern: CONTEXT_MANAGER
with open(file) as f:
    content = f.read()

Unlike AST (which just sees structure), semantic analysis understands:

What the code is trying to do
Whether it uses idiomatic patterns
What intent the pattern expresses

The 52+ Semantic Patterns

Markymarkov detects and tracks 52+ distinct semantic patterns across these categories:

Control Flow Idioms (8 patterns)

if-none-check: if x is None:
if-not-none: if x is not None:
if-empty-check: if not x: or if len(x) == 0:
if-type-check: isinstance(x, Type)
guard-clause: Early return for invalid input
ternary-expression: x if condition else y
boolean-return: Direct return of boolean expression
fallthrough: Implicit None return

Loop Patterns (9 patterns)

loop-accumulate: Collect items: result.append(item)
loop-filter: Conditional processing: if condition: process(item)
loop-transform: Transform items: yield transform(item)
loop-enumerate: for i, item in enumerate(items):
loop-zip: for x, y in zip(a, b):
loop-dict-items: for k, v in dict.items():
loop-dict-keys: for k in dict.keys():
loop-dict-values: for v in dict.values():
loop-reverse: for item in reversed(items):

Return Patterns (6 patterns)

return-none: return None or bare return
return-bool: return True or return False
return-list: return [] or return [value]
return-dict: return {} or return {k: v}
return-computed: return a + b or return func(x)
return-literal: return "string" or return 42

Data Structure Patterns (10 patterns)

init-empty-list: x = []
init-empty-dict: x = {}
init-empty-set: x = set()
init-counter: x = 0
append-to-list: list.append(item)
extend-list: list.extend(items)
dict-get-default: dict.get(key, default)
dict-setdefault: dict.setdefault(key, value)
unpacking: x, y = data or *args
string-concatenation: a + b or f"{a}{b}"

Comprehension Patterns (4 patterns)

list-comprehension: [x for x in items if condition]
dict-comprehension: {k: v for k, v in items}
set-comprehension: {x for x in items}
generator-expression: (x for x in items)

String Patterns (3 patterns)

string-format-fstring: f"{x} {y}"
string-format-method: "{} {}".format(x, y)
string-format-join: "-".join(items)

Function/Class Patterns (7 patterns)

function-validator: Checks and returns bool
function-transformer: Transforms input to output
function-factory: Creates and returns objects
init-method: __init__(self, ...)
property-getter: @property def x(self):
class-method: @classmethod def ...
static-method: @staticmethod def ...

Error Handling Patterns (5 patterns)

try-except-pass: Suppress exceptions silently
try-except-log: Log and continue
try-except-reraise: Re-raise after handling
try-finally: Cleanup code
context-manager: with statement

Miscellaneous Patterns (0+ patterns)

import-statement: Module imports
logging-call: logger.info(), etc.
assertion: assert condition
comment: Inline documentation

Example: Semantic Pattern Detection

Consider this function:

def find_user(user_id):
    if user_id is None:              # ← if-none-check
        return None                   # ← return-none

    user = database.get(user_id)      # ← database lookup
    if not user:                      # ← if-empty-check
        return None                   # ← return-none

    return user                       # ← return-computed

Semantic sequence detected: [if-none-check, return-none, if-empty-check, return-none, return-computed]

This pattern is classic: “validate inputs early, return early if invalid, then return the result.”

Why Semantic Matters

Two functions that do the same thing:

# Style A: Nested checks
def process(data):
    if data:
        if isinstance(data, list):
            if len(data) > 0:
                return [item.process() for item in data]
    return None

# Style B: Guard clauses (idiomatic Python)
def process(data):
    if not data or not isinstance(data, list) or len(data) == 0:
        return None
    return [item.process() for item in data]

# Style C: Even better (Python idiom)
def process(data):
    if not isinstance(data, list) or not data:
        return None
    return [item.process() for item in data]

Both are syntactically valid (both have valid ASTs). But:

Style A: Nested checks (less idiomatic)
Style B: Guard clause (idiomatic)
Style C: Guard clause + idiomatic ordering (most idiomatic)

Only semantic analysis can distinguish these.

Practical Impact

For code generation guidance, semantic patterns are crucial because they:

Capture how your team prefers to solve problems
Enable pattern-based suggestions
Explain code at a higher level
Match what developers reason about
Drive meaningful validation signals

How They Work Together

The magic of Markymarkov comes from combining these two levels:

Scenario 1: Valid AST, Unknown Semantic Pattern

def validate(x):
    if x is None:
        return None
    return x.process()

AST: ✓ Valid (FunctionDef → If → Return)
Semantic: if-none-check → return-none → return-computed
Confidence: Medium-to-high (recognized pattern)

Scenario 2: Valid AST, Unusual Semantic Pattern

def process(items):
    results = {}
    for item in items:
        try:
            results[item.id] = item.process()
        except Exception:
            pass
    return results

AST: ✓ Valid
Semantic: init-empty-dict → loop-filter → try-except-pass → return-dict
Unusual but valid (low confidence in semantic)
AST says “OK”, semantic says “uncommon”

Scenario 3: Syntactically Valid, Semantically Suspicious

def validate(items):
    if len(items) > 0:
        if isinstance(items, list):
            if items is not None:
                return items
    else:
        return None

AST: ✓ Valid (just nested ifs)
Semantic: Detects poor pattern composition
Confidence: Low (guards are unusual for this structure)
Even though syntactically OK, pattern matches poorly

Fallback Mechanism

When confidence is uncertain in one level, the other provides context:

# If AST says "unknown structure"
# Check: Does semantic pattern exist? If yes, probably OK.
# → Higher overall confidence

# If semantic says "unknown pattern"
# Check: Is AST structure valid? If yes, probably OK.
# → Higher overall confidence

Combined Confidence Scoring

Final confidence combines both:

overall_confidence = (ast_confidence × 0.4) + (semantic_confidence × 0.6)

Or for explicit validation:

Valid = (AST passes) AND (Semantic acceptable OR AST has >0.8 confidence)
Confidence = weighted_average(ast_score, semantic_score)

This gives you:

Structural guarantee: Code parses and flows correctly (AST)
Style assurance: Code follows learned idioms (Semantic)
Interpretability: Can explain both dimensions separately
Flexibility: Can weight them differently per use case

Real-World Example

Given training code:

def find_users(search_term):
    if not search_term:          # ← guard-clause
        return []                # ← return-list

    results = []
    for user in database.all():  # ← loop-iterate
        if user.name.contains(search_term):  # ← loop-filter
            results.append(user) # ← loop-accumulate
    return results               # ← return-list

Pattern sequence: [guard-clause, return-list, loop-iterate, loop-filter, loop-accumulate, return-list]

When validating generated code:

def find_users(search_term):
    if not search_term:
        return []
    return [u for u in database.all() if search_term in u.name]

Pattern sequence: [guard-clause, return-list, list-comprehension]

Checking:

AST: ✓ Valid (FunctionDef → If → Return → Comprehension)
Semantic: guard-clause → return-list → list-comprehension
- Different from training! (Training had loop-filter, list-comp does filter inline)
- But still valid pattern chain
- Confidence: 0.7 (valid but different style)

Result: Accepted with medium confidence (semantically different but AST valid)

4. Practical Examples

Training Markymarkov on Your Codebase

markymarkov train /path/to/code models/ --model-type both

Analyzes all Python files
Builds AST and semantic models
Exports as Python modules
<1 minutes for 1000 files

Using Markymarkov to Validate Generated Code

markymarkov validate models/semantic_model.py generated_code.py

Shows pattern sequences
Highlights known vs unknown transitions
Provides coverage percentage
Suggests expected alternatives

Real-World Validation Output with Diagnostics

> uv run markymarkov validate examples/pytest/semantic_model.py src/__main__.py
Built markymarkov @ file:///.../marky
Uninstalled 1 package in 0.21ms
Installed 1 package in 0.45ms
Loading model: examples/pytest/semantic_model.py
Validating code: src/__main__.py

Extracted 71 semantic patterns
First 20 patterns: ['init-method', 'function-transformer', 'if-empty-check', 'return-none', 'function-transformer', 'guard-clause', 'return-list', 'guard-clause', 'string-format', 'return-computed', 'function-transformer', 'if-empty-check', 'return-none', 'context-manager', 'string-format', 'context-manager', 'string-format', 'function-transformer', 'return-computed', 'loop-enumerate']
Model order: 2
Model has 211 pattern sequences

Validation Result (Semantic Model):
  Valid: True
  Confidence: 0.373
  Pattern sequences checked: 15
  Known transitions: 9/15

  ✓ Matching sequences (9):
    1. function-transformer → if-empty-check → return-none (0.423) @ line 118:12
    2. if-empty-check → return-none → function-transformer (0.370) @ line 135:4
    3. return-none → function-transformer → guard-clause (0.275) @ line 138:12
    4. guard-clause → return-list → guard-clause (1.000) @ line 143:8
    5. string-format → return-computed → function-transformer (0.714) @ line 149:4
    6. return-computed → function-transformer → if-empty-check (0.056) @ line 158:8
    7. function-transformer → if-empty-check → return-none (0.423) @ line 160:12

  ✗ Non-matching sequences (6):
    1. init-method → function-transformer → if-empty-check @ line 116:8
       Expected one of: return-computed, if-type-check, unpacking
    2. function-transformer → guard-clause → return-list @ line 139:16
       Expected one of: return-computed, return-none, return-bool
    3. return-list → guard-clause → string-format
       Expected one of: return-list
    4. Unknown sequence: guard-clause → string-format @ line 145:12
    5. if-empty-check → return-none → context-manager
       Expected one of: function-transformer, return-computed, guard-clause
    ... and 1 more

  Summary:
    Unique patterns found: 23
    Coverage: 9/15 transitions (60.0%)
    Issues: 4 unexpected, 2 unknown context

Understanding Coverage & Confidence Scores

Coverage: % of transitions that match training data
Confidence: Probability-weighted score (0.0-1.0)
High coverage + high confidence = Very idiomatic
Low coverage = Unusual but possibly valid style

5. 52 Semantic Patterns

Control Flow Patterns

IF_NONE_CHECK: if x is None:
IF_NOT_NONE: if x is not None:
IF_EMPTY_CHECK: if not x: or if len(x) == 0:
IF_TYPE_CHECK: if isinstance(x, Type):
GUARD_CLAUSE: Early return for invalid cases

Loop Patterns

LOOP_ACCUMULATE: result = []; for item in items: result.append(...)
LOOP_TRANSFORM: for item in items: yield transform(item)
LOOP_FILTER: for item in items: if condition: process(item)
LOOP_ENUMERATE: for i, item in enumerate(items):
LOOP_ZIP: for x, y in zip(a, b):
LOOP_DICT_ITEMS: for k, v in dict.items():

Return Patterns

RETURN_NONE: return None or bare return
RETURN_BOOL: return True or return False
RETURN_LIST: return [] or return [computed_value]
RETURN_DICT: return {} or return {key: value}
RETURN_COMPUTED: return a + b or return func(x)

Data Structure Patterns

INIT_EMPTY_LIST: x = []
INIT_EMPTY_DICT: x = {}
INIT_COUNTER: x = 0
APPEND_TO_LIST: list.append(item)
DICT_UPDATE: dict.update(other)
DICT_GET_DEFAULT: dict.get(key, default)
UNPACKING: x, y = data

Error Handling Patterns

TRY_EXCEPT_PASS: try: ...; except: pass
TRY_EXCEPT_LOG: try: ...; except: logger.error(...)
TRY_EXCEPT_RERAISE: try: ...; except: raise
TRY_FINALLY: try: ...; finally: cleanup()
CONTEXT_MANAGER: with open(file) as f:

Function/Class Patterns

FUNCTION_VALIDATOR: Function that validates input
FUNCTION_TRANSFORMER: Function that transforms data
INIT_METHOD: __init__(self, ...)
PROPERTY_GETTER: @property def x(self):
CLASS_METHOD: @classmethod def ...
STATIC_METHOD: @staticmethod def ...

Comprehension & Other Patterns

LIST_COMPREHENSION: [x for x in items if ...]
DICT_COMPREHENSION: {k: v for k, v in ...}
GENERATOR_EXPRESSION: (x for x in items)
TERNARY_EXPRESSION: x if condition else y
STRING_FORMAT: f"{x} {y}" or x.format(y)
LOGGING_CALL: logger.info(), logger.debug(), etc.

6. Integration with LLM Agents

Note: This chapter explores integration patterns between Markymarkov and LLM agents. While not yet implemented in practice, the architecture suggests several natural approaches worth considering.

Marky’s design lends itself to integration with LLM-based code generation agents. Rather than replacing the LLM, Markymarkov serves as a validation and guidance layer, helping agents generate more idiomatic code in real-time.

How Agents Use Markymarkov for Guidance

The integration pattern is straightforward: instead of a simple generate-then-review cycle, Markymarkov enables a continuous feedback loop during generation.

Step 1: Agent Starts Generation

User prompt: "Generate a Python function that validates user input"
↓
LLM begins token-by-token generation with temperature sampling

Step 2: Real-Time Validation

Generated tokens: [def, validate, (, user_input, ), :, \n, if, not, user_input, ...]
↓
After each statement, Markymarkov analyzes:
- AST structure validity
- Semantic pattern sequence
- Confidence score for next pattern

Step 3: Confidence-Based Feedback

Markymarkov output:
  Current pattern: guard-clause
  Next pattern options:
    - return-none (0.85 confidence)     ← high confidence
    - return-bool (0.12 confidence)     ← lower confidence
    - loop-filter (0.03 confidence)     ← very unusual

  Recommendation: Steer toward return-none

Step 4: Agent Adjusts Generation

Based on confidence scores, the agent decides:
  - Continue: If high confidence (>0.7)
  - Adjust temperature: If medium confidence (0.4-0.7)
  - Use logit biasing: If low confidence (<0.4)
  - Regenerate: If confidence below threshold

Example Flow

Here’s how a generation might proceed with Markymarkov validation:

INITIAL: Empty function stub
def validate_email(email):


ITERATION 1:
  Generated: "if not email:"
  Markymarkov analysis: guard-clause detected
  Confidence: 0.92 (very common opening pattern)
  Action: Continue with high confidence

ITERATION 2:
  Generated: "    return False"
  Pattern after guard-clause: return-bool
  Markymarkov analysis: return-bool confidence 0.8 (common after guards)
  Markymarkov analysis: return-none confidence 0.15 (less common)
  Action: Continue, both valid patterns

ITERATION 3:
  Generated: "    email_regex = r'...'"
  Pattern: init-var (assignment)
  Markymarkov analysis: After return-bool, return-var confidence 0.3
  Markymarkov analysis: After return-bool, init-var confidence 0.15
  Assessment: Medium-low confidence (unusual pattern chain)
  Action: Alert agent, consider regeneration or steering

ITERATION 4:
  Generated: "    return email_regex.match(email) is not None"
  Pattern: return-computed
  Markymarkov analysis: After init-var, return-computed confidence 0.7
  Assessment: Pattern sequence matches learned patterns
  Action: Continue

RESULT:
def validate_email(email):
    if not email:
        return False
    return email_regex.match(email) is not None

Pattern sequence: [guard-clause, return-bool, init-var, return-computed]
Overall confidence: 0.65 average (valid but slightly unusual)

Temperature Sampling for Diversity

LLMs use temperature to control randomness in token selection. Marky’s confidence scores provide a natural signal for temperature adjustments.

How Temperature Works with LLM Sampling

Standard temperature sampling:

At each step, LLM computes logits for all possible next tokens:
logits = [0.8, 1.2, 0.3, -0.5, 2.1, ...]

Temperature=1.0 (default):
  probabilities = softmax(logits)
  Sample from probabilities (natural randomness)

Temperature=0.3 (low, deterministic):
  probabilities = softmax(logits / 0.3)
  Peaks sharpen, mostly samples highest probability token
  Result: More predictable, less creative

Temperature=2.0 (high, creative):
  probabilities = softmax(logits / 2.0)
  Distribution flattens, samples more diverse tokens
  Result: More creative, less predictable

Marky-Aware Temperature Tuning

Consider dynamic temperature adjustment based on confidence:

def adjust_temperature(marky_confidence):
    """
    Adjust generation temperature based on pattern confidence.
    High confidence → follow patterns closely
    Low confidence → explore alternatives
    """
    if marky_confidence > 0.8:
        return 0.3  # Follow learned patterns closely
    elif marky_confidence > 0.5:
        return 0.7  # Balanced exploration
    else:
        return 1.2  # High diversity, unusual patterns OK

Use Cases

Scenario 1: Style Enforcement Mode
  Markymarkov confidence: 0.95 (return-list very common after guard-clause)
  Temperature: 0.2 (low, enforce pattern)
  Effect: Agent strongly prefers idiomatic code

Scenario 2: Experimental Code Generation
  Markymarkov confidence: 0.4 (unusual pattern combination)
  Temperature: 1.5 (high, explore alternatives)
  Effect: Agent tries novel patterns while staying valid

Scenario 3: Adaptive Mode
  Markymarkov confidence: varies dynamically
  Temperature: adjusts per iteration
  Effect: Agent balances idiomaticity with novelty

Logit Biasing for LLM Steering

Many LLM APIs (OpenAI, Anthropic, etc.) support logit biasing: adjusting the probability of specific tokens before sampling. Marky’s pattern knowledge maps naturally to this mechanism.

How Logit Biasing Works

Raw logits: [0.8, 1.2, 0.3, -0.5, 2.1, ...]
Token IDs:  [123, 456, 789, 234, 567, ...]

With logit bias (boost token 456, suppress token 789):
  bias = {456: +2.0, 789: -3.0}
  adjusted_logits = [0.8, 3.2, -2.7, -0.5, 2.1, ...]
           ↑              ↑             ↑
        unchanged      boosted      suppressed

Sample from adjusted distribution:
  Token 456 becomes much more likely
  Token 789 becomes much less likely

Marky-Driven Logit Biasing

Here’s an approach to recommend token biases:

def recommend_logit_bias(current_pattern, model_transitions):
    """
    Recommend token biases based on expected next patterns.
    """
    expected_next = model_transitions[current_pattern]

    # Boost tokens for high-confidence patterns
    positive_bias = {}
    for pattern, confidence in expected_next.items():
        if confidence > 0.7:
            tokens = pattern_to_tokens(pattern)
            for token in tokens:
                positive_bias[token] = confidence * 2.0

    # Suppress tokens for low-confidence patterns
    negative_bias = {}
    for pattern, confidence in expected_next.items():
        if confidence < 0.1:
            tokens = pattern_to_tokens(pattern)
            for token in tokens:
                negative_bias[token] = -3.0

    return {**positive_bias, **negative_bias}

Example: Steering Toward Guard Clauses

Suppose Markymarkov analysis shows: “After a function definition, guards have 0.85 confidence.”

# Generate logit bias for guard clause patterns
bias = {
    token_id("if"): +1.5,      # Boost "if" keyword
    token_id("not"): +1.2,     # Boost "not" (common in guards)
    token_id("is None"): +1.0, # Boost None check

    # Suppress less idiomatic patterns
    token_id("while"): -2.0,   # Suppress while loops
    token_id("try"): -1.5,     # Suppress try blocks
}

# Pass to LLM API
response = client.chat.completions.create(
    model="gpt-4",
    messages=[...],
    logit_bias=bias  # Steer generation toward guard clauses
)

Benefits of This Approach

Precise control: Boost/suppress specific tokens or patterns
Non-invasive: Works with any LLM that supports logit bias
Interpretable: Clear explanation for token preferences
Efficient: Minimal inference overhead
Composable: Stack multiple biases for complex guidance

Real-Time Validation During Generation

While temperature and logit biasing shape generation proactively, real-time validation provides reactive feedback as code emerges.

Architecture

LLM Stream Generator
    ↓ (token by token)
Code Builder (accumulates tokens)
    ↓ (complete statement)
Pattern Detector
    ├─ Extract AST
    ├─ Extract Semantic Patterns
    └─ Create N-grams
    ↓
Markov Model Validator
    ├─ Check AST transitions
    ├─ Check semantic transitions
    └─ Calculate confidence
    ↓
Feedback Engine
    ├─ Report to agent/user
    ├─ Suggest corrections
    └─ Recommend next patterns

Example: Real-Time Detection

def validate_stream(token_stream, model):
    """
    Validate code as it streams from LLM.
    """
    code = ""
    for token in token_stream:
        code += token

        # When we have a complete statement
        if is_complete_statement(code):
            # Extract patterns
            ast_patterns = extract_ast_patterns(code)
            semantic_patterns = extract_semantic_patterns(code)

            # Validate last transition
            if len(semantic_patterns) >= 2:
                prev_pattern = semantic_patterns[-2]
                curr_pattern = semantic_patterns[-1]

                confidence = model.check_transition(
                    prev_pattern,
                    curr_pattern
                )

                if confidence < CONFIDENCE_THRESHOLD:
                    yield {
                        'warning': f"Unusual pattern: {prev_pattern} → {curr_pattern}",
                        'confidence': confidence,
                        'expected': model.top_transitions(prev_pattern),
                    }

            yield {'valid': True, 'code': code}

Use Cases

1. Style Enforcement During Generation

Generated: "for item in items:\n    try:"
Markymarkov feedback: loop-try-except pattern (confidence: 0.2)
Alert: "Unusual! After loop-iterate, try-except is uncommon"
       "Expected: loop-filter (0.7), loop-transform (0.6)"
Action: Agent chooses to regenerate or accept the warning

2. Early Error Detection

Generated: "def func(): return\n    print('unreachable')"
Markymarkov feedback: AST error detected - code after return
Alert: "Invalid AST structure! Code after return statement"
Action: Agent must regenerate

3. Coverage Tracking

As code generates, Markymarkov tracks:
  Coverage so far: 45% of patterns matched training data
  Current confidence: 0.62
  Unique patterns used: 8/23

Alert: "Code is stylistically different from training"
       "Consider: Adding more guard clauses, use comprehensions"

4. Guided Refinement

Agent generates initial code:
def process(data):
    results = []
    for item in data:
        try:
            results.append(item.process())
        except:
            pass
    return results

Markymarkov analysis:
  Pattern: [init-empty-list, loop-filter, try-except-pass, return-list]
  Confidence: 0.45 (valid but unusual combination)

Suggestion: "Consider rewriting with list comprehension?"
  [item.process() for item in data if item.valid()]
  This pattern has confidence: 0.85

Agent regenerates:
def process(data):
    return [item.process() for item in data if item.valid()]

New confidence: 0.88 ✓

Integration Patterns

Several integration patterns emerge naturally:

Pattern 1: Validation-Only

Agent generates code → Markymarkov validates → Report results
Simplest approach, no feedback loop.

Pattern 2: Temperature-Aware

Agent generates tokens → Markymarkov scores pattern → Adjust temperature
Agent regenerates with new temperature
Enables iterative refinement.

Pattern 3: Logit Bias Steering

Before generation, compute logit biases from Marky
Pass biases to LLM API
Generation steered toward idiomatic patterns from the start.

Pattern 4: Interactive Refinement

Agent generates skeleton code
User reviews with Markymarkov feedback
Agent refines based on confidence scores
Iterate until satisfied.

Pattern 5: Multi-Agent Validation

Agent A generates candidate code
Agent B (validator) uses Markymarkov to score
Agent A refines based on scores
Both agents converge toward idiomatic code.

Implementation Sketch

Here’s a sketch of how integration might look in practice:

import json
from typing import Generator, Dict, List
from models.semantic_model import MarkovCodeGuide

class MarkyGuidedAgent:
    """
    Example of Markymarkov integration with an LLM agent.
    """

    def __init__(self, marky_model_path: str, llm_client, temperature=0.7):
        self.markymarkov = MarkovCodeGuide.load(marky_model_path)
        self.llm = llm_client
        self.temperature = temperature
        self.generated_code = ""
        self.pattern_history = []

    def generate_with_guidance(
        self,
        prompt: str,
        max_iterations: int = 3
    ) -> Dict[str, any]:
        """
        Generate code with Markymarkov guidance and iterative refinement.

        Workflow:
        1. Generate code with current temperature
        2. Validate with Marky
        3. Adjust temperature based on confidence
        4. Regenerate if confidence is too low
        5. Repeat until confident or max iterations reached
        """
        best_result = None
        best_confidence = 0.0

        for iteration in range(max_iterations):
            # Adjust temperature based on previous iteration's confidence
            if iteration > 0:
                self.temperature = self._adjust_temperature(best_confidence)

            print(f"\n[Iteration {iteration + 1}]")
            print(f"  Temperature: {self.temperature:.2f}")

            # Generate code
            code = self._stream_generate(prompt)

            # Validate with Marky
            validation = self._validate_code(code)

            print(f"  Generated: {len(code)} chars")
            print(f"  Confidence: {validation['confidence']:.2f}")
            print(f"  Coverage: {validation['coverage']:.1%}")

            # Store if best so far
            if validation['confidence'] > best_confidence:
                best_result = {
                    'code': code,
                    'validation': validation,
                    'temperature': self.temperature,
                    'iteration': iteration,
                }
                best_confidence = validation['confidence']

            # Stop if confident enough
            if best_confidence > 0.8:
                print(f"\n  Confidence threshold reached, stopping.")
                break

        return best_result

    def _stream_generate(self, prompt: str) -> str:
        """Stream token generation from LLM."""
        code = ""
        response = self.llm.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=self.temperature,
            stream=True,
        )

        for chunk in response:
            if chunk.choices[0].delta.content:
                token = chunk.choices[0].delta.content
                code += token

        return code

    def _validate_code(self, code: str) -> Dict:
        """Validate with Markymarkov model."""
        import ast as ast_module

        try:
            tree = ast_module.parse(code)
        except SyntaxError as e:
            return {
                'valid': False,
                'confidence': 0.0,
                'coverage': 0.0,
                'error': str(e),
            }

        # Extract patterns
        patterns = self.marky.extract_patterns(code)

        # Validate pattern transitions
        transitions = []
        for i in range(1, len(patterns)):
            prev_pattern = patterns[i-1]
            curr_pattern = patterns[i]
            confidence = self.marky.check_transition(
                prev_pattern,
                curr_pattern
            )
            transitions.append({
                'prev': prev_pattern,
                'curr': curr_pattern,
                'confidence': confidence,
            })

        # Calculate coverage and average confidence
        valid_transitions = sum(
            1 for t in transitions if t['confidence'] > 0.3
        )
        coverage = valid_transitions / max(len(transitions), 1)
        avg_confidence = (
            sum(t['confidence'] for t in transitions)
            / max(len(transitions), 1)
        )

        return {
            'valid': True,
            'confidence': avg_confidence,
            'coverage': coverage,
            'patterns': patterns,
            'transitions': transitions,
        }

    def _adjust_temperature(self, confidence: float) -> float:
        """Dynamically adjust temperature based on Markymarkov confidence."""
        if confidence > 0.8:
            return 0.3  # Low temp: follow patterns closely
        elif confidence > 0.6:
            return 0.7  # Medium temp: balanced
        elif confidence > 0.4:
            return 1.0  # Normal temp: explore more
        else:
            return 1.5  # High temp: significant exploration

# Example usage:
"""
import openai

agent = MarkyGuidedAgent(
    'models/semantic_model.py',
    llm_client=openai.Client(),
)

result = agent.generate_with_guidance(
    prompt="Write a function to validate email addresses",
    max_iterations=3,
)

print(f"\nFinal Result:")
print(f"Code:\n{result['code']}")
print(f"Confidence: {result['validation']['confidence']:.2f}")
print(f"Iteration: {result['iteration'] + 1}")
"""

Design Considerations

Feedback interval: Per token? Per statement? Per function?
Multi-file generation: How to maintain context across files?
Error recovery: Graceful handling of parse errors and edge cases
Performance: Streaming validation overhead and caching strategies
API compatibility: Adaptation layer for different LLM providers

Why This Integration Makes Sense

Combining Markymarkov with LLM agents offers several advantages:

Consistency: Generated code more closely matches project conventions
Reduced review cycles: Less stylistic feedback needed from humans
Pattern awareness: Agent learns project patterns over time
Flexibility maintained: LLM still free to generate novel code when appropriate
Transparency: Confidence scores make decisions explainable

Open Questions

Some questions worth exploring through implementation:

How much does temperature/logit biasing improve code quality in practice?
What’s the performance impact of real-time validation?
How to handle partial code (incomplete statements) during streaming?
Can iterative refinement improve patterns over multiple generations?
How well do confidence scores correlate with actual code quality?
What’s the optimal balance between enforcement and exploration?

The answers to these questions would come from experimentation and real-world usage.

7. Performance Characteristics

Marky’s performance has been benchmarked against the Python 3.13 standard library (596 files, 20.23 MB of code). These are real-world numbers, not synthetic benchmarks.

Training Performance

Training speed is crucial for iterative development. Markymarkov processes code quickly:

Dataset Characteristics

Files analyzed: 596 Python files
Total code size: 20.23 MB (21,210,750 bytes)
Average file size: 34.8 KB
Code complexity: Real-world Python stdlib (highly varied patterns)

Training Speed

Both models (AST + Semantic):
  Total time: 5.65 seconds
  Throughput: 105.6 files/second
  Throughput: 3.58 MB/second
  Per-file average: 9.5 ms

AST model only:
  Total time: 2.60 seconds
  Throughput: 229.2 files/second
  Throughput: 7.78 MB/second

Semantic model only:
  Total time: 2.81 seconds
  Throughput: 212.0 files/second
  Throughput: 7.19 MB/second

What This Means:

A 1000-file codebase trains in ~10 seconds
A 10,000-file codebase trains in ~2 minutes
Iterative development is practical (retrain frequently)
CI/CD integration is feasible

Model Sizes

The trained models are compact:

AST model: 537.1 KB
Semantic model: 277.5 KB
Combined: 814.6 KB

Compression ratio: 25.4x (20 MB → 815 KB)

This makes models:

Easy to version control (sub-1MB files)
Fast to load (<100ms)
Negligible memory footprint
Distributable in repos

Validation Performance

Validation speed determines whether Markymarkov can be used in real-time workflows:

End-to-End Validation (including subprocess startup):

AST Validation:
  Median: 216.9 ms
  Mean: 218.8 ms
  P95: 238.3 ms
  Range: 207.1–246.8 ms

Semantic Validation:
  Median: 211.4 ms
  Mean: 380.0 ms
  P95: 1233.5 ms
  Range: 187.9–1251.7 ms

Note: Higher variance in semantic validation due to
      pattern complexity in different code sections.

What This Means:

Validation completes in ~200ms for typical files
Fast enough for save-on-edit workflows
Acceptable for CI/CD pre-commit hooks
Could be optimized further with persistent process

The subprocess overhead dominates these times. In a long-running process (like an IDE plugin), validation would be 10-50x faster.

Query Latency

The core model lookup operations are extremely fast:

Model Transition Lookup

Warm lookup (cached in memory):
  Mean: <1 microsecond
  Median: <1 microsecond
  P99: <2 microseconds

Cold lookup (first access):
  Mean: ~1-2 microseconds

This is effectively instant. The lookup is just a Python dict access:

confidence = transitions.get((prev_pattern, curr_pattern), {})
# O(1) hash table lookup

At these speeds:

1,000,000 lookups/second per core
No noticeable latency in real applications
Cache warming unnecessary (already fast enough)

Memory Footprint

Markymarkov is memory-efficient:

Model Loading

AST model in memory: ~1-2 MB (loaded)
Semantic model in memory: ~500 KB - 1 MB (loaded)
Both models: ~2-3 MB total

Python overhead: ~10-20 MB (interpreter)
Total process: ~15-25 MB

Scaling Characteristics

Model size grows sub-linearly with codebase size:

100 files → ~100 KB model
1,000 files → ~800 KB model
10,000 files → ~5 MB model (estimated)
100,000 files → ~30 MB model (estimated)

The sub-linear growth occurs because:

Pattern vocabulary is bounded (finite number of patterns)
Transition frequency converges (common patterns dominate)
Many codebases have repeated idioms

Throughput & Scalability

Training Scalability

Training time scales linearly with codebase size:

O(n) where n = number of AST nodes

Observed:
  596 files × 34.8 KB avg = 5.65s
  Predicted for 6,000 files: ~56s
  Predicted for 60,000 files: ~9 minutes

Bottlenecks:

AST parsing (Python’s ast module)
Pattern detection (Python visitor pattern)
File I/O

Not bottlenecks:

Model building (accumulating counters)
Serialization (writing Python dicts)

Validation Scalability

Validation time scales with file size:

O(m) where m = lines of code in file

Typical:
  100 LOC file: ~50ms
  1,000 LOC file: ~200ms
  10,000 LOC file: ~2s

For very large files (>5000 LOC), consider:

Chunked validation (validate functions separately)
Parallel validation (multiple files in parallel)
Incremental validation (only changed code)

Parallel Processing

Training is embarrassingly parallel:

Single-threaded: 105.6 files/second
8 cores: ~850 files/second (estimated)
16 cores: ~1,700 files/second (estimated)

The current implementation is single-threaded, but parallelization is straightforward (process files independently, merge results).

Real-World Performance Examples

Small Project (50 files, 2 MB)

Training: ~0.5 seconds
Model size: ~100 KB
Validation: ~200ms per file
Total setup: <1 second

Medium Project (500 files, 20 MB)

Training: ~5 seconds
Model size: ~800 KB
Validation: ~200ms per file
Total setup: ~5 seconds

Large Project (5,000 files, 200 MB)

Training: ~50 seconds
Model size: ~5 MB
Validation: ~200ms per file
Total setup: ~1 minute

Very Large Project (50,000 files, 2 GB)

Training: ~8 minutes (estimated)
Model size: ~30 MB (estimated)
Validation: ~200ms per file
Total setup: ~10 minutes (one-time)

Optimizations & Tuning

Training Speed

To optimize training:

1. Use faster disk I/O (SSD vs HDD: 2-3x improvement)
2. Parallelize file processing (linear speedup)
3. Use lower n-gram order (order=1 is 2x faster)
4. Filter irrelevant files (tests, generated code)
5. Cache parsed ASTs (if re-training frequently)

Validation Speed

To optimize validation:

1. Run in persistent process (avoid subprocess overhead)
2. Pre-load models at startup (100ms → 1ms validation)
3. Validate incrementally (only changed code)
4. Use AST cache (if validating same file repeatedly)
5. Parallel validation (multiple files)

Memory Usage

To reduce memory:

1. Use lower n-gram order (order=1: 50% less memory)
2. Prune rare transitions (threshold=2: 30% less memory)
3. Quantize probabilities (float32 → int8: 75% less memory)
4. Stream model loading (don't load all at once)

Comparison with Alternatives

vs. Linters (pylint, flake8)

Markymarkov training: 5.6s for 596 files
Pylint: ~60s for 596 files (10x slower)
Flake8: ~15s for 596 files (3x slower)

Markymarkov validation: ~200ms per file
Pylint: ~300ms per file
Flake8: ~100ms per file

Advantage: Markymarkov learns from your code, not fixed rules
Trade-off: Markymarkov requires initial training step

vs. Type Checkers (mypy)

Markymarkov validation: ~200ms per file
Mypy: ~500ms per file (cold), ~100ms (warm)

Advantage: Markymarkov checks patterns, not types (complementary)
Trade-off: Different problem domain

vs. Deep Learning Models

Markymarkov training: 5.6s for 596 files
CodeBERT training: Hours/days
GPT fine-tuning: Days/weeks

Markymarkov inference: <1ms per lookup
CodeBERT inference: ~100ms per prediction
GPT inference: ~1s per completion

Advantage: Markymarkov is 100-1000x faster
Trade-off: Markymarkov doesn't understand semantics, only patterns

Performance Recommendations

For best performance in different scenarios:

Development/IDE Integration

- Use persistent process (avoid subprocess overhead)
- Pre-load models at startup
- Validate on save (200ms is acceptable)
- Update models nightly (don't retrain on every change)

CI/CD Pipelines

- Train models in dedicated step (cache for reuse)
- Validate changed files only (not entire codebase)
- Run in parallel (one validation per core)
- Fail fast (stop on first validation error)

Code Review Tools

- Load models once (per review session)
- Validate diffs only (not unchanged code)
- Show confidence scores (help reviewers prioritize)
- Cache results (same file validated multiple times)

Large-Scale Analysis

- Parallelize training (split files across workers)
- Stream processing (don't load all files at once)
- Sample validation (validate subset for estimates)
- Incremental updates (retrain only changed modules)

Benchmark Methodology

These numbers were obtained using:

Hardware: x86_64 Linux system - 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
Python: 3.13
Dataset: Python 3.13 standard library (/usr/lib/python3.13)
Tool version: Markymarkov current git HEAD
Method:
- Training: Single run (warm filesystem cache)
- Validation: 50 iterations, median reported
- Lookup: 10,000 iterations, median reported

Benchmark code is available in the repository (benchmark.py and benchmark_summary.py).

Key Takeaways

✓ Fast training: 100+ files/second, practical for daily retraining ✓ Compact models: 25x compression, <1 MB for typical projects ✓ Quick validation: ~200ms per file including overhead ✓ Instant lookups: <1 microsecond for model queries ✓ Linear scaling: Performance predictable as codebase grows ✓ Memory efficient: 2-3 MB for loaded models ✓ Production-ready: Performance suitable for real-time use

Marky’s performance characteristics make it viable for:

Interactive development workflows (IDE plugins)
Automated code review (CI/CD integration)
Large-scale code analysis (batch processing)
Real-time code generation guidance (LLM integration)

8. Use Cases

Marky’s pattern-learning approach enables a variety of practical applications, from code quality enforcement to training data analysis. Here are detailed scenarios showing how Markymarkov solves real problems.

Code Quality Assurance

Problem: Teams need to ensure generated or contributed code follows project conventions, but manual review doesn’t scale and traditional linters only check syntax.

How Markymarkov Helps:

Markymarkov learns what “quality code” looks like from your existing codebase, then validates new code against those patterns.

Example Workflow:

# 1. Train on your production codebase
uv run markymarkov train src/ models/ --model-type both

# 2. Validate new code during PR review
uv run markymarkov validate models/semantic_model.py new_feature.py

# Output shows:
#   - Confidence score (0.0-1.0)
#   - Which patterns match your codebase
#   - Which patterns are unusual
#   - Suggestions for improvement

Real-World Scenario:

A team maintaining a Flask application trains Markymarkov on their existing routes, models, and utilities:

# Existing codebase pattern: Always validate input early
@app.route('/user/<user_id>')
def get_user(user_id):
    if not user_id:
        return jsonify({'error': 'Missing user_id'}), 400
    user = db.query(User).get(user_id)
    if not user:
        return jsonify({'error': 'User not found'}), 404
    return jsonify(user.to_dict())

When a new developer submits code without input validation:

@app.route('/product/<product_id>')
def get_product(product_id):
    product = db.query(Product).get(product_id)
    return jsonify(product.to_dict())

Markymarkov flags this:

Validation Result:
  Confidence: 0.32 (Low - unusual pattern)

  Missing expected patterns:
    - guard-clause (expected after function-transformer: 0.85 confidence)
    - if-empty-check (common in route handlers: 0.78 confidence)

  Suggestion: Add input validation (guard clause pattern)

Benefits:

Catches style violations automatically
Provides objective feedback (“doesn’t match our patterns”)
Reduces review time (reviewers focus on logic, not style)
Teaches new developers team conventions

Style Enforcement

Problem: Different developers write code differently. Some use list comprehensions, others use loops. Some use guard clauses, others nest conditionals. You want consistency.

How Markymarkov Helps:

Markymarkov identifies the dominant patterns in your codebase and flags deviations. This creates a “style fingerprint” unique to your project.

Example Workflow:

# Train on well-styled reference code
uv run markymarkov train examples/good_style/ models/

# Validate new contributions
uv run markymarkov validate models/semantic_model.py contribution.py

# Integrate into pre-commit hook
# .git/hooks/pre-commit:
#!/bin/bash
for file in $(git diff --cached --name-only --diff-filter=ACM | grep '\.py$'); do
    uv run markymarkov validate models/semantic_model.py "$file"
    if [ $? -ne 0 ]; then
        echo "Style validation failed for $file"
        exit 1
    fi
done

Real-World Scenario:

Your team prefers comprehensions over manual loops:

# Preferred style (80% of codebase)
result = [item.transform() for item in items if item.valid]

# Less preferred (20% of codebase)
result = []
for item in items:
    if item.valid:
        result.append(item.transform())

New code using manual loops gets flagged:

Pattern detected: loop-accumulate
Confidence: 0.25 (Low for this context)

Alternative suggestion:
  list-comprehension has 0.82 confidence in this context
  Consider: [item.transform() for item in items if item.valid]

Configuration Options:

You can tune enforcement strictness:

# Strict mode: Reject if confidence < 0.7
if confidence < 0.7:
    reject_code()

# Permissive mode: Warn if confidence < 0.5, reject if < 0.3
if confidence < 0.3:
    reject_code()
elif confidence < 0.5:
    warn_developer()

# Learning mode: Always accept, but log unusual patterns
log_pattern_stats(code, confidence)

Benefits:

Enforces team style automatically
Adapts to your codebase (not generic rules)
Provides constructive feedback
Reduces style debates (data-driven)

Training Data Analysis

Problem: You want to understand what patterns dominate your codebase, identify inconsistencies, or prepare data for model training.

How Markymarkov Helps:

Markymarkov extracts and quantifies patterns, giving you insights into code characteristics.

Example Workflow:

# Train model
uv run markymarkov train codebase/ analysis_models/

# View statistics
uv run markymarkov stats analysis_models/semantic_model.py

# Output shows:
#   - Most common patterns
#   - Pattern transition probabilities
#   - Unusual pattern combinations

Real-World Scenario:

Analyzing a legacy codebase before refactoring:

$ uv run markymarkov stats analysis_models/semantic_model.py

Model Statistics:
  Total transitions: 1,247
  Unique patterns: 31

Most common patterns:
  1. function-transformer (892 occurrences, 71.5%)
  2. guard-clause (423 occurrences, 33.9%)
  3. return-computed (389 occurrences, 31.2%)
  4. loop-filter (276 occurrences, 22.1%)
  5. try-except-pass (198 occurrences, 15.9%)

Unusual patterns (low confidence):
  - try-except-pass after loop-filter (0.12 confidence)
    → Suggests error handling inside loops (potential issue)

  - nested-if-else chains (0.08 confidence)
    → Only 8% of code uses this (refactoring candidate)

Pattern diversity: 0.62 (moderate - some patterns dominate)

From this analysis, you learn:

72% of functions are transformers (good for functional style)
34% use guard clauses (decent defensive programming)
16% silently swallow exceptions (tech debt!)
Nested ifs are rare (codebase is generally clean)

Use Cases:

Pre-refactoring analysis (identify problem areas)
Code review focus (prioritize unusual patterns)
Training data curation (find representative examples)
Team onboarding (show “how we write code”)

Advanced Analysis:

Compare patterns across different modules:

# Train separate models
uv run markymarkov train src/api/ models/api_model.py
uv run markymarkov train src/db/ models/db_model.py
uv run markymarkov train src/utils/ models/utils_model.py

# Compare statistics
python compare_models.py models/api_model.py models/db_model.py

# Output:
#   API layer: Heavy use of guard clauses (0.82)
#   DB layer: Heavy use of context managers (0.71)
#   Utils: Heavy use of list comprehensions (0.68)

Benefits:

Objective codebase metrics
Identify tech debt patterns
Understand team coding style
Prepare for refactoring efforts

Model-Driven Code Generation

Problem: You want LLMs to generate code that matches your project’s style, not generic Python.

How Markymarkov Helps:

Train Markymarkov on your codebase, then use it to guide or validate LLM-generated code.

Example Workflow:

# 1. Train on your codebase
subprocess.run(["uv", "run", "python", "-m", "src", "train",
                "src/", "models/"])

# 2. Generate code with LLM
llm_code = generate_code_with_llm(prompt="Write a user validation function")

# 3. Validate with Marky
validation = validate_code(llm_code, "models/semantic_model.py")

# 4. If low confidence, regenerate with hints
if validation['confidence'] < 0.6:
    hints = get_expected_patterns(validation)
    llm_code = generate_code_with_llm(
        prompt=f"{original_prompt}\n\nUse these patterns: {hints}"
    )

Real-World Scenario:

Your project uses specific error handling patterns:

# Your codebase pattern
def fetch_user(user_id):
    try:
        return db.get_user(user_id)
    except NotFoundError:
        logger.warning(f"User {user_id} not found")
        return None
    except DatabaseError as e:
        logger.error(f"Database error: {e}")
        raise

LLM generates generic code:

def fetch_product(product_id):
    try:
        return db.get_product(product_id)
    except Exception:
        return None

Markymarkov validation:

Confidence: 0.23 (Very low)

Issues:
  - Catching generic Exception (your code uses specific exceptions)
  - No logging (your code always logs errors)
  - Silent failure (your code raises on critical errors)

Expected patterns:
  - try-except with specific exception types (0.89 confidence)
  - logging-call in except blocks (0.82 confidence)
  - try-except-reraise for critical errors (0.71 confidence)

You regenerate with hints:

# Improved LLM output (after hints)
def fetch_product(product_id):
    try:
        return db.get_product(product_id)
    except NotFoundError:
        logger.warning(f"Product {product_id} not found")
        return None
    except DatabaseError as e:
        logger.error(f"Database error: {e}")
        raise

New confidence: 0.87 ✓

Benefits:

LLM output matches your style
Reduces post-generation cleanup
Iterative improvement through feedback
Project-specific code generation

Identifying Code Anomalies

Problem: You suspect certain code sections are unusual or buggy, but need objective evidence.

How Markymarkov Helps:

Low-confidence patterns indicate code that doesn’t match typical project style. This can surface bugs, anti-patterns, or quick hacks.

Example Workflow:

# Train on healthy codebase
uv run markymarkov train src/ models/ --exclude tests/

# Validate suspicious file
uv run markymarkov validate models/semantic_model.py suspicious.py

# Look for low-confidence patterns

Real-World Scenario:

A bug report comes in for a function. Markymarkov validation shows:

def process_payment(amount, user):
    result = charge_card(user.card, amount)
    if result:
        return result
    update_balance(user, amount)  # ← This line is unusual
    return {"status": "success"}

Markymarkov output:

Validation Result:
  Confidence: 0.18 (Very low)

  Unusual patterns:
    Line 4: Unreachable code after return (0.02 confidence)
    Expected: All code paths should be reachable

  Similar functions in codebase:
    - process_refund: Always checks result before proceeding (0.91 confidence)
    - process_subscription: Uses guard clauses (0.87 confidence)

The bug: update_balance is never called because of early return. Markymarkov flagged it as unusual because similar functions in the codebase don’t have this pattern.

Finding Tech Debt:

Run validation on entire codebase and sort by confidence:

# Validate all files
for file in src/**/*.py; do
    uv run markymarkov validate models/semantic_model.py "$file" >> results.txt
done

# Sort by confidence (low = suspicious)
grep "Confidence:" results.txt | sort -t: -k2 -n

# Output:
#   src/legacy/utils.py: Confidence: 0.12
#   src/old_api/handlers.py: Confidence: 0.19
#   src/deprecated/auth.py: Confidence: 0.23
#   ...

Files with very low confidence often indicate:

Legacy code (written before standards were established)
Quick hacks (rushed implementation)
Copy-pasted from external sources
Potential bugs or anti-patterns

Benefits:

Objective code quality metric
Find bugs before they manifest
Prioritize refactoring efforts
Code review efficiency

Pre-Commit Hooks / CI Integration

Problem: You want to catch style issues before code reaches review.

How Markymarkov Helps:

Integrate Markymarkov into your development workflow to validate code automatically.

Example: Pre-Commit Hook

#!/bin/bash
# .git/hooks/pre-commit

THRESHOLD=0.5  # Minimum acceptable confidence

for file in $(git diff --cached --name-only --diff-filter=ACM | grep '\.py$'); do
    result=$(uv run markymarkov validate models/semantic_model.py "$file" 2>&1)
    confidence=$(echo "$result" | grep "Confidence:" | awk '{print $2}')

    if (( $(echo "$confidence < $THRESHOLD" | bc -l) )); then
        echo "❌ $file: Low confidence ($confidence)"
        echo "$result"
        exit 1
    else
        echo "✓ $file: Passed ($confidence)"
    fi
done

echo "All files passed Markymarkov validation!"

Example: GitHub Actions

name: Code Style Validation

on: [pull_request]

jobs:
  marky-validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Setup Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.13'

      - name: Install Marky
        run: |
          pip install uv
          uv sync

      - name: Train model (or use cached)
        run: |
          if [ ! -f models/semantic_model.py ]; then
            uv run markymarkov train src/ models/
          fi

      - name: Validate changed files
        run: |
          for file in $(git diff --name-only origin/main...HEAD | grep '\.py$'); do
            uv run markymarkov validate models/semantic_model.py "$file"
          done

Benefits:

Catch issues early (before review)
Consistent enforcement
Automated workflow
Fast feedback loop

Code Migration / Refactoring Assistance

Problem: Migrating from one style to another (e.g., callbacks → async/await, classes → functional).

How Markymarkov Helps:

Train one model on old style, another on new style, then validate migrations.

Example Workflow:

# Train on old codebase (before refactor)
uv run markymarkov train old_src/ models/old_style.py

# Train on target examples (new style)
uv run markymarkov train examples/new_style/ models/new_style.py

# During migration, validate against new style
uv run markymarkov validate models/new_style.py refactored_file.py

Real-World Scenario:

Migrating from synchronous to async code:

# Old style (callback-based)
def fetch_user(user_id, callback):
    result = db.get_user(user_id)
    callback(result)

# New style (async/await)
async def fetch_user(user_id):
    result = await db.get_user(user_id)
    return result

Markymarkov trained on new style flags incomplete migrations:

# Partially migrated (still has callback)
async def fetch_product(product_id, callback):
    result = await db.get_product(product_id)
    callback(result)  # ← Should return instead

Validation:

Confidence: 0.31 (Low - mixed patterns)

Issues:
  - Async function with callback parameter (not in new style)
  - Expected: async functions return values (0.94 confidence in new_style.py)

Suggestion: Remove callback, use return statement

Benefits:

Consistency during migrations
Catch incomplete refactors
Learn from target examples
Objective migration progress tracking

Onboarding New Developers

Problem: New team members need to learn project conventions quickly.

How Markymarkov Helps:

Use Markymarkov statistics to show “how we write code here” with concrete examples.

Example Workflow:

# Generate onboarding report
uv run markymarkov stats models/semantic_model.py > ONBOARDING.md

# Show examples of each pattern
python extract_pattern_examples.py models/semantic_model.py src/

Onboarding Document Generated:

# Our Code Style (Generated from Codebase Analysis)

## Most Common Patterns

### 1. Guard Clauses (85% of functions)
We prefer early returns for invalid input:

\`\`\`python
# ✓ Our style
def process(data):
    if not data:
        return None
    if not data.valid:
        return None
    return data.process()

# ✗ Avoid
def process(data):
    if data and data.valid:
        return data.process()
    else:
        return None
\`\`\`

### 2. List Comprehensions (73% of loops)
Use comprehensions for transformations:

\`\`\`python
# ✓ Our style
results = [item.transform() for item in items if item.valid]

# ✗ Avoid
results = []
for item in items:
    if item.valid:
        results.append(item.transform())
\`\`\`

> Rare Patterns (Avoid Unless Necessary)

- try-except-pass (only 3% of code)
- nested-if-else (only 5% of code)
- while loops (only 8% of code)

Benefits:

Data-driven onboarding
Clear examples from actual code
New developers write idiomatic code faster
Reduces “that’s not how we do it” feedback

Documentation Generation

Problem: Code style guides are often out of date or incomplete.

How Markymarkov Helps:

Generate living documentation from actual code patterns.

# Auto-generate style guide
python generate_style_guide.py models/semantic_model.py > STYLE_GUIDE.md

This creates documentation that’s:

Always up-to-date (regenerate from latest model)
Based on actual code (not aspirational)
Quantified (shows pattern prevalence)
Example-rich (extracted from codebase)

Benefits:

Living documentation
No manual maintenance
Objective style standards
Easy to update

Research / Code Analysis

Problem: Understanding code evolution, pattern trends, or comparative analysis.

How Markymarkov Helps:

Train models on different versions or projects to study differences.

Example Workflow:

# Train on multiple versions
uv run markymarkov train v1.0/ models/v1.py
uv run markymarkov train v2.0/ models/v2.py
uv run markymarkov train v3.0/ models/v3.py

# Compare patterns across versions
python compare_evolution.py models/v*.py

Research Questions Answered:

How did coding style evolve over time?
What patterns were added/removed between versions?
Do different teams have different style fingerprints?
What patterns correlate with bugs?

Benefits:

Empirical code analysis
Track style evolution
Comparative studies
Research publications

Summary of Use Cases

Use Case	Benefit	Typical Users
Code Quality Assurance	Automated style validation	All developers
Style Enforcement	Consistent codebase	Tech leads, DevOps
Training Data Analysis	Understand patterns	Data scientists, researchers
Model-Driven Generation	Better LLM output	AI/ML teams
Anomaly Detection	Find bugs early	QA, code reviewers
CI/CD Integration	Automated checks	DevOps, platform teams
Refactoring Assistance	Migration validation	Senior developers
Developer Onboarding	Learn conventions	New team members
Documentation	Living style guides	Tech writers, leads
Research	Code pattern studies	Researchers, academics

All these use cases leverage the same core capability: learning patterns from code and validating new code against those patterns. The flexibility of Marky’s approach makes it valuable across the entire development lifecycle.

9. Getting Started

Easily run from main branch:

uvx --from git+https://github.com/roobie/markymarkov markymarkov
# You can alias the above
alias -s marky="uvx --from git+https://github.com/roobie/markymarkov markymarkov"

Installation & Setup

git clone https://github.com/roobie/markymarkov
cd markymarkov
uv sync

Training Your First Model

# Train on your codebase
uv run markymarkov train /path/to/your/code models/

# Or on specific patterns
uv run markymarkov train /path/to/code models/ --model-type semantic --order 2

Running Validation

# Validate a file
uv run markymarkov validate models/semantic_model.py your_file.py

# See statistics
uv run markymarkov stats models/semantic_model.py

# Try the demo
uv run markymarkov demo

Integration Patterns

Load model as Python module
Call MarkovCodeGuide.suggest_next_nodes()
Use with streaming code validator
Integrate with LLM inference pipeline

10. Conclusion

Markymarkov shows that small, interpretable probabilistic models add high value to code workflows. By learning patterns from your own codebase, Markymarkov provides fast, explainable guidance that complements existing tools rather than replacing them.

Key advantages:

Fast to train and load, compact to store
Interpretable confidence scores for actionable feedback
Flexible integration points: CI, IDE, LLM workflows

If you care about consistency, developer productivity, and explainability, try training Markymarkov on a representative subset of your code and see what it highlights.