Skip to content
Cogitate
Go back

MarkyMarkov - Markov Chain-Based Code Guidance for LLM Agents and Humans Alike

| Björn Roberg Claude Edit page

Table of Contents

Open Table of Contents

Outline Details

1. Introduction

What is Markymarkov?

Markymarkov is a Markov Chain-based code guidance system designed to help LLM agents generate better code. Unlike traditional linters or type checkers that enforce fixed rules, Markymarkov learns patterns directly from your codebase and uses those patterns to validate and guide code generation in real-time.

At its core, Markymarkov operates on a simple but powerful principle: code is sequential, and code patterns are learnable. By analyzing existing code, it builds probabilistic models of what patterns typically follow other patterns—both at the syntactic level (how code is structured) and at the semantic level (what code idioms are preferred).

Key characteristics:

The Problem It Solves

Modern LLM-based code generation has created a paradox: generated code is often syntactically correct does not always follow stylistic idioms or algorithmic patterns.

Consider this scenario:

This is the gap Markymarkov addresses. Today’s development teams face several related challenges:

  1. Style Enforcement at Scale

    • Teams want consistency across codebases
    • Manual code review can’t catch every style deviation
    • Linters only check static rules, not learned patterns
    • No way to programmatically enforce “this is how we write code”
  2. LLM Code Generation Quality

    • LLMs generate valid code, but not idiomatic code
    • Agents can’t distinguish between “correct” and “our style”
    • Temperature and sampling can’t capture organizational conventions
    • Training on diverse data means diverse output
  3. Validation Gaps

    • Linters are rule-based (hard to maintain)
    • Type checkers focus on types, not patterns
    • AST visitors can find structure but miss intent
    • No standard way to validate “does this follow our patterns?”
  4. Training Data Analysis

    • Hard to understand what patterns dominate your codebase
    • Difficult to identify anomalies or tech debt
    • No way to measure consistency across teams
    • Can’t extract “what makes our code unique?”

Markymarkov solves these by learning patterns from your code and providing:

Why Markov Chains for Code?

You might wonder: “Why not use deep learning, LSTMs, or transformers?” The answer reveals important truths about code and pattern matching.

  1. Code is Inherently Sequential

    • Code flows through AST traversal order
    • Control flow follows predictable paths
    • Pattern composition is chain-like
    • Markov chains were literally invented for this use case
  2. The Markov Property Holds for Code

    • “The next pattern depends only on the current pattern” ✓
    • Previous history can be captured in n-gram state
    • Two-state or three-state context is often sufficient
    • Rare cases benefit from higher-order models
  3. Computational Efficiency

    • Training: O(n) pass through codebase
    • Inference: O(1) hash table lookup
    • Deep learning: GPU-heavy, requires infrastructure
    • Marky: CPU-friendly, runs anywhere
    • 1000 files/minute training speed (AST model)
    • <1ms query latency (cached lookups)
  4. Interpretability

    • Deep learning: “Why did it choose this?” → Inscrutable
    • Markov chains: “What’s the probability?” → Auditable
    • Can explain: “After X pattern, Y is expected 85% of the time”
    • Developers can reason about confidence scores
    • No black box; built on transparent math
  5. Data Efficiency

    • Deep learning: Needs massive datasets
    • Markov chains: Work well with any size codebases
    • Effective with 100+ files
    • Scales gracefully to 100K+ files
    • Graceful degradation (unknown patterns still provide value)
  6. Integration with LLMs

    • LLMs already generate token sequences
    • Markov models provide per-token guidance
    • Natural fit for token-by-token validation
    • Easy to integrate into sampling/generation loops
    • No need to retrain LLM; validate output instead

Real-World Example: Why This Matters

Imagine your team uses these patterns:

An LLM might generate perfectly valid alternatives:

Markymarkov learns these preferences from your codebase and can:

  1. During generation: Guide the LLM toward idiomatic patterns
  2. During validation: Flag deviations with confidence scores
  3. During review: Explain why code is unexpected
  4. For training: Help fine-tune models on your patterns

The Markymarkov Approach

Rather than debating “right” vs. “wrong” code style, Markymarkov asks: “What does this codebase do?” and “Does this code match those patterns?”

This shifts the conversation from:

The result: objective, data-driven style validation that teams can understand and trust.

2. Core Concept

Two-Level Architecture (AST + Semantic)

Marky’s core innovation is its two-level validation architecture. This dual approach gives you the best of both worlds: structural correctness and stylistic idiomaticity.

Level 1: AST Patterns (Syntactic Correctness)

Level 2: Semantic Patterns (Style & Idioms)

Why Two Levels?

The two-level approach solves a fundamental problem: syntax and style are orthogonal concerns.

Consider this example: Both of these are syntactically valid:

# Style A: Nested conditionals
def validate(items):
    if items:
        if len(items) > 0:
            for item in items:
                if item.valid:
                    process(item)
            return True
    return False

# Style B: Guard clauses
def validate(items):
    if not items:
        return False

    valid_items = [item for item in items if item.valid]
    process_all(valid_items)
    return True

Both parse successfully. Both produce valid ASTs. But they’re very different in:

AST alone can’t distinguish these. Both have valid structures. You need semantic analysis to capture the how and why of code organization.

Conversely, semantic analysis alone can’t catch structure errors. A semantic pattern might be recognized, but applied to syntactically invalid code. AST validates the foundation.

Together, they form a powerful validation layer:

How Markov Models Learn Code Patterns

The process of learning happens in three stages:

Stage 1: Code Extraction

Input: Python codebase (100s-1000s of files)
┗Parse each file to AST
┗Extract semantic patterns (52 high-level idioms)
┗Output: Lists of patterns and AST sequences

For example, a function might produce:

AST sequence: [Module, FunctionDef, FunctionDef, Return, Return]
Semantic sequence: [function-transformer, guard-clause, return-list, return-none]
Location tracking: [(line 10, col 4), (line 12, col 8), ...]

Stage 2: N-gram Creation The sequences are converted to n-grams (chains of N consecutive states):

For order=2 with semantic patterns:

Sequence: [function-transformer, guard-clause, return-list, return-none]
2-grams:
  - (function-transformer, guard-clause) → next is return-list
  - (guard-clause, return-list) → next is return-none

Each n-gram becomes a key in a frequency table:

{
  ('function-transformer', 'guard-clause'): {
    'return-list': 3,
    'return-none': 1,
    'string-format': 1,
  },
  ('guard-clause', 'return-list'): {
    'return-none': 5,
  },
}

Stage 3: Probability Calculation From frequencies, we calculate probabilities:

P(return-list | function-transformer, guard-clause) = 3 / (3+1+1) = 0.60
P(return-none | function-transformer, guard-clause) = 1 / 5 = 0.20
P(string-format | function-transformer, guard-clause) = 1 / 5 = 0.20

These probabilities become confidence scores during validation. A probability of 0.60 means: “In training data, when we see this context, the next state was return-list 60% of the time.”

Transition Example

Let’s trace through a real example. Given training code:

def process_data(items):
    if not items:                    # ← guard-clause pattern detected
        return None                   # ← return-none pattern

    results = [x.transform() for x in items]  # ← return-computed pattern
    return results                    # ← return-list pattern

Semantic patterns extracted: [guard-clause, return-none, return-computed, return-list]

2-grams learned:

guard-clause → return-none (confidence: 0.8)
return-none → return-computed (confidence: 0.3)
return-computed → return-list (confidence: 0.9)

Later, when validating generated code with sequence [guard-clause, return-none, return-computed, return-list]:

Result: Valid code, moderate-to-high confidence.

From Training to Deployment

The journey from your codebase to real-time validation has four steps:

Step 1: Training (Offline, One-Time)

$ markymarkov train /path/to/codebase models/

Step 2: Export (One-Time)

# Models are automatically exported as:
# models/ast_model.py
# models/semantic_model.py

Each is a Python file containing:

Example structure:

# semantic_model.py
TRANSITIONS = {
    ('guard-clause', 'return-none'): {
        'return-computed': 0.8,
        'return-list': 0.15,
        'function-transformer': 0.05,
    },
    # ... 100s more transitions ...
}

MODEL_METADATA = {
    'order': 2,
    'total_transitions': 847,
    'unique_patterns': 23,
}

Step 3: Loading (Startup, <100ms)

from models.semantic_model import TRANSITIONS, MODEL_METADATA

model = MarkovCodeGuide.from_table(TRANSITIONS, MODEL_METADATA)

Step 4: Validation (Real-Time, <1ms per query)

# During code generation
confidence = model.check_transition(current_pattern, next_pattern)
# confidence = 0.8 (80% match to training data)

The Deployment Advantage

Notice that once trained, Markymarkov needs:

This makes Marky:

The Full Pipeline

Putting it together:

TRAINING PHASE (Offline, One-Time)
├─ Codebase → Parser → AST/Semantic Patterns
├─ Patterns → N-gram Creator → Transition Frequencies
├─ Frequencies → Probability Calculator → Confidence Scores
└─ Export → Python Module (models/semantic_model.py)

INFERENCE PHASE (Online, Real-Time)
├─ Load Model (import models/semantic_model.py)
├─ Generated Code → Pattern Extractor → Pattern Sequence
├─ Sequence → N-gram Splitter → (context, next)
├─ Lookup in TRANSITIONS → Confidence Score
└─ Return Score to LLM Agent or Validator

This pipeline gives you the best of both worlds:

3. Architecture Deep Dive

Level 1: AST Patterns (Syntactic Correctness)

The AST (Abstract Syntax Tree) level operates at the structural foundation of code. It answers the question: “Is this code structurally valid for our codebase?”

How AST Extraction Works

Python’s ast module parses code into a tree representation where each node represents a syntactic construct:

Module
├─ FunctionDef (name='process')
│  ├─ arguments
│  ├─ If
│  │  ├─ Compare
│  │  └─ Return
│  └─ For
│     ├─ expr
│     └─ Expr (Call)
└─ FunctionDef (name='validate')
   ├─ Return
   └─ Return

Markymarkov extracts parent-child transitions from this tree:

(Module, FunctionDef)
(FunctionDef, If)
(FunctionDef, For)
(If, Return)
(For, Expr)
(Expr, Call)
...

What This Captures

AST patterns capture structural rules about valid code composition:

Common AST transitions you’d see in typical code:

(FunctionDef, Return)      ✓ Functions can return values
(If, Assign)               ✓ Can assign in if blocks
(For, If)                  ✓ Can nest if in for loop
(FunctionDef, ClassDef)    ✗ Can't define class inside function (rarely)
(Return, FunctionDef)      ✗ Can't define function after return

Concrete Example

Let’s say your codebase has these patterns frequently:

def validate(x):
    if not x:
        return False
    return True

def process(items):
    if not items:
        return []
    return [x.upper() for x in items]

def transform(data):
    if isinstance(data, str):
        return data.strip()
    if data is None:
        return ""
    return str(data)

Markymarkov extracts:

(Module, FunctionDef) - appears 3 times
(FunctionDef, If) - appears 3 times
(If, Return) - appears 4 times
(If, If) - appears 1 time (nested ifs)
(Return, FunctionDef) - appears 2 times

Transition probabilities:

P(If | FunctionDef) = 3/3 = 1.0   (very common)
P(Return | If) = 4/4 = 1.0         (very common)
P(If | If) = 1/3 = 0.33            (less common)
P(ClassDef | FunctionDef) = 0      (never seen)

Later, when validating generated code with structure:

def my_func():
    if condition:
        return result

AST validation checks:

Result: Structurally valid and idiomatic for this codebase.

Why AST Alone Isn’t Enough

Two functions with identical AST structure:

# Version A
def validate(items):
    valid = []
    for item in items:
        if item.ok:
            valid.append(item)
    return valid

# Version B
def validate(items):
    return [item for item in items if item.ok]

Both have AST:

FunctionDef → Return
For/Comprehension → If

But Version B is more idiomatic Python. AST can’t tell you that.

Performance Characteristics

AST extraction is highly efficient:


Level 2: Semantic Patterns (Code Style & Idioms)

While AST validates structure, semantic patterns validate intent and style. This level asks: “Does this code follow how we write code?”

How Semantic Pattern Detection Works

Semantic patterns are high-level abstractions detected by analyzing code behavior:

# Pattern: GUARD_CLAUSE
if not items:
    return []

# Pattern: RETURN_NONE
if condition:
    return None

# Pattern: LOOP_FILTER
for item in items:
    if item.valid:
        process(item)

# Pattern: LIST_COMPREHENSION
[x.transform() for x in items if x.valid]

# Pattern: CONTEXT_MANAGER
with open(file) as f:
    content = f.read()

Unlike AST (which just sees structure), semantic analysis understands:

The 52+ Semantic Patterns

Markymarkov detects and tracks 52+ distinct semantic patterns across these categories:

Control Flow Idioms (8 patterns)

Loop Patterns (9 patterns)

Return Patterns (6 patterns)

Data Structure Patterns (10 patterns)

Comprehension Patterns (4 patterns)

String Patterns (3 patterns)

Function/Class Patterns (7 patterns)

Error Handling Patterns (5 patterns)

Miscellaneous Patterns (0+ patterns)

Example: Semantic Pattern Detection

Consider this function:

def find_user(user_id):
    if user_id is None:              # ← if-none-check
        return None                   # ← return-none

    user = database.get(user_id)      # ← database lookup
    if not user:                      # ← if-empty-check
        return None                   # ← return-none

    return user                       # ← return-computed

Semantic sequence detected: [if-none-check, return-none, if-empty-check, return-none, return-computed]

This pattern is classic: “validate inputs early, return early if invalid, then return the result.”

Why Semantic Matters

Two functions that do the same thing:

# Style A: Nested checks
def process(data):
    if data:
        if isinstance(data, list):
            if len(data) > 0:
                return [item.process() for item in data]
    return None

# Style B: Guard clauses (idiomatic Python)
def process(data):
    if not data or not isinstance(data, list) or len(data) == 0:
        return None
    return [item.process() for item in data]

# Style C: Even better (Python idiom)
def process(data):
    if not isinstance(data, list) or not data:
        return None
    return [item.process() for item in data]

Both are syntactically valid (both have valid ASTs). But:

Only semantic analysis can distinguish these.

Practical Impact

For code generation guidance, semantic patterns are crucial because they:


How They Work Together

The magic of Markymarkov comes from combining these two levels:

Scenario 1: Valid AST, Unknown Semantic Pattern

def validate(x):
    if x is None:
        return None
    return x.process()

Scenario 2: Valid AST, Unusual Semantic Pattern

def process(items):
    results = {}
    for item in items:
        try:
            results[item.id] = item.process()
        except Exception:
            pass
    return results

Scenario 3: Syntactically Valid, Semantically Suspicious

def validate(items):
    if len(items) > 0:
        if isinstance(items, list):
            if items is not None:
                return items
    else:
        return None

Fallback Mechanism

When confidence is uncertain in one level, the other provides context:

# If AST says "unknown structure"
# Check: Does semantic pattern exist? If yes, probably OK.
# → Higher overall confidence

# If semantic says "unknown pattern"
# Check: Is AST structure valid? If yes, probably OK.
# → Higher overall confidence

Combined Confidence Scoring

Final confidence combines both:

overall_confidence = (ast_confidence × 0.4) + (semantic_confidence × 0.6)

Or for explicit validation:

Valid = (AST passes) AND (Semantic acceptable OR AST has >0.8 confidence)
Confidence = weighted_average(ast_score, semantic_score)

This gives you:

Real-World Example

Given training code:

def find_users(search_term):
    if not search_term:          # ← guard-clause
        return []                # ← return-list

    results = []
    for user in database.all():  # ← loop-iterate
        if user.name.contains(search_term):  # ← loop-filter
            results.append(user) # ← loop-accumulate
    return results               # ← return-list

Pattern sequence: [guard-clause, return-list, loop-iterate, loop-filter, loop-accumulate, return-list]

When validating generated code:

def find_users(search_term):
    if not search_term:
        return []
    return [u for u in database.all() if search_term in u.name]

Pattern sequence: [guard-clause, return-list, list-comprehension]

Checking:

Result: Accepted with medium confidence (semantically different but AST valid)

4. Practical Examples

Training Markymarkov on Your Codebase

markymarkov train /path/to/code models/ --model-type both

Using Markymarkov to Validate Generated Code

markymarkov validate models/semantic_model.py generated_code.py

Real-World Validation Output with Diagnostics

> uv run markymarkov validate examples/pytest/semantic_model.py src/__main__.py
Built markymarkov @ file:///.../marky
Uninstalled 1 package in 0.21ms
Installed 1 package in 0.45ms
Loading model: examples/pytest/semantic_model.py
Validating code: src/__main__.py

Extracted 71 semantic patterns
First 20 patterns: ['init-method', 'function-transformer', 'if-empty-check', 'return-none', 'function-transformer', 'guard-clause', 'return-list', 'guard-clause', 'string-format', 'return-computed', 'function-transformer', 'if-empty-check', 'return-none', 'context-manager', 'string-format', 'context-manager', 'string-format', 'function-transformer', 'return-computed', 'loop-enumerate']
Model order: 2
Model has 211 pattern sequences

Validation Result (Semantic Model):
  Valid: True
  Confidence: 0.373
  Pattern sequences checked: 15
  Known transitions: 9/15

  ✓ Matching sequences (9):
    1. function-transformer → if-empty-check → return-none (0.423) @ line 118:12
    2. if-empty-check → return-none → function-transformer (0.370) @ line 135:4
    3. return-none → function-transformer → guard-clause (0.275) @ line 138:12
    4. guard-clause → return-list → guard-clause (1.000) @ line 143:8
    5. string-format → return-computed → function-transformer (0.714) @ line 149:4
    6. return-computed → function-transformer → if-empty-check (0.056) @ line 158:8
    7. function-transformer → if-empty-check → return-none (0.423) @ line 160:12

  ✗ Non-matching sequences (6):
    1. init-method → function-transformer → if-empty-check @ line 116:8
       Expected one of: return-computed, if-type-check, unpacking
    2. function-transformer → guard-clause → return-list @ line 139:16
       Expected one of: return-computed, return-none, return-bool
    3. return-list → guard-clause → string-format
       Expected one of: return-list
    4. Unknown sequence: guard-clause → string-format @ line 145:12
    5. if-empty-check → return-none → context-manager
       Expected one of: function-transformer, return-computed, guard-clause
    ... and 1 more

  Summary:
    Unique patterns found: 23
    Coverage: 9/15 transitions (60.0%)
    Issues: 4 unexpected, 2 unknown context

Understanding Coverage & Confidence Scores

5. 52 Semantic Patterns

Control Flow Patterns

Loop Patterns

Return Patterns

Data Structure Patterns

Error Handling Patterns

Function/Class Patterns

Comprehension & Other Patterns

6. Integration with LLM Agents

Note: This chapter explores integration patterns between Markymarkov and LLM agents. While not yet implemented in practice, the architecture suggests several natural approaches worth considering.

Marky’s design lends itself to integration with LLM-based code generation agents. Rather than replacing the LLM, Markymarkov serves as a validation and guidance layer, helping agents generate more idiomatic code in real-time.

How Agents Use Markymarkov for Guidance

The integration pattern is straightforward: instead of a simple generate-then-review cycle, Markymarkov enables a continuous feedback loop during generation.

Step 1: Agent Starts Generation

User prompt: "Generate a Python function that validates user input"

LLM begins token-by-token generation with temperature sampling

Step 2: Real-Time Validation

Generated tokens: [def, validate, (, user_input, ), :, \n, if, not, user_input, ...]

After each statement, Markymarkov analyzes:
- AST structure validity
- Semantic pattern sequence
- Confidence score for next pattern

Step 3: Confidence-Based Feedback

Markymarkov output:
  Current pattern: guard-clause
  Next pattern options:
    - return-none (0.85 confidence)     ← high confidence
    - return-bool (0.12 confidence)     ← lower confidence
    - loop-filter (0.03 confidence)     ← very unusual

  Recommendation: Steer toward return-none

Step 4: Agent Adjusts Generation

Based on confidence scores, the agent decides:
  - Continue: If high confidence (>0.7)
  - Adjust temperature: If medium confidence (0.4-0.7)
  - Use logit biasing: If low confidence (<0.4)
  - Regenerate: If confidence below threshold

Example Flow

Here’s how a generation might proceed with Markymarkov validation:

INITIAL: Empty function stub
def validate_email(email):


ITERATION 1:
  Generated: "if not email:"
  Markymarkov analysis: guard-clause detected
  Confidence: 0.92 (very common opening pattern)
  Action: Continue with high confidence

ITERATION 2:
  Generated: "    return False"
  Pattern after guard-clause: return-bool
  Markymarkov analysis: return-bool confidence 0.8 (common after guards)
  Markymarkov analysis: return-none confidence 0.15 (less common)
  Action: Continue, both valid patterns

ITERATION 3:
  Generated: "    email_regex = r'...'"
  Pattern: init-var (assignment)
  Markymarkov analysis: After return-bool, return-var confidence 0.3
  Markymarkov analysis: After return-bool, init-var confidence 0.15
  Assessment: Medium-low confidence (unusual pattern chain)
  Action: Alert agent, consider regeneration or steering

ITERATION 4:
  Generated: "    return email_regex.match(email) is not None"
  Pattern: return-computed
  Markymarkov analysis: After init-var, return-computed confidence 0.7
  Assessment: Pattern sequence matches learned patterns
  Action: Continue

RESULT:
def validate_email(email):
    if not email:
        return False
    return email_regex.match(email) is not None

Pattern sequence: [guard-clause, return-bool, init-var, return-computed]
Overall confidence: 0.65 average (valid but slightly unusual)

Temperature Sampling for Diversity

LLMs use temperature to control randomness in token selection. Marky’s confidence scores provide a natural signal for temperature adjustments.

How Temperature Works with LLM Sampling

Standard temperature sampling:

At each step, LLM computes logits for all possible next tokens:
logits = [0.8, 1.2, 0.3, -0.5, 2.1, ...]

Temperature=1.0 (default):
  probabilities = softmax(logits)
  Sample from probabilities (natural randomness)

Temperature=0.3 (low, deterministic):
  probabilities = softmax(logits / 0.3)
  Peaks sharpen, mostly samples highest probability token
  Result: More predictable, less creative

Temperature=2.0 (high, creative):
  probabilities = softmax(logits / 2.0)
  Distribution flattens, samples more diverse tokens
  Result: More creative, less predictable

Marky-Aware Temperature Tuning

Consider dynamic temperature adjustment based on confidence:

def adjust_temperature(marky_confidence):
    """
    Adjust generation temperature based on pattern confidence.
    High confidence → follow patterns closely
    Low confidence → explore alternatives
    """
    if marky_confidence > 0.8:
        return 0.3  # Follow learned patterns closely
    elif marky_confidence > 0.5:
        return 0.7  # Balanced exploration
    else:
        return 1.2  # High diversity, unusual patterns OK

Use Cases

Scenario 1: Style Enforcement Mode
  Markymarkov confidence: 0.95 (return-list very common after guard-clause)
  Temperature: 0.2 (low, enforce pattern)
  Effect: Agent strongly prefers idiomatic code

Scenario 2: Experimental Code Generation
  Markymarkov confidence: 0.4 (unusual pattern combination)
  Temperature: 1.5 (high, explore alternatives)
  Effect: Agent tries novel patterns while staying valid

Scenario 3: Adaptive Mode
  Markymarkov confidence: varies dynamically
  Temperature: adjusts per iteration
  Effect: Agent balances idiomaticity with novelty

Logit Biasing for LLM Steering

Many LLM APIs (OpenAI, Anthropic, etc.) support logit biasing: adjusting the probability of specific tokens before sampling. Marky’s pattern knowledge maps naturally to this mechanism.

How Logit Biasing Works

Raw logits: [0.8, 1.2, 0.3, -0.5, 2.1, ...]
Token IDs:  [123, 456, 789, 234, 567, ...]

With logit bias (boost token 456, suppress token 789):
  bias = {456: +2.0, 789: -3.0}
  adjusted_logits = [0.8, 3.2, -2.7, -0.5, 2.1, ...]
           ↑              ↑             ↑
        unchanged      boosted      suppressed

Sample from adjusted distribution:
  Token 456 becomes much more likely
  Token 789 becomes much less likely

Marky-Driven Logit Biasing

Here’s an approach to recommend token biases:

def recommend_logit_bias(current_pattern, model_transitions):
    """
    Recommend token biases based on expected next patterns.
    """
    expected_next = model_transitions[current_pattern]

    # Boost tokens for high-confidence patterns
    positive_bias = {}
    for pattern, confidence in expected_next.items():
        if confidence > 0.7:
            tokens = pattern_to_tokens(pattern)
            for token in tokens:
                positive_bias[token] = confidence * 2.0

    # Suppress tokens for low-confidence patterns
    negative_bias = {}
    for pattern, confidence in expected_next.items():
        if confidence < 0.1:
            tokens = pattern_to_tokens(pattern)
            for token in tokens:
                negative_bias[token] = -3.0

    return {**positive_bias, **negative_bias}

Example: Steering Toward Guard Clauses

Suppose Markymarkov analysis shows: “After a function definition, guards have 0.85 confidence.”

# Generate logit bias for guard clause patterns
bias = {
    token_id("if"): +1.5,      # Boost "if" keyword
    token_id("not"): +1.2,     # Boost "not" (common in guards)
    token_id("is None"): +1.0, # Boost None check

    # Suppress less idiomatic patterns
    token_id("while"): -2.0,   # Suppress while loops
    token_id("try"): -1.5,     # Suppress try blocks
}

# Pass to LLM API
response = client.chat.completions.create(
    model="gpt-4",
    messages=[...],
    logit_bias=bias  # Steer generation toward guard clauses
)

Benefits of This Approach


Real-Time Validation During Generation

While temperature and logit biasing shape generation proactively, real-time validation provides reactive feedback as code emerges.

Architecture

LLM Stream Generator
    ↓ (token by token)
Code Builder (accumulates tokens)
    ↓ (complete statement)
Pattern Detector
    ├─ Extract AST
    ├─ Extract Semantic Patterns
    └─ Create N-grams

Markov Model Validator
    ├─ Check AST transitions
    ├─ Check semantic transitions
    └─ Calculate confidence

Feedback Engine
    ├─ Report to agent/user
    ├─ Suggest corrections
    └─ Recommend next patterns

Example: Real-Time Detection

def validate_stream(token_stream, model):
    """
    Validate code as it streams from LLM.
    """
    code = ""
    for token in token_stream:
        code += token

        # When we have a complete statement
        if is_complete_statement(code):
            # Extract patterns
            ast_patterns = extract_ast_patterns(code)
            semantic_patterns = extract_semantic_patterns(code)

            # Validate last transition
            if len(semantic_patterns) >= 2:
                prev_pattern = semantic_patterns[-2]
                curr_pattern = semantic_patterns[-1]

                confidence = model.check_transition(
                    prev_pattern,
                    curr_pattern
                )

                if confidence < CONFIDENCE_THRESHOLD:
                    yield {
                        'warning': f"Unusual pattern: {prev_pattern}{curr_pattern}",
                        'confidence': confidence,
                        'expected': model.top_transitions(prev_pattern),
                    }

            yield {'valid': True, 'code': code}

Use Cases

1. Style Enforcement During Generation

Generated: "for item in items:\n    try:"
Markymarkov feedback: loop-try-except pattern (confidence: 0.2)
Alert: "Unusual! After loop-iterate, try-except is uncommon"
       "Expected: loop-filter (0.7), loop-transform (0.6)"
Action: Agent chooses to regenerate or accept the warning

2. Early Error Detection

Generated: "def func(): return\n    print('unreachable')"
Markymarkov feedback: AST error detected - code after return
Alert: "Invalid AST structure! Code after return statement"
Action: Agent must regenerate

3. Coverage Tracking

As code generates, Markymarkov tracks:
  Coverage so far: 45% of patterns matched training data
  Current confidence: 0.62
  Unique patterns used: 8/23

Alert: "Code is stylistically different from training"
       "Consider: Adding more guard clauses, use comprehensions"

4. Guided Refinement

Agent generates initial code:
def process(data):
    results = []
    for item in data:
        try:
            results.append(item.process())
        except:
            pass
    return results

Markymarkov analysis:
  Pattern: [init-empty-list, loop-filter, try-except-pass, return-list]
  Confidence: 0.45 (valid but unusual combination)

Suggestion: "Consider rewriting with list comprehension?"
  [item.process() for item in data if item.valid()]
  This pattern has confidence: 0.85

Agent regenerates:
def process(data):
    return [item.process() for item in data if item.valid()]

New confidence: 0.88 ✓

Integration Patterns

Several integration patterns emerge naturally:

Pattern 1: Validation-Only

Agent generates code → Markymarkov validates → Report results
Simplest approach, no feedback loop.

Pattern 2: Temperature-Aware

Agent generates tokens → Markymarkov scores pattern → Adjust temperature
Agent regenerates with new temperature
Enables iterative refinement.

Pattern 3: Logit Bias Steering

Before generation, compute logit biases from Marky
Pass biases to LLM API
Generation steered toward idiomatic patterns from the start.

Pattern 4: Interactive Refinement

Agent generates skeleton code
User reviews with Markymarkov feedback
Agent refines based on confidence scores
Iterate until satisfied.

Pattern 5: Multi-Agent Validation

Agent A generates candidate code
Agent B (validator) uses Markymarkov to score
Agent A refines based on scores
Both agents converge toward idiomatic code.

Implementation Sketch

Here’s a sketch of how integration might look in practice:

import json
from typing import Generator, Dict, List
from models.semantic_model import MarkovCodeGuide

class MarkyGuidedAgent:
    """
    Example of Markymarkov integration with an LLM agent.
    """

    def __init__(self, marky_model_path: str, llm_client, temperature=0.7):
        self.markymarkov = MarkovCodeGuide.load(marky_model_path)
        self.llm = llm_client
        self.temperature = temperature
        self.generated_code = ""
        self.pattern_history = []

    def generate_with_guidance(
        self,
        prompt: str,
        max_iterations: int = 3
    ) -> Dict[str, any]:
        """
        Generate code with Markymarkov guidance and iterative refinement.

        Workflow:
        1. Generate code with current temperature
        2. Validate with Marky
        3. Adjust temperature based on confidence
        4. Regenerate if confidence is too low
        5. Repeat until confident or max iterations reached
        """
        best_result = None
        best_confidence = 0.0

        for iteration in range(max_iterations):
            # Adjust temperature based on previous iteration's confidence
            if iteration > 0:
                self.temperature = self._adjust_temperature(best_confidence)

            print(f"\n[Iteration {iteration + 1}]")
            print(f"  Temperature: {self.temperature:.2f}")

            # Generate code
            code = self._stream_generate(prompt)

            # Validate with Marky
            validation = self._validate_code(code)

            print(f"  Generated: {len(code)} chars")
            print(f"  Confidence: {validation['confidence']:.2f}")
            print(f"  Coverage: {validation['coverage']:.1%}")

            # Store if best so far
            if validation['confidence'] > best_confidence:
                best_result = {
                    'code': code,
                    'validation': validation,
                    'temperature': self.temperature,
                    'iteration': iteration,
                }
                best_confidence = validation['confidence']

            # Stop if confident enough
            if best_confidence > 0.8:
                print(f"\n  Confidence threshold reached, stopping.")
                break

        return best_result

    def _stream_generate(self, prompt: str) -> str:
        """Stream token generation from LLM."""
        code = ""
        response = self.llm.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=self.temperature,
            stream=True,
        )

        for chunk in response:
            if chunk.choices[0].delta.content:
                token = chunk.choices[0].delta.content
                code += token

        return code

    def _validate_code(self, code: str) -> Dict:
        """Validate with Markymarkov model."""
        import ast as ast_module

        try:
            tree = ast_module.parse(code)
        except SyntaxError as e:
            return {
                'valid': False,
                'confidence': 0.0,
                'coverage': 0.0,
                'error': str(e),
            }

        # Extract patterns
        patterns = self.marky.extract_patterns(code)

        # Validate pattern transitions
        transitions = []
        for i in range(1, len(patterns)):
            prev_pattern = patterns[i-1]
            curr_pattern = patterns[i]
            confidence = self.marky.check_transition(
                prev_pattern,
                curr_pattern
            )
            transitions.append({
                'prev': prev_pattern,
                'curr': curr_pattern,
                'confidence': confidence,
            })

        # Calculate coverage and average confidence
        valid_transitions = sum(
            1 for t in transitions if t['confidence'] > 0.3
        )
        coverage = valid_transitions / max(len(transitions), 1)
        avg_confidence = (
            sum(t['confidence'] for t in transitions)
            / max(len(transitions), 1)
        )

        return {
            'valid': True,
            'confidence': avg_confidence,
            'coverage': coverage,
            'patterns': patterns,
            'transitions': transitions,
        }

    def _adjust_temperature(self, confidence: float) -> float:
        """Dynamically adjust temperature based on Markymarkov confidence."""
        if confidence > 0.8:
            return 0.3  # Low temp: follow patterns closely
        elif confidence > 0.6:
            return 0.7  # Medium temp: balanced
        elif confidence > 0.4:
            return 1.0  # Normal temp: explore more
        else:
            return 1.5  # High temp: significant exploration

# Example usage:
"""
import openai

agent = MarkyGuidedAgent(
    'models/semantic_model.py',
    llm_client=openai.Client(),
)

result = agent.generate_with_guidance(
    prompt="Write a function to validate email addresses",
    max_iterations=3,
)

print(f"\nFinal Result:")
print(f"Code:\n{result['code']}")
print(f"Confidence: {result['validation']['confidence']:.2f}")
print(f"Iteration: {result['iteration'] + 1}")
"""

Design Considerations


Why This Integration Makes Sense

Combining Markymarkov with LLM agents offers several advantages:

Open Questions

Some questions worth exploring through implementation:

The answers to these questions would come from experimentation and real-world usage.

7. Performance Characteristics

Marky’s performance has been benchmarked against the Python 3.13 standard library (596 files, 20.23 MB of code). These are real-world numbers, not synthetic benchmarks.

Training Performance

Training speed is crucial for iterative development. Markymarkov processes code quickly:

Dataset Characteristics

Training Speed

Both models (AST + Semantic):
  Total time: 5.65 seconds
  Throughput: 105.6 files/second
  Throughput: 3.58 MB/second
  Per-file average: 9.5 ms

AST model only:
  Total time: 2.60 seconds
  Throughput: 229.2 files/second
  Throughput: 7.78 MB/second

Semantic model only:
  Total time: 2.81 seconds
  Throughput: 212.0 files/second
  Throughput: 7.19 MB/second

What This Means:

Model Sizes

The trained models are compact:

AST model: 537.1 KB
Semantic model: 277.5 KB
Combined: 814.6 KB

Compression ratio: 25.4x (20 MB → 815 KB)

This makes models:


Validation Performance

Validation speed determines whether Markymarkov can be used in real-time workflows:

End-to-End Validation (including subprocess startup):

AST Validation:
  Median: 216.9 ms
  Mean: 218.8 ms
  P95: 238.3 ms
  Range: 207.1–246.8 ms

Semantic Validation:
  Median: 211.4 ms
  Mean: 380.0 ms
  P95: 1233.5 ms
  Range: 187.9–1251.7 ms

Note: Higher variance in semantic validation due to
      pattern complexity in different code sections.

What This Means:

The subprocess overhead dominates these times. In a long-running process (like an IDE plugin), validation would be 10-50x faster.


Query Latency

The core model lookup operations are extremely fast:

Model Transition Lookup

Warm lookup (cached in memory):
  Mean: <1 microsecond
  Median: <1 microsecond
  P99: <2 microseconds

Cold lookup (first access):
  Mean: ~1-2 microseconds

This is effectively instant. The lookup is just a Python dict access:

confidence = transitions.get((prev_pattern, curr_pattern), {})
# O(1) hash table lookup

At these speeds:


Memory Footprint

Markymarkov is memory-efficient:

Model Loading

AST model in memory: ~1-2 MB (loaded)
Semantic model in memory: ~500 KB - 1 MB (loaded)
Both models: ~2-3 MB total

Python overhead: ~10-20 MB (interpreter)
Total process: ~15-25 MB

Scaling Characteristics

Model size grows sub-linearly with codebase size:

100 files → ~100 KB model
1,000 files → ~800 KB model
10,000 files → ~5 MB model (estimated)
100,000 files → ~30 MB model (estimated)

The sub-linear growth occurs because:


Throughput & Scalability

Training Scalability

Training time scales linearly with codebase size:

O(n) where n = number of AST nodes

Observed:
  596 files × 34.8 KB avg = 5.65s
  Predicted for 6,000 files: ~56s
  Predicted for 60,000 files: ~9 minutes

Bottlenecks:

Not bottlenecks:

Validation Scalability

Validation time scales with file size:

O(m) where m = lines of code in file

Typical:
  100 LOC file: ~50ms
  1,000 LOC file: ~200ms
  10,000 LOC file: ~2s

For very large files (>5000 LOC), consider:

Parallel Processing

Training is embarrassingly parallel:

Single-threaded: 105.6 files/second
8 cores: ~850 files/second (estimated)
16 cores: ~1,700 files/second (estimated)

The current implementation is single-threaded, but parallelization is straightforward (process files independently, merge results).


Real-World Performance Examples

Small Project (50 files, 2 MB)

Training: ~0.5 seconds
Model size: ~100 KB
Validation: ~200ms per file
Total setup: <1 second

Medium Project (500 files, 20 MB)

Training: ~5 seconds
Model size: ~800 KB
Validation: ~200ms per file
Total setup: ~5 seconds

Large Project (5,000 files, 200 MB)

Training: ~50 seconds
Model size: ~5 MB
Validation: ~200ms per file
Total setup: ~1 minute

Very Large Project (50,000 files, 2 GB)

Training: ~8 minutes (estimated)
Model size: ~30 MB (estimated)
Validation: ~200ms per file
Total setup: ~10 minutes (one-time)

Optimizations & Tuning

Training Speed

To optimize training:

1. Use faster disk I/O (SSD vs HDD: 2-3x improvement)
2. Parallelize file processing (linear speedup)
3. Use lower n-gram order (order=1 is 2x faster)
4. Filter irrelevant files (tests, generated code)
5. Cache parsed ASTs (if re-training frequently)

Validation Speed

To optimize validation:

1. Run in persistent process (avoid subprocess overhead)
2. Pre-load models at startup (100ms → 1ms validation)
3. Validate incrementally (only changed code)
4. Use AST cache (if validating same file repeatedly)
5. Parallel validation (multiple files)

Memory Usage

To reduce memory:

1. Use lower n-gram order (order=1: 50% less memory)
2. Prune rare transitions (threshold=2: 30% less memory)
3. Quantize probabilities (float32 → int8: 75% less memory)
4. Stream model loading (don't load all at once)

Comparison with Alternatives

vs. Linters (pylint, flake8)

Markymarkov training: 5.6s for 596 files
Pylint: ~60s for 596 files (10x slower)
Flake8: ~15s for 596 files (3x slower)

Markymarkov validation: ~200ms per file
Pylint: ~300ms per file
Flake8: ~100ms per file

Advantage: Markymarkov learns from your code, not fixed rules
Trade-off: Markymarkov requires initial training step

vs. Type Checkers (mypy)

Markymarkov validation: ~200ms per file
Mypy: ~500ms per file (cold), ~100ms (warm)

Advantage: Markymarkov checks patterns, not types (complementary)
Trade-off: Different problem domain

vs. Deep Learning Models

Markymarkov training: 5.6s for 596 files
CodeBERT training: Hours/days
GPT fine-tuning: Days/weeks

Markymarkov inference: <1ms per lookup
CodeBERT inference: ~100ms per prediction
GPT inference: ~1s per completion

Advantage: Markymarkov is 100-1000x faster
Trade-off: Markymarkov doesn't understand semantics, only patterns

Performance Recommendations

For best performance in different scenarios:

Development/IDE Integration

- Use persistent process (avoid subprocess overhead)
- Pre-load models at startup
- Validate on save (200ms is acceptable)
- Update models nightly (don't retrain on every change)

CI/CD Pipelines

- Train models in dedicated step (cache for reuse)
- Validate changed files only (not entire codebase)
- Run in parallel (one validation per core)
- Fail fast (stop on first validation error)

Code Review Tools

- Load models once (per review session)
- Validate diffs only (not unchanged code)
- Show confidence scores (help reviewers prioritize)
- Cache results (same file validated multiple times)

Large-Scale Analysis

- Parallelize training (split files across workers)
- Stream processing (don't load all files at once)
- Sample validation (validate subset for estimates)
- Incremental updates (retrain only changed modules)

Benchmark Methodology

These numbers were obtained using:

Benchmark code is available in the repository (benchmark.py and benchmark_summary.py).


Key Takeaways

Fast training: 100+ files/second, practical for daily retraining ✓ Compact models: 25x compression, <1 MB for typical projects ✓ Quick validation: ~200ms per file including overhead ✓ Instant lookups: <1 microsecond for model queries ✓ Linear scaling: Performance predictable as codebase grows ✓ Memory efficient: 2-3 MB for loaded models ✓ Production-ready: Performance suitable for real-time use

Marky’s performance characteristics make it viable for:

8. Use Cases

Marky’s pattern-learning approach enables a variety of practical applications, from code quality enforcement to training data analysis. Here are detailed scenarios showing how Markymarkov solves real problems.


Code Quality Assurance

Problem: Teams need to ensure generated or contributed code follows project conventions, but manual review doesn’t scale and traditional linters only check syntax.

How Markymarkov Helps:

Markymarkov learns what “quality code” looks like from your existing codebase, then validates new code against those patterns.

Example Workflow:

# 1. Train on your production codebase
uv run markymarkov train src/ models/ --model-type both

# 2. Validate new code during PR review
uv run markymarkov validate models/semantic_model.py new_feature.py

# Output shows:
#   - Confidence score (0.0-1.0)
#   - Which patterns match your codebase
#   - Which patterns are unusual
#   - Suggestions for improvement

Real-World Scenario:

A team maintaining a Flask application trains Markymarkov on their existing routes, models, and utilities:

# Existing codebase pattern: Always validate input early
@app.route('/user/<user_id>')
def get_user(user_id):
    if not user_id:
        return jsonify({'error': 'Missing user_id'}), 400
    user = db.query(User).get(user_id)
    if not user:
        return jsonify({'error': 'User not found'}), 404
    return jsonify(user.to_dict())

When a new developer submits code without input validation:

@app.route('/product/<product_id>')
def get_product(product_id):
    product = db.query(Product).get(product_id)
    return jsonify(product.to_dict())

Markymarkov flags this:

Validation Result:
  Confidence: 0.32 (Low - unusual pattern)

  Missing expected patterns:
    - guard-clause (expected after function-transformer: 0.85 confidence)
    - if-empty-check (common in route handlers: 0.78 confidence)

  Suggestion: Add input validation (guard clause pattern)

Benefits:


Style Enforcement

Problem: Different developers write code differently. Some use list comprehensions, others use loops. Some use guard clauses, others nest conditionals. You want consistency.

How Markymarkov Helps:

Markymarkov identifies the dominant patterns in your codebase and flags deviations. This creates a “style fingerprint” unique to your project.

Example Workflow:

# Train on well-styled reference code
uv run markymarkov train examples/good_style/ models/

# Validate new contributions
uv run markymarkov validate models/semantic_model.py contribution.py

# Integrate into pre-commit hook
# .git/hooks/pre-commit:
#!/bin/bash
for file in $(git diff --cached --name-only --diff-filter=ACM | grep '\.py$'); do
    uv run markymarkov validate models/semantic_model.py "$file"
    if [ $? -ne 0 ]; then
        echo "Style validation failed for $file"
        exit 1
    fi
done

Real-World Scenario:

Your team prefers comprehensions over manual loops:

# Preferred style (80% of codebase)
result = [item.transform() for item in items if item.valid]

# Less preferred (20% of codebase)
result = []
for item in items:
    if item.valid:
        result.append(item.transform())

New code using manual loops gets flagged:

Pattern detected: loop-accumulate
Confidence: 0.25 (Low for this context)

Alternative suggestion:
  list-comprehension has 0.82 confidence in this context
  Consider: [item.transform() for item in items if item.valid]

Configuration Options:

You can tune enforcement strictness:

# Strict mode: Reject if confidence < 0.7
if confidence < 0.7:
    reject_code()

# Permissive mode: Warn if confidence < 0.5, reject if < 0.3
if confidence < 0.3:
    reject_code()
elif confidence < 0.5:
    warn_developer()

# Learning mode: Always accept, but log unusual patterns
log_pattern_stats(code, confidence)

Benefits:


Training Data Analysis

Problem: You want to understand what patterns dominate your codebase, identify inconsistencies, or prepare data for model training.

How Markymarkov Helps:

Markymarkov extracts and quantifies patterns, giving you insights into code characteristics.

Example Workflow:

# Train model
uv run markymarkov train codebase/ analysis_models/

# View statistics
uv run markymarkov stats analysis_models/semantic_model.py

# Output shows:
#   - Most common patterns
#   - Pattern transition probabilities
#   - Unusual pattern combinations

Real-World Scenario:

Analyzing a legacy codebase before refactoring:

$ uv run markymarkov stats analysis_models/semantic_model.py

Model Statistics:
  Total transitions: 1,247
  Unique patterns: 31

Most common patterns:
  1. function-transformer (892 occurrences, 71.5%)
  2. guard-clause (423 occurrences, 33.9%)
  3. return-computed (389 occurrences, 31.2%)
  4. loop-filter (276 occurrences, 22.1%)
  5. try-except-pass (198 occurrences, 15.9%)

Unusual patterns (low confidence):
  - try-except-pass after loop-filter (0.12 confidence)
 Suggests error handling inside loops (potential issue)

  - nested-if-else chains (0.08 confidence)
 Only 8% of code uses this (refactoring candidate)

Pattern diversity: 0.62 (moderate - some patterns dominate)

From this analysis, you learn:

Use Cases:

Advanced Analysis:

Compare patterns across different modules:

# Train separate models
uv run markymarkov train src/api/ models/api_model.py
uv run markymarkov train src/db/ models/db_model.py
uv run markymarkov train src/utils/ models/utils_model.py

# Compare statistics
python compare_models.py models/api_model.py models/db_model.py

# Output:
#   API layer: Heavy use of guard clauses (0.82)
#   DB layer: Heavy use of context managers (0.71)
#   Utils: Heavy use of list comprehensions (0.68)

Benefits:


Model-Driven Code Generation

Problem: You want LLMs to generate code that matches your project’s style, not generic Python.

How Markymarkov Helps:

Train Markymarkov on your codebase, then use it to guide or validate LLM-generated code.

Example Workflow:

# 1. Train on your codebase
subprocess.run(["uv", "run", "python", "-m", "src", "train",
                "src/", "models/"])

# 2. Generate code with LLM
llm_code = generate_code_with_llm(prompt="Write a user validation function")

# 3. Validate with Marky
validation = validate_code(llm_code, "models/semantic_model.py")

# 4. If low confidence, regenerate with hints
if validation['confidence'] < 0.6:
    hints = get_expected_patterns(validation)
    llm_code = generate_code_with_llm(
        prompt=f"{original_prompt}\n\nUse these patterns: {hints}"
    )

Real-World Scenario:

Your project uses specific error handling patterns:

# Your codebase pattern
def fetch_user(user_id):
    try:
        return db.get_user(user_id)
    except NotFoundError:
        logger.warning(f"User {user_id} not found")
        return None
    except DatabaseError as e:
        logger.error(f"Database error: {e}")
        raise

LLM generates generic code:

def fetch_product(product_id):
    try:
        return db.get_product(product_id)
    except Exception:
        return None

Markymarkov validation:

Confidence: 0.23 (Very low)

Issues:
  - Catching generic Exception (your code uses specific exceptions)
  - No logging (your code always logs errors)
  - Silent failure (your code raises on critical errors)

Expected patterns:
  - try-except with specific exception types (0.89 confidence)
  - logging-call in except blocks (0.82 confidence)
  - try-except-reraise for critical errors (0.71 confidence)

You regenerate with hints:

# Improved LLM output (after hints)
def fetch_product(product_id):
    try:
        return db.get_product(product_id)
    except NotFoundError:
        logger.warning(f"Product {product_id} not found")
        return None
    except DatabaseError as e:
        logger.error(f"Database error: {e}")
        raise

New confidence: 0.87 ✓

Benefits:


Identifying Code Anomalies

Problem: You suspect certain code sections are unusual or buggy, but need objective evidence.

How Markymarkov Helps:

Low-confidence patterns indicate code that doesn’t match typical project style. This can surface bugs, anti-patterns, or quick hacks.

Example Workflow:

# Train on healthy codebase
uv run markymarkov train src/ models/ --exclude tests/

# Validate suspicious file
uv run markymarkov validate models/semantic_model.py suspicious.py

# Look for low-confidence patterns

Real-World Scenario:

A bug report comes in for a function. Markymarkov validation shows:

def process_payment(amount, user):
    result = charge_card(user.card, amount)
    if result:
        return result
    update_balance(user, amount)  # ← This line is unusual
    return {"status": "success"}

Markymarkov output:

Validation Result:
  Confidence: 0.18 (Very low)

  Unusual patterns:
    Line 4: Unreachable code after return (0.02 confidence)
    Expected: All code paths should be reachable

  Similar functions in codebase:
    - process_refund: Always checks result before proceeding (0.91 confidence)
    - process_subscription: Uses guard clauses (0.87 confidence)

The bug: update_balance is never called because of early return. Markymarkov flagged it as unusual because similar functions in the codebase don’t have this pattern.

Finding Tech Debt:

Run validation on entire codebase and sort by confidence:

# Validate all files
for file in src/**/*.py; do
    uv run markymarkov validate models/semantic_model.py "$file" >> results.txt
done

# Sort by confidence (low = suspicious)
grep "Confidence:" results.txt | sort -t: -k2 -n

# Output:
#   src/legacy/utils.py: Confidence: 0.12
#   src/old_api/handlers.py: Confidence: 0.19
#   src/deprecated/auth.py: Confidence: 0.23
#   ...

Files with very low confidence often indicate:

Benefits:


Pre-Commit Hooks / CI Integration

Problem: You want to catch style issues before code reaches review.

How Markymarkov Helps:

Integrate Markymarkov into your development workflow to validate code automatically.

Example: Pre-Commit Hook

#!/bin/bash
# .git/hooks/pre-commit

THRESHOLD=0.5  # Minimum acceptable confidence

for file in $(git diff --cached --name-only --diff-filter=ACM | grep '\.py$'); do
    result=$(uv run markymarkov validate models/semantic_model.py "$file" 2>&1)
    confidence=$(echo "$result" | grep "Confidence:" | awk '{print $2}')

    if (( $(echo "$confidence < $THRESHOLD" | bc -l) )); then
        echo "$file: Low confidence ($confidence)"
        echo "$result"
        exit 1
    else
        echo "$file: Passed ($confidence)"
    fi
done

echo "All files passed Markymarkov validation!"

Example: GitHub Actions

name: Code Style Validation

on: [pull_request]

jobs:
  marky-validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Setup Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.13'

      - name: Install Marky
        run: |
          pip install uv
          uv sync

      - name: Train model (or use cached)
        run: |
          if [ ! -f models/semantic_model.py ]; then
            uv run markymarkov train src/ models/
          fi

      - name: Validate changed files
        run: |
          for file in $(git diff --name-only origin/main...HEAD | grep '\.py$'); do
            uv run markymarkov validate models/semantic_model.py "$file"
          done

Benefits:


Code Migration / Refactoring Assistance

Problem: Migrating from one style to another (e.g., callbacks → async/await, classes → functional).

How Markymarkov Helps:

Train one model on old style, another on new style, then validate migrations.

Example Workflow:

# Train on old codebase (before refactor)
uv run markymarkov train old_src/ models/old_style.py

# Train on target examples (new style)
uv run markymarkov train examples/new_style/ models/new_style.py

# During migration, validate against new style
uv run markymarkov validate models/new_style.py refactored_file.py

Real-World Scenario:

Migrating from synchronous to async code:

# Old style (callback-based)
def fetch_user(user_id, callback):
    result = db.get_user(user_id)
    callback(result)

# New style (async/await)
async def fetch_user(user_id):
    result = await db.get_user(user_id)
    return result

Markymarkov trained on new style flags incomplete migrations:

# Partially migrated (still has callback)
async def fetch_product(product_id, callback):
    result = await db.get_product(product_id)
    callback(result)  # ← Should return instead

Validation:

Confidence: 0.31 (Low - mixed patterns)

Issues:
  - Async function with callback parameter (not in new style)
  - Expected: async functions return values (0.94 confidence in new_style.py)

Suggestion: Remove callback, use return statement

Benefits:


Onboarding New Developers

Problem: New team members need to learn project conventions quickly.

How Markymarkov Helps:

Use Markymarkov statistics to show “how we write code here” with concrete examples.

Example Workflow:

# Generate onboarding report
uv run markymarkov stats models/semantic_model.py > ONBOARDING.md

# Show examples of each pattern
python extract_pattern_examples.py models/semantic_model.py src/

Onboarding Document Generated:

# Our Code Style (Generated from Codebase Analysis)

## Most Common Patterns

### 1. Guard Clauses (85% of functions)
We prefer early returns for invalid input:

\`\`\`python
# ✓ Our style
def process(data):
    if not data:
        return None
    if not data.valid:
        return None
    return data.process()

# ✗ Avoid
def process(data):
    if data and data.valid:
        return data.process()
    else:
        return None
\`\`\`

### 2. List Comprehensions (73% of loops)
Use comprehensions for transformations:

\`\`\`python
# ✓ Our style
results = [item.transform() for item in items if item.valid]

# ✗ Avoid
results = []
for item in items:
    if item.valid:
        results.append(item.transform())
\`\`\`

> Rare Patterns (Avoid Unless Necessary)

- try-except-pass (only 3% of code)
- nested-if-else (only 5% of code)
- while loops (only 8% of code)

Benefits:


Documentation Generation

Problem: Code style guides are often out of date or incomplete.

How Markymarkov Helps:

Generate living documentation from actual code patterns.

# Auto-generate style guide
python generate_style_guide.py models/semantic_model.py > STYLE_GUIDE.md

This creates documentation that’s:

Benefits:


Research / Code Analysis

Problem: Understanding code evolution, pattern trends, or comparative analysis.

How Markymarkov Helps:

Train models on different versions or projects to study differences.

Example Workflow:

# Train on multiple versions
uv run markymarkov train v1.0/ models/v1.py
uv run markymarkov train v2.0/ models/v2.py
uv run markymarkov train v3.0/ models/v3.py

# Compare patterns across versions
python compare_evolution.py models/v*.py

Research Questions Answered:

Benefits:


Summary of Use Cases

Use CaseBenefitTypical Users
Code Quality AssuranceAutomated style validationAll developers
Style EnforcementConsistent codebaseTech leads, DevOps
Training Data AnalysisUnderstand patternsData scientists, researchers
Model-Driven GenerationBetter LLM outputAI/ML teams
Anomaly DetectionFind bugs earlyQA, code reviewers
CI/CD IntegrationAutomated checksDevOps, platform teams
Refactoring AssistanceMigration validationSenior developers
Developer OnboardingLearn conventionsNew team members
DocumentationLiving style guidesTech writers, leads
ResearchCode pattern studiesResearchers, academics

All these use cases leverage the same core capability: learning patterns from code and validating new code against those patterns. The flexibility of Marky’s approach makes it valuable across the entire development lifecycle.

9. Getting Started

Easily run from main branch:

uvx --from git+https://github.com/roobie/markymarkov markymarkov
# You can alias the above
alias -s marky="uvx --from git+https://github.com/roobie/markymarkov markymarkov"

Installation & Setup

git clone https://github.com/roobie/markymarkov
cd markymarkov
uv sync

Training Your First Model

# Train on your codebase
uv run markymarkov train /path/to/your/code models/

# Or on specific patterns
uv run markymarkov train /path/to/code models/ --model-type semantic --order 2

Running Validation

# Validate a file
uv run markymarkov validate models/semantic_model.py your_file.py

# See statistics
uv run markymarkov stats models/semantic_model.py

# Try the demo
uv run markymarkov demo

Integration Patterns

10. Conclusion

Markymarkov shows that small, interpretable probabilistic models add high value to code workflows. By learning patterns from your own codebase, Markymarkov provides fast, explainable guidance that complements existing tools rather than replacing them.

Key advantages:

If you care about consistency, developer productivity, and explainability, try training Markymarkov on a representative subset of your code and see what it highlights.


Edit page
Share this post on:

Next Post
Building Smarter AI Agents With Ideas From Philosophy