Table of Contents
Open Table of Contents
Outline Details
1. Introduction
What is Markymarkov?
Markymarkov is a Markov Chain-based code guidance system designed to help LLM agents generate better code. Unlike traditional linters or type checkers that enforce fixed rules, Markymarkov learns patterns directly from your codebase and uses those patterns to validate and guide code generation in real-time.
At its core, Markymarkov operates on a simple but powerful principle: code is sequential, and code patterns are learnable. By analyzing existing code, it builds probabilistic models of what patterns typically follow other patterns—both at the syntactic level (how code is structured) and at the semantic level (what code idioms are preferred).
Key characteristics:
- Pattern-based: Learns from your actual codebase, not predefined rules
- Probabilistic: Provides confidence scores, not just pass/fail
- Fast: Sub-millisecond lookups enable real-time guidance during generation
- Interpretable: Can explain why a pattern is expected or unexpected
- Language-agnostic architecture: Currently focused on Python, but extensible to other languages
- Deployable: Exports models as executable Python modules with no external dependencies
The Problem It Solves
Modern LLM-based code generation has created a paradox: generated code is often syntactically correct does not always follow stylistic idioms or algorithmic patterns.
Consider this scenario:
- An LLM generates a Python function that uses a valid but non-idiomatic pattern
- The code parses without errors
- Type checkers are satisfied
- But it violates your team’s coding conventions
- A human reviewer flags it for rewriting
This is the gap Markymarkov addresses. Today’s development teams face several related challenges:
-
Style Enforcement at Scale
- Teams want consistency across codebases
- Manual code review can’t catch every style deviation
- Linters only check static rules, not learned patterns
- No way to programmatically enforce “this is how we write code”
-
LLM Code Generation Quality
- LLMs generate valid code, but not idiomatic code
- Agents can’t distinguish between “correct” and “our style”
- Temperature and sampling can’t capture organizational conventions
- Training on diverse data means diverse output
-
Validation Gaps
- Linters are rule-based (hard to maintain)
- Type checkers focus on types, not patterns
- AST visitors can find structure but miss intent
- No standard way to validate “does this follow our patterns?”
-
Training Data Analysis
- Hard to understand what patterns dominate your codebase
- Difficult to identify anomalies or tech debt
- No way to measure consistency across teams
- Can’t extract “what makes our code unique?”
Markymarkov solves these by learning patterns from your code and providing:
- Real-time validation: Is this pattern expected?
- Confidence scores: How idiomatic is this code?
- Diagnostic information: Where does it diverge and why?
- Generation guidance: What patterns should come next?
Why Markov Chains for Code?
You might wonder: “Why not use deep learning, LSTMs, or transformers?” The answer reveals important truths about code and pattern matching.
-
Code is Inherently Sequential
- Code flows through AST traversal order
- Control flow follows predictable paths
- Pattern composition is chain-like
- Markov chains were literally invented for this use case
-
The Markov Property Holds for Code
- “The next pattern depends only on the current pattern” ✓
- Previous history can be captured in n-gram state
- Two-state or three-state context is often sufficient
- Rare cases benefit from higher-order models
-
Computational Efficiency
- Training: O(n) pass through codebase
- Inference: O(1) hash table lookup
- Deep learning: GPU-heavy, requires infrastructure
- Marky: CPU-friendly, runs anywhere
- 1000 files/minute training speed (AST model)
- <1ms query latency (cached lookups)
-
Interpretability
- Deep learning: “Why did it choose this?” → Inscrutable
- Markov chains: “What’s the probability?” → Auditable
- Can explain: “After X pattern, Y is expected 85% of the time”
- Developers can reason about confidence scores
- No black box; built on transparent math
-
Data Efficiency
- Deep learning: Needs massive datasets
- Markov chains: Work well with any size codebases
- Effective with 100+ files
- Scales gracefully to 100K+ files
- Graceful degradation (unknown patterns still provide value)
-
Integration with LLMs
- LLMs already generate token sequences
- Markov models provide per-token guidance
- Natural fit for token-by-token validation
- Easy to integrate into sampling/generation loops
- No need to retrain LLM; validate output instead
Real-World Example: Why This Matters
Imagine your team uses these patterns:
- Guard clauses for early returns (not nested ifs)
- List comprehensions over manual loops
- Context managers for resource handling
- f-strings over
.format()
An LLM might generate perfectly valid alternatives:
- Nested if-else chains (valid, not your style)
- For loops with manual appends (valid, not your style)
- Try-finally blocks (valid, not your style)
- .format() calls (valid, not your style)
Markymarkov learns these preferences from your codebase and can:
- During generation: Guide the LLM toward idiomatic patterns
- During validation: Flag deviations with confidence scores
- During review: Explain why code is unexpected
- For training: Help fine-tune models on your patterns
The Markymarkov Approach
Rather than debating “right” vs. “wrong” code style, Markymarkov asks: “What does this codebase do?” and “Does this code match those patterns?”
This shifts the conversation from:
- “That’s wrong” → “That’s not how we do it”
- “Use this style” → “Your code uses different patterns than training”
- “Bad practice” → “Low confidence match to learned patterns”
The result: objective, data-driven style validation that teams can understand and trust.
2. Core Concept
Two-Level Architecture (AST + Semantic)
Marky’s core innovation is its two-level validation architecture. This dual approach gives you the best of both worlds: structural correctness and stylistic idiomaticity.
Level 1: AST Patterns (Syntactic Correctness)
- Analyzes the Abstract Syntax Tree—how code is structurally composed
- Models transitions between node types: “What follows a function definition?”
- Answers: “Is the syntactic structure valid for this codebase?”
- Example patterns:
FunctionDef → Return(functions contain returns)If → Assign(assignments can appear in if blocks)For → Call(for loops can contain function calls)
- Think of it as: “Does the code structure make syntactic sense?”
Level 2: Semantic Patterns (Style & Idioms)
- Analyzes high-level coding idioms and patterns
- Models transitions between detected patterns: “What comes after a guard clause?”
- Answers: “Is the code written in the style we prefer?”
- Example patterns:
guard-clause → return-list(guard clauses often precede returns)loop-filter → process-item(filtered loops often process items)init-method → assign(constructors often assign properties)
- Think of it as: “Does this code match how we write code?”
Why Two Levels?
The two-level approach solves a fundamental problem: syntax and style are orthogonal concerns.
Consider this example: Both of these are syntactically valid:
# Style A: Nested conditionals
def validate(items):
if items:
if len(items) > 0:
for item in items:
if item.valid:
process(item)
return True
return False
# Style B: Guard clauses
def validate(items):
if not items:
return False
valid_items = [item for item in items if item.valid]
process_all(valid_items)
return True
Both parse successfully. Both produce valid ASTs. But they’re very different in:
- Readability (guard clauses are clearer)
- Idiomaticity (Python community prefers B)
- Maintainability (B is easier to modify)
AST alone can’t distinguish these. Both have valid structures. You need semantic analysis to capture the how and why of code organization.
Conversely, semantic analysis alone can’t catch structure errors. A semantic pattern might be recognized, but applied to syntactically invalid code. AST validates the foundation.
Together, they form a powerful validation layer:
- AST ensures: “The code is well-formed”
- Semantic ensures: “The code follows our conventions”
- Combined: “The code is correct and idiomatic”
How Markov Models Learn Code Patterns
The process of learning happens in three stages:
Stage 1: Code Extraction
Input: Python codebase (100s-1000s of files)
┗Parse each file to AST
┗Extract semantic patterns (52 high-level idioms)
┗Output: Lists of patterns and AST sequences
For example, a function might produce:
AST sequence: [Module, FunctionDef, FunctionDef, Return, Return]
Semantic sequence: [function-transformer, guard-clause, return-list, return-none]
Location tracking: [(line 10, col 4), (line 12, col 8), ...]
Stage 2: N-gram Creation The sequences are converted to n-grams (chains of N consecutive states):
For order=2 with semantic patterns:
Sequence: [function-transformer, guard-clause, return-list, return-none]
2-grams:
- (function-transformer, guard-clause) → next is return-list
- (guard-clause, return-list) → next is return-none
Each n-gram becomes a key in a frequency table:
{
('function-transformer', 'guard-clause'): {
'return-list': 3,
'return-none': 1,
'string-format': 1,
},
('guard-clause', 'return-list'): {
'return-none': 5,
},
}
Stage 3: Probability Calculation From frequencies, we calculate probabilities:
P(return-list | function-transformer, guard-clause) = 3 / (3+1+1) = 0.60
P(return-none | function-transformer, guard-clause) = 1 / 5 = 0.20
P(string-format | function-transformer, guard-clause) = 1 / 5 = 0.20
These probabilities become confidence scores during validation. A probability of 0.60 means: “In training data, when we see this context, the next state was return-list 60% of the time.”
Transition Example
Let’s trace through a real example. Given training code:
def process_data(items):
if not items: # ← guard-clause pattern detected
return None # ← return-none pattern
results = [x.transform() for x in items] # ← return-computed pattern
return results # ← return-list pattern
Semantic patterns extracted: [guard-clause, return-none, return-computed, return-list]
2-grams learned:
guard-clause → return-none (confidence: 0.8)
return-none → return-computed (confidence: 0.3)
return-computed → return-list (confidence: 0.9)
Later, when validating generated code with sequence [guard-clause, return-none, return-computed, return-list]:
- ✓
guard-clause → return-none: Matches! (0.8 confidence) - ✓
return-none → return-computed: Unusual but known (0.3 confidence) - ✓
return-computed → return-list: Matches! (0.9 confidence)
Result: Valid code, moderate-to-high confidence.
From Training to Deployment
The journey from your codebase to real-time validation has four steps:
Step 1: Training (Offline, One-Time)
$ markymarkov train /path/to/codebase models/
- Scans all Python files
- Builds AST and semantic models
- Calculates transition probabilities
- Duration: ~1 minute for 1000 files
Step 2: Export (One-Time)
# Models are automatically exported as:
# models/ast_model.py
# models/semantic_model.py
Each is a Python file containing:
- Pre-calculated probability tables
- Pattern definitions
- No external dependencies
- ~5-50 KB per model
Example structure:
# semantic_model.py
TRANSITIONS = {
('guard-clause', 'return-none'): {
'return-computed': 0.8,
'return-list': 0.15,
'function-transformer': 0.05,
},
# ... 100s more transitions ...
}
MODEL_METADATA = {
'order': 2,
'total_transitions': 847,
'unique_patterns': 23,
}
Step 3: Loading (Startup, <100ms)
from models.semantic_model import TRANSITIONS, MODEL_METADATA
model = MarkovCodeGuide.from_table(TRANSITIONS, MODEL_METADATA)
- Just Python imports, no parsing
- Models load in milliseconds
- Fits in memory (1-50 MB)
- Ready for validation
Step 4: Validation (Real-Time, <1ms per query)
# During code generation
confidence = model.check_transition(current_pattern, next_pattern)
# confidence = 0.8 (80% match to training data)
- Hash table lookup: O(1)
- Sub-millisecond latency
- Cache-friendly
- Can handle 50K+ queries/second
The Deployment Advantage
Notice that once trained, Markymarkov needs:
- No Python AST parser (patterns pre-extracted)
- No semantic analyzer (patterns pre-detected)
- No machine learning framework (probabilities pre-calculated)
- Just a Python dict and some math
This makes Marky:
- Fast: No parsing overhead
- Portable: Works in any Python environment
- Reliable: Deterministic, no randomness
- Versionable: Export as .py files, commit to git
- Debuggable: Can inspect transition tables manually
The Full Pipeline
Putting it together:
TRAINING PHASE (Offline, One-Time)
├─ Codebase → Parser → AST/Semantic Patterns
├─ Patterns → N-gram Creator → Transition Frequencies
├─ Frequencies → Probability Calculator → Confidence Scores
└─ Export → Python Module (models/semantic_model.py)
INFERENCE PHASE (Online, Real-Time)
├─ Load Model (import models/semantic_model.py)
├─ Generated Code → Pattern Extractor → Pattern Sequence
├─ Sequence → N-gram Splitter → (context, next)
├─ Lookup in TRANSITIONS → Confidence Score
└─ Return Score to LLM Agent or Validator
This pipeline gives you the best of both worlds:
- Learning power: Captures real patterns from your code
- Runtime efficiency: No expensive computation at inference time
- Transparency: Understand why code matches or doesn’t match
- Flexibility: Retrain anytime your codebase evolves
3. Architecture Deep Dive
Level 1: AST Patterns (Syntactic Correctness)
The AST (Abstract Syntax Tree) level operates at the structural foundation of code. It answers the question: “Is this code structurally valid for our codebase?”
How AST Extraction Works
Python’s ast module parses code into a tree representation where each node represents a syntactic construct:
Module
├─ FunctionDef (name='process')
│ ├─ arguments
│ ├─ If
│ │ ├─ Compare
│ │ └─ Return
│ └─ For
│ ├─ expr
│ └─ Expr (Call)
└─ FunctionDef (name='validate')
├─ Return
└─ Return
Markymarkov extracts parent-child transitions from this tree:
(Module, FunctionDef)
(FunctionDef, If)
(FunctionDef, For)
(If, Return)
(For, Expr)
(Expr, Call)
...
What This Captures
AST patterns capture structural rules about valid code composition:
- What can contain what? (FunctionDef can contain Return)
- What sequences are valid? (If blocks can be followed by For loops)
- What’s never valid? (Return outside function)
Common AST transitions you’d see in typical code:
(FunctionDef, Return) ✓ Functions can return values
(If, Assign) ✓ Can assign in if blocks
(For, If) ✓ Can nest if in for loop
(FunctionDef, ClassDef) ✗ Can't define class inside function (rarely)
(Return, FunctionDef) ✗ Can't define function after return
Concrete Example
Let’s say your codebase has these patterns frequently:
def validate(x):
if not x:
return False
return True
def process(items):
if not items:
return []
return [x.upper() for x in items]
def transform(data):
if isinstance(data, str):
return data.strip()
if data is None:
return ""
return str(data)
Markymarkov extracts:
(Module, FunctionDef) - appears 3 times
(FunctionDef, If) - appears 3 times
(If, Return) - appears 4 times
(If, If) - appears 1 time (nested ifs)
(Return, FunctionDef) - appears 2 times
Transition probabilities:
P(If | FunctionDef) = 3/3 = 1.0 (very common)
P(Return | If) = 4/4 = 1.0 (very common)
P(If | If) = 1/3 = 0.33 (less common)
P(ClassDef | FunctionDef) = 0 (never seen)
Later, when validating generated code with structure:
def my_func():
if condition:
return result
AST validation checks:
- ✓
Module → FunctionDef(0.99 confidence) - ✓
FunctionDef → If(1.0 confidence) - ✓
If → Return(1.0 confidence)
Result: Structurally valid and idiomatic for this codebase.
Why AST Alone Isn’t Enough
Two functions with identical AST structure:
# Version A
def validate(items):
valid = []
for item in items:
if item.ok:
valid.append(item)
return valid
# Version B
def validate(items):
return [item for item in items if item.ok]
Both have AST:
FunctionDef → Return
For/Comprehension → If
But Version B is more idiomatic Python. AST can’t tell you that.
Performance Characteristics
AST extraction is highly efficient:
- Single pass through code
- Linear time complexity O(n)
- Training speed: ~1000 files/minute
- Model size: 10-50 MB for typical codebases
- Lookup: O(1) hash table access, <1 microsecond
Level 2: Semantic Patterns (Code Style & Idioms)
While AST validates structure, semantic patterns validate intent and style. This level asks: “Does this code follow how we write code?”
How Semantic Pattern Detection Works
Semantic patterns are high-level abstractions detected by analyzing code behavior:
# Pattern: GUARD_CLAUSE
if not items:
return []
# Pattern: RETURN_NONE
if condition:
return None
# Pattern: LOOP_FILTER
for item in items:
if item.valid:
process(item)
# Pattern: LIST_COMPREHENSION
[x.transform() for x in items if x.valid]
# Pattern: CONTEXT_MANAGER
with open(file) as f:
content = f.read()
Unlike AST (which just sees structure), semantic analysis understands:
- What the code is trying to do
- Whether it uses idiomatic patterns
- What intent the pattern expresses
The 52+ Semantic Patterns
Markymarkov detects and tracks 52+ distinct semantic patterns across these categories:
Control Flow Idioms (8 patterns)
if-none-check:if x is None:if-not-none:if x is not None:if-empty-check:if not x:orif len(x) == 0:if-type-check:isinstance(x, Type)guard-clause: Early return for invalid inputternary-expression:x if condition else yboolean-return: Direct return of boolean expressionfallthrough: Implicit None return
Loop Patterns (9 patterns)
loop-accumulate: Collect items:result.append(item)loop-filter: Conditional processing:if condition: process(item)loop-transform: Transform items:yield transform(item)loop-enumerate:for i, item in enumerate(items):loop-zip:for x, y in zip(a, b):loop-dict-items:for k, v in dict.items():loop-dict-keys:for k in dict.keys():loop-dict-values:for v in dict.values():loop-reverse:for item in reversed(items):
Return Patterns (6 patterns)
return-none:return Noneor barereturnreturn-bool:return Trueorreturn Falsereturn-list:return []orreturn [value]return-dict:return {}orreturn {k: v}return-computed:return a + borreturn func(x)return-literal:return "string"orreturn 42
Data Structure Patterns (10 patterns)
init-empty-list:x = []init-empty-dict:x = {}init-empty-set:x = set()init-counter:x = 0append-to-list:list.append(item)extend-list:list.extend(items)dict-get-default:dict.get(key, default)dict-setdefault:dict.setdefault(key, value)unpacking:x, y = dataor*argsstring-concatenation:a + borf"{a}{b}"
Comprehension Patterns (4 patterns)
list-comprehension:[x for x in items if condition]dict-comprehension:{k: v for k, v in items}set-comprehension:{x for x in items}generator-expression:(x for x in items)
String Patterns (3 patterns)
string-format-fstring:f"{x} {y}"string-format-method:"{} {}".format(x, y)string-format-join:"-".join(items)
Function/Class Patterns (7 patterns)
function-validator: Checks and returns boolfunction-transformer: Transforms input to outputfunction-factory: Creates and returns objectsinit-method:__init__(self, ...)property-getter:@property def x(self):class-method:@classmethod def ...static-method:@staticmethod def ...
Error Handling Patterns (5 patterns)
try-except-pass: Suppress exceptions silentlytry-except-log: Log and continuetry-except-reraise: Re-raise after handlingtry-finally: Cleanup codecontext-manager:withstatement
Miscellaneous Patterns (0+ patterns)
import-statement: Module importslogging-call:logger.info(), etc.assertion:assert conditioncomment: Inline documentation
Example: Semantic Pattern Detection
Consider this function:
def find_user(user_id):
if user_id is None: # ← if-none-check
return None # ← return-none
user = database.get(user_id) # ← database lookup
if not user: # ← if-empty-check
return None # ← return-none
return user # ← return-computed
Semantic sequence detected: [if-none-check, return-none, if-empty-check, return-none, return-computed]
This pattern is classic: “validate inputs early, return early if invalid, then return the result.”
Why Semantic Matters
Two functions that do the same thing:
# Style A: Nested checks
def process(data):
if data:
if isinstance(data, list):
if len(data) > 0:
return [item.process() for item in data]
return None
# Style B: Guard clauses (idiomatic Python)
def process(data):
if not data or not isinstance(data, list) or len(data) == 0:
return None
return [item.process() for item in data]
# Style C: Even better (Python idiom)
def process(data):
if not isinstance(data, list) or not data:
return None
return [item.process() for item in data]
Both are syntactically valid (both have valid ASTs). But:
- Style A: Nested checks (less idiomatic)
- Style B: Guard clause (idiomatic)
- Style C: Guard clause + idiomatic ordering (most idiomatic)
Only semantic analysis can distinguish these.
Practical Impact
For code generation guidance, semantic patterns are crucial because they:
- Capture how your team prefers to solve problems
- Enable pattern-based suggestions
- Explain code at a higher level
- Match what developers reason about
- Drive meaningful validation signals
How They Work Together
The magic of Markymarkov comes from combining these two levels:
Scenario 1: Valid AST, Unknown Semantic Pattern
def validate(x):
if x is None:
return None
return x.process()
- AST: ✓ Valid (FunctionDef → If → Return)
- Semantic:
if-none-check → return-none → return-computed - Confidence: Medium-to-high (recognized pattern)
Scenario 2: Valid AST, Unusual Semantic Pattern
def process(items):
results = {}
for item in items:
try:
results[item.id] = item.process()
except Exception:
pass
return results
- AST: ✓ Valid
- Semantic:
init-empty-dict → loop-filter → try-except-pass → return-dict - Unusual but valid (low confidence in semantic)
- AST says “OK”, semantic says “uncommon”
Scenario 3: Syntactically Valid, Semantically Suspicious
def validate(items):
if len(items) > 0:
if isinstance(items, list):
if items is not None:
return items
else:
return None
- AST: ✓ Valid (just nested ifs)
- Semantic: Detects poor pattern composition
- Confidence: Low (guards are unusual for this structure)
- Even though syntactically OK, pattern matches poorly
Fallback Mechanism
When confidence is uncertain in one level, the other provides context:
# If AST says "unknown structure"
# Check: Does semantic pattern exist? If yes, probably OK.
# → Higher overall confidence
# If semantic says "unknown pattern"
# Check: Is AST structure valid? If yes, probably OK.
# → Higher overall confidence
Combined Confidence Scoring
Final confidence combines both:
overall_confidence = (ast_confidence × 0.4) + (semantic_confidence × 0.6)
Or for explicit validation:
Valid = (AST passes) AND (Semantic acceptable OR AST has >0.8 confidence)
Confidence = weighted_average(ast_score, semantic_score)
This gives you:
- Structural guarantee: Code parses and flows correctly (AST)
- Style assurance: Code follows learned idioms (Semantic)
- Interpretability: Can explain both dimensions separately
- Flexibility: Can weight them differently per use case
Real-World Example
Given training code:
def find_users(search_term):
if not search_term: # ← guard-clause
return [] # ← return-list
results = []
for user in database.all(): # ← loop-iterate
if user.name.contains(search_term): # ← loop-filter
results.append(user) # ← loop-accumulate
return results # ← return-list
Pattern sequence: [guard-clause, return-list, loop-iterate, loop-filter, loop-accumulate, return-list]
When validating generated code:
def find_users(search_term):
if not search_term:
return []
return [u for u in database.all() if search_term in u.name]
Pattern sequence: [guard-clause, return-list, list-comprehension]
Checking:
- AST: ✓ Valid (FunctionDef → If → Return → Comprehension)
- Semantic:
guard-clause → return-list → list-comprehension- Different from training! (Training had loop-filter, list-comp does filter inline)
- But still valid pattern chain
- Confidence: 0.7 (valid but different style)
Result: Accepted with medium confidence (semantically different but AST valid)
4. Practical Examples
Training Markymarkov on Your Codebase
markymarkov train /path/to/code models/ --model-type both
- Analyzes all Python files
- Builds AST and semantic models
- Exports as Python modules
- <1 minutes for 1000 files
Using Markymarkov to Validate Generated Code
markymarkov validate models/semantic_model.py generated_code.py
- Shows pattern sequences
- Highlights known vs unknown transitions
- Provides coverage percentage
- Suggests expected alternatives
Real-World Validation Output with Diagnostics
> uv run markymarkov validate examples/pytest/semantic_model.py src/__main__.py
Built markymarkov @ file:///.../marky
Uninstalled 1 package in 0.21ms
Installed 1 package in 0.45ms
Loading model: examples/pytest/semantic_model.py
Validating code: src/__main__.py
Extracted 71 semantic patterns
First 20 patterns: ['init-method', 'function-transformer', 'if-empty-check', 'return-none', 'function-transformer', 'guard-clause', 'return-list', 'guard-clause', 'string-format', 'return-computed', 'function-transformer', 'if-empty-check', 'return-none', 'context-manager', 'string-format', 'context-manager', 'string-format', 'function-transformer', 'return-computed', 'loop-enumerate']
Model order: 2
Model has 211 pattern sequences
Validation Result (Semantic Model):
Valid: True
Confidence: 0.373
Pattern sequences checked: 15
Known transitions: 9/15
✓ Matching sequences (9):
1. function-transformer → if-empty-check → return-none (0.423) @ line 118:12
2. if-empty-check → return-none → function-transformer (0.370) @ line 135:4
3. return-none → function-transformer → guard-clause (0.275) @ line 138:12
4. guard-clause → return-list → guard-clause (1.000) @ line 143:8
5. string-format → return-computed → function-transformer (0.714) @ line 149:4
6. return-computed → function-transformer → if-empty-check (0.056) @ line 158:8
7. function-transformer → if-empty-check → return-none (0.423) @ line 160:12
✗ Non-matching sequences (6):
1. init-method → function-transformer → if-empty-check @ line 116:8
Expected one of: return-computed, if-type-check, unpacking
2. function-transformer → guard-clause → return-list @ line 139:16
Expected one of: return-computed, return-none, return-bool
3. return-list → guard-clause → string-format
Expected one of: return-list
4. Unknown sequence: guard-clause → string-format @ line 145:12
5. if-empty-check → return-none → context-manager
Expected one of: function-transformer, return-computed, guard-clause
... and 1 more
Summary:
Unique patterns found: 23
Coverage: 9/15 transitions (60.0%)
Issues: 4 unexpected, 2 unknown context
Understanding Coverage & Confidence Scores
- Coverage: % of transitions that match training data
- Confidence: Probability-weighted score (0.0-1.0)
- High coverage + high confidence = Very idiomatic
- Low coverage = Unusual but possibly valid style
5. 52 Semantic Patterns
Control Flow Patterns
- IF_NONE_CHECK:
if x is None: - IF_NOT_NONE:
if x is not None: - IF_EMPTY_CHECK:
if not x:orif len(x) == 0: - IF_TYPE_CHECK:
if isinstance(x, Type): - GUARD_CLAUSE: Early return for invalid cases
Loop Patterns
- LOOP_ACCUMULATE:
result = []; for item in items: result.append(...) - LOOP_TRANSFORM:
for item in items: yield transform(item) - LOOP_FILTER:
for item in items: if condition: process(item) - LOOP_ENUMERATE:
for i, item in enumerate(items): - LOOP_ZIP:
for x, y in zip(a, b): - LOOP_DICT_ITEMS:
for k, v in dict.items():
Return Patterns
- RETURN_NONE:
return Noneor barereturn - RETURN_BOOL:
return Trueorreturn False - RETURN_LIST:
return []orreturn [computed_value] - RETURN_DICT:
return {}orreturn {key: value} - RETURN_COMPUTED:
return a + borreturn func(x)
Data Structure Patterns
- INIT_EMPTY_LIST:
x = [] - INIT_EMPTY_DICT:
x = {} - INIT_COUNTER:
x = 0 - APPEND_TO_LIST:
list.append(item) - DICT_UPDATE:
dict.update(other) - DICT_GET_DEFAULT:
dict.get(key, default) - UNPACKING:
x, y = data
Error Handling Patterns
- TRY_EXCEPT_PASS:
try: ...; except: pass - TRY_EXCEPT_LOG:
try: ...; except: logger.error(...) - TRY_EXCEPT_RERAISE:
try: ...; except: raise - TRY_FINALLY:
try: ...; finally: cleanup() - CONTEXT_MANAGER:
with open(file) as f:
Function/Class Patterns
- FUNCTION_VALIDATOR: Function that validates input
- FUNCTION_TRANSFORMER: Function that transforms data
- INIT_METHOD:
__init__(self, ...) - PROPERTY_GETTER:
@property def x(self): - CLASS_METHOD:
@classmethod def ... - STATIC_METHOD:
@staticmethod def ...
Comprehension & Other Patterns
- LIST_COMPREHENSION:
[x for x in items if ...] - DICT_COMPREHENSION:
{k: v for k, v in ...} - GENERATOR_EXPRESSION:
(x for x in items) - TERNARY_EXPRESSION:
x if condition else y - STRING_FORMAT:
f"{x} {y}"orx.format(y) - LOGGING_CALL:
logger.info(),logger.debug(), etc.
6. Integration with LLM Agents
Note: This chapter explores integration patterns between Markymarkov and LLM agents. While not yet implemented in practice, the architecture suggests several natural approaches worth considering.
Marky’s design lends itself to integration with LLM-based code generation agents. Rather than replacing the LLM, Markymarkov serves as a validation and guidance layer, helping agents generate more idiomatic code in real-time.
How Agents Use Markymarkov for Guidance
The integration pattern is straightforward: instead of a simple generate-then-review cycle, Markymarkov enables a continuous feedback loop during generation.
Step 1: Agent Starts Generation
User prompt: "Generate a Python function that validates user input"
↓
LLM begins token-by-token generation with temperature sampling
Step 2: Real-Time Validation
Generated tokens: [def, validate, (, user_input, ), :, \n, if, not, user_input, ...]
↓
After each statement, Markymarkov analyzes:
- AST structure validity
- Semantic pattern sequence
- Confidence score for next pattern
Step 3: Confidence-Based Feedback
Markymarkov output:
Current pattern: guard-clause
Next pattern options:
- return-none (0.85 confidence) ← high confidence
- return-bool (0.12 confidence) ← lower confidence
- loop-filter (0.03 confidence) ← very unusual
Recommendation: Steer toward return-none
Step 4: Agent Adjusts Generation
Based on confidence scores, the agent decides:
- Continue: If high confidence (>0.7)
- Adjust temperature: If medium confidence (0.4-0.7)
- Use logit biasing: If low confidence (<0.4)
- Regenerate: If confidence below threshold
Example Flow
Here’s how a generation might proceed with Markymarkov validation:
INITIAL: Empty function stub
def validate_email(email):
ITERATION 1:
Generated: "if not email:"
Markymarkov analysis: guard-clause detected
Confidence: 0.92 (very common opening pattern)
Action: Continue with high confidence
ITERATION 2:
Generated: " return False"
Pattern after guard-clause: return-bool
Markymarkov analysis: return-bool confidence 0.8 (common after guards)
Markymarkov analysis: return-none confidence 0.15 (less common)
Action: Continue, both valid patterns
ITERATION 3:
Generated: " email_regex = r'...'"
Pattern: init-var (assignment)
Markymarkov analysis: After return-bool, return-var confidence 0.3
Markymarkov analysis: After return-bool, init-var confidence 0.15
Assessment: Medium-low confidence (unusual pattern chain)
Action: Alert agent, consider regeneration or steering
ITERATION 4:
Generated: " return email_regex.match(email) is not None"
Pattern: return-computed
Markymarkov analysis: After init-var, return-computed confidence 0.7
Assessment: Pattern sequence matches learned patterns
Action: Continue
RESULT:
def validate_email(email):
if not email:
return False
return email_regex.match(email) is not None
Pattern sequence: [guard-clause, return-bool, init-var, return-computed]
Overall confidence: 0.65 average (valid but slightly unusual)
Temperature Sampling for Diversity
LLMs use temperature to control randomness in token selection. Marky’s confidence scores provide a natural signal for temperature adjustments.
How Temperature Works with LLM Sampling
Standard temperature sampling:
At each step, LLM computes logits for all possible next tokens:
logits = [0.8, 1.2, 0.3, -0.5, 2.1, ...]
Temperature=1.0 (default):
probabilities = softmax(logits)
Sample from probabilities (natural randomness)
Temperature=0.3 (low, deterministic):
probabilities = softmax(logits / 0.3)
Peaks sharpen, mostly samples highest probability token
Result: More predictable, less creative
Temperature=2.0 (high, creative):
probabilities = softmax(logits / 2.0)
Distribution flattens, samples more diverse tokens
Result: More creative, less predictable
Marky-Aware Temperature Tuning
Consider dynamic temperature adjustment based on confidence:
def adjust_temperature(marky_confidence):
"""
Adjust generation temperature based on pattern confidence.
High confidence → follow patterns closely
Low confidence → explore alternatives
"""
if marky_confidence > 0.8:
return 0.3 # Follow learned patterns closely
elif marky_confidence > 0.5:
return 0.7 # Balanced exploration
else:
return 1.2 # High diversity, unusual patterns OK
Use Cases
Scenario 1: Style Enforcement Mode
Markymarkov confidence: 0.95 (return-list very common after guard-clause)
Temperature: 0.2 (low, enforce pattern)
Effect: Agent strongly prefers idiomatic code
Scenario 2: Experimental Code Generation
Markymarkov confidence: 0.4 (unusual pattern combination)
Temperature: 1.5 (high, explore alternatives)
Effect: Agent tries novel patterns while staying valid
Scenario 3: Adaptive Mode
Markymarkov confidence: varies dynamically
Temperature: adjusts per iteration
Effect: Agent balances idiomaticity with novelty
Logit Biasing for LLM Steering
Many LLM APIs (OpenAI, Anthropic, etc.) support logit biasing: adjusting the probability of specific tokens before sampling. Marky’s pattern knowledge maps naturally to this mechanism.
How Logit Biasing Works
Raw logits: [0.8, 1.2, 0.3, -0.5, 2.1, ...]
Token IDs: [123, 456, 789, 234, 567, ...]
With logit bias (boost token 456, suppress token 789):
bias = {456: +2.0, 789: -3.0}
adjusted_logits = [0.8, 3.2, -2.7, -0.5, 2.1, ...]
↑ ↑ ↑
unchanged boosted suppressed
Sample from adjusted distribution:
Token 456 becomes much more likely
Token 789 becomes much less likely
Marky-Driven Logit Biasing
Here’s an approach to recommend token biases:
def recommend_logit_bias(current_pattern, model_transitions):
"""
Recommend token biases based on expected next patterns.
"""
expected_next = model_transitions[current_pattern]
# Boost tokens for high-confidence patterns
positive_bias = {}
for pattern, confidence in expected_next.items():
if confidence > 0.7:
tokens = pattern_to_tokens(pattern)
for token in tokens:
positive_bias[token] = confidence * 2.0
# Suppress tokens for low-confidence patterns
negative_bias = {}
for pattern, confidence in expected_next.items():
if confidence < 0.1:
tokens = pattern_to_tokens(pattern)
for token in tokens:
negative_bias[token] = -3.0
return {**positive_bias, **negative_bias}
Example: Steering Toward Guard Clauses
Suppose Markymarkov analysis shows: “After a function definition, guards have 0.85 confidence.”
# Generate logit bias for guard clause patterns
bias = {
token_id("if"): +1.5, # Boost "if" keyword
token_id("not"): +1.2, # Boost "not" (common in guards)
token_id("is None"): +1.0, # Boost None check
# Suppress less idiomatic patterns
token_id("while"): -2.0, # Suppress while loops
token_id("try"): -1.5, # Suppress try blocks
}
# Pass to LLM API
response = client.chat.completions.create(
model="gpt-4",
messages=[...],
logit_bias=bias # Steer generation toward guard clauses
)
Benefits of This Approach
- Precise control: Boost/suppress specific tokens or patterns
- Non-invasive: Works with any LLM that supports logit bias
- Interpretable: Clear explanation for token preferences
- Efficient: Minimal inference overhead
- Composable: Stack multiple biases for complex guidance
Real-Time Validation During Generation
While temperature and logit biasing shape generation proactively, real-time validation provides reactive feedback as code emerges.
Architecture
LLM Stream Generator
↓ (token by token)
Code Builder (accumulates tokens)
↓ (complete statement)
Pattern Detector
├─ Extract AST
├─ Extract Semantic Patterns
└─ Create N-grams
↓
Markov Model Validator
├─ Check AST transitions
├─ Check semantic transitions
└─ Calculate confidence
↓
Feedback Engine
├─ Report to agent/user
├─ Suggest corrections
└─ Recommend next patterns
Example: Real-Time Detection
def validate_stream(token_stream, model):
"""
Validate code as it streams from LLM.
"""
code = ""
for token in token_stream:
code += token
# When we have a complete statement
if is_complete_statement(code):
# Extract patterns
ast_patterns = extract_ast_patterns(code)
semantic_patterns = extract_semantic_patterns(code)
# Validate last transition
if len(semantic_patterns) >= 2:
prev_pattern = semantic_patterns[-2]
curr_pattern = semantic_patterns[-1]
confidence = model.check_transition(
prev_pattern,
curr_pattern
)
if confidence < CONFIDENCE_THRESHOLD:
yield {
'warning': f"Unusual pattern: {prev_pattern} → {curr_pattern}",
'confidence': confidence,
'expected': model.top_transitions(prev_pattern),
}
yield {'valid': True, 'code': code}
Use Cases
1. Style Enforcement During Generation
Generated: "for item in items:\n try:"
Markymarkov feedback: loop-try-except pattern (confidence: 0.2)
Alert: "Unusual! After loop-iterate, try-except is uncommon"
"Expected: loop-filter (0.7), loop-transform (0.6)"
Action: Agent chooses to regenerate or accept the warning
2. Early Error Detection
Generated: "def func(): return\n print('unreachable')"
Markymarkov feedback: AST error detected - code after return
Alert: "Invalid AST structure! Code after return statement"
Action: Agent must regenerate
3. Coverage Tracking
As code generates, Markymarkov tracks:
Coverage so far: 45% of patterns matched training data
Current confidence: 0.62
Unique patterns used: 8/23
Alert: "Code is stylistically different from training"
"Consider: Adding more guard clauses, use comprehensions"
4. Guided Refinement
Agent generates initial code:
def process(data):
results = []
for item in data:
try:
results.append(item.process())
except:
pass
return results
Markymarkov analysis:
Pattern: [init-empty-list, loop-filter, try-except-pass, return-list]
Confidence: 0.45 (valid but unusual combination)
Suggestion: "Consider rewriting with list comprehension?"
[item.process() for item in data if item.valid()]
This pattern has confidence: 0.85
Agent regenerates:
def process(data):
return [item.process() for item in data if item.valid()]
New confidence: 0.88 ✓
Integration Patterns
Several integration patterns emerge naturally:
Pattern 1: Validation-Only
Agent generates code → Markymarkov validates → Report results
Simplest approach, no feedback loop.
Pattern 2: Temperature-Aware
Agent generates tokens → Markymarkov scores pattern → Adjust temperature
Agent regenerates with new temperature
Enables iterative refinement.
Pattern 3: Logit Bias Steering
Before generation, compute logit biases from Marky
Pass biases to LLM API
Generation steered toward idiomatic patterns from the start.
Pattern 4: Interactive Refinement
Agent generates skeleton code
User reviews with Markymarkov feedback
Agent refines based on confidence scores
Iterate until satisfied.
Pattern 5: Multi-Agent Validation
Agent A generates candidate code
Agent B (validator) uses Markymarkov to score
Agent A refines based on scores
Both agents converge toward idiomatic code.
Implementation Sketch
Here’s a sketch of how integration might look in practice:
import json
from typing import Generator, Dict, List
from models.semantic_model import MarkovCodeGuide
class MarkyGuidedAgent:
"""
Example of Markymarkov integration with an LLM agent.
"""
def __init__(self, marky_model_path: str, llm_client, temperature=0.7):
self.markymarkov = MarkovCodeGuide.load(marky_model_path)
self.llm = llm_client
self.temperature = temperature
self.generated_code = ""
self.pattern_history = []
def generate_with_guidance(
self,
prompt: str,
max_iterations: int = 3
) -> Dict[str, any]:
"""
Generate code with Markymarkov guidance and iterative refinement.
Workflow:
1. Generate code with current temperature
2. Validate with Marky
3. Adjust temperature based on confidence
4. Regenerate if confidence is too low
5. Repeat until confident or max iterations reached
"""
best_result = None
best_confidence = 0.0
for iteration in range(max_iterations):
# Adjust temperature based on previous iteration's confidence
if iteration > 0:
self.temperature = self._adjust_temperature(best_confidence)
print(f"\n[Iteration {iteration + 1}]")
print(f" Temperature: {self.temperature:.2f}")
# Generate code
code = self._stream_generate(prompt)
# Validate with Marky
validation = self._validate_code(code)
print(f" Generated: {len(code)} chars")
print(f" Confidence: {validation['confidence']:.2f}")
print(f" Coverage: {validation['coverage']:.1%}")
# Store if best so far
if validation['confidence'] > best_confidence:
best_result = {
'code': code,
'validation': validation,
'temperature': self.temperature,
'iteration': iteration,
}
best_confidence = validation['confidence']
# Stop if confident enough
if best_confidence > 0.8:
print(f"\n Confidence threshold reached, stopping.")
break
return best_result
def _stream_generate(self, prompt: str) -> str:
"""Stream token generation from LLM."""
code = ""
response = self.llm.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=self.temperature,
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
code += token
return code
def _validate_code(self, code: str) -> Dict:
"""Validate with Markymarkov model."""
import ast as ast_module
try:
tree = ast_module.parse(code)
except SyntaxError as e:
return {
'valid': False,
'confidence': 0.0,
'coverage': 0.0,
'error': str(e),
}
# Extract patterns
patterns = self.marky.extract_patterns(code)
# Validate pattern transitions
transitions = []
for i in range(1, len(patterns)):
prev_pattern = patterns[i-1]
curr_pattern = patterns[i]
confidence = self.marky.check_transition(
prev_pattern,
curr_pattern
)
transitions.append({
'prev': prev_pattern,
'curr': curr_pattern,
'confidence': confidence,
})
# Calculate coverage and average confidence
valid_transitions = sum(
1 for t in transitions if t['confidence'] > 0.3
)
coverage = valid_transitions / max(len(transitions), 1)
avg_confidence = (
sum(t['confidence'] for t in transitions)
/ max(len(transitions), 1)
)
return {
'valid': True,
'confidence': avg_confidence,
'coverage': coverage,
'patterns': patterns,
'transitions': transitions,
}
def _adjust_temperature(self, confidence: float) -> float:
"""Dynamically adjust temperature based on Markymarkov confidence."""
if confidence > 0.8:
return 0.3 # Low temp: follow patterns closely
elif confidence > 0.6:
return 0.7 # Medium temp: balanced
elif confidence > 0.4:
return 1.0 # Normal temp: explore more
else:
return 1.5 # High temp: significant exploration
# Example usage:
"""
import openai
agent = MarkyGuidedAgent(
'models/semantic_model.py',
llm_client=openai.Client(),
)
result = agent.generate_with_guidance(
prompt="Write a function to validate email addresses",
max_iterations=3,
)
print(f"\nFinal Result:")
print(f"Code:\n{result['code']}")
print(f"Confidence: {result['validation']['confidence']:.2f}")
print(f"Iteration: {result['iteration'] + 1}")
"""
Design Considerations
- Feedback interval: Per token? Per statement? Per function?
- Multi-file generation: How to maintain context across files?
- Error recovery: Graceful handling of parse errors and edge cases
- Performance: Streaming validation overhead and caching strategies
- API compatibility: Adaptation layer for different LLM providers
Why This Integration Makes Sense
Combining Markymarkov with LLM agents offers several advantages:
- Consistency: Generated code more closely matches project conventions
- Reduced review cycles: Less stylistic feedback needed from humans
- Pattern awareness: Agent learns project patterns over time
- Flexibility maintained: LLM still free to generate novel code when appropriate
- Transparency: Confidence scores make decisions explainable
Open Questions
Some questions worth exploring through implementation:
- How much does temperature/logit biasing improve code quality in practice?
- What’s the performance impact of real-time validation?
- How to handle partial code (incomplete statements) during streaming?
- Can iterative refinement improve patterns over multiple generations?
- How well do confidence scores correlate with actual code quality?
- What’s the optimal balance between enforcement and exploration?
The answers to these questions would come from experimentation and real-world usage.
7. Performance Characteristics
Marky’s performance has been benchmarked against the Python 3.13 standard library (596 files, 20.23 MB of code). These are real-world numbers, not synthetic benchmarks.
Training Performance
Training speed is crucial for iterative development. Markymarkov processes code quickly:
Dataset Characteristics
- Files analyzed: 596 Python files
- Total code size: 20.23 MB (21,210,750 bytes)
- Average file size: 34.8 KB
- Code complexity: Real-world Python stdlib (highly varied patterns)
Training Speed
Both models (AST + Semantic):
Total time: 5.65 seconds
Throughput: 105.6 files/second
Throughput: 3.58 MB/second
Per-file average: 9.5 ms
AST model only:
Total time: 2.60 seconds
Throughput: 229.2 files/second
Throughput: 7.78 MB/second
Semantic model only:
Total time: 2.81 seconds
Throughput: 212.0 files/second
Throughput: 7.19 MB/second
What This Means:
- A 1000-file codebase trains in ~10 seconds
- A 10,000-file codebase trains in ~2 minutes
- Iterative development is practical (retrain frequently)
- CI/CD integration is feasible
Model Sizes
The trained models are compact:
AST model: 537.1 KB
Semantic model: 277.5 KB
Combined: 814.6 KB
Compression ratio: 25.4x (20 MB → 815 KB)
This makes models:
- Easy to version control (sub-1MB files)
- Fast to load (<100ms)
- Negligible memory footprint
- Distributable in repos
Validation Performance
Validation speed determines whether Markymarkov can be used in real-time workflows:
End-to-End Validation (including subprocess startup):
AST Validation:
Median: 216.9 ms
Mean: 218.8 ms
P95: 238.3 ms
Range: 207.1–246.8 ms
Semantic Validation:
Median: 211.4 ms
Mean: 380.0 ms
P95: 1233.5 ms
Range: 187.9–1251.7 ms
Note: Higher variance in semantic validation due to
pattern complexity in different code sections.
What This Means:
- Validation completes in ~200ms for typical files
- Fast enough for save-on-edit workflows
- Acceptable for CI/CD pre-commit hooks
- Could be optimized further with persistent process
The subprocess overhead dominates these times. In a long-running process (like an IDE plugin), validation would be 10-50x faster.
Query Latency
The core model lookup operations are extremely fast:
Model Transition Lookup
Warm lookup (cached in memory):
Mean: <1 microsecond
Median: <1 microsecond
P99: <2 microseconds
Cold lookup (first access):
Mean: ~1-2 microseconds
This is effectively instant. The lookup is just a Python dict access:
confidence = transitions.get((prev_pattern, curr_pattern), {})
# O(1) hash table lookup
At these speeds:
- 1,000,000 lookups/second per core
- No noticeable latency in real applications
- Cache warming unnecessary (already fast enough)
Memory Footprint
Markymarkov is memory-efficient:
Model Loading
AST model in memory: ~1-2 MB (loaded)
Semantic model in memory: ~500 KB - 1 MB (loaded)
Both models: ~2-3 MB total
Python overhead: ~10-20 MB (interpreter)
Total process: ~15-25 MB
Scaling Characteristics
Model size grows sub-linearly with codebase size:
100 files → ~100 KB model
1,000 files → ~800 KB model
10,000 files → ~5 MB model (estimated)
100,000 files → ~30 MB model (estimated)
The sub-linear growth occurs because:
- Pattern vocabulary is bounded (finite number of patterns)
- Transition frequency converges (common patterns dominate)
- Many codebases have repeated idioms
Throughput & Scalability
Training Scalability
Training time scales linearly with codebase size:
O(n) where n = number of AST nodes
Observed:
596 files × 34.8 KB avg = 5.65s
Predicted for 6,000 files: ~56s
Predicted for 60,000 files: ~9 minutes
Bottlenecks:
- AST parsing (Python’s ast module)
- Pattern detection (Python visitor pattern)
- File I/O
Not bottlenecks:
- Model building (accumulating counters)
- Serialization (writing Python dicts)
Validation Scalability
Validation time scales with file size:
O(m) where m = lines of code in file
Typical:
100 LOC file: ~50ms
1,000 LOC file: ~200ms
10,000 LOC file: ~2s
For very large files (>5000 LOC), consider:
- Chunked validation (validate functions separately)
- Parallel validation (multiple files in parallel)
- Incremental validation (only changed code)
Parallel Processing
Training is embarrassingly parallel:
Single-threaded: 105.6 files/second
8 cores: ~850 files/second (estimated)
16 cores: ~1,700 files/second (estimated)
The current implementation is single-threaded, but parallelization is straightforward (process files independently, merge results).
Real-World Performance Examples
Small Project (50 files, 2 MB)
Training: ~0.5 seconds
Model size: ~100 KB
Validation: ~200ms per file
Total setup: <1 second
Medium Project (500 files, 20 MB)
Training: ~5 seconds
Model size: ~800 KB
Validation: ~200ms per file
Total setup: ~5 seconds
Large Project (5,000 files, 200 MB)
Training: ~50 seconds
Model size: ~5 MB
Validation: ~200ms per file
Total setup: ~1 minute
Very Large Project (50,000 files, 2 GB)
Training: ~8 minutes (estimated)
Model size: ~30 MB (estimated)
Validation: ~200ms per file
Total setup: ~10 minutes (one-time)
Optimizations & Tuning
Training Speed
To optimize training:
1. Use faster disk I/O (SSD vs HDD: 2-3x improvement)
2. Parallelize file processing (linear speedup)
3. Use lower n-gram order (order=1 is 2x faster)
4. Filter irrelevant files (tests, generated code)
5. Cache parsed ASTs (if re-training frequently)
Validation Speed
To optimize validation:
1. Run in persistent process (avoid subprocess overhead)
2. Pre-load models at startup (100ms → 1ms validation)
3. Validate incrementally (only changed code)
4. Use AST cache (if validating same file repeatedly)
5. Parallel validation (multiple files)
Memory Usage
To reduce memory:
1. Use lower n-gram order (order=1: 50% less memory)
2. Prune rare transitions (threshold=2: 30% less memory)
3. Quantize probabilities (float32 → int8: 75% less memory)
4. Stream model loading (don't load all at once)
Comparison with Alternatives
vs. Linters (pylint, flake8)
Markymarkov training: 5.6s for 596 files
Pylint: ~60s for 596 files (10x slower)
Flake8: ~15s for 596 files (3x slower)
Markymarkov validation: ~200ms per file
Pylint: ~300ms per file
Flake8: ~100ms per file
Advantage: Markymarkov learns from your code, not fixed rules
Trade-off: Markymarkov requires initial training step
vs. Type Checkers (mypy)
Markymarkov validation: ~200ms per file
Mypy: ~500ms per file (cold), ~100ms (warm)
Advantage: Markymarkov checks patterns, not types (complementary)
Trade-off: Different problem domain
vs. Deep Learning Models
Markymarkov training: 5.6s for 596 files
CodeBERT training: Hours/days
GPT fine-tuning: Days/weeks
Markymarkov inference: <1ms per lookup
CodeBERT inference: ~100ms per prediction
GPT inference: ~1s per completion
Advantage: Markymarkov is 100-1000x faster
Trade-off: Markymarkov doesn't understand semantics, only patterns
Performance Recommendations
For best performance in different scenarios:
Development/IDE Integration
- Use persistent process (avoid subprocess overhead)
- Pre-load models at startup
- Validate on save (200ms is acceptable)
- Update models nightly (don't retrain on every change)
CI/CD Pipelines
- Train models in dedicated step (cache for reuse)
- Validate changed files only (not entire codebase)
- Run in parallel (one validation per core)
- Fail fast (stop on first validation error)
Code Review Tools
- Load models once (per review session)
- Validate diffs only (not unchanged code)
- Show confidence scores (help reviewers prioritize)
- Cache results (same file validated multiple times)
Large-Scale Analysis
- Parallelize training (split files across workers)
- Stream processing (don't load all files at once)
- Sample validation (validate subset for estimates)
- Incremental updates (retrain only changed modules)
Benchmark Methodology
These numbers were obtained using:
- Hardware: x86_64 Linux system - 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
- Python: 3.13
- Dataset: Python 3.13 standard library (/usr/lib/python3.13)
- Tool version: Markymarkov current git HEAD
- Method:
- Training: Single run (warm filesystem cache)
- Validation: 50 iterations, median reported
- Lookup: 10,000 iterations, median reported
Benchmark code is available in the repository (benchmark.py and benchmark_summary.py).
Key Takeaways
✓ Fast training: 100+ files/second, practical for daily retraining ✓ Compact models: 25x compression, <1 MB for typical projects ✓ Quick validation: ~200ms per file including overhead ✓ Instant lookups: <1 microsecond for model queries ✓ Linear scaling: Performance predictable as codebase grows ✓ Memory efficient: 2-3 MB for loaded models ✓ Production-ready: Performance suitable for real-time use
Marky’s performance characteristics make it viable for:
- Interactive development workflows (IDE plugins)
- Automated code review (CI/CD integration)
- Large-scale code analysis (batch processing)
- Real-time code generation guidance (LLM integration)
8. Use Cases
Marky’s pattern-learning approach enables a variety of practical applications, from code quality enforcement to training data analysis. Here are detailed scenarios showing how Markymarkov solves real problems.
Code Quality Assurance
Problem: Teams need to ensure generated or contributed code follows project conventions, but manual review doesn’t scale and traditional linters only check syntax.
How Markymarkov Helps:
Markymarkov learns what “quality code” looks like from your existing codebase, then validates new code against those patterns.
Example Workflow:
# 1. Train on your production codebase
uv run markymarkov train src/ models/ --model-type both
# 2. Validate new code during PR review
uv run markymarkov validate models/semantic_model.py new_feature.py
# Output shows:
# - Confidence score (0.0-1.0)
# - Which patterns match your codebase
# - Which patterns are unusual
# - Suggestions for improvement
Real-World Scenario:
A team maintaining a Flask application trains Markymarkov on their existing routes, models, and utilities:
# Existing codebase pattern: Always validate input early
@app.route('/user/<user_id>')
def get_user(user_id):
if not user_id:
return jsonify({'error': 'Missing user_id'}), 400
user = db.query(User).get(user_id)
if not user:
return jsonify({'error': 'User not found'}), 404
return jsonify(user.to_dict())
When a new developer submits code without input validation:
@app.route('/product/<product_id>')
def get_product(product_id):
product = db.query(Product).get(product_id)
return jsonify(product.to_dict())
Markymarkov flags this:
Validation Result:
Confidence: 0.32 (Low - unusual pattern)
Missing expected patterns:
- guard-clause (expected after function-transformer: 0.85 confidence)
- if-empty-check (common in route handlers: 0.78 confidence)
Suggestion: Add input validation (guard clause pattern)
Benefits:
- Catches style violations automatically
- Provides objective feedback (“doesn’t match our patterns”)
- Reduces review time (reviewers focus on logic, not style)
- Teaches new developers team conventions
Style Enforcement
Problem: Different developers write code differently. Some use list comprehensions, others use loops. Some use guard clauses, others nest conditionals. You want consistency.
How Markymarkov Helps:
Markymarkov identifies the dominant patterns in your codebase and flags deviations. This creates a “style fingerprint” unique to your project.
Example Workflow:
# Train on well-styled reference code
uv run markymarkov train examples/good_style/ models/
# Validate new contributions
uv run markymarkov validate models/semantic_model.py contribution.py
# Integrate into pre-commit hook
# .git/hooks/pre-commit:
#!/bin/bash
for file in $(git diff --cached --name-only --diff-filter=ACM | grep '\.py$'); do
uv run markymarkov validate models/semantic_model.py "$file"
if [ $? -ne 0 ]; then
echo "Style validation failed for $file"
exit 1
fi
done
Real-World Scenario:
Your team prefers comprehensions over manual loops:
# Preferred style (80% of codebase)
result = [item.transform() for item in items if item.valid]
# Less preferred (20% of codebase)
result = []
for item in items:
if item.valid:
result.append(item.transform())
New code using manual loops gets flagged:
Pattern detected: loop-accumulate
Confidence: 0.25 (Low for this context)
Alternative suggestion:
list-comprehension has 0.82 confidence in this context
Consider: [item.transform() for item in items if item.valid]
Configuration Options:
You can tune enforcement strictness:
# Strict mode: Reject if confidence < 0.7
if confidence < 0.7:
reject_code()
# Permissive mode: Warn if confidence < 0.5, reject if < 0.3
if confidence < 0.3:
reject_code()
elif confidence < 0.5:
warn_developer()
# Learning mode: Always accept, but log unusual patterns
log_pattern_stats(code, confidence)
Benefits:
- Enforces team style automatically
- Adapts to your codebase (not generic rules)
- Provides constructive feedback
- Reduces style debates (data-driven)
Training Data Analysis
Problem: You want to understand what patterns dominate your codebase, identify inconsistencies, or prepare data for model training.
How Markymarkov Helps:
Markymarkov extracts and quantifies patterns, giving you insights into code characteristics.
Example Workflow:
# Train model
uv run markymarkov train codebase/ analysis_models/
# View statistics
uv run markymarkov stats analysis_models/semantic_model.py
# Output shows:
# - Most common patterns
# - Pattern transition probabilities
# - Unusual pattern combinations
Real-World Scenario:
Analyzing a legacy codebase before refactoring:
$ uv run markymarkov stats analysis_models/semantic_model.py
Model Statistics:
Total transitions: 1,247
Unique patterns: 31
Most common patterns:
1. function-transformer (892 occurrences, 71.5%)
2. guard-clause (423 occurrences, 33.9%)
3. return-computed (389 occurrences, 31.2%)
4. loop-filter (276 occurrences, 22.1%)
5. try-except-pass (198 occurrences, 15.9%)
Unusual patterns (low confidence):
- try-except-pass after loop-filter (0.12 confidence)
→ Suggests error handling inside loops (potential issue)
- nested-if-else chains (0.08 confidence)
→ Only 8% of code uses this (refactoring candidate)
Pattern diversity: 0.62 (moderate - some patterns dominate)
From this analysis, you learn:
- 72% of functions are transformers (good for functional style)
- 34% use guard clauses (decent defensive programming)
- 16% silently swallow exceptions (tech debt!)
- Nested ifs are rare (codebase is generally clean)
Use Cases:
- Pre-refactoring analysis (identify problem areas)
- Code review focus (prioritize unusual patterns)
- Training data curation (find representative examples)
- Team onboarding (show “how we write code”)
Advanced Analysis:
Compare patterns across different modules:
# Train separate models
uv run markymarkov train src/api/ models/api_model.py
uv run markymarkov train src/db/ models/db_model.py
uv run markymarkov train src/utils/ models/utils_model.py
# Compare statistics
python compare_models.py models/api_model.py models/db_model.py
# Output:
# API layer: Heavy use of guard clauses (0.82)
# DB layer: Heavy use of context managers (0.71)
# Utils: Heavy use of list comprehensions (0.68)
Benefits:
- Objective codebase metrics
- Identify tech debt patterns
- Understand team coding style
- Prepare for refactoring efforts
Model-Driven Code Generation
Problem: You want LLMs to generate code that matches your project’s style, not generic Python.
How Markymarkov Helps:
Train Markymarkov on your codebase, then use it to guide or validate LLM-generated code.
Example Workflow:
# 1. Train on your codebase
subprocess.run(["uv", "run", "python", "-m", "src", "train",
"src/", "models/"])
# 2. Generate code with LLM
llm_code = generate_code_with_llm(prompt="Write a user validation function")
# 3. Validate with Marky
validation = validate_code(llm_code, "models/semantic_model.py")
# 4. If low confidence, regenerate with hints
if validation['confidence'] < 0.6:
hints = get_expected_patterns(validation)
llm_code = generate_code_with_llm(
prompt=f"{original_prompt}\n\nUse these patterns: {hints}"
)
Real-World Scenario:
Your project uses specific error handling patterns:
# Your codebase pattern
def fetch_user(user_id):
try:
return db.get_user(user_id)
except NotFoundError:
logger.warning(f"User {user_id} not found")
return None
except DatabaseError as e:
logger.error(f"Database error: {e}")
raise
LLM generates generic code:
def fetch_product(product_id):
try:
return db.get_product(product_id)
except Exception:
return None
Markymarkov validation:
Confidence: 0.23 (Very low)
Issues:
- Catching generic Exception (your code uses specific exceptions)
- No logging (your code always logs errors)
- Silent failure (your code raises on critical errors)
Expected patterns:
- try-except with specific exception types (0.89 confidence)
- logging-call in except blocks (0.82 confidence)
- try-except-reraise for critical errors (0.71 confidence)
You regenerate with hints:
# Improved LLM output (after hints)
def fetch_product(product_id):
try:
return db.get_product(product_id)
except NotFoundError:
logger.warning(f"Product {product_id} not found")
return None
except DatabaseError as e:
logger.error(f"Database error: {e}")
raise
New confidence: 0.87 ✓
Benefits:
- LLM output matches your style
- Reduces post-generation cleanup
- Iterative improvement through feedback
- Project-specific code generation
Identifying Code Anomalies
Problem: You suspect certain code sections are unusual or buggy, but need objective evidence.
How Markymarkov Helps:
Low-confidence patterns indicate code that doesn’t match typical project style. This can surface bugs, anti-patterns, or quick hacks.
Example Workflow:
# Train on healthy codebase
uv run markymarkov train src/ models/ --exclude tests/
# Validate suspicious file
uv run markymarkov validate models/semantic_model.py suspicious.py
# Look for low-confidence patterns
Real-World Scenario:
A bug report comes in for a function. Markymarkov validation shows:
def process_payment(amount, user):
result = charge_card(user.card, amount)
if result:
return result
update_balance(user, amount) # ← This line is unusual
return {"status": "success"}
Markymarkov output:
Validation Result:
Confidence: 0.18 (Very low)
Unusual patterns:
Line 4: Unreachable code after return (0.02 confidence)
Expected: All code paths should be reachable
Similar functions in codebase:
- process_refund: Always checks result before proceeding (0.91 confidence)
- process_subscription: Uses guard clauses (0.87 confidence)
The bug: update_balance is never called because of early return. Markymarkov flagged it as unusual because similar functions in the codebase don’t have this pattern.
Finding Tech Debt:
Run validation on entire codebase and sort by confidence:
# Validate all files
for file in src/**/*.py; do
uv run markymarkov validate models/semantic_model.py "$file" >> results.txt
done
# Sort by confidence (low = suspicious)
grep "Confidence:" results.txt | sort -t: -k2 -n
# Output:
# src/legacy/utils.py: Confidence: 0.12
# src/old_api/handlers.py: Confidence: 0.19
# src/deprecated/auth.py: Confidence: 0.23
# ...
Files with very low confidence often indicate:
- Legacy code (written before standards were established)
- Quick hacks (rushed implementation)
- Copy-pasted from external sources
- Potential bugs or anti-patterns
Benefits:
- Objective code quality metric
- Find bugs before they manifest
- Prioritize refactoring efforts
- Code review efficiency
Pre-Commit Hooks / CI Integration
Problem: You want to catch style issues before code reaches review.
How Markymarkov Helps:
Integrate Markymarkov into your development workflow to validate code automatically.
Example: Pre-Commit Hook
#!/bin/bash
# .git/hooks/pre-commit
THRESHOLD=0.5 # Minimum acceptable confidence
for file in $(git diff --cached --name-only --diff-filter=ACM | grep '\.py$'); do
result=$(uv run markymarkov validate models/semantic_model.py "$file" 2>&1)
confidence=$(echo "$result" | grep "Confidence:" | awk '{print $2}')
if (( $(echo "$confidence < $THRESHOLD" | bc -l) )); then
echo "❌ $file: Low confidence ($confidence)"
echo "$result"
exit 1
else
echo "✓ $file: Passed ($confidence)"
fi
done
echo "All files passed Markymarkov validation!"
Example: GitHub Actions
name: Code Style Validation
on: [pull_request]
jobs:
marky-validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: '3.13'
- name: Install Marky
run: |
pip install uv
uv sync
- name: Train model (or use cached)
run: |
if [ ! -f models/semantic_model.py ]; then
uv run markymarkov train src/ models/
fi
- name: Validate changed files
run: |
for file in $(git diff --name-only origin/main...HEAD | grep '\.py$'); do
uv run markymarkov validate models/semantic_model.py "$file"
done
Benefits:
- Catch issues early (before review)
- Consistent enforcement
- Automated workflow
- Fast feedback loop
Code Migration / Refactoring Assistance
Problem: Migrating from one style to another (e.g., callbacks → async/await, classes → functional).
How Markymarkov Helps:
Train one model on old style, another on new style, then validate migrations.
Example Workflow:
# Train on old codebase (before refactor)
uv run markymarkov train old_src/ models/old_style.py
# Train on target examples (new style)
uv run markymarkov train examples/new_style/ models/new_style.py
# During migration, validate against new style
uv run markymarkov validate models/new_style.py refactored_file.py
Real-World Scenario:
Migrating from synchronous to async code:
# Old style (callback-based)
def fetch_user(user_id, callback):
result = db.get_user(user_id)
callback(result)
# New style (async/await)
async def fetch_user(user_id):
result = await db.get_user(user_id)
return result
Markymarkov trained on new style flags incomplete migrations:
# Partially migrated (still has callback)
async def fetch_product(product_id, callback):
result = await db.get_product(product_id)
callback(result) # ← Should return instead
Validation:
Confidence: 0.31 (Low - mixed patterns)
Issues:
- Async function with callback parameter (not in new style)
- Expected: async functions return values (0.94 confidence in new_style.py)
Suggestion: Remove callback, use return statement
Benefits:
- Consistency during migrations
- Catch incomplete refactors
- Learn from target examples
- Objective migration progress tracking
Onboarding New Developers
Problem: New team members need to learn project conventions quickly.
How Markymarkov Helps:
Use Markymarkov statistics to show “how we write code here” with concrete examples.
Example Workflow:
# Generate onboarding report
uv run markymarkov stats models/semantic_model.py > ONBOARDING.md
# Show examples of each pattern
python extract_pattern_examples.py models/semantic_model.py src/
Onboarding Document Generated:
# Our Code Style (Generated from Codebase Analysis)
## Most Common Patterns
### 1. Guard Clauses (85% of functions)
We prefer early returns for invalid input:
\`\`\`python
# ✓ Our style
def process(data):
if not data:
return None
if not data.valid:
return None
return data.process()
# ✗ Avoid
def process(data):
if data and data.valid:
return data.process()
else:
return None
\`\`\`
### 2. List Comprehensions (73% of loops)
Use comprehensions for transformations:
\`\`\`python
# ✓ Our style
results = [item.transform() for item in items if item.valid]
# ✗ Avoid
results = []
for item in items:
if item.valid:
results.append(item.transform())
\`\`\`
> Rare Patterns (Avoid Unless Necessary)
- try-except-pass (only 3% of code)
- nested-if-else (only 5% of code)
- while loops (only 8% of code)
Benefits:
- Data-driven onboarding
- Clear examples from actual code
- New developers write idiomatic code faster
- Reduces “that’s not how we do it” feedback
Documentation Generation
Problem: Code style guides are often out of date or incomplete.
How Markymarkov Helps:
Generate living documentation from actual code patterns.
# Auto-generate style guide
python generate_style_guide.py models/semantic_model.py > STYLE_GUIDE.md
This creates documentation that’s:
- Always up-to-date (regenerate from latest model)
- Based on actual code (not aspirational)
- Quantified (shows pattern prevalence)
- Example-rich (extracted from codebase)
Benefits:
- Living documentation
- No manual maintenance
- Objective style standards
- Easy to update
Research / Code Analysis
Problem: Understanding code evolution, pattern trends, or comparative analysis.
How Markymarkov Helps:
Train models on different versions or projects to study differences.
Example Workflow:
# Train on multiple versions
uv run markymarkov train v1.0/ models/v1.py
uv run markymarkov train v2.0/ models/v2.py
uv run markymarkov train v3.0/ models/v3.py
# Compare patterns across versions
python compare_evolution.py models/v*.py
Research Questions Answered:
- How did coding style evolve over time?
- What patterns were added/removed between versions?
- Do different teams have different style fingerprints?
- What patterns correlate with bugs?
Benefits:
- Empirical code analysis
- Track style evolution
- Comparative studies
- Research publications
Summary of Use Cases
| Use Case | Benefit | Typical Users |
|---|---|---|
| Code Quality Assurance | Automated style validation | All developers |
| Style Enforcement | Consistent codebase | Tech leads, DevOps |
| Training Data Analysis | Understand patterns | Data scientists, researchers |
| Model-Driven Generation | Better LLM output | AI/ML teams |
| Anomaly Detection | Find bugs early | QA, code reviewers |
| CI/CD Integration | Automated checks | DevOps, platform teams |
| Refactoring Assistance | Migration validation | Senior developers |
| Developer Onboarding | Learn conventions | New team members |
| Documentation | Living style guides | Tech writers, leads |
| Research | Code pattern studies | Researchers, academics |
All these use cases leverage the same core capability: learning patterns from code and validating new code against those patterns. The flexibility of Marky’s approach makes it valuable across the entire development lifecycle.
9. Getting Started
Easily run from main branch:
uvx --from git+https://github.com/roobie/markymarkov markymarkov
# You can alias the above
alias -s marky="uvx --from git+https://github.com/roobie/markymarkov markymarkov"
Installation & Setup
git clone https://github.com/roobie/markymarkov
cd markymarkov
uv sync
Training Your First Model
# Train on your codebase
uv run markymarkov train /path/to/your/code models/
# Or on specific patterns
uv run markymarkov train /path/to/code models/ --model-type semantic --order 2
Running Validation
# Validate a file
uv run markymarkov validate models/semantic_model.py your_file.py
# See statistics
uv run markymarkov stats models/semantic_model.py
# Try the demo
uv run markymarkov demo
Integration Patterns
- Load model as Python module
- Call
MarkovCodeGuide.suggest_next_nodes() - Use with streaming code validator
- Integrate with LLM inference pipeline
10. Conclusion
Markymarkov shows that small, interpretable probabilistic models add high value to code workflows. By learning patterns from your own codebase, Markymarkov provides fast, explainable guidance that complements existing tools rather than replacing them.
Key advantages:
- Fast to train and load, compact to store
- Interpretable confidence scores for actionable feedback
- Flexible integration points: CI, IDE, LLM workflows
If you care about consistency, developer productivity, and explainability, try training Markymarkov on a representative subset of your code and see what it highlights.