Show HN: Mdarena – Benchmark your Claude.md against your own PRs

https://news.ycombinator.com/rss Hits: 2

Summary

Benchmark your CLAUDE.md against your own PRs. Most CLAUDE.md files are written blindly. Research shows they often reduce agent success rates and cost 20%+ more tokens. mdarena lets you measure whether yours helps or hurts, on tasks from your actual codebase. pip install mdarena # Mine 50 merged PRs into a test set mdarena mine owner/repo --limit 50 --detect-tests # Benchmark multiple CLAUDE.md files + baseline (no context) mdarena run -c claude_v1.md -c claude_v2.md -c agents.md # See who wins mdarena report mdarena mine -> Fetch merged PRs, filter, build task set Auto-detect test commands from CI/package files mdarena run -> For each task x condition: - Checkout repo at pre-PR commit - Baseline: all CLAUDE.md files stripped - Context: inject CLAUDE.md, let Claude discover it - Run tests if available, capture git diff mdarena report -> Compare patches against gold (actual PR diff) - Test pass/fail (same as SWE-bench) - File/hunk overlap, cost, tokens - Statistical significance (paired t-test) mdarena can run your repo's actual tests to grade agent patches, the same way SWE-bench does it. # Auto-detect from CI/CD mdarena mine owner/repo --detect-tests # Or specify manually mdarena mine owner/repo --test-cmd "make test" --setup-cmd "npm install" Parses .github/workflows/*.yml, package.json, pyproject.toml, Cargo.toml, and go.mod. When tests aren't available, falls back to diff overlap scoring. Pass a directory to benchmark a full CLAUDE.md tree: mdarena run -c ./configs-v1/ -c ./configs-v2/ Each directory mirrors your repo structure. Baseline strips ALL CLAUDE.md and AGENTS.md files from the entire tree. We ran mdarena against a large production monorepo: 20 merged PRs, Claude Opus 4.6, three conditions (bare baseline, existing CLAUDE.md, hand-written alternative). Patches graded against real test suites. Not string matching, not LLM-as-judge. Key findings: The existing CLAUDE.md improved test resolution by ~27% over bare baseline A consolidated alternative that merg...

First seen: 2026-04-06 01:42

Last seen: 2026-04-06 02:43

Read Full Article More from this Source

Show HN: Mdarena – Benchmark your Claude.md against your own PRs

Summary

Related News

Show HN: Gemma Gem – AI model embedded in a browser – no API keys, no cloud

Sheets Spreadsheets in Your Terminal

Endian wars and anti-portability: this again?

Running Google Gemma 4 Locally with LM Studio's New Headless CLI and Claude Code

The Free Market Lie: Why Switzerland Has 25 Gbit Internet and America Doesn't