$500 GPU outperforms Claude Sonnet on coding benchmarks using open-source AI system

https://lobste.rs/rss Hits: 30
Summary

Adaptive Test-time Learning and Autonomous Specialization A.T.L.A.S achieves 74.6% LiveCodeBench pass@1-v(k=3) with a frozen 14B model on a single consumer GPU -- up from 36-41% in V2 -- through constraint-driven generation and self-verified iterative refinement. The premise: wrap a frozen smaller model in intelligent infrastructure -- structured generation, energy-based verification, self-verified repair -- and it can compete with frontier API models at a fraction of the cost. No fine-tuning, no API calls, no cloud. Fully self-hosted -- no data leaves the machine, no API keys required, no usage metering. One GPU, one box. Hardware: RTX 5060 Ti 16GB | Model: Qwen3-14B-Q4_K_M (frozen) Benchmark Score Tasks Method LiveCodeBench v5 74.6% pass@1-v(k=3)* 599 V3 pipeline: PlanSearch + self-verified PR-CoT repair, V3 Score GPQA Diamond 47.0% 198 k=5, multiple-choice knowledge reasoning, V2 Score SciCode 14.7% (sub-problems) 341 k=1, cross-domain scientific coding, V2 Score *pass@k-v(k=3) = one solution submitted per task, but generated via best-of-3 candidates + Lens selection + iterative repair on failures. Not single-shot generation, it is not pass@1. See methodology. V3 ablation breakdown Condition Configuration Pass Rate Delta A Baseline (no V3) 54.9% -- B +Phase 1 (PlanSearch + BudgetForcing + DivSampling) 67.3% +12.4pp C +Phase 1+2 (Lens routing) 67.3% +0.0pp D +Phase 1+3 (self-verified refinement) 74.6% +7.3pp Phase 3 uses self-generated test cases for internal verification -- the model never sees the answer key during repair. PR-CoT rescues 36/42 tasks (85.7% of Phase 3 rescues). Full report: V3_ABLATION_STUDY.md Cost and Performance Context System LCB pass@1 Est. cost/task Notes DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot GPT-5 (high) 84.6% ~$0.043 API, single-shot ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline Claude 4.5 Sonnet 71.4% ~$0.066 API, single-shot Claude 4 Sonnet 65.5% ~$0.066 API, single-shot Meth...

First seen: 2026-03-27 00:17

Last seen: 2026-03-27 17:29