Starting from scratch: Training a 30M Topological Transformer

https://news.ycombinator.com/rss Hits: 14

Summary

Training a 30M parameters Topological Transformer Tauformer is a topological transformer (see paper) that replaces dot‑product attention with a Laplacian-derived scalar (taumode) per token/head, then attends using distances in that scalar space. Below is a post-style overview of the idea and the first training signals from a 30M-parameter run. Tauformer in one idea Tauformer’s goal is to inject domain structure directly into attention by using a Graph Laplacian built from a domain embedding space (a “domain memory”) as a persistent reference. Instead of ranking keys by \(Q\cdot K\), Tauformer ranks them by how similar their Laplacian-derived taumode scalars are, which is intended to bias attention toward domain-relevant relations rather than generic geometric similarity. At the implementation level, Tauformer keeps the familiar Q/K/V projections, RoPE, causal masking, and stable softmax/value aggregation pipeline, but changes how attention logits are computed. Each head vector is compressed into a scalar \(\lambda\) using a bounded Rayleigh-quotient energy computed with a feature-space Laplacian \(L\), then logits are computed as a negative distance \(-|\lambda_q-\lambda_k|/\text{temperature}\). Key building blocks (as implemented): Taumode scalar: compute \(E_{\text{raw}}=(x^\top L x)/(x^\top x+\varepsilon)\), then bound it as \(E_{\text{raw}}/(E_{\text{raw}}+\tau)\) to produce \(\lambda\in[0,1)\). Logits: \(\text{att}_{ij} = -\|\lambda^Q_i - \lambda^K_j\|/\text{temperature}\), then reuse causal mask \(→\) subtract row max \(→\) softmax \(→\) multiply by \(V\). Why it can be cheaper Because scoring no longer needs full key vectors, Tauformer’s KV-cache can store values plus a compact key-side scalar stream rather than both K and V tensors. Concretely, the cache payload is \((V,\lambda_k)\) (not \((K,V)\)), which yields an approximate ~50% per-layer cache reduction for typical head dimensions (small overhead for storing the extra scalar). The design also anticipates...

First seen: 2026-01-18 12:27

Last seen: 2026-01-19 01:29

Read Full Article More from this Source

Starting from scratch: Training a 30M Topological Transformer

Summary

Related News

Airfoil (2024)

Microsoft forced me to switch to Linux

Show HN: Build Web Automations via Demonstration

A verification layer for browser agents: Amazon case study

Package Management Is a Wicked Problem