Paper Tape Is All You Need – Training a Transformer on a 1976 Minicomputer

https://news.ycombinator.com/rss Hits: 12

Summary

ATTN/11 - Paper Tape Is All You Need A single-layer, single-head transformer written in PDP-11 assembly language. This project is the spiritual successor of Xortran, a neural network that learns XOR with backpropagation in Fortran IV on the IBM 1130 (1965) and PDP-11/20 (1970). The natural next step was to see if those machines could successfully train a small transformer in an acceptable amount of time (a few hours). Architecturally, a transformer is actually a fairly modest extension of a basic neural network. The building blocks such as matrix multiplies, backpropagation, SGD, and cross-entropy are already there. The three new components are: Self-attention: dot-product score between projected queries and keys Positional encoding: learned position embeddings, added to the input Softmax: to turn scores into a probability distribution The goal is to train the Transformer to reverse a sequence of digits. Despite its apparent simplicity, reversal is not a trivial task for a neural network: the model must learn to route each token to a position that depends only on its index, with no content-based shortcut. This is the kind of problem that self-attention is designed for, and is in fact one of the algorithmic benchmarks included in Tensor2Tensor, Google's reference implementation of the original transformer in 2017. The data path is straightforward: tokens are embedded, passed through self-attention with a residual connection, then projected back to the vocabulary and softmaxed into a prediction: Tokens -> Embedding -> Self-Attention -> Residual -> Projection -> Softmax Hyperparameter Value Layers 1 Heads 1 d_model 16 Sequence length 8 Vocabulary 10 (digits 0–9) Parameters 1,216 The model is an encoder-only transformer: embedding, self-attention with residual connection, and output projection. It's a genuine Transformer with self-attention, but not BERT or a GPT either: it has no layer norm, no feed-forward network, and no decoder. The task requires no transformation o...

First seen: 2026-03-28 11:39

Last seen: 2026-03-28 22:45

Read Full Article More from this Source

Paper Tape Is All You Need – Training a Transformer on a 1976 Minicomputer

Summary

Related News

The Failure of the Thermodynamics of Computation(2010)

LinkedIn uses 2.4 GB RAM across two tabs

The Cloud: The dystopian book that changed Germany (2022)

Show HN: Public transit systems as data – lines, stations, railcars, and history

CSS is DOOMed