Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data

https://news.ycombinator.com/rss Hits: 15

Summary

It’s 2022. I check out this cool new project, CUTLASS, with very fast matmuls. I take a large matmul, 8192 x 8192 x 8192, and benchmark it in PyTorch, which calls CuBLAS.python mm_bench.py > CuBLAS: 258 TeraflopsNot bad, 83% flop utilization. Now let’s check out Cutlass’s performance using their profiler../cutlass_profiler --operation=Gemm --m=8192 --n=8192 --k=8192 > CUTLASS: 288 Teraflops!!! 10% higher perf? That’s incredible. CuBLAS is highly optimized for large compute-bound matmuls, and somehow CUTLASS + autotuning is outperforming it by 10%? We gotta start using these matmuls yesterday.The next step is to bind the CUTLASS kernels into Python and compare against CuBLAS using my previous script.python cutlass_mm_bench.py > CuBLAS: 258 Teraflops > CUTLASS: 257 TeraflopsSomehow, in the light of Python, all of CUTLASS’s performance gains disappear. This in of itself is not shocking - it’s notoriously difficult to ensure consistent benchmarking across setups.I tediously ablate the two benchmark scripts, until finally, I find that CUTLASS’s profiler, by default, actually initializes the values in a fairly strange way - it only initializes the inputs with integers. Confused about whether this matters, I try:zero_inputs = torch.zeros(N, N) randn_inputs = torch.randn(N, N) benchmark(zero_inputs) # 295 Teraflops benchmark(randn_inputs) # 257 TeraflopsWhat? How could the values of the matrix affect the runtime of the model? I know Nvidia has some weird data compression thing on A100s, but I wouldn’t have expected that to be on in matmuls. Let’s try some other data distributions, like an uniform distribution [0,1].This was … confusing, to say the least. Somehow, the actual content of the tensors being multiplied is leading to different matmul performance.There certainly are cases where the runtime depends on the content of the tensor — indirect indexing (e.g. A[b]), or things like sparsity.But matrix multiplications have nothing like that at all! No matter what the content...

First seen: 2026-05-27 13:53

Last seen: 2026-05-28 04:01

Read Full Article More from this Source

Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data

Summary

Related News

Rapira (Рапира) – Soviet programming language interpreter

Rust (and Slint) on a Jailbroken Kindle

I think Anthropic and OpenAI have found product-market fit

The Ask

SimCity 3k in 4k