An NSFW filter for Marginalia search

https://news.ycombinator.com/rss Hits: 2
Summary

… optional, that is.I’ve been working on an NSFW filter for Marginalia Search, as that is something some people have asked for, primarily API consumers.The search engine has had some domain based filtering for a while, based on the UT1 lists, but that isn’t a very comprehensive approach.We’ll land on a single hidden layer neural network approach, implemented from scratch, but before landing on that, many other things were tried along the way.This is largely an abbreviated account of the way there.There is a tension between speed and generality in classification.Building something that is both fast and reasonably correct in its assessments is incredibly fiddly work, even if the solution itself is often pretty straightforward.The main limiting constraint for a filter that runs in a search engine is that it needs to be really fast and run well on CPUs.This immediately disqualifies transformer-based models and other state-of-the art approaches, capable as they are they check neither of those boxes.FasttextOne of the early stabs of the problems I tried was using fasttext, which is a classifier library from “Facebook, Inc.” (back when they were).It’s got a few years on its neck, but it’s well named in that it really is fast. The search engine already uses it for language identification, so no new dependencies! Worth a try at least.Problem with training a classifier is that you need sample data, and kind of a lot of it. Thankfully, finding candidate samples is easy enough when you run a search engine. You can just search for them! Hook up a little script to the API and search for all manner of depravity, save the results for labeling.Training DataTo get a filter that is half way decent we need tens of thousands of samples, and as exciting as manually labelling them sounds, I can’t help but feel there are better ways to spend a couple of weeks.While NSFW sample sets nominally do exist, fast classifiers are very sensitive to the context and shape of the data.Training a ...

First seen: 2026-03-30 18:12

Last seen: 2026-03-30 19:13