Scaling meme search to TODO million images

DRAFT

Updated 2024-12-03 / Created 2024-12-03 / 2.32k words

Downloading and indexing everything* on Reddit on one computer.

Be the first person to not do something that no one else has ever thought of not doing before.

— Brian Eno

Computers are very fast. It is easy to forget this when they routinely function so slowly, and now that many engineers are working on heavily abstracted cloud systems, but even my slightly outdated laptop is in principle capable of executing 15 billion instructions per core in each second it wastes stuttering and doing nothing in particular. People will sometimes talk about how their system has to serve "millions of requests a day", but a day is about 105 seconds, and the problem of serving tens of queries a second on much worse hardware than we have now was solved decades ago. The situation is even sillier for GPUs - every consumer GPU is roughly as fast as entire 1990s supercomputers[1] and they mostly get used to shade triangles for games. In the spirit of Production Twitter on One Machine, Command-line Tools can be 235x Faster than your Hadoop Cluster and projects like Marginalia, I have assembled what I believe to be a competitively sized image dataset and search system on my one "server"[2] by avoiding work.

Scraping

The concept for this project was developed in May, when I was pondering how to get more memes and a more general collection without the existing semimanual curation systems, particularly in order to provide open-domain image search. MemeThresher's crawler pulls from a small set of subreddits, and it seemed plausible that I could just switch it to r/all[3] to get a decent sample of recent data. However, after their IPO and/or some manager realizing unreasonably late that people might be willing to pay for unstructured text data now, Reddit does not want you to scrape much, and this consistently cut off after a few thousand items. Conveniently, however, in the words of Reddit's CTO:

“Existing” bulk data solutions that have been deployed (by others!) in the past generally include words such as “unsanctioned” and “bittorent”

The unsanctioned datasets distributed via BitTorrent, widely used in research and diligently maintained by PushShift and Arctic Shift, were pleasantly easy to download and use, and after the slow process of decompressing and streaming all 500GB of historical submissions through some basic analysis tools on my staging VPS (it has a mechanical hard drive and two Broadwell cores...) I ran some rough estimates and realized that it would be possible for me to process all the images (up to TODO)[4] rather than just a subset.

This may be unintuitive, since "all the images" was, based on my early estimates, about 250 million. Assuming a (slightly pessimistic) 1MB per image, I certainly don't have 250TB of storage. Usable thumbnails would occupy perhaps 50kB each with the best available compression, which would have been very costly to apply, but 12TB is still more than I actually have free. The trick is that it wasn't actually necessary to store any of that[5]: to do search, only the embedding vectors, occupying about 2kB each, are needed (as well as some metadata for practicality). Prior work like img2dataset retained resized images for later embedding: I avoided this by implementing the entire system as a monolithic minimal-buffering pipeline going straight from URLs to image buffers to embeddings to a very large compressed blob on disk, with backpressure to clamp download speed to the rate necessary to feed the GPU.

I spent a day or two implementing this, with a mode to randomly sample a small fraction of the images for initial testing. This revealed some bottlenecks - notably, the inference server was slower than it theoretically could be and substantially CPU-hungry - which I was able to partly fix by hackily rewriting the model using AITemplate. I had anticipated running close to network bandwidth limits, but with my GPU fully loaded and the inference server improved I only hit 200Mbps down at first; a surprising and more binding limit was actually the CPU-based image preprocessing code, which I "fixed" by compromising image quality very slightly. I also had to increase a lot of resource limits (file descriptors and local DNS caching) to handle the unreasonable amount of parallel downloads. This more or less worked, but more detailed calculations showed that I'd need a month of runtme and significant additional storage for a full run, and the electricity/SSD costs were nontrivial so the project was shelved.

Recently, some reprioritization and requiring a lot of additional storage anyway resulted in me resurrecting the project from the archives. I had to make a few final tweaks to integrate it with the metrics system, reduce network traffic by making it ignore probably-non-image URLs earlier, log some data I was missing and (slightly) handle links to things like Imgur galleries. After an early issue with miswritten concurrency code leading to records being read in the wrong order such that it would not correctly recover from a restart, it ran very smoothly for a few days. There were, however, several unexplained discontinuities in the metrics, as well as some gradual changes over time which resulted in me using far too much CPU time. I had to actually think about optimization.

The metrics dashboard just after starting it up. The white stripe is due to placeholder "image deleted" images, which weren't discarded early until later.

While not constantly maxing out CPU, it was bad enough to worsen GPU utilization.

There were, conveniently, easy solutions. I reanalyzed some code and realized that I was using an inefficient msgpack library in the Python inference server for no particular reason, which was easy to swap out; that having the inference server client code send images as PNGs to reduce network traffic was not actually necessary for this and was probably using nontrivial CPU time for encode/decode (PNG uses outdated and slow compression); and that a farily easy replacement for the Rust image resizing code was available with significant speedups. This reduced load enough to keep things functioning stably at slightly less than 100% CPU for a while, but it crept up again later. Further profiling revealed no obvious low-hanging fruit other than console-subscriber, a monitoring tool for tokio async code, using ~15% of runtime for no good reason - fixing this and switching again to slightly lower-quality image resizing fixed everything for the remainder of runtime. There was a later spike in network bandwidth which appeared to be due to there being many more large PNGs to download, which I worried would sink the project (or at least make it 20% slower), but this resolved itself after a few hours.

Indexing

Previous Meme Search Engines have used a small dataset in the tens of thousands of items, so search was very easy and could work through trivial brute force in milliseconds, and I mostly treated the vector index as a black box. A four-order-of-magnitude scaleup makes many problems less easy: a brute-force scan takes an impractical several minutes, so ANN algorithms are necessary. Unfortunately, I have weird enough requirements that my problems went from "trivial" to "active area of research": I have ten times more data than fits in RAM uncompressed, I want reasonably high recall[6], I want low latency (hundreds of milliseconds at most), I have much less CPU power than the people studying this usually do, and I need to search image embeddings using text embeddings, which apparently makes ANN indices' jobs much harder.

The "default" option for this is FAISS's inverted lists (clustering vectors and searching a subset of clusters; these can be stored on disk and memory-mapped) and product quantization (lossy compression of vectors for faster scanning and smaller memory footprint). FAISS's documentation is somewhat haphazard, and while some of it shows this working very well, we also see significantly worse performance with product quantization alone on datasets I think are more representative (though they do not actually seem to list their dataset sizes there). Recent work has shown terrible recall performance on similar datasets (TEXT2IMAGE-100M and -1B)

Since big tech extensively uses vector indices, there are better solutions available. DiskANN uses an on-disk graph data structure[7] and has significantly better recall/QPS curves. It doesn't have Rust bindings, but apparently it has an entire Rust port. Said port was apparently never finished, only builds on Windows, looks overcomplicated and seems to make some strange decisions, so I don't trust it. I considered trying to work out how to bind the DiskANN C++ code to Rust instead and ignoring its quirks, but I don't like C++ and it makes other strange design decisions. The algorithm did not look that hard, so I decided to implement it myself, which would also allow me to reduce latency by storing results' URLs and metadata along with their vectors and graph neighbours. It was, however, hard, especially on my compute budget. There were many confusing problems[8], smallish indices on my test data turned out bad at first, and my code had to do too many disk reads because I had ignored a component (storing product-quantized vectors in memory to decide which neighbours to visit in the graph) which was actually very important. I anticipated enough extra time being spent on fixing this that I used the spare time to run the scraper for longer, so you get slightly better results out of this, at least.

but it was around that time that I read the RoarGraph paper, which implied that I would have to adjust the algorithms to use information from the query distribution (of text vectors) to build an index effectively.

Improving results

The index was generated from a filtered and deduplicated version of the dump - the embeddings are very amenable to useful classification, so I discarded anything which looked like a YouTube or Imgur "this has been deleted" image (I don't know why they couldn't just send HTTP 404; I eventually put this into the scraper itself) and anything NSFW Reddit might have missed. Deduplicating without an index already built was quite tricky since the image encoder is nondeterministic[9], so I went for hashing quantized vectors and URLs, though this likely missed things. Sadly, by Sturgeon's law, this is insufficient to get good results, since most things are bad. TODO.

Notes

The meme search master plan.

  • I should have done more deduplication earlier in the pipeline to avoid wasting computing time.
  • The runtime of index building is dominated by computing dot products. I messed around for two days with integer quantization but had substantial accuracy problems, then looked more carefully at the dot product code I was depending on and beat it by a factor of three (in microbenchmarking) in twenty minutes using highly advanced techniques like "unrolling the loop at all", "using multiple accumulators" and "assuming vectors have lengths which are multiples of 32". There's probably a lesson here. I don't know what it is.
  • I don't have enough cores that the ParlayANN algorithms were necessary, but it provided useful information about search quality on various datasets and the code is nicer than the main DiskANN library.
  • Nearest Neighbour Normalization may have been helpful here, but for various reasons implementation would have been difficult. It may be implemented "later".
  • DiskANN++ has some relevant performance improvements. I adopt entry vertex selection, if mostly because it is structurally easier.
  • Much of the difficulty of this project came from running on underpowered hardware. With a mere* terabyte or so of RAM, indexing with off-the-shelf tools would have been a non-issue, and the streaming embedding generation could have been replaced with normal disks, img2dataset and faster high-end GPUs. Many people have access to this and the idea has been possible since at least last year. However, nothing like this is, to my knowledge, currently available anywhere else - Google and Bing have bad image search, Yandex (for some reason) is slightly better and the big clip-retrieval deployments are dead. We can thus conclude that the EMH isn't real.
  • Filtering of results by non-vector metadata is done in the naive way - postprocessing results from the vector index. It is possible to do better, but the algorithms for this are quite limited and I did not want the extra complexity.
  • The frontend now also has some extra telemetry so I can hopefully train better recommendation systems later.

  1. Yes, I know we count supercomputer power in FP64 and consumer hardware mostly won't do double-precision. I am ignoring that for the purposes of art. ↩︎

  2. It has an AMD Ryzen 5 5500, 64GB of slow RAM and a Nvidia RTX 3090. ↩︎

  3. For non-Redditors, this is the merged content of all subreddits. ↩︎

  4. I did not, technically, actually process all of them. Some of them are probably in formats my code can't process, some (so, so many) are dead links now, some are above the arbitrary filesize limit of 16MiB I imposed, some of them weren't directly linked to, and a few were skipped due to lazy restart-on-failure code which uses the last timestamp as a cutoff even though the output is not fully linearized. ↩︎

  5. It might also be legally fraught to, in any case. ↩︎

  6. You may be wondering why I didn't just do what the old LAION search index did: while I don't have access to it to check, I think it compromised lots on recall. Private discussions with rom1504 confirm this. ↩︎

  7. This seems weird since random reads are quite expensive, even with SSDs, but I guess the graph-based algorithm is good enough that it works out. Optane would do better if Intel still sold it. ↩︎

  8. I kept accidentally introducing complicated bugs which partly cancelled each other out. ↩︎

  9. Mathematically, it shouldn't be, of course, but performance on consumer Nvidia hardware brings numerical instability and CUDA determinism has a ~10% performance hit. ↩︎

Comments