Isn't it really expensive to dedupe images based on content? As you have to compare every image to every other image in the dataset?
How could one go about deduping images? Maybe using something similar to rsync protocol? Cheap hash method, then a more expensive one, then a full comparison, maybe. Even so 2B+ images... and you are talking about saving on storage costs, mostly which is quite cheap these days.
It depends on exactly what problem you're trying to solve. If the goal is to find the same image with slight differences caused by re-encoding, downsampling, scaling, etc. you can use something like phash.org pretty efficiently to build a database of image hashes, review the most similar ones, and use it to decide whether you've already “seen” new images.
That approach works well when the images are basically the same. It doesn't work so well when you're trying to find images which are either different photos of the same subject or where one of them is a crop of a larger image or has been modified more heavily. A number of years back I used OpenCV for that task[1] to identify the source of a given thumbnail image in a larger master file and used phash to validate that a new higher resolution thumbnail was highly similar to the original low-res thumbnail after trying to match the original crop & rotation. I imagine there are far more sophisticated tools for that now but at the time phash felt basically free in comparison the amount of computation which OpenCV required.
No, you embed all images with CLIP and use an approximate nearest neighbour library (like faiss) to get the most similar ones to the query in logarithmic time. Embedding will also be invariant to small variations.
You can try this on images.yandex.com - they do similarity search with embeddings. Upload any photo and you'll get millions of similar photos, unlike Google that has only exact duplicate search. It's diverse like Pinterest but without the logins.
You don't have to compare all images to one another and doing so wouldn't reliably dedupe - what if one image is slightly different resolution, different image type, different metadata, etc? They would different hashes but still basically the same data.
I think the way you do it is to train a model to represent images as vectors. Then you put those vectors into a BTree which will allow you to efficiently query for the "nearest neighbor" to an image on log(n) time. You calibrate to find a distance that picks up duplicates without getting too many non-duplicates and then it's n log(n) time rather than n^2.
If that's still too slow there is also a thing called ANNOY which lets you do approximate nearest neighbor faster.
There are hash algorithms for image similarity. For a toy example, imagine scaling the image to 8x8px, making it grayscale, and using those 64 bytes as hash. That way you only have to hash each picture once, and can find duplicates by searching for hashes with a low hamming distance (number of bit flips) to each other, which is very fast.
Of course actual hash algorithms are a bit cleverer, there are a number to choose from depending on what you want to consider a duplicate (cropping, flips, rotations, etc)
You're probably confusing "cryptographic hash functions" with "perceptual hashing" (or other forms of "locality-sensitive hashing"). In the case of the latter, what you say is almost always not true (that's the point of using "perceptual hashing" after all: similar objects get mapped to similar/same hash).
I don't have experience with image duplication, but if you can make a decent hash a 2.3 billion item hashtable is really cheap.
If you need to do something closer to pairwise (for instance, because you can't make a cheap hash of images which papers over differences in compression), make the hash table for the text descriptions, then compare the images within buckets. Of the few 5 or 6 text fields I just spot checked (not even close to random selection) the worst false positive I found (in the 12M data set) was 3 pairs of two duplicates with the same description. On the other hand I found one set of 76 identical images with the same description.
Convert the images to embeddings, and perform an approximate nearest neighbor search on them and identify images that are very close together (e.g. with faiss, which the page alludes to using).
How could one go about deduping images? Maybe using something similar to rsync protocol? Cheap hash method, then a more expensive one, then a full comparison, maybe. Even so 2B+ images... and you are talking about saving on storage costs, mostly which is quite cheap these days.