Useful relevance ranking is definitely difficult, but a lot of people are gettin...

throwaway81523 · on Oct 15, 2022

The only way I found to get any useful ranking in Solr was to implement application-level scoring of the documents based on semantic considerations of their contents, and also of relationships between the documents. And even that was pretty bad. I concocted some dubious schemes to condition those scores on external info as well, but the project ended before I got to try any of that.

Yes that book described tf/idf and bm25 but a bunch of other stuff too that was somewhat more promising. Really though, IMHO there is no getting away from understanding the actual documents, and possibly their connection with the outer world (pagerank being an example of the latter). For web search it's now probably worse than before, since instead of merely being overwhelmed by noise, the data is now actually adverserial in the sense of having SEO trying to game your ranking.

Even without that though, go on any retail site and try a search. The relevance ranking will be so awful that sorting by price or age or alphabetically will work a lot better. Same thing with the Algolia search here on HN. Chronological is almost always more useful than the search engine's idea of relevance. For automating relevance the scoring system needs much more semantic understanding of the data. Maybe that is more feasible with recent advances in NLP. I don't know whether that is good or bad.

ramraj07 · on Oct 15, 2022

As long as you can calculate a metric of importance for each of your rows, Postgres search rank seems to work plenty fine - we use Postgres FTS to power many of our search efforts and are replacing elasticsearch implementations because they’re harder to manage and no one seems to notice any quality differences. We just merely use a product of the importance metric and the ts_rank and that seems to do the trick.