Show HN: Pisa – Full Text Search Engine Written in C++

rubyn00bie · on March 14, 2020

Ugh, I hate it when readme's have zero useful information; and literally only a way to cite them as a source... this is a big problem with C++ libraries. I 100% think the author(s) deserve credit for their work, and I'm sure it's brilliant, but I'll never use your work if I don't know what it does.

How does this compare to anything other than being fast? Why should I use it? Can't you include a few examples, use cases, or reasons why it exists? Is it in memory, distributed, on disk... when is it fast? Does it have a binary I can run?

[clicks around documentation site]

... oh shit there is a binary of something, maybe multiple... how about showing those on the readme? Or maybe talk about what algorithms are actually implemented instead of just saying lots of them are.

PisaSheeya · on March 14, 2020

I had the same problem. I really wanted to like this project, but I found it hard to get a handle on what their goal is. Performant, c++17, open source, IR tools sounds great to me. But.. how does someone use this? Are they a parallel to Lucene? ElasticSearch? Grep?

Talking about scale would help a lot here. What's the largest dataset they've indexed? Do they shard across multiple nodes? etc.

I found they had a research paper about PISA at OSIRRC (a replicability challenge for ir?) last year with some details. You can get the paper and slides off the conference site:

https://osirrc.github.io/osirrc2019/

They have run the dataset on things like ClueWeb12 (1.5TB web), but the paper was about replicable search and lacked performance comparisons to other systems. It's hard to call yourself performant unless you show you're at least as good as other implementations.

BubRoss · on March 14, 2020

The only mention of a benchmark I see is an issue raised over a year ago, so I wonder what they are basing this claim on.

epr · on March 14, 2020

It is quite confusing when projects advertise that they have best-in-class performance then don't provide benchmark results in the readme or documentation. This may be an extreme example of that.

b52_ · on March 15, 2020

Where do they claim to be best in class? I don't seem to see that anywhere.

rozim · on March 14, 2020

Would be nice to see it compared with Xapian as Xapian has been around for a long time.

https://xapian.org/

ddorian43 · on March 14, 2020

There is also vespa.ai proton search-core in c++ https://docs.vespa.ai/documentation/proton.html

_xnmw · on March 14, 2020

I want someone to use one of these libraries to build a local-desktop-offline personal search engine which only crawls sources I select, including my personal emails and browsing history.

syspec · on March 14, 2020

Finder on OSX meets that criteria

jdc · on March 14, 2020

Now if only it would index the actual web pages of the browser history.

SanchoPanda · on March 14, 2020

Recoll does this.

ternaryoperator · on March 14, 2020

That looks like a really cool tool. Thanks for mentioning it!

rubyn00bie · on March 14, 2020

Already on it :)

prtaylor · on March 15, 2020

What makes it specific to academia? Can't it be used as a general purpose search?

skyde · on March 14, 2020

nice I didn’t know about this project and it’s using my implementation of VarIntG8IU codec.

I will give it a try.

amallia · on March 14, 2020

Maxime Caron?