Re: pg_trgm vs. Solr ngram - Mailing list pgsql-general

From Bertrand Mamasam
Subject Re: pg_trgm vs. Solr ngram
Date
Msg-id CACZ67_UD38B7r3ug5uSW2_TgtdGt+hPhhw0BnsM+bK1vQE_01g@mail.gmail.com
Whole thread Raw
In response to pg_trgm vs. Solr ngram  (Chris <rc@networkz.ch>)
List pgsql-general


Le ven. 10 févr. 2023, 03:20, Chris <rc@networkz.ch> a écrit :
Hello list

I'm pondering migrating an FTS application from Solr to Postgres, just
because we use Postgres for everything else.

The application is basically fgrep with a web frontend. However the
indexed documents are very computer network specific and contain a lot
of hyphenated hostnames with dot-separated domains, as well as IPv4 and
IPv6 addresses. In Solr I was using ngrams and customized the
TokenizerFactories until more or less only whitespace was as separator,
while [.:-_\d] remains part of the ngrams. This allows to search for
".12.255/32" or "xzy-eth5.example.org" without any false positives.

It looks like a straight conversion of this method is not possible since
the tokenization in pg_trgm is not configurable afaict. Is there some
other good method to search for a random substring including all the
punctuation using an index? Or a pg_trgm-style module that is more
flexible like the Solr/Lucene variant?

Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, do
I pretty much just need to change the emitted tokens or will this lead
to significant complications in the operators, indexes etc.?

thanks for any hints & cheers
Christian

In Solr you used FTS so I suggest that you do the same in Postgres and look at the full text search functions. You can create a tsvector yourself in many different ways or use one of the provided functions. So you could add complete IP adresses to your index and then search for them using something like phrase search. You can also create text search configurations or just use the "simple" one if you just need something like fgrep. Of course, the end result will be more like Solr and less like fgrep. 



pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: pg_trgm vs. Solr ngram
Next
From: Alban Hertroys
Date:
Subject: Re: WHERE col = ANY($1) extended to 2 or more columns?