Home > mailing lists

pg_trgm vs. Solr ngram - Mailing list pgsql-general

From	Chris
Subject	pg_trgm vs. Solr ngram
Date	February 10, 2023 05:20:36
Msg-id	4628c3f6-e2c5-1484-71cf-62446cec984d@networkz.ch Whole thread Raw
Responses	Re: pg_trgm vs. Solr ngram (Laurenz Albe <laurenz.albe@cybertec.at>) Re: pg_trgm vs. Solr ngram (Tom Lane <tgl@sss.pgh.pa.us>) Re: pg_trgm vs. Solr ngram (Bertrand Mamasam <golgote@gmail.com>)
List	pgsql-general

Tree view

Hello list

I'm pondering migrating an FTS application from Solr to Postgres, just 
because we use Postgres for everything else.

The application is basically fgrep with a web frontend. However the 
indexed documents are very computer network specific and contain a lot 
of hyphenated hostnames with dot-separated domains, as well as IPv4 and 
IPv6 addresses. In Solr I was using ngrams and customized the 
TokenizerFactories until more or less only whitespace was as separator, 
while [.:-_\d] remains part of the ngrams. This allows to search for 
".12.255/32" or "xzy-eth5.example.org" without any false positives.

It looks like a straight conversion of this method is not possible since 
the tokenization in pg_trgm is not configurable afaict. Is there some 
other good method to search for a random substring including all the 
punctuation using an index? Or a pg_trgm-style module that is more 
flexible like the Solr/Lucene variant?

Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, do 
I pretty much just need to change the emitted tokens or will this lead 
to significant complications in the operators, indexes etc.?

thanks for any hints & cheers
Christian

pgsql-general by date:

From: Peter Geoghegan
Date: 10 February 2023, 04:15:02
Subject: Re: ERROR: posting list tuple with 2 items cannot be split at offset 17

From: Laurenz Albe
Date: 10 February 2023, 06:48:48
Subject: Re: pg_trgm vs. Solr ngram

pg_trgm vs. Solr ngram - Mailing list pgsql-general

Previous

Next