Thread: pg_trgm vs. Solr ngram

pg_trgm vs. Solr ngram

From
Chris
Date:
Hello list

I'm pondering migrating an FTS application from Solr to Postgres, just 
because we use Postgres for everything else.

The application is basically fgrep with a web frontend. However the 
indexed documents are very computer network specific and contain a lot 
of hyphenated hostnames with dot-separated domains, as well as IPv4 and 
IPv6 addresses. In Solr I was using ngrams and customized the 
TokenizerFactories until more or less only whitespace was as separator, 
while [.:-_\d] remains part of the ngrams. This allows to search for 
".12.255/32" or "xzy-eth5.example.org" without any false positives.

It looks like a straight conversion of this method is not possible since 
the tokenization in pg_trgm is not configurable afaict. Is there some 
other good method to search for a random substring including all the 
punctuation using an index? Or a pg_trgm-style module that is more 
flexible like the Solr/Lucene variant?

Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, do 
I pretty much just need to change the emitted tokens or will this lead 
to significant complications in the operators, indexes etc.?

thanks for any hints & cheers
Christian



Re: pg_trgm vs. Solr ngram

From
Laurenz Albe
Date:
On Fri, 2023-02-10 at 03:20 +0100, Chris wrote:
> I'm pondering migrating an FTS application from Solr to Postgres, just
> because we use Postgres for everything else.
>
> The application is basically fgrep with a web frontend. However the
> indexed documents are very computer network specific and contain a lot
> of hyphenated hostnames with dot-separated domains, as well as IPv4 and
> IPv6 addresses. In Solr I was using ngrams and customized the
> TokenizerFactories until more or less only whitespace was as separator,
> while [.:-_\d] remains part of the ngrams. This allows to search for
> ".12.255/32" or "xzy-eth5.example.org" without any false positives.
>
> It looks like a straight conversion of this method is not possible since
> the tokenization in pg_trgm is not configurable afaict. Is there some
> other good method to search for a random substring including all the
> punctuation using an index? Or a pg_trgm-style module that is more
> flexible like the Solr/Lucene variant?'127.0.0.1/32'
>
> Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, do
> I pretty much just need to change the emitted tokens or will this lead
> to significant complications in the operators, indexes etc.?

Here is a hack that you can try: pre-process your strings and replace
symbols with rare characters:

  SELECT show_trgm(translate('127.0.0.1/32', './', 'qx'));

                          show_trgm
  ═════════════════════════════════════════════════════════
   {"  1"," 12",0q0,0q1,127,1x3,27q,"32 ",7q0,q0q,q1x,x32}
  (1 row)

Then you could search like

  WHERE translate(search_string, './', 'qx') LIKE translate('%127.0.0.1/32%', './', 'qx')
    AND search_string LIKE '%127.0.0.1/32%'

The first condition can use a trigram index, and the second filters out
false positives.

Yours,
Laurenz Albe



Re: pg_trgm vs. Solr ngram

From
Tom Lane
Date:
Chris <rc@networkz.ch> writes:
> Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, do 
> I pretty much just need to change the emitted tokens or will this lead 
> to significant complications in the operators, indexes etc.?

See KEEPONLYALNUM in pg_trgm/trgm.h ...

Now, using a custom-modified pg_trgm module in production is likely
not something you want to do for long.  It might be interesting to
look into whether the relatively-recently-invented "operator class
parameter" mechanism could be exploited to allow this behavior to be
customized without hacking C code.

            regards, tom lane



Re: pg_trgm vs. Solr ngram

From
Bertrand Mamasam
Date:


Le ven. 10 févr. 2023, 03:20, Chris <rc@networkz.ch> a écrit :
Hello list

I'm pondering migrating an FTS application from Solr to Postgres, just
because we use Postgres for everything else.

The application is basically fgrep with a web frontend. However the
indexed documents are very computer network specific and contain a lot
of hyphenated hostnames with dot-separated domains, as well as IPv4 and
IPv6 addresses. In Solr I was using ngrams and customized the
TokenizerFactories until more or less only whitespace was as separator,
while [.:-_\d] remains part of the ngrams. This allows to search for
".12.255/32" or "xzy-eth5.example.org" without any false positives.

It looks like a straight conversion of this method is not possible since
the tokenization in pg_trgm is not configurable afaict. Is there some
other good method to search for a random substring including all the
punctuation using an index? Or a pg_trgm-style module that is more
flexible like the Solr/Lucene variant?

Or maybe hacking my own pg_trgm wouldn't be so hard and could be fun, do
I pretty much just need to change the emitted tokens or will this lead
to significant complications in the operators, indexes etc.?

thanks for any hints & cheers
Christian

In Solr you used FTS so I suggest that you do the same in Postgres and look at the full text search functions. You can create a tsvector yourself in many different ways or use one of the provided functions. So you could add complete IP adresses to your index and then search for them using something like phrase search. You can also create text search configurations or just use the "simple" one if you just need something like fgrep. Of course, the end result will be more like Solr and less like fgrep. 



Re: pg_trgm vs. Solr ngram

From
"Christian Ramseyer (mx04)"
Date:
On 10.02.23 04:48, Laurenz Albe wrote:
> On Fri, 2023-02-10 at 03:20 +0100, Chris wrote:
>> In Solr I was using ngrams and customized the
>> TokenizerFactories until more or less only whitespace was as separator,
>> while [.:-_\d] remains part of the ngrams. This allows to search for
>> ".12.255/32" or "xzy-eth5.example.org" without any false positives.
>>
>> It looks like a straight conversion of this method is not possible 


On 10.02.23 04:48, Laurenz Albe wrote:
> Here is a hack that you can try: pre-process your strings and replace
> symbols with rare characters:
> 
>    SELECT show_trgm(translate('127.0.0.1/32', './', 'qx'));
> 
>                            show_trgm
>    ═════════════════════════════════════════════════════════
>     {"  1"," 12",0q0,0q1,127,1x3,27q,"32 ",7q0,q0q,q1x,x32}
>    (1 row)
> 
> Then you could search like
> 
>    WHERE translate(search_string, './', 'qx') LIKE translate('%127.0.0.1/32%', './', 'qx')
>      AND search_string LIKE '%127.0.0.1/32%'
> 
> The first condition can use a trigram index, and the second filters out
> false positives.


Hehe that is a nifty idea. I went to try this but then it turned out 
that I was probably overthinking the whole issue already. Using a 
gist_trgm_ops index and % as operator works perfectly well:

insert into docs (orig) values ('120.2.10.22');
insert into docs (orig) values ('120 2 10 22');
CREATE INDEX iorig ON docs USING GIST (orig gist_trgm_ops);

set enable_seqscan = off;
explain analyze verbose select * from docs where orig like '%.10.22%';

Index Scan using iorig on public.docs  (cost=0.14..8.16 rows=1 width=32) 
(actual time=0.952..1.018 rows=1 loops=1)
   Output: orig
   Index Cond: (docs.orig ~~ '%.10.22%'::text)
   Rows Removed by Index Recheck: 1

Even though this query has the same trigrams like e.g. '% 10 22%', the 
index recheck takes care of it and only the matching row is returned. 
Excellent, not quite sure why I was expecting false positives in the 
first place, it would be a pretty stark violation of how % is supposed 
to behave.

Not quite sure how big the performance hit of not having the optimal 
trigrams with punctuation in the index and rechecking some hits will be, 
but for now I'll assume it's negligible. Otherwise I'll try the 
translate variant.

Many thanks!