Re: WIP: index support for regexp search - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: WIP: index support for regexp search
Date
Msg-id 4F187D5C.30701@enterprisedb.com
Whole thread Raw
In response to WIP: index support for regexp search  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: WIP: index support for regexp search
Re: WIP: index support for regexp search
List pgsql-hackers
On 22.11.2011 21:38, Alexander Korotkov wrote:
> WIP patch with index support for regexp search for pg_trgm contrib is
> attached.
> In spite of techniques which extracts continuous text parts from regexp,
> this patch presents technique of automatum transformation. That allows more
> comprehensive trigrams extraction.

Nice!

> Current version of patch have some limitations:
> 1) Algorithm of logical expression extraction on trigrams have high
> computational complexity. So, it can become really slow on regexp with many
> branches. Probably, improvements of this algorithm is possible.
> 2) Surely, no perfomance benefit if no trigrams can be extracted from
> regexp. It's inevitably.
> 3) Currently, only GIN index is supported. There are no serious problems,
> GiST code for it just not written yet.
> 4) It appear to be some kind of problem to extract multibyte encoded
> character from pg_wchar. I've posted question about it here:
> http://archives.postgresql.org/pgsql-hackers/2011-11/msg01222.php
> While I've hardcoded some dirty solution. So
> PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not
> supported yet.

This is pretty far from being in committable state, so I'm going to mark 
this as "returned with feedback" in the commitfest app. The feedback:

The code badly needs comments. There is no explanation of how the 
trigram extraction code in trgm_regexp.c works. Guessing from the 
variable names, it seems to be some sort of a coloring algorithm that 
works on a graph, but that all needs to be explained. Can this algorithm 
be found somewhere in literature, perhaps? A link to a paper would be nice.

Apart from that, the multibyte issue seems like the big one. Any way 
around that?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Arithmetic operators for macaddr type
Next
From: Dimitri Fontaine
Date:
Subject: Re: Inline Extension