Home > mailing lists

Re: WIP: index support for regexp search - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: WIP: index support for regexp search
Date	January 19, 2012 16:30:47
Msg-id	4F187D5C.30701@enterprisedb.com Whole thread Raw
In response to	WIP: index support for regexp search (Alexander Korotkov <aekorotkov@gmail.com>)
Responses	Re: WIP: index support for regexp search Re: WIP: index support for regexp search
List	pgsql-hackers

Tree view

On 22.11.2011 21:38, Alexander Korotkov wrote:
> WIP patch with index support for regexp search for pg_trgm contrib is
> attached.
> In spite of techniques which extracts continuous text parts from regexp,
> this patch presents technique of automatum transformation. That allows more
> comprehensive trigrams extraction.

Nice!

> Current version of patch have some limitations:
> 1) Algorithm of logical expression extraction on trigrams have high
> computational complexity. So, it can become really slow on regexp with many
> branches. Probably, improvements of this algorithm is possible.
> 2) Surely, no perfomance benefit if no trigrams can be extracted from
> regexp. It's inevitably.
> 3) Currently, only GIN index is supported. There are no serious problems,
> GiST code for it just not written yet.
> 4) It appear to be some kind of problem to extract multibyte encoded
> character from pg_wchar. I've posted question about it here:
> http://archives.postgresql.org/pgsql-hackers/2011-11/msg01222.php
> While I've hardcoded some dirty solution. So
> PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not
> supported yet.

This is pretty far from being in committable state, so I'm going to mark 
this as "returned with feedback" in the commitfest app. The feedback:

The code badly needs comments. There is no explanation of how the 
trigram extraction code in trgm_regexp.c works. Guessing from the 
variable names, it seems to be some sort of a coloring algorithm that 
works on a graph, but that all needs to be explained. Can this algorithm 
be found somewhere in literature, perhaps? A link to a paper would be nice.

Apart from that, the multibyte issue seems like the big one. Any way 
around that?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

pgsql-hackers by date:

From: Robert Haas
Date: 19 January 2012, 16:26:27
Subject: Re: Arithmetic operators for macaddr type

From: Dimitri Fontaine
Date: 19 January 2012, 16:43:19
Subject: Re: Inline Extension

Re: WIP: index support for regexp search - Mailing list pgsql-hackers

Previous

Next