Re: WIP: index support for regexp search - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: WIP: index support for regexp search
Date
Msg-id 50BC795C.5000509@vmware.com
Whole thread Raw
In response to Re: WIP: index support for regexp search  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: WIP: index support for regexp search
List pgsql-hackers
On 02.12.2012 20:19, Tom Lane wrote:
> Alexander Korotkov<aekorotkov@gmail.com>  writes:
>> Nice idea to delay expanding colors to characters! Obviously, we should
>> delay expanding inly alphanumerical characters. Because non-alphanumberical
>> characters influence graph structure. Trying to implement...
>
> Uh, why would that be?  Colors are colors.  The regexp machinery doesn't
> care whether they represent alphanumerics or not.  (Or to be more
> precise, if there is a situation where it makes a difference, separate
> colors will have been created for each set of characters that need to be
> distinguished.)

The regexp machinery doesn't care, but the trigrams that pg_trgm 
extracts only contain alphanumeric characters. So if by looking at the 
CNFA graph produced by the regexp machinery you conclude that any 
matching strings must contain three-letter sequences "%oo" and "#oo", 
you can just club them together into " oo" trigram.

I think you can run a pre-processing step to the colors, and merge 
colors that are equivalent as far as trigrams are considered. For 
example, if you have a color that contains only character '%', and 
another that contains character '#', you can treat them as the same 
hcolor. You might then be able to simplify the CNFA. Actually, it would 
be even better if you could apply the pre-processing to the regexp 
before the regexp machinery turns it into a CNFA. Not sure how easy it 
would be to do such pre-processing.

BTW, why create the path matrix? You could check the "check" array of 
trigrams in the consistent function directly against the graph. 
Consistent should return true, iff there is a path through the graph 
following only arcs that contain trigrams present in the check array. 
Finding a path through a complex graph could be expensive, O(|E|), but 
if the path is complex, the path matrix would be large as well, and 
checking against a large matrix isn't exactly free either. It would 
allow you to avoid "overflows" caused by having too many paths through 
the graph.

- Heikki



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Enabling Checksums
Next
From: Dimitri Fontaine
Date:
Subject: Re: ALTER command reworks