Re: WIP: index support for regexp search - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: WIP: index support for regexp search
Date
Msg-id CAPpHfdtTxBiwGX5CuwQn2NKQJTYLsCghQDAcUR8D-BF=_hoSMw@mail.gmail.com
Whole thread Raw
In response to Re: WIP: index support for regexp search  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: WIP: index support for regexp search  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Thu, Mar 14, 2013 at 9:40 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 23.01.2013 09:36, Alexander Korotkov wrote:
>> On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>>> The biggest problem is that I really don't care for the idea of
>>> contrib/pg_trgm being this cozy with the innards of regex_t.

>> The only option I see now is to provide a method like "export_cnfa" which
>> would export corresponding CNFA in fixed format.

> Yeah, I think that makes sense. The transformation code in trgm_regexp.c
> would probably be more readable too, if it didn't have to deal with the
> regex guts representation of the CNFA. Also, once you have intermediate
> representation of the original CNFA, you could do some of the
> transformation work on that representation, before building the
> "tranformed graph" containing trigrams. You could eliminate any
> non-alphanumeric characters, joining states connected by arcs with
> non-alphanumeric characters, for example.

It's not just the CNFA though; the other big API problem is with mapping
colors back to characters.  Right now, that not only knows way too much
about a part of the regex internals we have ambitions to change soon,
but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
which should be known to the regex library IMO.  So I'm not sure how we
divvy that up sanely.  To be clear: I'm not going to insist that we have
to have a clean API factorization before we commit this at all.  But it
worries me if we don't even know how we could get to that, because we
are going to need it eventually.

Now I have following idea about API.
Put code of stage 2 (transform the original CNFA into an automaton-like graph) into regex engine. It would use API which describes what exactly are we going to extract from CNFA. This API could look like this.

typedef char *Prefix;
typedef char *ArcLabel;

typedef struct
{
Prefix newPrefix;
ArcLabel label;
} ArcInfo;

typedef struct
{
Prefix (*getInitialPrefix) ();
bool (*prefixContains) (Prefix prefix1, Prefix prefix2);
Prefix * (*getPrefixes) (Prefix prefix, color c, int *n);
ArcInfo * (*getArcs) (Prefix prefix, color c, int *n);
void (*freePrefix) (Prefix prefix);
void (*freeArcLabel) (ArcLabel arcLabel);
} CFNATransformAPI;

getInitialPrefix returns initial prefix value like now this code does:
> initkey.prefix.colors[0] = UNKNOWN_COLOR;
> initkey.prefix.colors[1] = UNKNOWN_COLOR;
prefixContains are exactly same as function with this name.
getPrefixes and getArcs cycle step work of addKeys an addArcs.
freePrefix and freeArcLabel frees used memory of Prefix and ArcLabel strutures.

Additionally regex engine should provide correct way to examine colormap.
int getColorCharsCount(colormap *cm, color c);
pg_wchar *getColorChars(colormap *cm, color c);
getColorCharsCount would return -1 if this color should be considered as unexpandable.

Now I have working implemetation of this API. Comments still need rework. Could you give me any feedback?

------
With best regards,
Alexander Korotkov.
Attachment

pgsql-hackers by date:

Previous
From: Euler Taveira
Date:
Subject: Re: SIGHUP not received by custom bgworkers if postmaster is notified
Next
From: Alvaro Herrera
Date:
Subject: Re: SIGHUP not received by custom bgworkers if postmaster is notified