Re: WIP: index support for regexp search - Mailing list pgsql-hackers

From Tom Lane
Subject Re: WIP: index support for regexp search
Date
Msg-id 6275.1364164727@sss.pgh.pa.us
Whole thread Raw
In response to Re: WIP: index support for regexp search  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: WIP: index support for regexp search  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Similarly, pushing PG-specific declarations like RE_compile_and_cache()
>> into regex/regex.h is completely not the right thing for preserving a
>> clear library boundary (even positing that we want to expose that
>> function outside adt/regexp.c, which I'd rather we didn't).
>> 
>> Perhaps we could avoid these issues by defining a library API that
>> provides accessors like these for the opaque regex_t struct:
>> 
>> * get the number of states in the CNFA
>> 
>> * get the numbers of the initial and final states
>> 
>> * get the number of out-arcs for the N'th state
>> 
>> * get the out-arcs for the N'th state into a caller-provided array
>> (sized using the previous function), where each out-arc is represented
>> by a color and an end-state
>> 
>> * get the number of character codes represented by color C
>> 
>> * get the wchar codes for color C into a caller-provided array
>> 
>> (The reason for letting the caller allocate the result arrays is so we
>> can use palloc for that; if we allocate it in backend/regex/ we must
>> use malloc, which will greatly increase the risk of leakages.  Also,
>> as far as the color API goes, the above lets the caller decide how
>> many characters is "too many" to bother with.)

> I like the this idea. Seems like clear and not over-engineered API. I can
> implement it. Could you propose something particular to do with
> RE_compile_and_cache in this patch? With this API we still need a way to
> get regex_t or equivalent from string.

Well, the brute force answer is for pg_trgm to not go through
RE_compile_and_cache at all, but just call the regex library for itself.
That would lose the ability to cache regex compilations, but I'm not
sure we care.  The motivation for RE_compile_and_cache is mainly to
prevent having to compile the regex again for *every row*, which is what
would otherwise happen in a simple SELECT using a regex function.
Unless I'm misunderstanding something, pg_trgm would only need to
compile the regex once per indexscan, which is probably tolerable.

Another approach we could take is for adt/regexp.c to expose a function
that collects the deconstructed CNFA data into a palloc'd structure
using the above-proposed functions.  This'd be a reasonable way to go if
we decide the caching behavior is actually needed for this purpose; but
otherwise it seems like it's just involving one more module than
necessary.

BTW, one reason I don't like exposing RE_compile_and_cache directly is
that then you have the question of how long is the regex_t it hands back
good for.  The caller doesn't own that struct, and at some point
regexp.c is likely to recycle it, so it's a bit risky to just assume
it'll stay around "long enough" without any further interaction with
regexp.c.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: WIP: index support for regexp search
Next
From: Nicholas White
Date:
Subject: Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls