Re: WIP: index support for regexp search - Mailing list pgsql-hackers

From Alexander Korotkov
Subject Re: WIP: index support for regexp search
Date
Msg-id CAPpHfdsRdG1M9HU1=71WM1eXYHGDXSooyuAKPXYZQmrfmnKcGA@mail.gmail.com
Whole thread Raw
In response to Re: WIP: index support for regexp search  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: WIP: index support for regexp search  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alexander Korotkov <aekorotkov@gmail.com> writes:
> Now I have working implemetation of this API. Comments still need rework.
> Could you give me any feedback?

I looked at this a little bit, but it's not very far along at all
towards resolving my API worries.  The basic point that I'm concerned
about is that we would like to split off the regex library (ie,
backend/regex/) as a standalone project someday.  There are already
some problems to be resolved to make that possible, and I don't want
this patch to introduce new ones.  From that standpoint, the proposed
regextract.c file is a disaster, because it depends on a boatload of
Postgres-specific stuff (at least StringInfo, List, HTAB, and pg_wchar;
to say nothing of palloc).  We can't consider that to be on the regex
side of the fence.
 
Now I can see your position about API much more clearly. Previously I thought only about algorithmic side of things.

Similarly, pushing PG-specific declarations like RE_compile_and_cache()
into regex/regex.h is completely not the right thing for preserving a
clear library boundary (even positing that we want to expose that
function outside adt/regexp.c, which I'd rather we didn't).

Given that you've already got a notion of callbacks provided by
contrib/pg_trgm, perhaps this can be fixed by pushing more of the work
into those callbacks, so that the heavy-duty data structures like the
hash table live over there and the API exposed by backend/regex/ is at
a much simpler level than what you have here.  But right now I don't
see any usable library API here.

Perhaps we could avoid these issues by defining a library API that
provides accessors like these for the opaque regex_t struct:

* get the number of states in the CNFA

* get the numbers of the initial and final states

* get the number of out-arcs for the N'th state

* get the out-arcs for the N'th state into a caller-provided array
(sized using the previous function), where each out-arc is represented
by a color and an end-state

* get the number of character codes represented by color C

* get the wchar codes for color C into a caller-provided array

(The reason for letting the caller allocate the result arrays is so we
can use palloc for that; if we allocate it in backend/regex/ we must
use malloc, which will greatly increase the risk of leakages.  Also,
as far as the color API goes, the above lets the caller decide how
many characters is "too many" to bother with.)

I like the this idea. Seems like clear and not over-engineered API. I can implement it. Could you propose something particular to do with RE_compile_and_cache in this patch? With this API we still need a way to get regex_t or equivalent from string.

------
With best regards,
Alexander Korotkov.

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Interesting post-mortem on a near disaster with git
Next
From: Tom Lane
Date:
Subject: Re: WIP: index support for regexp search