Re: WIP: index support for regexp search - Mailing list pgsql-hackers

From Tom Lane
Subject Re: WIP: index support for regexp search
Date
Msg-id 5280.1364161844@sss.pgh.pa.us
Whole thread Raw
In response to Re: WIP: index support for regexp search  (Alexander Korotkov <aekorotkov@gmail.com>)
Responses Re: WIP: index support for regexp search  (Alexander Korotkov <aekorotkov@gmail.com>)
List pgsql-hackers
Alexander Korotkov <aekorotkov@gmail.com> writes:
> Now I have working implemetation of this API. Comments still need rework.
> Could you give me any feedback?

I looked at this a little bit, but it's not very far along at all
towards resolving my API worries.  The basic point that I'm concerned
about is that we would like to split off the regex library (ie,
backend/regex/) as a standalone project someday.  There are already
some problems to be resolved to make that possible, and I don't want
this patch to introduce new ones.  From that standpoint, the proposed
regextract.c file is a disaster, because it depends on a boatload of
Postgres-specific stuff (at least StringInfo, List, HTAB, and pg_wchar;
to say nothing of palloc).  We can't consider that to be on the regex
side of the fence.

Similarly, pushing PG-specific declarations like RE_compile_and_cache()
into regex/regex.h is completely not the right thing for preserving a
clear library boundary (even positing that we want to expose that
function outside adt/regexp.c, which I'd rather we didn't).

Given that you've already got a notion of callbacks provided by
contrib/pg_trgm, perhaps this can be fixed by pushing more of the work
into those callbacks, so that the heavy-duty data structures like the
hash table live over there and the API exposed by backend/regex/ is at
a much simpler level than what you have here.  But right now I don't
see any usable library API here.

Perhaps we could avoid these issues by defining a library API that
provides accessors like these for the opaque regex_t struct:

* get the number of states in the CNFA

* get the numbers of the initial and final states

* get the number of out-arcs for the N'th state

* get the out-arcs for the N'th state into a caller-provided array
(sized using the previous function), where each out-arc is represented
by a color and an end-state

* get the number of character codes represented by color C

* get the wchar codes for color C into a caller-provided array

(The reason for letting the caller allocate the result arrays is so we
can use palloc for that; if we allocate it in backend/regex/ we must
use malloc, which will greatly increase the risk of leakages.  Also,
as far as the color API goes, the above lets the caller decide how
many characters is "too many" to bother with.)
        regards, tom lane



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Limiting setting of hint bits by read-only queries; vacuum_delay
Next
From: Michael Paquier
Date:
Subject: Re: Interesting post-mortem on a near disaster with git