Re: WIP: index support for regexp search - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: WIP: index support for regexp search |
Date | |
Msg-id | 6275.1364164727@sss.pgh.pa.us Whole thread Raw |
In response to | Re: WIP: index support for regexp search (Alexander Korotkov <aekorotkov@gmail.com>) |
Responses |
Re: WIP: index support for regexp search
|
List | pgsql-hackers |
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Mon, Mar 25, 2013 at 1:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Similarly, pushing PG-specific declarations like RE_compile_and_cache() >> into regex/regex.h is completely not the right thing for preserving a >> clear library boundary (even positing that we want to expose that >> function outside adt/regexp.c, which I'd rather we didn't). >> >> Perhaps we could avoid these issues by defining a library API that >> provides accessors like these for the opaque regex_t struct: >> >> * get the number of states in the CNFA >> >> * get the numbers of the initial and final states >> >> * get the number of out-arcs for the N'th state >> >> * get the out-arcs for the N'th state into a caller-provided array >> (sized using the previous function), where each out-arc is represented >> by a color and an end-state >> >> * get the number of character codes represented by color C >> >> * get the wchar codes for color C into a caller-provided array >> >> (The reason for letting the caller allocate the result arrays is so we >> can use palloc for that; if we allocate it in backend/regex/ we must >> use malloc, which will greatly increase the risk of leakages. Also, >> as far as the color API goes, the above lets the caller decide how >> many characters is "too many" to bother with.) > I like the this idea. Seems like clear and not over-engineered API. I can > implement it. Could you propose something particular to do with > RE_compile_and_cache in this patch? With this API we still need a way to > get regex_t or equivalent from string. Well, the brute force answer is for pg_trgm to not go through RE_compile_and_cache at all, but just call the regex library for itself. That would lose the ability to cache regex compilations, but I'm not sure we care. The motivation for RE_compile_and_cache is mainly to prevent having to compile the regex again for *every row*, which is what would otherwise happen in a simple SELECT using a regex function. Unless I'm misunderstanding something, pg_trgm would only need to compile the regex once per indexscan, which is probably tolerable. Another approach we could take is for adt/regexp.c to expose a function that collects the deconstructed CNFA data into a palloc'd structure using the above-proposed functions. This'd be a reasonable way to go if we decide the caching behavior is actually needed for this purpose; but otherwise it seems like it's just involving one more module than necessary. BTW, one reason I don't like exposing RE_compile_and_cache directly is that then you have the question of how long is the regex_t it hands back good for. The caller doesn't own that struct, and at some point regexp.c is likely to recycle it, so it's a bit risky to just assume it'll stay around "long enough" without any further interaction with regexp.c. regards, tom lane
pgsql-hackers by date: