regex_fixed_prefix() is still a few bricks shy of a load - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | regex_fixed_prefix() is still a few bricks shy of a load |
Date | |
Msg-id | 11509.1341686788@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: regex_fixed_prefix() is still a few bricks shy of a load
|
List | pgsql-hackers |
In http://archives.postgresql.org/pgsql-general/2012-07/msg00107.php it's pointed out that regex_fixed_prefix() gets the wrong answer when presented with a regex like '^(xyz...)?...'. It thinks this pattern has a fixed prefix of "xyz", that is can only match strings beginning with "xyz". But because of the quantifier attached to the parenthesized subexpression, that ain't so. This leads to generating incorrect indexable conditions from a regex WHERE clause. I can see a few ways we might attack this: 1. Give up on index-optimizing patterns that begin with "^(". This is fairly undesirable because psql's "\d object-name-pattern" commands all generate regexes that start that way; so for example "\df myfunc" would no longer be able to use the index on pg_proc.proname. 2. Teach regex_fixed_prefix() to recognize whether the parenthesized subexpression has a quantifier. This would require a quantum jump in regex_fixed_prefix's intelligence, though, because now it would have to be capable of finding the matching right paren. We could perhaps restrict the set of things it knows how to skip over to find the matching paren to things that are likely to appear in psql \d usage, but it still seems messy and fragile. 3. Try another approach entirely. The idea that I've got in mind here is to compile the regex using the regex library and then look at the compiled NFA representation to see if there must be a fixed prefix. I would not have risked this before this year, but after last winter's expeditions into the darkest corners of the regex library I feel competent to do it, and some quick study suggests that it might not take very much code to produce something that is significantly brighter than what we have now. However, there are a couple of arguments against pursuing this path: * We would now require the regex library to expose an API that seems rather Postgres-specific, namely "give me the fixed prefix if any for this regex_t". Given our ambition of eventually splitting off Spencer's library as a standalone project, it's somewhat annoying to saddle it with such a requirement. On the other hand, it's a rather small tax to pay for Postgres effectively sponsoring further development of the regex library. * Character values in the compiled regex are in pg_wchar representation, so we'd need a wchar-to-server-encoding transformation. Such code exists in HEAD, as of a couple days ago, but it's practically untested; so back-porting it into stable branches and relying on it to be correct seems a tad optimistic. Possibly in the back branches we could restrict regex_fixed_prefix to only cope with 7-bit ASCII prefix strings? I bet there would be squawks from non-English-speakers, though. I think that solution #3 is probably the best way to go forward, but it's not at all clear what to do for the back branches. Thoughts? regards, tom lane
pgsql-hackers by date: