regex_fixed_prefix() is still a few bricks shy of a load - Mailing list pgsql-hackers

From Tom Lane
Subject regex_fixed_prefix() is still a few bricks shy of a load
Date
Msg-id 11509.1341686788@sss.pgh.pa.us
Whole thread Raw
Responses Re: regex_fixed_prefix() is still a few bricks shy of a load
List pgsql-hackers
In
http://archives.postgresql.org/pgsql-general/2012-07/msg00107.php
it's pointed out that regex_fixed_prefix() gets the wrong answer
when presented with a regex like '^(xyz...)?...'.  It thinks this
pattern has a fixed prefix of "xyz", that is can only match strings
beginning with "xyz".  But because of the quantifier attached to the
parenthesized subexpression, that ain't so.  This leads to generating
incorrect indexable conditions from a regex WHERE clause.

I can see a few ways we might attack this:

1. Give up on index-optimizing patterns that begin with "^(".  This is
fairly undesirable because psql's "\d object-name-pattern" commands
all generate regexes that start that way; so for example "\df myfunc"
would no longer be able to use the index on pg_proc.proname.

2. Teach regex_fixed_prefix() to recognize whether the parenthesized
subexpression has a quantifier.  This would require a quantum jump in
regex_fixed_prefix's intelligence, though, because now it would have
to be capable of finding the matching right paren.  We could perhaps
restrict the set of things it knows how to skip over to find the
matching paren to things that are likely to appear in psql \d usage,
but it still seems messy and fragile.

3. Try another approach entirely.  The idea that I've got in mind here
is to compile the regex using the regex library and then look at the
compiled NFA representation to see if there must be a fixed prefix.
I would not have risked this before this year, but after last winter's
expeditions into the darkest corners of the regex library I feel
competent to do it, and some quick study suggests that it might not take
very much code to produce something that is significantly brighter than
what we have now.  However, there are a couple of arguments against
pursuing this path:

* We would now require the regex library to expose an API that seems
rather Postgres-specific, namely "give me the fixed prefix if any for
this regex_t".  Given our ambition of eventually splitting off Spencer's
library as a standalone project, it's somewhat annoying to saddle it
with such a requirement.  On the other hand, it's a rather small tax to
pay for Postgres effectively sponsoring further development of the regex
library.

* Character values in the compiled regex are in pg_wchar representation,
so we'd need a wchar-to-server-encoding transformation.  Such code
exists in HEAD, as of a couple days ago, but it's practically untested;
so back-porting it into stable branches and relying on it to be correct
seems a tad optimistic.  Possibly in the back branches we could restrict
regex_fixed_prefix to only cope with 7-bit ASCII prefix strings?  I bet
there would be squawks from non-English-speakers, though.

I think that solution #3 is probably the best way to go forward, but
it's not at all clear what to do for the back branches.  Thoughts?
        regards, tom lane


pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: New statistics for WAL buffer dirty writes
Next
From: Jeff Janes
Date:
Subject: Re: pg_prewarm