Thread: text_pattern_ops and complex regexps

text_pattern_ops and complex regexps

From
Stephen Frost
Date:
Greetings,
 I've run into an annoying issue which I would think could be handled better.  Basically, indexes using
text_pattern_opsdon't work with some complex regexps even when they (imv anyway) could.  I'm willing to believe I'm
wrongabout the potential to use them, or that my regexp is wrong, but I don't see it. 
 Test case:
 create table text_test (name text); insert into text_test values ('North'); insert into text_test values ('North
West');create index text_test_name_idx on text_test using btree (name text_pattern_ops); set enable_seqscan = false; --
justto show the test -- works fine explain analyze select * from text_test where name ~ '^(North)'; -- works fine
explainanalyze select * from text_test where name ~ '^(North)( West)'; -- doesn't work explain analyze select * from
text_testwhere name ~ '^(North)(| West)'; 
 Results:

CREATE TABLE
INSERT 0 1
INSERT 0 1
CREATE INDEX
SET                                                         QUERY PLAN
         

-------------------------------------------------------------------------------------------------------------------------------Index
Scanusing text_test_name_idx on text_test  (cost=0.00..8.27 rows=1 width=32) (actual time=0.071..0.077 rows=2 loops=1)
IndexCond: ((name ~>=~ 'North'::text) AND (name ~<~ 'Norti'::text))  Filter: (name ~ '^(North)'::text)Total runtime:
0.121ms 
(4 rows)
                                                         QUERY PLAN
      

-------------------------------------------------------------------------------------------------------------------------------Index
Scanusing text_test_name_idx on text_test  (cost=0.00..8.27 rows=1 width=32) (actual time=0.176..0.178 rows=1 loops=1)
IndexCond: ((name ~>=~ 'North'::text) AND (name ~<~ 'Norti'::text))  Filter: (name ~ '^(North)( West)'::text)Total
runtime:0.209 ms 
(4 rows)
                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------Seq
Scanon text_test  (cost=100000000.00..100000001.03 rows=1 width=32) (actual time=0.013..0.019 rows=2 loops=1)  Filter:
(name~ '^(North)(| West)'::text)Total runtime: 0.045 ms 
(3 rows)
 I don't see why the last case can't use the index.  Obviously, for this example case, doing a Seq Scan is fine but
withthe real data set there are cases where an index could help. 
 Any help would be greatly appreciated.
     Thanks,
    Stephen

Re: text_pattern_ops and complex regexps

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
>   I don't see why the last case can't use the index.

The planner's understanding of regexps is far weaker than yours.

(In particular, I think it's set up to abandon optimization if it
sees | anywhere.)
        regards, tom lane


Re: text_pattern_ops and complex regexps

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> >   I don't see why the last case can't use the index.
>
> The planner's understanding of regexps is far weaker than yours.
>
> (In particular, I think it's set up to abandon optimization if it
> sees | anywhere.)

That's kind of what I figured from the empirical data.  My hope was that
it might be something which could be fixed.  Is this entirely the
planner's doing (eg: PG code)?  Perhaps this is misguided but I would
think that the regexp libraries might have some support for "give me all
anchored required text for this regexp" which we could then use in the
planner.  Certainly in an ideal world we wouldn't have to teach the
planner the knowledge that the regexp libraries include for this.
Thoughts?
    Thanks,
        Stephen

Re: text_pattern_ops and complex regexps

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> (In particular, I think it's set up to abandon optimization if it
>> sees | anywhere.)

> That's kind of what I figured from the empirical data.  My hope was that
> it might be something which could be fixed.

See regex_fixed_prefix(), but it's a pretty hard problem without writing
a complete regex parser.

> Perhaps this is misguided but I would
> think that the regexp libraries might have some support for "give me all
> anchored required text for this regexp" which we could then use in the
> planner.

I wouldn't see why.  It's certainly worth considering to hand the
pattern to the regex engine and then burrow into the data structure it
builds; but right now we consider that structure to be entirely private
to backend/regex/.  There's also the problem that we'd have no easy
way to determine how much the result depends on the current regex flavor
setting.  There are some cases now where regex_fixed_prefix deliberately
omits possible optimizations because of uncertainty about the flavor.
        regards, tom lane


Re: text_pattern_ops and complex regexps

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:

> > Perhaps this is misguided but I would think that the regexp
> > libraries might have some support for "give me all anchored required
> > text for this regexp" which we could then use in the planner.
> 
> I wouldn't see why.  It's certainly worth considering to hand the
> pattern to the regex engine and then burrow into the data structure it
> builds; but right now we consider that structure to be entirely private
> to backend/regex/.  There's also the problem that we'd have no easy
> way to determine how much the result depends on the current regex flavor
> setting.  There are some cases now where regex_fixed_prefix deliberately
> omits possible optimizations because of uncertainty about the flavor.

I think changeable regex flavors turned out to be a bad idea.  They can
wreak all sorts of havoc.  You change the setting, SIGHUP, and suddenly
your application fails to work as expected.  Maybe we should make that
setting PGC_POSTMASTER (or just get rid of it?), and provide was to pass
flags to change the flavor for particular operations (this is easy for
function-based stuff but not so easy for operators).  That way it
doesn't intrude in stuff like cached plans and so on.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: text_pattern_ops and complex regexps

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> I think changeable regex flavors turned out to be a bad idea.  They can
> wreak all sorts of havoc.  You change the setting, SIGHUP, and suddenly
> your application fails to work as expected.  Maybe we should make that
> setting PGC_POSTMASTER (or just get rid of it?), and provide was to pass
> flags to change the flavor for particular operations (this is easy for
> function-based stuff but not so easy for operators).  That way it
> doesn't intrude in stuff like cached plans and so on.

Maybe so.  I think it was originally intended mostly as a
backwards-compatibility measure when we added the support for ARE
flavor.  It's pretty likely that no one changes the flavor setting
in practice anymore.  If we just locked it down as "advanced always"
then we could simplify the documentation by a measurable amount ...
        regards, tom lane


Re: text_pattern_ops and complex regexps

From
David Fetter
Date:
On Wed, May 06, 2009 at 12:10:49PM -0400, Alvaro Herrera wrote:
> Tom Lane wrote:
> > Stephen Frost <sfrost@snowman.net> writes:
> 
> > > Perhaps this is misguided but I would think that the regexp
> > > libraries might have some support for "give me all anchored
> > > required text for this regexp" which we could then use in the
> > > planner.
> > 
> > I wouldn't see why.  It's certainly worth considering to hand the
> > pattern to the regex engine and then burrow into the data
> > structure it builds; but right now we consider that structure to
> > be entirely private to backend/regex/.  There's also the problem
> > that we'd have no easy way to determine how much the result
> > depends on the current regex flavor setting.  There are some cases
> > now where regex_fixed_prefix deliberately omits possible
> > optimizations because of uncertainty about the flavor.
> 
> I think changeable regex flavors turned out to be a bad idea.

+1

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: text_pattern_ops and complex regexps

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> I think changeable regex flavors turned out to be a bad idea.  They can
>> wreak all sorts of havoc.  You change the setting, SIGHUP, and suddenly
>> your application fails to work as expected.  Maybe we should make that
>> setting PGC_POSTMASTER (or just get rid of it?), and provide was to pass
>> flags to change the flavor for particular operations (this is easy for
>> function-based stuff but not so easy for operators).  That way it
>> doesn't intrude in stuff like cached plans and so on.
> 
> Maybe so.  I think it was originally intended mostly as a
> backwards-compatibility measure when we added the support for ARE
> flavor.  It's pretty likely that no one changes the flavor setting
> in practice anymore.  If we just locked it down as "advanced always"
> then we could simplify the documentation by a measurable amount ...

yeah I don't recall a single incident in the last few years that 
required playing with the regex flavours....



Stefan


Re: text_pattern_ops and complex regexps

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> ... Maybe we should make that
> setting PGC_POSTMASTER (or just get rid of it?), and provide was to pass
> flags to change the flavor for particular operations (this is easy for
> function-based stuff but not so easy for operators).

BTW, if you are putting it on the application to use some other syntax
to get at the old flavors, then there already is an adequate feature
built into the regex library: a pattern beginning with (?b) or (?e)
will be taken as a BRE or ERE respectively, cf. table 9-19 in current
docs.  So I don't see any value in inventing something additional.
The only reason for regex_flavor to exist is to satisfy applications
that were written to expect the pre-7.4 regex syntax to work as-is.
If we think there aren't any of those anymore, let's just kill the
GUC and be done with it.
        regards, tom lane


Re: text_pattern_ops and complex regexps

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> If we think there aren't any of those anymore, let's just kill the
> GUC and be done with it.

+1.

I'll try to spend some time in backend/regexp and regex_fixed_prefix
soon.
Thanks,
    Stephen

Re: text_pattern_ops and complex regexps

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>   
>> I think changeable regex flavors turned out to be a bad idea.  They can
>> wreak all sorts of havoc.  You change the setting, SIGHUP, and suddenly
>> your application fails to work as expected.  Maybe we should make that
>> setting PGC_POSTMASTER (or just get rid of it?), and provide was to pass
>> flags to change the flavor for particular operations (this is easy for
>> function-based stuff but not so easy for operators).  That way it
>> doesn't intrude in stuff like cached plans and so on.
>>     
>
> Maybe so.  I think it was originally intended mostly as a
> backwards-compatibility measure when we added the support for ARE
> flavor.  It's pretty likely that no one changes the flavor setting
> in practice anymore.  If we just locked it down as "advanced always"
> then we could simplify the documentation by a measurable amount ...
>
>             
>   

I know of at least one significant client (OpenACS) that still 
apparently requires extended flavor. Removing the compatibility option 
would be a major pain point for some of my clients. PGC_POSTMASTER would 
be fine, though.

cheers

andrew


Re: text_pattern_ops and complex regexps

From
"Joshua D. Drake"
Date:
On Wed, 2009-05-06 at 15:55 -0400, Andrew Dunstan wrote:

> I know of at least one significant client (OpenACS) that still 
> apparently requires extended flavor. Removing the compatibility option 
> would be a major pain point for some of my clients. PGC_POSTMASTER would 
> be fine, though.

Isn't that why we wouldn't remove it from back releases? 

Joshua D. Drake

> 
> cheers
> 
> andrew
> 
-- 
PostgreSQL - XMPP: jdrake@jabber.postgresql.org  Consulting, Development, Support, Training  503-667-4564 -
http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
 



Re: text_pattern_ops and complex regexps

From
Andrew Dunstan
Date:

Joshua D. Drake wrote:
> On Wed, 2009-05-06 at 15:55 -0400, Andrew Dunstan wrote:
>
>   
>> I know of at least one significant client (OpenACS) that still 
>> apparently requires extended flavor. Removing the compatibility option 
>> would be a major pain point for some of my clients. PGC_POSTMASTER would 
>> be fine, though.
>>     
>
> Isn't that why we wouldn't remove it from back releases? 
>
>   

My clients aren't going to be very happy if they can't upgrade because 
of this.

cheers

andrew


Re: text_pattern_ops and complex regexps

From
"Joshua D. Drake"
Date:
On Wed, 2009-05-06 at 16:10 -0400, Andrew Dunstan wrote:

> > Isn't that why we wouldn't remove it from back releases? 
> >
> >   
> 
> My clients aren't going to be very happy if they can't upgrade because 
> of this.

Certainly. Nobody wants to make clients unhappy but for the good of the
code man, for the good of the code :). 8.3 will be supported for a very
long time to come.

Sincerely,

Joshua D. Drake

-- 
PostgreSQL - XMPP: jdrake@jabber.postgresql.org  Consulting, Development, Support, Training  503-667-4564 -
http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
 



Re: text_pattern_ops and complex regexps

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> I know of at least one significant client (OpenACS) that still 
> apparently requires extended flavor.

Is this demonstrable, or just speculation?  The incompatibilities
between ARE mode and (legal) ERE patterns are pretty darn small.
        regards, tom lane


Re: text_pattern_ops and complex regexps

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> ...  Maybe we should make that
> setting PGC_POSTMASTER (or just get rid of it?),

Another thought here: if we do get persuaded that the regex_flavor GUC
has to stay, we could eliminate it as a hazard for planning by changing
its scope to PGC_BACKEND.  That would be much less restrictive than
PGC_POSTMASTER; for instance it'd still work to set it for a particular
application via ALTER ROLE.
        regards, tom lane


Re: text_pattern_ops and complex regexps

From
Andrew Dunstan
Date:

Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> I know of at least one significant client (OpenACS) that still 
>> apparently requires extended flavor.
>>     
>
> Is this demonstrable, or just speculation?  The incompatibilities
> between ARE mode and (legal) ERE patterns are pretty darn small.
>
>             
>   

It's explicitly documented. Whether or not there is a good basis for the 
documentation I can't yet say. If that's going to be influential I will 
dig deeper.


cheers

andrew


Re: text_pattern_ops and complex regexps

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> Is this demonstrable, or just speculation?  The incompatibilities
>> between ARE mode and (legal) ERE patterns are pretty darn small.

> It's explicitly documented. Whether or not there is a good basis for the 
> documentation I can't yet say. If that's going to be influential I will 
> dig deeper.

Our fine manual asserts that the only such incompatibility is that
inside square brackets (character alternative lists), ERE mode takes
backslash as an ordinary character while ARE mode thinks it begins an
escape.  Given the additional capabilities afforded by the latter
interpretation, and that every other modern regex engine on the planet
takes the latter approach *without* giving you any option, it doesn't
seem too unreasonable to ask OpenACS to join the twenty-first century
...
        regards, tom lane