Thread: POSIX regex performance bug in 7.3 Vs. 7.2

POSIX regex performance bug in 7.3 Vs. 7.2

From

wade

Date:

31 January 2003, 19:50:00

Hello, We recently upgraded a project from 7.2 to 7.3.1 to make use of some of
the cool new features in 7.3.  The installed version is CVS stable from
yesterday.  However, we noticed a major performance hit in POSIX regular
expression matches against columns using the ~* operator.  
http://arch.wavefire.com/badregex73.txt has explain analyze output from 7.2
and 7.3.1 respectively.  Both of these tables have only 101 rows.  The
7.3.1 install is using the default settings from postgresql.conf.
 Any ideas as to why this should be happening?
 Should anyone require additional information, please let me know.

-Wade Klaver

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

31 January 2003, 22:52:51

wade <wade@wavefire.com> writes:
>   We recently upgraded a project from 7.2 to 7.3.1 to make use of some of
> the cool new features in 7.3.  The installed version is CVS stable from
> yesterday.  However, we noticed a major performance hit in POSIX regular
> expression matches against columns using the ~* operator.  

The only thought that comes to mind is that multibyte character sets are
supported in 7.3 whereas they were optional (and not default) in 7.2.
I'd not have expected a factor-of-150 performance hit from that, though.

Could you rebuild your backend with profiling enabled and get a gprof
summary of where the time is going?
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Christopher Kings-Lynne

Date:

01 February 2003, 07:31:17

Why on earth are you using a CVS version!?!?!?!

Chris

On Fri, 31 Jan 2003, wade wrote:

> Hello,
>   We recently upgraded a project from 7.2 to 7.3.1 to make use of some of
> the cool new features in 7.3.  The installed version is CVS stable from
> yesterday.  However, we noticed a major performance hit in POSIX regular
> expression matches against columns using the ~* operator.
> http://arch.wavefire.com/badregex73.txt has explain analyze output from 7.2
> and 7.3.1 respectively.  Both of these tables have only 101 rows.  The
> 7.3.1 install is using the default settings from postgresql.conf.
>
>   Any ideas as to why this should be happening?
>
>   Should anyone require additional information, please let me know.
>
> -Wade Klaver
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

01 February 2003, 11:41:36

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
> Why on earth are you using a CVS version!?!?!?!

I assume he meant tip of REL7_3 branch --- which is a perfectly
reasonable thing to install, even if there are still a few fixes
to go before we call it 7.3.2.
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

wade

Date:

03 February 2003, 12:39:59

At 08:31 PM 2/1/03 +0800, Christopher Kings-Lynne wrote:
>Why on earth are you using a CVS version!?!?!?!
>
>Chris
>
This problem manifests itself under 7.3.1 release as well.  CVS is used so
we can access patches to the SRF stuff implemented after 7.3.1 was released.

Tom... any links that document the procedure for obtaining the profile
information you requested?  I've used a profiler briefly, but am not sure
how to go about gathering information pertainent to this problematic query
only.
-Wade Klaver.

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

wade

Date:

03 February 2003, 17:42:04

At 10:52 PM 1/31/03 -0500, Tom Lane wrote:
>wade <wade@wavefire.com> writes:
>>   We recently upgraded a project from 7.2 to 7.3.1 to make use of some of
>> the cool new features in 7.3.  The installed version is CVS stable from
>> yesterday.  However, we noticed a major performance hit in POSIX regular
>> expression matches against columns using the ~* operator.  
>
>The only thought that comes to mind is that multibyte character sets are
>supported in 7.3 whereas they were optional (and not default) in 7.2.
>I'd not have expected a factor-of-150 performance hit from that, though.
>
>Could you rebuild your backend with profiling enabled and get a gprof
>summary of where the time is going?
>
>            regards, tom lane
>
>
Here is the profile information.  I included a log of the session that
generated it at the top of the gprof  output.  If there is any other info I
can help you with, please let me know.
http://arch.wavefire.com/pgregexgmon.txt
 -Wade Klaver

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

03 February 2003, 17:51:28

wade <wade@wavefire.com> writes:
> Here is the profile information.  I included a log of the session that
> generated it at the top of the gprof  output.  If there is any other info I
> can help you with, please let me know.

A four-second test isn't long enough to gather any statistically
meaningful profile info.  On most machines, gprof samples 100 times per
second, so realistically you need a minute or two of runtime to have
trustworthy numbers.

Please replicate the rows in the table by a factor of ten or twenty or
so and try again.
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

wade

Date:

03 February 2003, 19:45:10

At 05:51 PM 2/3/03 -0500, Tom Lane wrote:
>wade <wade@wavefire.com> writes:
>> Here is the profile information.  I included a log of the session that
>> generated it at the top of the gprof  output.  If there is any other info I
>> can help you with, please let me know.
>
>A four-second test isn't long enough to gather any statistically
>meaningful profile info.  On most machines, gprof samples 100 times per
>second, so realistically you need a minute or two of runtime to have
>trustworthy numbers.
>
>Please replicate the rows in the table by a factor of ten or twenty or
>so and try again.
>
>            regards, tom lane
OK, here goes again.
I have tried a different table, this one with 27444 rows.
In this case, the query with the regex of the form "row ~* 'regex'"
runs in 1.113 seconds and the other runs in 600.
The session idled for a while after the query completed if that makes a
difference.
Queries and explain output are included at the top of the gprof output.
http://arch.wavefire.com/pgregexgmon.txt -Wade Klaver

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

03 February 2003, 20:27:18

Sigh.  It seems that somebody broke caching of compiled regexes,
so that your regex is recompiled each time it's used.  I haven't
dug into the logic yet, but I think it must have been a mistake
in Thomas' change to make the regex cache be searched circularly:

2002-06-14 22:49  thomas
* src/backend/utils/adt/regexp.c: Search the existing regularexpression cache as a ring buffer.  Will optimize the case
forrepeatedcalls for the same expression,  which seems to be the mostcommon case. Formerly, always searched    from the
firstentry.  Maywant to look at the least-recently-used algorithm to make sure it is identifying the right slots to
reclaim.Seems silly to do mathwhen  it seems that we could simply use an incrementing counter...
 

Considering that we now know that this is a factor-of-150 performance
hit, I wonder if this is a "must fix" for 7.3.2?  We already wrapped
the tarball, but ...
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

wade

Date:

03 February 2003, 20:37:03

  Well, IMHO I would rather see a delay of the roll-out by a day or two
than see a release with such a serious performance glitch.  Especially
since I personally have been shooting my big mouth off to all my geek
friends on the leaps and bounds PG has made in the last few releases.  With
my luck one of them will find it. :) I guess in the end, it comes down to the rest of you developer types, but
I would be inclined to re-wrap.  However, this is easy for me to say given
that I have no idea how much work it actually is to re-wrap. -Wade Klaver

At 08:07 PM 2/3/03 -0500, Tom Lane wrote:
>Sigh.  It seems that somebody broke caching of compiled regexes,
>so that your regex is recompiled each time it's used.  I haven't
>dug into the logic yet, but I think it must have been a mistake
>in Thomas' change to make the regex cache be searched circularly:
>
>2002-06-14 22:49  thomas
>
>    * src/backend/utils/adt/regexp.c: Search the existing regular
>    expression cache as a ring buffer.  Will optimize the case for
>    repeated calls for the same expression,  which seems to be the most
>    common case. Formerly, always searched    from the first entry.  May
>    want to look at the least-recently-used algorithm to make sure it 
>    is identifying the right slots to reclaim. Seems silly to do math
>    when  it seems that we could simply use an incrementing counter...
>
>Considering that we now know that this is a factor-of-150 performance
>hit, I wonder if this is a "must fix" for 7.3.2?  We already wrapped
>the tarball, but ...
>
>            regards, tom lane
>
>

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

03 February 2003, 22:02:41

Wade, how many distinct patterns do you have in that table?  What's the
population distribution (in particular, do the top 32 patterns account
for most of the table)?

It's looking like the issue is not so much that the 7.3 code is
completely broken, as that its LRU replacement policy for precompiled
regex patterns got rejiggered in a way that doesn't match your data.
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

03 February 2003, 23:08:24

Next question: may I guess that you weren't using MULTIBYTE in 7.2?

After still more digging, I'm coming round to the opinion that the
problem is that MULTIBYTE is forced on in 7.3, and this imposes a
factor-of-256 overhead in a bunch of the operations in regcomp.c.
In particular, compiling a case-independent regex is now hugely
more expensive than it used to be.

The parties who wanted to force MULTIBYTE on promised that there 
would be no such penalties :-(
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

wade

Date:

04 February 2003, 11:32:30

OK,  I redid my trials with the same data set on 7.2.3 --with-multibyte and I
get the same brutal performance hit, so it is definitely a
multibyte-specific problem. WRT the distribution of the data in the table, I used the following:
All g-words in /usr/share/dict with different processes attached: no process init caps. word || row_id etc...

There are only about 1000 words that appear more than once (2 or 3 times)
in 27k rows. -Wade Klaver

At 11:08 PM 2/3/03 -0500, Tom Lane wrote:
>Next question: may I guess that you weren't using MULTIBYTE in 7.2?
>
>After still more digging, I'm coming round to the opinion that the
>problem is that MULTIBYTE is forced on in 7.3, and this imposes a
>factor-of-256 overhead in a bunch of the operations in regcomp.c.
>In particular, compiling a case-independent regex is now hugely
>more expensive than it used to be.
>
>The parties who wanted to force MULTIBYTE on promised that there 
>would be no such penalties :-(
>
>            regards, tom lane
>
>

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

04 February 2003, 11:46:28

wade <wade@wavefire.com> writes:
>   I redid my trials with the same data set on 7.2.3 --with-multibyte and I
> get the same brutal performance hit, so it is definitely a
> multibyte-specific problem.
>
> There are only about 1000 words that appear more than once (2 or 3 times)
> in 27k rows.

Right, so the caching of compiled regexps that regexp.c does is of no
help, and any change in its behavior in 7.3 wouldn't have made any
difference anyway.  I leapt to a conclusion after reviewing the CVS
logs for pertinent changes, but it was the wrong conclusion.  The true
problem is that MULTIBYTE is now forced on, and that causes some
loops in the regexp compiler to change from 256 to 65536 iterations.

I believe if you change NC in src/include/regex/utils.h from its new
value of 65536 back to 256, performance will go back where it was.
This will *not* do if you run any multibyte character sets --- but
as long as the database is all ASCII or ISO-8859-whatever, it should
be a safe hack that will let you use 7.3.*.

Rather than trying to band-aid a solution like this in the main sources,
I think I shall go investigate Spencer's new regexp code in Tcl, which
reputedly is designed for wider-than-8-bit chars from the get-go.  We've
had it on the TODO list for a long time to assimilate that code; it's
probably time to make it happen.
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Neil Conway

Date:

04 February 2003, 11:51:04

On Tue, 2003-02-04 at 11:24, wade wrote:
>   I redid my trials with the same data set on 7.2.3 --with-multibyte and I
> get the same brutal performance hit, so it is definitely a
> multibyte-specific problem.

Given that this problem isn't a regression, I don't think we need to
delay 7.3.2 to fix it (of course, a fix for 7.3.3 and 7.4 is essential,
IMHO).

Cheers,

Neil
-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

04 February 2003, 12:02:42

Neil Conway <neilc@samurai.com> writes:
> Given that this problem isn't a regression, I don't think we need to
> delay 7.3.2 to fix it (of course, a fix for 7.3.3 and 7.4 is essential,
> IMHO).

No, I've had to abandon my original thought that it was a localized bug,
so it's not going to be fixed in 7.3.2.

The real problem is simply that we're up against design limitations of
the existing regex package, which was never designed for wider-than-8-bit
character sets.  It's been rather crudely hacked while it was in our
hands (Henry Spencer would probably disown the code if he saw it now ;-))
so that it sorta kinda does MULTIBYTE, but it's slow and I don't think
it's complete either.

I'm about to go off and look at whether we can absorb the Tcl regex
package, which is Spencer's new baby.  That will not be a solution for
7.3.anything, but it could be an answer for 7.4.
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Neil Conway

Date:

04 February 2003, 12:22:28

On Tue, 2003-02-04 at 11:59, Tom Lane wrote:
> I'm about to go off and look at whether we can absorb the Tcl regex
> package, which is Spencer's new baby.  That will not be a solution for
> 7.3.anything, but it could be an answer for 7.4.

Sounds like we had about the same idea at about the same time -- I
emailed Henry Spencer inquiring about the new RE engine last night. I
came across a post this post that indicates he was planning to package
the new RE engine separately:

http://infosoc.uni-koeln.de/pipermail/php/1999-February/000019.html

but I wasn't able to find a release of it anywhere -- I'll let the list
know if/when he gets back to me.

Another option is to consider a different regular expression engine. At
least according to the benchmarks here,

http://ourworld.compuserve.com/homepages/john_maddock/proposals/exregex.htm

Spencer's implementation is outperformed by some other RE engines,
notably PCRE (www.pcre.org). But switching to another engine might
impose backward-compatibility problems, in terms of the details of the
RE syntax.

Cheers,

Neil
-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Hannu Krosing

Date:

04 February 2003, 12:29:02

On Tue, 2003-02-04 at 16:59, Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
> > Given that this problem isn't a regression, I don't think we need to
> > delay 7.3.2 to fix it (of course, a fix for 7.3.3 and 7.4 is essential,
> > IMHO).
> 
> No, I've had to abandon my original thought that it was a localized bug,
> so it's not going to be fixed in 7.3.2.
> 
> The real problem is simply that we're up against design limitations of
> the existing regex package, which was never designed for wider-than-8-bit
> character sets.  It's been rather crudely hacked while it was in our
> hands (Henry Spencer would probably disown the code if he saw it now ;-))
> so that it sorta kinda does MULTIBYTE, but it's slow and I don't think
> it's complete either.
> 
> I'm about to go off and look at whether we can absorb the Tcl regex
> package, which is Spencer's new baby. 

Why not PCRE ( http://pcre.sourceforge.net/ ) ? 

They claim at least utf-8 (I don't remember other multibyte charsets
being mentioned) support and have a BSD-ish license,
http://pcre.sourceforge.net/license.txt .

-- 
Hannu Krosing <hannu@tm.ee>

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Hannu Krosing

Date:

04 February 2003, 12:36:03

On Tue, 2003-02-04 at 17:15, Neil Conway wrote:
> On Tue, 2003-02-04 at 11:59, Tom Lane wrote:
> > I'm about to go off and look at whether we can absorb the Tcl regex
> > package, which is Spencer's new baby.  That will not be a solution for
> > 7.3.anything, but it could be an answer for 7.4.
> 
> Sounds like we had about the same idea at about the same time -- I
> emailed Henry Spencer inquiring about the new RE engine last night. I
> came across a post this post that indicates he was planning to package
> the new RE engine separately:
> 
> http://infosoc.uni-koeln.de/pipermail/php/1999-February/000019.html
> 
> but I wasn't able to find a release of it anywhere -- I'll let the list
> know if/when he gets back to me.
> 
> Another option is to consider a different regular expression engine. At
> least according to the benchmarks here,
> 
> http://ourworld.compuserve.com/homepages/john_maddock/proposals/exregex.htm

I did not see anything about MULTIBYTE there, so it might be worth
experimenting on some MB charsets as well.

> Spencer's implementation is outperformed by some other RE engines,
> notably PCRE (www.pcre.org). But switching to another engine might
> impose backward-compatibility problems, in terms of the details of the
> RE syntax.

Yeah, it seems that POSIX (seemingly Spencers code) there is unable to
do some tests (the NA's in benchmark tables).

We could try a soft switch by having both implementations for a release
or two with different operators. At least good for comparing/testing.

-- 
Hannu Krosing <hannu@tm.ee>

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

04 February 2003, 12:36:10

Neil Conway <neilc@samurai.com> writes:
> Sounds like we had about the same idea at about the same time -- I
> emailed Henry Spencer inquiring about the new RE engine last night.

I just did that this morning ;-) ... but more as politeness than
anything else.  AFAICT from searching the net, packaging his new code
as a separate library is something that's been on Spencer's TODO list
for several years now.  We've been waiting for him to do it, but I'm now
thinking that it's time to quit waiting.  We can lift the code from Tcl
with probably not all that much more work than if it were an official
separate package.

> Another option is to consider a different regular expression engine. At
> least according to the benchmarks here,
> http://ourworld.compuserve.com/homepages/john_maddock/proposals/exregex.htm
> Spencer's implementation is outperformed by some other RE engines,
> notably PCRE (www.pcre.org).

AFAICT, that page is benchmarking Spencer's old code (the same library
we started from).  His new code is state-of-the-art according to Friedl
in _Mastering Regular Expressions_, 2nd ed 2002 (O'Reilly).
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Jon Jensen

Date:

04 February 2003, 12:36:39

On Tue, 4 Feb 2003, Neil Conway wrote:

> Spencer's implementation is outperformed by some other RE engines,
> notably PCRE (www.pcre.org). But switching to another engine might
> impose backward-compatibility problems, in terms of the details of the
> RE syntax.

It would be a delight to be able to use more advanced (IMHO) Perl-
compatible regexes in PostgreSQL. Perhaps a global configuration variable 
to select which regex engine to use would help work around any backward 
compatibility problems? Or, somewhat uglier, a different syntax in SQL for 
the new regex engine?

Jon

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

04 February 2003, 13:21:47

Jon Jensen <jon@endpoint.com> writes:
> It would be a delight to be able to use more advanced (IMHO) Perl-
> compatible regexes in PostgreSQL.

After some further research, pcre does seem like an interesting
alternative.  Both pcre and Spencer's new code have essentially
Berkeley-style licenses, so there's no problem there.  Some
relevant comparisons:

1. pcre tries to be exactly compatible with Perl, so details of its
regex flavor will be familiar to many more people than the Tcl flavor
(by and large the features are similar, but there are differences).

2. pcre is already distributed as a nice tidy library; we need not
extract code from the Tcl distribution.

3. pcre is actively maintained (although tracking a new release every
couple months may not be something we really want to do, anyway).
AFAICT Henry's not doing anything much with his code, so it'd be
pretty much take-once-and-maintain-for-ourselves.

4. pcre looks like it's probably *not* as well suited to a multibyte
environment.  In particular, I doubt that its UTF8 compile option was
even turned on for the performance comparison Neil cited --- and the man
page only promises "experimental, incomplete support for UTF-8 encoded
strings".  The Tcl code by contrast is used only in a multibyte
environment, so that's the supported, optimized path.  It doesn't even
assume null-terminated strings (yay).

5. As best I can tell so far, neither code is currently set up for
run-time choice of encoding; we'd have to do some work for that in
either case.  (This probably means that tracking pcre update releases
would be problematic anyhow.)

6. According to Friedl's book, the Tcl engine (Spencer's new code)
is way faster than Perl's, and so presumably faster than pcre, though
I can't find any specific measurements of pcre in the book.  It uses a
hybrid DFA/NFA approach which Friedl considers state of the art.

Strict Perl compatibility would be a nice feature, but right at the
moment the multibyte issue is looking like the determining factor.
If we don't get a multibyte-optimized engine out of this change, we're
wasting our time.
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Sean Chittenden

Date:

04 February 2003, 14:03:24

> > It would be a delight to be able to use more advanced (IMHO) Perl-
> > compatible regexes in PostgreSQL.
> 
> After some further research, pcre does seem like an interesting
> alternative.  Both pcre and Spencer's new code have essentially
> Berkeley-style licenses, so there's no problem there.  Some relevant
> comparisons:
> 
> 1. pcre tries to be exactly compatible with Perl, so details of its
> regex flavor will be familiar to many more people than the Tcl flavor
> (by and large the features are similar, but there are differences).

pcre is lgpl, iirc.  Ruby went off and wrote an explicitly BSD
licensed regexp engine to replace it's GPL'ed Perl/pcre based bits.

> 2. pcre is already distributed as a nice tidy library; we need not
> extract code from the Tcl distribution.

http://www.ruby-lang.org/cgi-bin/cvsweb.cgi/oniguruma/

> 3. pcre is actively maintained (although tracking a new release every
> couple months may not be something we really want to do, anyway).
> AFAICT Henry's not doing anything much with his code, so it'd be
> pretty much take-once-and-maintain-for-ourselves.

Oniguruma is pretty well maintained given that it's Ruby's regexp
engine, and has the perk of being maintained outside of Ruby as a
standalone module that gets periodically imported.

> 4. pcre looks like it's probably *not* as well suited to a multibyte
> environment.  In particular, I doubt that its UTF8 compile option
> was even turned on for the performance comparison Neil cited --- and
> the man page only promises "experimental, incomplete support for
> UTF-8 encoded strings".  The Tcl code by contrast is used only in a
> multibyte environment, so that's the supported, optimized path.  It
> doesn't even assume null-terminated strings (yay).

Oniguruma only supports ASCII, UTF-8, EUC-JP, and Shift_JIS, but
boasts being 10-20% faster than PCRE for ASCII (no clue about
multi-byte character sets).  In terms of development/API, it supports
the GNU regex, POSIX, Oniguruma APIs (the latter is what ruby uses to
hook in).

Just another option to add to the table, don't know if it fully fits
our requirements, but since it is actively being developed by
resources outside of this project, and it has support for 16-bit and
32-bit encodings (UCS-2, UCS-4, UTF-16) is on the TODO list, it might
be nice to keep this in mind and let Ruby maintain it instead of
PostgreSQL.

-sc

-- 
Sean Chittenden

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Hannu Krosing

Date:

04 February 2003, 14:10:23

On Tue, 2003-02-04 at 18:21, Tom Lane wrote:
> 4. pcre looks like it's probably *not* as well suited to a multibyte
> environment.  In particular, I doubt that its UTF8 compile option was
> even turned on for the performance comparison Neil cited --- and the man
> page only promises "experimental, incomplete support for UTF-8 encoded
> strings".  The Tcl code by contrast is used only in a multibyte
> environment, so that's the supported, optimized path.  It doesn't even
> assume null-terminated strings (yay).

If we are going into code-lifting business, we should also consider
Pythons sre (a modified pcre, that works both on 8-bit and python's
unicode (either 16 or 32 byte chars, depending on compile options))

It has no specific support for "raw" utf-8 or other variable-width
encodings.

-- 
Hannu Krosing <hannu@tm.ee>

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

04 February 2003, 14:18:25

Hannu Krosing <hannu@tm.ee> writes:
> If we are going into code-lifting business, we should also consider
> Pythons sre

What advantages does it have to make it worth considering?
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Neil Conway

Date:

04 February 2003, 14:41:20

On Tue, 2003-02-04 at 13:21, Tom Lane wrote:
> After some further research, pcre does seem like an interesting
> alternative.  Both pcre and Spencer's new code have essentially
> Berkeley-style licenses, so there's no problem there.

Keep in mind that pcre has an advertising clause in its license
(software that distributes pcre commercially or non-commercially needs
to add a note to the effect in its documentation / online help). Since
PostgreSQL's license doesn't have this restriction, it would be shame to
impose that requirement on PostgreSQL users.

(Note that as I'm not a lawyer, my interpretation of the license may not
be correct.)

> Strict Perl compatibility would be a nice feature, but right at the
> moment the multibyte issue is looking like the determining factor.

Agreed -- ISTM that Spencer's new engine is probably the best choice.

Cheers,

Neil
-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Hannu Krosing

Date:

04 February 2003, 16:36:30

Tom Lane kirjutas T, 04.02.2003 kell 21:18:
> Hannu Krosing <hannu@tm.ee> writes:
> > If we are going into code-lifting business, we should also consider
> > Pythons sre
> 
> What advantages does it have to make it worth considering?

Should be the same as pcre + support for wide chars.


-- 
Hannu Krosing <hannu@tm.ee>

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

04 February 2003, 17:26:15

Proof of concept:

PG 7.3 using regression database:

regression=# select count(*) from tenk1 where 'quotidian' ~ string4;count
-------    0
(1 row)

Time: 676.14 ms
regression=# select count(*) from tenk1 where 'quotidian' ~ stringu1;count
-------    0
(1 row)

Time: 3426.96 ms
regression=# select count(*) from tenk1 where 'quotidian' ~* stringu1;count
-------    0
(1 row)

Time: 466344.48 ms


CVS tip plus code extracted from Tcl:

regression=# select count(*) from tenk1 where 'quotidian' ~ string4;count
-------    0
(1 row)

Time: 472.48 ms
regression=# select count(*) from tenk1 where 'quotidian' ~ stringu1;count
-------    0
(1 row)

Time: 4414.91 ms
regression=# select count(*) from tenk1 where 'quotidian' ~* stringu1;count
-------    0
(1 row)

Time: 4608.49 ms

In the first case there are only four distinct patterns used, so we're
running with cached precompiled regexes.  In the other cases a new regex
compilation must occur at each row.  So, regex execution is a little
faster than before (at least for trivial regexes); compilation seems to
be a shade slower, but it doesn't fall over and die when compiling
case-insensitive patterns (or bracket expressions, which is what the
code actually reduces a case-insensitive pattern to).

This is nowhere near ready to commit, but it compiles cleanly and passes
regression tests ...
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

04 February 2003, 17:46:53

Hannu Krosing <hannu@tm.ee> writes:
> Tom Lane kirjutas T, 04.02.2003 kell 21:18:
>> What advantages does it have to make it worth considering?

> Should be the same as pcre + support for wide chars.

Well, if someone wants to do the legwork to try it, that interface
should work just about comparably to Spencer's package: I found that
the easiest way to make it work is to use pg_mb2wchar to expand our
internal encoding into an array of pg_wchar's and then apply the
regex package to that form.  So as long as sre can handle 4-byte
wide chars it ought to more or less drop in.

I've got a fair amount of cleanup to do on Spencer's package before
I can even think of committing it (ANSIfy function headers, fix comment
formatting so that pg_indent won't completely destroy 'em, etc).
But I'll be glad to send the modified interface file (adt/regexp.c)
to anyone who'd like to try getting it to work with sre.
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Neil Conway

Date:

04 February 2003, 17:54:59

On Tue, 2003-02-04 at 17:26, Tom Lane wrote:
> Proof of concept:
> [...]

Very cool work, Tom.

> In the first case there are only four distinct patterns used, so we're
> running with cached precompiled regexes.  In the other cases a new regex
> compilation must occur at each row.

Speaking of which, is there (or should there be) some mechanism for
increasing the size of the compiled pattern cache? Perhaps a GUC var?

Cheers,

Neil
-- 
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

04 February 2003, 18:35:42

Neil Conway <neilc@samurai.com> writes:
> Speaking of which, is there (or should there be) some mechanism for
> increasing the size of the compiled pattern cache? Perhaps a GUC var?

I thought about that while I was messing with the code, but I don't
think there's much point in it, unless someone wants to invest the work
to make the cache search much smarter (maybe a hash table?).  At present
a larger cache will cost you in extra search time, especially in the
case where the pattern isn't in the cache.

I did do the work last night to convert the cache management algorithm
into a self-organizing list (a la Knuth) rather than a round-robin
search as it was before.  This should reduce the expected number of
comparisons for cases where the cache is actually accomplishing
something, but of course it's no help if you have too many patterns
for the cache.
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tatsuo Ishii

Date:

04 February 2003, 21:32:37

Ok. The original complain can be sasily solved at least for single
byte encoding databases. With the small patches(against 7.3.1)
included, I got following result.

test1:
select count(*) from tenk1 where 'quotidian' ~ string4;count 
-------    0
(1 row)

Time: 113.81 ms

test2:
select count(*) from tenk1 where 'quotidian' ~ stringu1;count 
-------    0
(1 row)

Time: 419.36 ms

test3:
select count(*) from tenk1 where 'quotidian' ~* stringu1;count 
-------    0
(1 row)

Time: 1633.21 ms

The ratio for test3/test1 is now 14.35. Although not great as the
Spencer's new code according to Tom (with the code test3/test1 =
9.75), it seems much better than the original 7.3 code (test3/test1 =
689.71).

Note that if the database encoding is not a single byte one, it should
be as slow as the original 7.3. There is no easy way to fix it.

P.S. With the patches all the regression tests have passed on my Linux
box.
--
Tatsuo Ishii

----------------------------------- cut here -------------------------------------
*** postgresql-7.3.1/src/backend/regex/regcomp.c.orig    2002-09-05 05:31:24.000000000 +0900
--- postgresql-7.3.1/src/backend/regex/regcomp.c    2003-02-05 10:05:03.000000000 +0900
***************
*** 178,183 ****
--- 178,186 ----     int            i;     size_t        len;     pg_wchar   *wcp;
+     size_t    csetsize;
+ 
+     csetsize = (pg_database_encoding_max_length() == 1)?(SCHAR_MAX - SCHAR_MIN + 1):NC;      if (cclasses == NULL)
    cclasses = cclass_init();
 
***************
*** 211,217 ****      /* do the mallocs early so failure handling is easy */     g = (struct re_guts *)
malloc(sizeof(structre_guts) +
 
!                                   (NC - 1) * sizeof(cat_t));     if (g == NULL)         return REG_ESPACE;
p->ssize= len / (size_t) 2 *(size_t) 3 + (size_t) 1;        /* ugh */
 
--- 214,220 ----      /* do the mallocs early so failure handling is easy */     g = (struct re_guts *)
malloc(sizeof(structre_guts) +
 
!                                   (csetsize - 1) * sizeof(cat_t));     if (g == NULL)         return REG_ESPACE;
p->ssize= len / (size_t) 2 *(size_t) 3 + (size_t) 1;        /* ugh */
 
***************
*** 235,241 ****         p->pbegin[i] = 0;         p->pend[i] = 0;     }
!     g->csetsize = NC;     g->sets = NULL;     g->setbits = NULL;     g->ncsets = 0;
--- 238,244 ----         p->pbegin[i] = 0;         p->pend[i] = 0;     }
!     g->csetsize = csetsize;     g->sets = NULL;     g->setbits = NULL;     g->ncsets = 0;
***************
*** 248,254 ****     g->nsub = 0;     g->ncategories = 1;            /* category 0 is "everything else" */
g->categories= &g->catspace[-(CHAR_MIN)];
 
!     memset((char *) g->catspace, 0, NC * sizeof(cat_t));     g->backrefs = 0;      /* do it */
--- 251,257 ----     g->nsub = 0;     g->ncategories = 1;            /* category 0 is "everything else" */
g->categories= &g->catspace[-(CHAR_MIN)];
 
!     memset((char *) g->catspace, 0, csetsize * sizeof(cat_t));     g->backrefs = 0;      /* do it */

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Hannu Krosing

Date:

05 February 2003, 01:06:45

Tom Lane kirjutas K, 05.02.2003 kell 01:35:
> Neil Conway <neilc@samurai.com> writes:
> > Speaking of which, is there (or should there be) some mechanism for
> > increasing the size of the compiled pattern cache? Perhaps a GUC var?
> 
> I thought about that while I was messing with the code, but I don't
> think there's much point in it, unless someone wants to invest the work
> to make the cache search much smarter (maybe a hash table?).  At present
> a larger cache will cost you in extra search time, especially in the
> case where the pattern isn't in the cache.
> 
> I did do the work last night to convert the cache management algorithm
> into a self-organizing list (a la Knuth) rather than a round-robin
> search as it was before.  This should reduce the expected number of
> comparisons for cases where the cache is actually accomplishing
> something, but of course it's no help if you have too many patterns
> for the cache.

Perhaps the decision weather to try to use the cache at all could be
done at planning time depending on statistics information ?

Another idea is to make special regex type and store the regexes
pre-parsed (i.e. in some fast-load form) ?

-- 
Hannu Krosing <hannu@tm.ee>

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

05 February 2003, 01:12:56

Hannu Krosing <hannu@tm.ee> writes:
> Another idea is to make special regex type and store the regexes
> pre-parsed (i.e. in some fast-load form) ?

Seems unlikely that going out to disk could beat just recompiling the
regexp.  They're not *that* slow to compile ... at least not when we
avoid the performance problem that we introduced into the older regexp
code.
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

05 February 2003, 10:40:44

Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> Ok. The original complain can be sasily solved at least for single
> byte encoding databases. With the small patches(against 7.3.1)
> included, I got following result.

Nice work, Tatsuo!  Wade, can you confirm that this patch solves your
problem?

Tatsuo, please commit into REL7_3 branch only --- I'm nearly ready to do
a wholesale replacement of the regex code in HEAD, so you wouldn't
accomplish much except to create a merge problem for me ...
        regards, tom lane

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

wade

Date:

05 February 2003, 11:57:53

Confirmed.  Looks like a 100-fold increase. Thanx guys.
Explain output can be seen here:
http://arch.wavefire.com/pgregex.txt -Wade Klaver

At 09:59 AM 2/5/03 -0500, Tom Lane wrote:
>Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> Ok. The original complain can be sasily solved at least for single
>> byte encoding databases. With the small patches(against 7.3.1)
>> included, I got following result.
>
>Nice work, Tatsuo!  Wade, can you confirm that this patch solves your
>problem?
>
>Tatsuo, please commit into REL7_3 branch only --- I'm nearly ready to do
>a wholesale replacement of the regex code in HEAD, so you wouldn't
>accomplish much except to create a merge problem for me ...
>
>            regards, tom lane
>
>

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tatsuo Ishii

Date:

05 February 2003, 19:03:15

> Nice work, Tatsuo!  Wade, can you confirm that this patch solves your
> problem?
>
> Tatsuo, please commit into REL7_3 branch only --- I'm nearly ready to do
> a wholesale replacement of the regex code in HEAD, so you wouldn't
> accomplish much except to create a merge problem for me ...

Ok. I have just committed into the 7.3 stable branch.
--
Tatsuo Ishii

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Hannu Krosing

Date:

06 February 2003, 14:53:51

Tom Lane kirjutas K, 05.02.2003 kell 08:12:
> Hannu Krosing <hannu@tm.ee> writes:
> > Another idea is to make special regex type and store the regexes
> > pre-parsed (i.e. in some fast-load form) ?
> 
> Seems unlikely that going out to disk could beat just recompiling the
> regexp. 

We have to get _something_ from disk anyway. Currently we fetch regex
source code, but if there were some format that is faster to load then
that could be an option.

-- 
Hannu Krosing <hannu@tm.ee>

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Bruce Momjian

Date:

17 February 2003, 16:02:22

Can this improvement get merged up into CVS current, or did you already
do that Tom?

---------------------------------------------------------------------------

Tatsuo Ishii wrote:
> > Nice work, Tatsuo!  Wade, can you confirm that this patch solves your
> > problem?
> >
> > Tatsuo, please commit into REL7_3 branch only --- I'm nearly ready to do
> > a wholesale replacement of the regex code in HEAD, so you wouldn't
> > accomplish much except to create a merge problem for me ...
> 
> Ok. I have just committed into the 7.3 stable branch.
> --
> Tatsuo Ishii
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: POSIX regex performance bug in 7.3 Vs. 7.2

From

Tom Lane

Date:

17 February 2003, 17:09:06

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Can this improvement get merged up into CVS current, or did you already
> do that Tom?

It's irrelevant to current.
        regards, tom lane