Re: POSIX regex performance bug in 7.3 Vs. 7.2 - Mailing list pgsql-hackers

From Tom Lane
Subject Re: POSIX regex performance bug in 7.3 Vs. 7.2
Date
Msg-id 14971.1044377191@sss.pgh.pa.us
Whole thread Raw
In response to Re: POSIX regex performance bug in 7.3 Vs. 7.2  (wade <wade@wavefire.com>)
List pgsql-hackers
wade <wade@wavefire.com> writes:
>   I redid my trials with the same data set on 7.2.3 --with-multibyte and I
> get the same brutal performance hit, so it is definitely a
> multibyte-specific problem.
>
> There are only about 1000 words that appear more than once (2 or 3 times)
> in 27k rows.

Right, so the caching of compiled regexps that regexp.c does is of no
help, and any change in its behavior in 7.3 wouldn't have made any
difference anyway.  I leapt to a conclusion after reviewing the CVS
logs for pertinent changes, but it was the wrong conclusion.  The true
problem is that MULTIBYTE is now forced on, and that causes some
loops in the regexp compiler to change from 256 to 65536 iterations.

I believe if you change NC in src/include/regex/utils.h from its new
value of 65536 back to 256, performance will go back where it was.
This will *not* do if you run any multibyte character sets --- but
as long as the database is all ASCII or ISO-8859-whatever, it should
be a safe hack that will let you use 7.3.*.

Rather than trying to band-aid a solution like this in the main sources,
I think I shall go investigate Spencer's new regexp code in Tcl, which
reputedly is designed for wider-than-8-bit chars from the get-go.  We've
had it on the TODO list for a long time to assimilate that code; it's
probably time to make it happen.
        regards, tom lane


pgsql-hackers by date:

Previous
From: wade
Date:
Subject: Re: POSIX regex performance bug in 7.3 Vs. 7.2
Next
From: Neil Conway
Date:
Subject: Re: POSIX regex performance bug in 7.3 Vs. 7.2