Thread: POSIX regex performance bug in 7.3 Vs. 7.2
Hello, We recently upgraded a project from 7.2 to 7.3.1 to make use of some of the cool new features in 7.3. The installed version is CVS stable from yesterday. However, we noticed a major performance hit in POSIX regular expression matches against columns using the ~* operator. http://arch.wavefire.com/badregex73.txt has explain analyze output from 7.2 and 7.3.1 respectively. Both of these tables have only 101 rows. The 7.3.1 install is using the default settings from postgresql.conf. Any ideas as to why this should be happening? Should anyone require additional information, please let me know. -Wade Klaver
wade <wade@wavefire.com> writes: > We recently upgraded a project from 7.2 to 7.3.1 to make use of some of > the cool new features in 7.3. The installed version is CVS stable from > yesterday. However, we noticed a major performance hit in POSIX regular > expression matches against columns using the ~* operator. The only thought that comes to mind is that multibyte character sets are supported in 7.3 whereas they were optional (and not default) in 7.2. I'd not have expected a factor-of-150 performance hit from that, though. Could you rebuild your backend with profiling enabled and get a gprof summary of where the time is going? regards, tom lane
Why on earth are you using a CVS version!?!?!?! Chris On Fri, 31 Jan 2003, wade wrote: > Hello, > We recently upgraded a project from 7.2 to 7.3.1 to make use of some of > the cool new features in 7.3. The installed version is CVS stable from > yesterday. However, we noticed a major performance hit in POSIX regular > expression matches against columns using the ~* operator. > http://arch.wavefire.com/badregex73.txt has explain analyze output from 7.2 > and 7.3.1 respectively. Both of these tables have only 101 rows. The > 7.3.1 install is using the default settings from postgresql.conf. > > Any ideas as to why this should be happening? > > Should anyone require additional information, please let me know. > > -Wade Klaver > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html >
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > Why on earth are you using a CVS version!?!?!?! I assume he meant tip of REL7_3 branch --- which is a perfectly reasonable thing to install, even if there are still a few fixes to go before we call it 7.3.2. regards, tom lane
At 08:31 PM 2/1/03 +0800, Christopher Kings-Lynne wrote: >Why on earth are you using a CVS version!?!?!?! > >Chris > This problem manifests itself under 7.3.1 release as well. CVS is used so we can access patches to the SRF stuff implemented after 7.3.1 was released. Tom... any links that document the procedure for obtaining the profile information you requested? I've used a profiler briefly, but am not sure how to go about gathering information pertainent to this problematic query only. -Wade Klaver.
At 10:52 PM 1/31/03 -0500, Tom Lane wrote: >wade <wade@wavefire.com> writes: >> We recently upgraded a project from 7.2 to 7.3.1 to make use of some of >> the cool new features in 7.3. The installed version is CVS stable from >> yesterday. However, we noticed a major performance hit in POSIX regular >> expression matches against columns using the ~* operator. > >The only thought that comes to mind is that multibyte character sets are >supported in 7.3 whereas they were optional (and not default) in 7.2. >I'd not have expected a factor-of-150 performance hit from that, though. > >Could you rebuild your backend with profiling enabled and get a gprof >summary of where the time is going? > > regards, tom lane > > Here is the profile information. I included a log of the session that generated it at the top of the gprof output. If there is any other info I can help you with, please let me know. http://arch.wavefire.com/pgregexgmon.txt -Wade Klaver
wade <wade@wavefire.com> writes: > Here is the profile information. I included a log of the session that > generated it at the top of the gprof output. If there is any other info I > can help you with, please let me know. A four-second test isn't long enough to gather any statistically meaningful profile info. On most machines, gprof samples 100 times per second, so realistically you need a minute or two of runtime to have trustworthy numbers. Please replicate the rows in the table by a factor of ten or twenty or so and try again. regards, tom lane
At 05:51 PM 2/3/03 -0500, Tom Lane wrote: >wade <wade@wavefire.com> writes: >> Here is the profile information. I included a log of the session that >> generated it at the top of the gprof output. If there is any other info I >> can help you with, please let me know. > >A four-second test isn't long enough to gather any statistically >meaningful profile info. On most machines, gprof samples 100 times per >second, so realistically you need a minute or two of runtime to have >trustworthy numbers. > >Please replicate the rows in the table by a factor of ten or twenty or >so and try again. > > regards, tom lane OK, here goes again. I have tried a different table, this one with 27444 rows. In this case, the query with the regex of the form "row ~* 'regex'" runs in 1.113 seconds and the other runs in 600. The session idled for a while after the query completed if that makes a difference. Queries and explain output are included at the top of the gprof output. http://arch.wavefire.com/pgregexgmon.txt -Wade Klaver
Sigh. It seems that somebody broke caching of compiled regexes, so that your regex is recompiled each time it's used. I haven't dug into the logic yet, but I think it must have been a mistake in Thomas' change to make the regex cache be searched circularly: 2002-06-14 22:49 thomas * src/backend/utils/adt/regexp.c: Search the existing regularexpression cache as a ring buffer. Will optimize the case forrepeatedcalls for the same expression, which seems to be the mostcommon case. Formerly, always searched from the firstentry. Maywant to look at the least-recently-used algorithm to make sure it is identifying the right slots to reclaim.Seems silly to do mathwhen it seems that we could simply use an incrementing counter... Considering that we now know that this is a factor-of-150 performance hit, I wonder if this is a "must fix" for 7.3.2? We already wrapped the tarball, but ... regards, tom lane
Well, IMHO I would rather see a delay of the roll-out by a day or two than see a release with such a serious performance glitch. Especially since I personally have been shooting my big mouth off to all my geek friends on the leaps and bounds PG has made in the last few releases. With my luck one of them will find it. :) I guess in the end, it comes down to the rest of you developer types, but I would be inclined to re-wrap. However, this is easy for me to say given that I have no idea how much work it actually is to re-wrap. -Wade Klaver At 08:07 PM 2/3/03 -0500, Tom Lane wrote: >Sigh. It seems that somebody broke caching of compiled regexes, >so that your regex is recompiled each time it's used. I haven't >dug into the logic yet, but I think it must have been a mistake >in Thomas' change to make the regex cache be searched circularly: > >2002-06-14 22:49 thomas > > * src/backend/utils/adt/regexp.c: Search the existing regular > expression cache as a ring buffer. Will optimize the case for > repeated calls for the same expression, which seems to be the most > common case. Formerly, always searched from the first entry. May > want to look at the least-recently-used algorithm to make sure it > is identifying the right slots to reclaim. Seems silly to do math > when it seems that we could simply use an incrementing counter... > >Considering that we now know that this is a factor-of-150 performance >hit, I wonder if this is a "must fix" for 7.3.2? We already wrapped >the tarball, but ... > > regards, tom lane > >
Wade, how many distinct patterns do you have in that table? What's the population distribution (in particular, do the top 32 patterns account for most of the table)? It's looking like the issue is not so much that the 7.3 code is completely broken, as that its LRU replacement policy for precompiled regex patterns got rejiggered in a way that doesn't match your data. regards, tom lane
Next question: may I guess that you weren't using MULTIBYTE in 7.2? After still more digging, I'm coming round to the opinion that the problem is that MULTIBYTE is forced on in 7.3, and this imposes a factor-of-256 overhead in a bunch of the operations in regcomp.c. In particular, compiling a case-independent regex is now hugely more expensive than it used to be. The parties who wanted to force MULTIBYTE on promised that there would be no such penalties :-( regards, tom lane
OK, I redid my trials with the same data set on 7.2.3 --with-multibyte and I get the same brutal performance hit, so it is definitely a multibyte-specific problem. WRT the distribution of the data in the table, I used the following: All g-words in /usr/share/dict with different processes attached: no process init caps. word || row_id etc... There are only about 1000 words that appear more than once (2 or 3 times) in 27k rows. -Wade Klaver At 11:08 PM 2/3/03 -0500, Tom Lane wrote: >Next question: may I guess that you weren't using MULTIBYTE in 7.2? > >After still more digging, I'm coming round to the opinion that the >problem is that MULTIBYTE is forced on in 7.3, and this imposes a >factor-of-256 overhead in a bunch of the operations in regcomp.c. >In particular, compiling a case-independent regex is now hugely >more expensive than it used to be. > >The parties who wanted to force MULTIBYTE on promised that there >would be no such penalties :-( > > regards, tom lane > >
wade <wade@wavefire.com> writes: > I redid my trials with the same data set on 7.2.3 --with-multibyte and I > get the same brutal performance hit, so it is definitely a > multibyte-specific problem. > > There are only about 1000 words that appear more than once (2 or 3 times) > in 27k rows. Right, so the caching of compiled regexps that regexp.c does is of no help, and any change in its behavior in 7.3 wouldn't have made any difference anyway. I leapt to a conclusion after reviewing the CVS logs for pertinent changes, but it was the wrong conclusion. The true problem is that MULTIBYTE is now forced on, and that causes some loops in the regexp compiler to change from 256 to 65536 iterations. I believe if you change NC in src/include/regex/utils.h from its new value of 65536 back to 256, performance will go back where it was. This will *not* do if you run any multibyte character sets --- but as long as the database is all ASCII or ISO-8859-whatever, it should be a safe hack that will let you use 7.3.*. Rather than trying to band-aid a solution like this in the main sources, I think I shall go investigate Spencer's new regexp code in Tcl, which reputedly is designed for wider-than-8-bit chars from the get-go. We've had it on the TODO list for a long time to assimilate that code; it's probably time to make it happen. regards, tom lane
On Tue, 2003-02-04 at 11:24, wade wrote: > I redid my trials with the same data set on 7.2.3 --with-multibyte and I > get the same brutal performance hit, so it is definitely a > multibyte-specific problem. Given that this problem isn't a regression, I don't think we need to delay 7.3.2 to fix it (of course, a fix for 7.3.3 and 7.4 is essential, IMHO). Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway <neilc@samurai.com> writes: > Given that this problem isn't a regression, I don't think we need to > delay 7.3.2 to fix it (of course, a fix for 7.3.3 and 7.4 is essential, > IMHO). No, I've had to abandon my original thought that it was a localized bug, so it's not going to be fixed in 7.3.2. The real problem is simply that we're up against design limitations of the existing regex package, which was never designed for wider-than-8-bit character sets. It's been rather crudely hacked while it was in our hands (Henry Spencer would probably disown the code if he saw it now ;-)) so that it sorta kinda does MULTIBYTE, but it's slow and I don't think it's complete either. I'm about to go off and look at whether we can absorb the Tcl regex package, which is Spencer's new baby. That will not be a solution for 7.3.anything, but it could be an answer for 7.4. regards, tom lane
On Tue, 2003-02-04 at 11:59, Tom Lane wrote: > I'm about to go off and look at whether we can absorb the Tcl regex > package, which is Spencer's new baby. That will not be a solution for > 7.3.anything, but it could be an answer for 7.4. Sounds like we had about the same idea at about the same time -- I emailed Henry Spencer inquiring about the new RE engine last night. I came across a post this post that indicates he was planning to package the new RE engine separately: http://infosoc.uni-koeln.de/pipermail/php/1999-February/000019.html but I wasn't able to find a release of it anywhere -- I'll let the list know if/when he gets back to me. Another option is to consider a different regular expression engine. At least according to the benchmarks here, http://ourworld.compuserve.com/homepages/john_maddock/proposals/exregex.htm Spencer's implementation is outperformed by some other RE engines, notably PCRE (www.pcre.org). But switching to another engine might impose backward-compatibility problems, in terms of the details of the RE syntax. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
On Tue, 2003-02-04 at 16:59, Tom Lane wrote: > Neil Conway <neilc@samurai.com> writes: > > Given that this problem isn't a regression, I don't think we need to > > delay 7.3.2 to fix it (of course, a fix for 7.3.3 and 7.4 is essential, > > IMHO). > > No, I've had to abandon my original thought that it was a localized bug, > so it's not going to be fixed in 7.3.2. > > The real problem is simply that we're up against design limitations of > the existing regex package, which was never designed for wider-than-8-bit > character sets. It's been rather crudely hacked while it was in our > hands (Henry Spencer would probably disown the code if he saw it now ;-)) > so that it sorta kinda does MULTIBYTE, but it's slow and I don't think > it's complete either. > > I'm about to go off and look at whether we can absorb the Tcl regex > package, which is Spencer's new baby. Why not PCRE ( http://pcre.sourceforge.net/ ) ? They claim at least utf-8 (I don't remember other multibyte charsets being mentioned) support and have a BSD-ish license, http://pcre.sourceforge.net/license.txt . -- Hannu Krosing <hannu@tm.ee>
On Tue, 2003-02-04 at 17:15, Neil Conway wrote: > On Tue, 2003-02-04 at 11:59, Tom Lane wrote: > > I'm about to go off and look at whether we can absorb the Tcl regex > > package, which is Spencer's new baby. That will not be a solution for > > 7.3.anything, but it could be an answer for 7.4. > > Sounds like we had about the same idea at about the same time -- I > emailed Henry Spencer inquiring about the new RE engine last night. I > came across a post this post that indicates he was planning to package > the new RE engine separately: > > http://infosoc.uni-koeln.de/pipermail/php/1999-February/000019.html > > but I wasn't able to find a release of it anywhere -- I'll let the list > know if/when he gets back to me. > > Another option is to consider a different regular expression engine. At > least according to the benchmarks here, > > http://ourworld.compuserve.com/homepages/john_maddock/proposals/exregex.htm I did not see anything about MULTIBYTE there, so it might be worth experimenting on some MB charsets as well. > Spencer's implementation is outperformed by some other RE engines, > notably PCRE (www.pcre.org). But switching to another engine might > impose backward-compatibility problems, in terms of the details of the > RE syntax. Yeah, it seems that POSIX (seemingly Spencers code) there is unable to do some tests (the NA's in benchmark tables). We could try a soft switch by having both implementations for a release or two with different operators. At least good for comparing/testing. -- Hannu Krosing <hannu@tm.ee>
Neil Conway <neilc@samurai.com> writes: > Sounds like we had about the same idea at about the same time -- I > emailed Henry Spencer inquiring about the new RE engine last night. I just did that this morning ;-) ... but more as politeness than anything else. AFAICT from searching the net, packaging his new code as a separate library is something that's been on Spencer's TODO list for several years now. We've been waiting for him to do it, but I'm now thinking that it's time to quit waiting. We can lift the code from Tcl with probably not all that much more work than if it were an official separate package. > Another option is to consider a different regular expression engine. At > least according to the benchmarks here, > http://ourworld.compuserve.com/homepages/john_maddock/proposals/exregex.htm > Spencer's implementation is outperformed by some other RE engines, > notably PCRE (www.pcre.org). AFAICT, that page is benchmarking Spencer's old code (the same library we started from). His new code is state-of-the-art according to Friedl in _Mastering Regular Expressions_, 2nd ed 2002 (O'Reilly). regards, tom lane
On Tue, 4 Feb 2003, Neil Conway wrote: > Spencer's implementation is outperformed by some other RE engines, > notably PCRE (www.pcre.org). But switching to another engine might > impose backward-compatibility problems, in terms of the details of the > RE syntax. It would be a delight to be able to use more advanced (IMHO) Perl- compatible regexes in PostgreSQL. Perhaps a global configuration variable to select which regex engine to use would help work around any backward compatibility problems? Or, somewhat uglier, a different syntax in SQL for the new regex engine? Jon
Jon Jensen <jon@endpoint.com> writes: > It would be a delight to be able to use more advanced (IMHO) Perl- > compatible regexes in PostgreSQL. After some further research, pcre does seem like an interesting alternative. Both pcre and Spencer's new code have essentially Berkeley-style licenses, so there's no problem there. Some relevant comparisons: 1. pcre tries to be exactly compatible with Perl, so details of its regex flavor will be familiar to many more people than the Tcl flavor (by and large the features are similar, but there are differences). 2. pcre is already distributed as a nice tidy library; we need not extract code from the Tcl distribution. 3. pcre is actively maintained (although tracking a new release every couple months may not be something we really want to do, anyway). AFAICT Henry's not doing anything much with his code, so it'd be pretty much take-once-and-maintain-for-ourselves. 4. pcre looks like it's probably *not* as well suited to a multibyte environment. In particular, I doubt that its UTF8 compile option was even turned on for the performance comparison Neil cited --- and the man page only promises "experimental, incomplete support for UTF-8 encoded strings". The Tcl code by contrast is used only in a multibyte environment, so that's the supported, optimized path. It doesn't even assume null-terminated strings (yay). 5. As best I can tell so far, neither code is currently set up for run-time choice of encoding; we'd have to do some work for that in either case. (This probably means that tracking pcre update releases would be problematic anyhow.) 6. According to Friedl's book, the Tcl engine (Spencer's new code) is way faster than Perl's, and so presumably faster than pcre, though I can't find any specific measurements of pcre in the book. It uses a hybrid DFA/NFA approach which Friedl considers state of the art. Strict Perl compatibility would be a nice feature, but right at the moment the multibyte issue is looking like the determining factor. If we don't get a multibyte-optimized engine out of this change, we're wasting our time. regards, tom lane
> > It would be a delight to be able to use more advanced (IMHO) Perl- > > compatible regexes in PostgreSQL. > > After some further research, pcre does seem like an interesting > alternative. Both pcre and Spencer's new code have essentially > Berkeley-style licenses, so there's no problem there. Some relevant > comparisons: > > 1. pcre tries to be exactly compatible with Perl, so details of its > regex flavor will be familiar to many more people than the Tcl flavor > (by and large the features are similar, but there are differences). pcre is lgpl, iirc. Ruby went off and wrote an explicitly BSD licensed regexp engine to replace it's GPL'ed Perl/pcre based bits. > 2. pcre is already distributed as a nice tidy library; we need not > extract code from the Tcl distribution. http://www.ruby-lang.org/cgi-bin/cvsweb.cgi/oniguruma/ > 3. pcre is actively maintained (although tracking a new release every > couple months may not be something we really want to do, anyway). > AFAICT Henry's not doing anything much with his code, so it'd be > pretty much take-once-and-maintain-for-ourselves. Oniguruma is pretty well maintained given that it's Ruby's regexp engine, and has the perk of being maintained outside of Ruby as a standalone module that gets periodically imported. > 4. pcre looks like it's probably *not* as well suited to a multibyte > environment. In particular, I doubt that its UTF8 compile option > was even turned on for the performance comparison Neil cited --- and > the man page only promises "experimental, incomplete support for > UTF-8 encoded strings". The Tcl code by contrast is used only in a > multibyte environment, so that's the supported, optimized path. It > doesn't even assume null-terminated strings (yay). Oniguruma only supports ASCII, UTF-8, EUC-JP, and Shift_JIS, but boasts being 10-20% faster than PCRE for ASCII (no clue about multi-byte character sets). In terms of development/API, it supports the GNU regex, POSIX, Oniguruma APIs (the latter is what ruby uses to hook in). Just another option to add to the table, don't know if it fully fits our requirements, but since it is actively being developed by resources outside of this project, and it has support for 16-bit and 32-bit encodings (UCS-2, UCS-4, UTF-16) is on the TODO list, it might be nice to keep this in mind and let Ruby maintain it instead of PostgreSQL. -sc -- Sean Chittenden
On Tue, 2003-02-04 at 18:21, Tom Lane wrote: > 4. pcre looks like it's probably *not* as well suited to a multibyte > environment. In particular, I doubt that its UTF8 compile option was > even turned on for the performance comparison Neil cited --- and the man > page only promises "experimental, incomplete support for UTF-8 encoded > strings". The Tcl code by contrast is used only in a multibyte > environment, so that's the supported, optimized path. It doesn't even > assume null-terminated strings (yay). If we are going into code-lifting business, we should also consider Pythons sre (a modified pcre, that works both on 8-bit and python's unicode (either 16 or 32 byte chars, depending on compile options)) It has no specific support for "raw" utf-8 or other variable-width encodings. -- Hannu Krosing <hannu@tm.ee>
Hannu Krosing <hannu@tm.ee> writes: > If we are going into code-lifting business, we should also consider > Pythons sre What advantages does it have to make it worth considering? regards, tom lane
On Tue, 2003-02-04 at 13:21, Tom Lane wrote: > After some further research, pcre does seem like an interesting > alternative. Both pcre and Spencer's new code have essentially > Berkeley-style licenses, so there's no problem there. Keep in mind that pcre has an advertising clause in its license (software that distributes pcre commercially or non-commercially needs to add a note to the effect in its documentation / online help). Since PostgreSQL's license doesn't have this restriction, it would be shame to impose that requirement on PostgreSQL users. (Note that as I'm not a lawyer, my interpretation of the license may not be correct.) > Strict Perl compatibility would be a nice feature, but right at the > moment the multibyte issue is looking like the determining factor. Agreed -- ISTM that Spencer's new engine is probably the best choice. Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Tom Lane kirjutas T, 04.02.2003 kell 21:18: > Hannu Krosing <hannu@tm.ee> writes: > > If we are going into code-lifting business, we should also consider > > Pythons sre > > What advantages does it have to make it worth considering? Should be the same as pcre + support for wide chars. -- Hannu Krosing <hannu@tm.ee>
Proof of concept: PG 7.3 using regression database: regression=# select count(*) from tenk1 where 'quotidian' ~ string4;count ------- 0 (1 row) Time: 676.14 ms regression=# select count(*) from tenk1 where 'quotidian' ~ stringu1;count ------- 0 (1 row) Time: 3426.96 ms regression=# select count(*) from tenk1 where 'quotidian' ~* stringu1;count ------- 0 (1 row) Time: 466344.48 ms CVS tip plus code extracted from Tcl: regression=# select count(*) from tenk1 where 'quotidian' ~ string4;count ------- 0 (1 row) Time: 472.48 ms regression=# select count(*) from tenk1 where 'quotidian' ~ stringu1;count ------- 0 (1 row) Time: 4414.91 ms regression=# select count(*) from tenk1 where 'quotidian' ~* stringu1;count ------- 0 (1 row) Time: 4608.49 ms In the first case there are only four distinct patterns used, so we're running with cached precompiled regexes. In the other cases a new regex compilation must occur at each row. So, regex execution is a little faster than before (at least for trivial regexes); compilation seems to be a shade slower, but it doesn't fall over and die when compiling case-insensitive patterns (or bracket expressions, which is what the code actually reduces a case-insensitive pattern to). This is nowhere near ready to commit, but it compiles cleanly and passes regression tests ... regards, tom lane
Hannu Krosing <hannu@tm.ee> writes: > Tom Lane kirjutas T, 04.02.2003 kell 21:18: >> What advantages does it have to make it worth considering? > Should be the same as pcre + support for wide chars. Well, if someone wants to do the legwork to try it, that interface should work just about comparably to Spencer's package: I found that the easiest way to make it work is to use pg_mb2wchar to expand our internal encoding into an array of pg_wchar's and then apply the regex package to that form. So as long as sre can handle 4-byte wide chars it ought to more or less drop in. I've got a fair amount of cleanup to do on Spencer's package before I can even think of committing it (ANSIfy function headers, fix comment formatting so that pg_indent won't completely destroy 'em, etc). But I'll be glad to send the modified interface file (adt/regexp.c) to anyone who'd like to try getting it to work with sre. regards, tom lane
On Tue, 2003-02-04 at 17:26, Tom Lane wrote: > Proof of concept: > [...] Very cool work, Tom. > In the first case there are only four distinct patterns used, so we're > running with cached precompiled regexes. In the other cases a new regex > compilation must occur at each row. Speaking of which, is there (or should there be) some mechanism for increasing the size of the compiled pattern cache? Perhaps a GUC var? Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
Neil Conway <neilc@samurai.com> writes: > Speaking of which, is there (or should there be) some mechanism for > increasing the size of the compiled pattern cache? Perhaps a GUC var? I thought about that while I was messing with the code, but I don't think there's much point in it, unless someone wants to invest the work to make the cache search much smarter (maybe a hash table?). At present a larger cache will cost you in extra search time, especially in the case where the pattern isn't in the cache. I did do the work last night to convert the cache management algorithm into a self-organizing list (a la Knuth) rather than a round-robin search as it was before. This should reduce the expected number of comparisons for cases where the cache is actually accomplishing something, but of course it's no help if you have too many patterns for the cache. regards, tom lane
Ok. The original complain can be sasily solved at least for single byte encoding databases. With the small patches(against 7.3.1) included, I got following result. test1: select count(*) from tenk1 where 'quotidian' ~ string4;count ------- 0 (1 row) Time: 113.81 ms test2: select count(*) from tenk1 where 'quotidian' ~ stringu1;count ------- 0 (1 row) Time: 419.36 ms test3: select count(*) from tenk1 where 'quotidian' ~* stringu1;count ------- 0 (1 row) Time: 1633.21 ms The ratio for test3/test1 is now 14.35. Although not great as the Spencer's new code according to Tom (with the code test3/test1 = 9.75), it seems much better than the original 7.3 code (test3/test1 = 689.71). Note that if the database encoding is not a single byte one, it should be as slow as the original 7.3. There is no easy way to fix it. P.S. With the patches all the regression tests have passed on my Linux box. -- Tatsuo Ishii ----------------------------------- cut here ------------------------------------- *** postgresql-7.3.1/src/backend/regex/regcomp.c.orig 2002-09-05 05:31:24.000000000 +0900 --- postgresql-7.3.1/src/backend/regex/regcomp.c 2003-02-05 10:05:03.000000000 +0900 *************** *** 178,183 **** --- 178,186 ---- int i; size_t len; pg_wchar *wcp; + size_t csetsize; + + csetsize = (pg_database_encoding_max_length() == 1)?(SCHAR_MAX - SCHAR_MIN + 1):NC; if (cclasses == NULL) cclasses = cclass_init(); *************** *** 211,217 **** /* do the mallocs early so failure handling is easy */ g = (struct re_guts *) malloc(sizeof(structre_guts) + ! (NC - 1) * sizeof(cat_t)); if (g == NULL) return REG_ESPACE; p->ssize= len / (size_t) 2 *(size_t) 3 + (size_t) 1; /* ugh */ --- 214,220 ---- /* do the mallocs early so failure handling is easy */ g = (struct re_guts *) malloc(sizeof(structre_guts) + ! (csetsize - 1) * sizeof(cat_t)); if (g == NULL) return REG_ESPACE; p->ssize= len / (size_t) 2 *(size_t) 3 + (size_t) 1; /* ugh */ *************** *** 235,241 **** p->pbegin[i] = 0; p->pend[i] = 0; } ! g->csetsize = NC; g->sets = NULL; g->setbits = NULL; g->ncsets = 0; --- 238,244 ---- p->pbegin[i] = 0; p->pend[i] = 0; } ! g->csetsize = csetsize; g->sets = NULL; g->setbits = NULL; g->ncsets = 0; *************** *** 248,254 **** g->nsub = 0; g->ncategories = 1; /* category 0 is "everything else" */ g->categories= &g->catspace[-(CHAR_MIN)]; ! memset((char *) g->catspace, 0, NC * sizeof(cat_t)); g->backrefs = 0; /* do it */ --- 251,257 ---- g->nsub = 0; g->ncategories = 1; /* category 0 is "everything else" */ g->categories= &g->catspace[-(CHAR_MIN)]; ! memset((char *) g->catspace, 0, csetsize * sizeof(cat_t)); g->backrefs = 0; /* do it */
Tom Lane kirjutas K, 05.02.2003 kell 01:35: > Neil Conway <neilc@samurai.com> writes: > > Speaking of which, is there (or should there be) some mechanism for > > increasing the size of the compiled pattern cache? Perhaps a GUC var? > > I thought about that while I was messing with the code, but I don't > think there's much point in it, unless someone wants to invest the work > to make the cache search much smarter (maybe a hash table?). At present > a larger cache will cost you in extra search time, especially in the > case where the pattern isn't in the cache. > > I did do the work last night to convert the cache management algorithm > into a self-organizing list (a la Knuth) rather than a round-robin > search as it was before. This should reduce the expected number of > comparisons for cases where the cache is actually accomplishing > something, but of course it's no help if you have too many patterns > for the cache. Perhaps the decision weather to try to use the cache at all could be done at planning time depending on statistics information ? Another idea is to make special regex type and store the regexes pre-parsed (i.e. in some fast-load form) ? -- Hannu Krosing <hannu@tm.ee>
Hannu Krosing <hannu@tm.ee> writes: > Another idea is to make special regex type and store the regexes > pre-parsed (i.e. in some fast-load form) ? Seems unlikely that going out to disk could beat just recompiling the regexp. They're not *that* slow to compile ... at least not when we avoid the performance problem that we introduced into the older regexp code. regards, tom lane
Tatsuo Ishii <t-ishii@sra.co.jp> writes: > Ok. The original complain can be sasily solved at least for single > byte encoding databases. With the small patches(against 7.3.1) > included, I got following result. Nice work, Tatsuo! Wade, can you confirm that this patch solves your problem? Tatsuo, please commit into REL7_3 branch only --- I'm nearly ready to do a wholesale replacement of the regex code in HEAD, so you wouldn't accomplish much except to create a merge problem for me ... regards, tom lane
Confirmed. Looks like a 100-fold increase. Thanx guys. Explain output can be seen here: http://arch.wavefire.com/pgregex.txt -Wade Klaver At 09:59 AM 2/5/03 -0500, Tom Lane wrote: >Tatsuo Ishii <t-ishii@sra.co.jp> writes: >> Ok. The original complain can be sasily solved at least for single >> byte encoding databases. With the small patches(against 7.3.1) >> included, I got following result. > >Nice work, Tatsuo! Wade, can you confirm that this patch solves your >problem? > >Tatsuo, please commit into REL7_3 branch only --- I'm nearly ready to do >a wholesale replacement of the regex code in HEAD, so you wouldn't >accomplish much except to create a merge problem for me ... > > regards, tom lane > >
> Nice work, Tatsuo! Wade, can you confirm that this patch solves your > problem? > > Tatsuo, please commit into REL7_3 branch only --- I'm nearly ready to do > a wholesale replacement of the regex code in HEAD, so you wouldn't > accomplish much except to create a merge problem for me ... Ok. I have just committed into the 7.3 stable branch. -- Tatsuo Ishii
Tom Lane kirjutas K, 05.02.2003 kell 08:12: > Hannu Krosing <hannu@tm.ee> writes: > > Another idea is to make special regex type and store the regexes > > pre-parsed (i.e. in some fast-load form) ? > > Seems unlikely that going out to disk could beat just recompiling the > regexp. We have to get _something_ from disk anyway. Currently we fetch regex source code, but if there were some format that is faster to load then that could be an option. -- Hannu Krosing <hannu@tm.ee>
Can this improvement get merged up into CVS current, or did you already do that Tom? --------------------------------------------------------------------------- Tatsuo Ishii wrote: > > Nice work, Tatsuo! Wade, can you confirm that this patch solves your > > problem? > > > > Tatsuo, please commit into REL7_3 branch only --- I'm nearly ready to do > > a wholesale replacement of the regex code in HEAD, so you wouldn't > > accomplish much except to create a merge problem for me ... > > Ok. I have just committed into the 7.3 stable branch. > -- > Tatsuo Ishii > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Can this improvement get merged up into CVS current, or did you already > do that Tom? It's irrelevant to current. regards, tom lane