Thread: Assert failure found in 8.1RC1
Hey all, While trying to get a reproducible test case for my CS storm problem (see http://archives.postgresql.org/pgsql-hackers/2005-10/msg00585.php), I upgraded to 8.1RC1 and encountered the following assert: TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno && shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File: "slru.c", Line: 309) On the good side, I'm yet unable to get a sustained CS storm anymore with this level of code. Looks like something might changed for the better in the last 2 weeks? For the assert, I had 5 sets of my app running, each with 8 potential outstanding queries. I then threw my test at the db with 20 more queries, and took the above failure. creagrs=# select version(); version ------------------------------------------------------------------------------- --------------------------PostgreSQL 8.1RC1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.1 (Mandrake Linux 9.2 3.3.1-2mdk) BINDIR = /usr/local/pgsql810/bin DOCDIR = /usr/local/pgsql810/doc INCLUDEDIR = /usr/local/pgsql810/include PKGINCLUDEDIR = /usr/local/pgsql810/include INCLUDEDIR-SERVER = /usr/local/pgsql810/include/server LIBDIR = /usr/local/pgsql810/lib PKGLIBDIR = /usr/local/pgsql810/lib LOCALEDIR = MANDIR = /usr/local/pgsql810/man SHAREDIR = /usr/local/pgsql810/share SYSCONFDIR = /usr/local/pgsql810/etc PGXS = /usr/local/pgsql810/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = '--enable-syslog' '--prefix=/usr/local/pgsql810' '--enable-debug' '--enable-cassert' CC = gcc CPPFLAGS = -D_GNU_SOURCE CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Winline -Wendif-labels -fno-strict-aliasing -g CFLAGS_SL = -fpic LDFLAGS = -Wl,-rpath,/usr/local/pgsql810/lib LDFLAGS_SL = LIBS = -lpgport -lz -lreadline -lncurses -lcrypt -lresolv -lnsl -ldl -lm -lbsd VERSION = PostgreSQL 8.1RC1 Thanks, Rob -- Robert Creager Advisory Software Engineer Data Management Group Sun Microsystems Robert.Creager@Sun.com 303.673.2365 Office 888.912.4458 Pager
Robert Creager <Robert.Creager@Sun.com> writes: > TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno && > shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File: "slru.c", > Line: 309) http://archives.postgresql.org/pgsql-hackers/2005-10/msg01385.php If you can reproduce the failure with any reliability, please try one or both of the proposed patches: http://archives.postgresql.org/pgsql-patches/2005-10/msg00240.php http://archives.postgresql.org/pgsql-patches/2005-10/msg00248.php regards, tom lane
On Wed, 02 Nov 2005 15:37:05 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Creager <Robert.Creager@Sun.com> writes: > > I can reproduce very quickly. Looks like I should try the patch in 248 > > first to see if it fixes 8.1RC1? > > Excellent. Yes, the second patch is higher priority, but please try > both while you're at it. > I've put in patch 2. I'm kicking the s**t out of it, with no problems so far. I'll let it run for a while longer. One note is that I did hit the CS switch problem, but with a combination of production app and my test app. But, it took much more activity, wasn't as severe (queries were typically staying < 10 seconds) and the db came out of it a few minutes after my test app stopped. I'll put in the first patch and re-run the tests. Cheers, Rob
On Wed, 02 Nov 2005 15:19:44 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Creager <Robert.Creager@Sun.com> writes: > > TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno && > > shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File: "slru.c", > > Line: 309) > > http://archives.postgresql.org/pgsql-hackers/2005-10/msg01385.php > > If you can reproduce the failure with any reliability, please try > one or both of the proposed patches: > > http://archives.postgresql.org/pgsql-patches/2005-10/msg00240.php > http://archives.postgresql.org/pgsql-patches/2005-10/msg00248.php > Ran with both for an hour with no problem, where I could produce the ASSERT failure within minutes for the non patched version. Thanks, Rob
Robert Creager <Robert.Creager@Sun.com> writes: > Ran with both for an hour with no problem, where I could produce the ASSERT > failure within minutes for the non patched version. Great. I'll go ahead and commit the smaller fix into HEAD and the back branches, and hold the larger fix for 8.2. It's curious that two different people stumbled across this just recently, when the bug has been there since 7.2. I suppose that the addition of pg_subtrans increased the probability of seeing the bug by a considerable amount, but I'm still surprised it wasn't identified before. At the very least, we should have heard about it earlier in the 8.0 release cycle ... regards, tom lane
On Wed, Nov 02, 2005 at 06:45:21PM -0500, Tom Lane wrote: > Robert Creager <Robert.Creager@Sun.com> writes: > > Ran with both for an hour with no problem, where I could produce the ASSERT > > failure within minutes for the non patched version. > > Great. I'll go ahead and commit the smaller fix into HEAD and the back > branches, and hold the larger fix for 8.2. > > It's curious that two different people stumbled across this just > recently, when the bug has been there since 7.2. I suppose that the > addition of pg_subtrans increased the probability of seeing the bug by > a considerable amount, but I'm still surprised it wasn't identified > before. At the very least, we should have heard about it earlier in > the 8.0 release cycle ... Well, the common theme in each case IIRC is a fairly high transaction rate; on the order of hundreds if not thousands per second. Could something like that be added to regression, or maybe as a seperate test case for the buildfarm? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > Could something like that be added to regression, or maybe as a seperate > test case for the buildfarm? If you don't have a self-contained, reproducible test case, it's a bit pointless to suggest adding the nonexistent test case to the regression suite. regards, tom lane
On Fri, Nov 04, 2005 at 04:35:10PM -0500, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > Could something like that be added to regression, or maybe as a seperate > > test case for the buildfarm? > > If you don't have a self-contained, reproducible test case, it's a bit > pointless to suggest adding the nonexistent test case to the regression > suite. Well, for things like race conditions I don't know that you can create reproducable test cases. My point was that this bug was exposed by databases with workloads that involved very high transaction rates. I know in the case of my client this is due to some sub-optimal design decisions, and I believe the other case was similar. My suggestion is that having a test that involves a lot of row-by-row type operations that generate a very high transaction rate would help expose these kinds of bugs. Of course if someone can come up with a self-contained reproducable test case for this race condition that would be great as well. :) -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: >On Fri, Nov 04, 2005 at 04:35:10PM -0500, Tom Lane wrote: > > >>"Jim C. Nasby" <jnasby@pervasive.com> writes: >> >> >>>Could something like that be added to regression, or maybe as a seperate >>>test case for the buildfarm? >>> >>> >>If you don't have a self-contained, reproducible test case, it's a bit >>pointless to suggest adding the nonexistent test case to the regression >>suite. >> >> > >Well, for things like race conditions I don't know that you can create >reproducable test cases. My point was that this bug was exposed by >databases with workloads that involved very high transaction rates. I >know in the case of my client this is due to some sub-optimal design >decisions, and I believe the other case was similar. My suggestion is >that having a test that involves a lot of row-by-row type operations >that generate a very high transaction rate would help expose these kinds >of bugs. > >Of course if someone can come up with a self-contained reproducable test >case for this race condition that would be great as well. :) > > These conditions make it quite unsuitable for buildfarm, which is designed as a thin veneer over the postgres build process, and intended to run anywhere you can build postgres. Maybe you could use one of the Linux labs, since your client is on RHEL. cheers andrew
On Fri, Nov 04, 2005 at 05:26:25PM -0500, Andrew Dunstan wrote: > >Well, for things like race conditions I don't know that you can create > >reproducable test cases. My point was that this bug was exposed by > >databases with workloads that involved very high transaction rates. I > >know in the case of my client this is due to some sub-optimal design > >decisions, and I believe the other case was similar. My suggestion is > >that having a test that involves a lot of row-by-row type operations > >that generate a very high transaction rate would help expose these kinds > >of bugs. > > > >Of course if someone can come up with a self-contained reproducable test > >case for this race condition that would be great as well. :) > > > > > > These conditions make it quite unsuitable for buildfarm, which is > designed as a thin veneer over the postgres build process, and intended > to run anywhere you can build postgres. > > Maybe you could use one of the Linux labs, since your client is on RHEL. I'm not worried about my client, I'm just thinking of a way to better ferret out bugs like this. And there's no real reason why something like this couldn't be part of regression, or an additional build target. BTW, I just realized that part of the answer to Tom's musing about why this hasn't been seen before now is that few (if any) regular users are running with asserts turned on, so odds are good that they'd never know if this problem occured or not. Further argument for trying to test this on the buildfarm and/or enabling assertions by default, IMHO. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Fri, 4 Nov 2005, Jim C. Nasby wrote: > On Fri, Nov 04, 2005 at 05:26:25PM -0500, Andrew Dunstan wrote: >>> Well, for things like race conditions I don't know that you can create >>> reproducable test cases. My point was that this bug was exposed by >>> databases with workloads that involved very high transaction rates. I >>> know in the case of my client this is due to some sub-optimal design >>> decisions, and I believe the other case was similar. My suggestion is >>> that having a test that involves a lot of row-by-row type operations >>> that generate a very high transaction rate would help expose these kinds >>> of bugs. >>> >>> Of course if someone can come up with a self-contained reproducable test >>> case for this race condition that would be great as well. :) >>> >>> >> >> These conditions make it quite unsuitable for buildfarm, which is >> designed as a thin veneer over the postgres build process, and intended >> to run anywhere you can build postgres. >> >> Maybe you could use one of the Linux labs, since your client is on RHEL. > > I'm not worried about my client, I'm just thinking of a way to better > ferret out bugs like this. And there's no real reason why something like > this couldn't be part of regression, or an additional build target. For all the talk about "couldn't it be part of regression", I haven't seen anyone submit a patch that would test for it ... since I believe both you and Tom have both stated that "for things like race conditions, I don't know that you can create reproducable cases", can you submit a patch for how you propose this should be added to the regression tests? ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email: scrappy@hub.org Yahoo!: yscrappy ICQ: 7615664
On Fri, Nov 04, 2005 at 08:46:27PM -0400, Marc G. Fournier wrote: > On Fri, 4 Nov 2005, Jim C. Nasby wrote: > For all the talk about "couldn't it be part of regression", I haven't seen > anyone submit a patch that would test for it ... since I believe both you > and Tom have both stated that "for things like race conditions, I don't > know that you can create reproducable cases", can you submit a patch for > how you propose this should be added to the regression tests? I have an idea, but it might be better if Robert could produce a test case since it would cover both a context storm issue as well as this race condition. Baring that, my idea was to spawn a number of processes, all of which were trying to insert/update a random value in a table using David Fetter's plpgsql code for doing a merge. This would produce a heavy workload that also used subtransactions (due to the exception handling in plpgsql). Suggestions for a better test welcome... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, 08 Nov 2005 14:09:58 -0600 "Jim C. Nasby" <jnasby@pervasive.com> wrote: > On Fri, Nov 04, 2005 at 08:46:27PM -0400, Marc G. Fournier wrote: > > On Fri, 4 Nov 2005, Jim C. Nasby wrote: > > For all the talk about "couldn't it be part of regression", I haven't seen > > anyone submit a patch that would test for it ... since I believe both you > > and Tom have both stated that "for things like race conditions, I don't > > know that you can create reproducable cases", can you submit a patch for > > how you propose this should be added to the regression tests? > > I have an idea, but it might be better if Robert could produce a test > case since it would cover both a context storm issue as well as this > race condition. > Actually, I have a test case. I just sent it out to Tom a couple of hours ago. The quick and dirty is that it shows the problem after running for about 20 minutes on my Xenon system with 8.1.0... I cannot get it to fail on my AMD system with a much higher load... I can send it to others who are interested. The e-mail with dump, module and script is just over 1Mb. Cheers, Rob
On Tue, Nov 08, 2005 at 02:09:35PM -0700, Robert Creager wrote: > On Tue, 08 Nov 2005 14:09:58 -0600 > "Jim C. Nasby" <jnasby@pervasive.com> wrote: > > > On Fri, Nov 04, 2005 at 08:46:27PM -0400, Marc G. Fournier wrote: > > > On Fri, 4 Nov 2005, Jim C. Nasby wrote: > > > For all the talk about "couldn't it be part of regression", I haven't seen > > > anyone submit a patch that would test for it ... since I believe both you > > > and Tom have both stated that "for things like race conditions, I don't > > > know that you can create reproducable cases", can you submit a patch for > > > how you propose this should be added to the regression tests? > > > > I have an idea, but it might be better if Robert could produce a test > > case since it would cover both a context storm issue as well as this > > race condition. > > > > Actually, I have a test case. I just sent it out to Tom a couple of hours ago. > The quick and dirty is that it shows the problem after running for about 20 > minutes on my Xenon system with 8.1.0... I cannot get it to fail on my AMD > system with a much higher load... > > I can send it to others who are interested. The e-mail with dump, module and > script is just over 1Mb. Just to clarify, did it show the assert failure, the context switch storm, or both? Yes, I'd like to take a look at this if you could send it on to me. Is there any simple way to populate the database? I doubt people would be keen on having a 1MB dump in CVS... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, 08 Nov 2005 15:36:18 -0600 "Jim C. Nasby" <jnasby@pervasive.com> wrote: > > Just to clarify, did it show the assert failure, the context switch > storm, or both? I didn't try for the assert after the patch. I was developing the test when I ran across the assert problem. It should trigger the assert problem. > > Yes, I'd like to take a look at this if you could send it on to me. Is > there any simple way to populate the database? I doubt people would be > keen on having a 1MB dump in CVS... Hmmm... Should be possible to populate all the data algorithmically. For the most part, the specific data doesn't matter, just the general patterns in the data. I'll re-send the e-mail to you. Cheers, Rob