Thread: Assert failure found in 8.1RC1

Assert failure found in 8.1RC1

From
Robert Creager
Date:
Hey all,

While trying to get a reproducible test case for my CS storm problem (see
http://archives.postgresql.org/pgsql-hackers/2005-10/msg00585.php), I upgraded
to 8.1RC1 and encountered the following assert:

TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno &&
shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File: "slru.c",
Line: 309)

On the good side, I'm yet unable to get a sustained CS storm anymore with this
level of code.  Looks like something might changed for the better in the last 2
weeks?

For the assert, I had 5 sets of my app running, each with 8 potential
outstanding queries.  I then threw my test at the db with 20 more queries, and
took the above failure.

creagrs=# select version();                                                version                        
-------------------------------------------------------------------------------
--------------------------PostgreSQL 8.1RC1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.1
(Mandrake Linux 9.2 3.3.1-2mdk)

BINDIR = /usr/local/pgsql810/bin
DOCDIR = /usr/local/pgsql810/doc
INCLUDEDIR = /usr/local/pgsql810/include
PKGINCLUDEDIR = /usr/local/pgsql810/include
INCLUDEDIR-SERVER = /usr/local/pgsql810/include/server
LIBDIR = /usr/local/pgsql810/lib
PKGLIBDIR = /usr/local/pgsql810/lib
LOCALEDIR =
MANDIR = /usr/local/pgsql810/man
SHAREDIR = /usr/local/pgsql810/share
SYSCONFDIR = /usr/local/pgsql810/etc
PGXS = /usr/local/pgsql810/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--enable-syslog' '--prefix=/usr/local/pgsql810' '--enable-debug'
'--enable-cassert'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE
CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Winline -Wendif-labels
-fno-strict-aliasing -g
CFLAGS_SL = -fpic
LDFLAGS = -Wl,-rpath,/usr/local/pgsql810/lib
LDFLAGS_SL =
LIBS = -lpgport -lz -lreadline -lncurses -lcrypt -lresolv -lnsl -ldl -lm -lbsd
VERSION = PostgreSQL 8.1RC1

Thanks,
Rob

-- 
Robert Creager
Advisory Software Engineer
Data Management Group
Sun Microsystems
Robert.Creager@Sun.com
303.673.2365 Office
888.912.4458 Pager



Re: Assert failure found in 8.1RC1

From
Tom Lane
Date:
Robert Creager <Robert.Creager@Sun.com> writes:
> TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno &&
> shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File: "slru.c",
> Line: 309)

http://archives.postgresql.org/pgsql-hackers/2005-10/msg01385.php

If you can reproduce the failure with any reliability, please try
one or both of the proposed patches:

http://archives.postgresql.org/pgsql-patches/2005-10/msg00240.php
http://archives.postgresql.org/pgsql-patches/2005-10/msg00248.php
        regards, tom lane


Re: Assert failure found in 8.1RC1

From
Robert Creager
Date:
On Wed, 02 Nov 2005 15:37:05 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Robert Creager <Robert.Creager@Sun.com> writes:
> > I can reproduce very quickly.  Looks like I should try the patch in 248
> > first to see if it fixes 8.1RC1?
> 
> Excellent.  Yes, the second patch is higher priority, but please try
> both while you're at it.
> 

I've put in patch 2.  I'm kicking the s**t out of it, with no problems so far. 
I'll let it run for a while longer.

One note is that I did hit the CS switch problem, but with a combination of
production app and my test app.  But, it took much more activity, wasn't as
severe (queries were typically staying < 10 seconds) and the db came out of it a
few minutes after my test app stopped.

I'll put in the first patch and re-run the tests.

Cheers,
Rob


Re: Assert failure found in 8.1RC1

From
Robert Creager
Date:
On Wed, 02 Nov 2005 15:19:44 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Robert Creager <Robert.Creager@Sun.com> writes:
> > TRAP: FailedAssertion("!(shared->page_number[slotno] == pageno &&
> > shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)", File: "slru.c",
> > Line: 309)
> 
> http://archives.postgresql.org/pgsql-hackers/2005-10/msg01385.php
> 
> If you can reproduce the failure with any reliability, please try
> one or both of the proposed patches:
> 
> http://archives.postgresql.org/pgsql-patches/2005-10/msg00240.php
> http://archives.postgresql.org/pgsql-patches/2005-10/msg00248.php
> 

Ran with both for an hour with no problem, where I could produce the ASSERT
failure within minutes for the non patched version.

Thanks,
Rob


Re: Assert failure found in 8.1RC1

From
Tom Lane
Date:
Robert Creager <Robert.Creager@Sun.com> writes:
> Ran with both for an hour with no problem, where I could produce the ASSERT
> failure within minutes for the non patched version.

Great.  I'll go ahead and commit the smaller fix into HEAD and the back
branches, and hold the larger fix for 8.2.

It's curious that two different people stumbled across this just
recently, when the bug has been there since 7.2.  I suppose that the
addition of pg_subtrans increased the probability of seeing the bug by
a considerable amount, but I'm still surprised it wasn't identified
before.  At the very least, we should have heard about it earlier in
the 8.0 release cycle ...
        regards, tom lane


Re: Assert failure found in 8.1RC1

From
"Jim C. Nasby"
Date:
On Wed, Nov 02, 2005 at 06:45:21PM -0500, Tom Lane wrote:
> Robert Creager <Robert.Creager@Sun.com> writes:
> > Ran with both for an hour with no problem, where I could produce the ASSERT
> > failure within minutes for the non patched version.
> 
> Great.  I'll go ahead and commit the smaller fix into HEAD and the back
> branches, and hold the larger fix for 8.2.
> 
> It's curious that two different people stumbled across this just
> recently, when the bug has been there since 7.2.  I suppose that the
> addition of pg_subtrans increased the probability of seeing the bug by
> a considerable amount, but I'm still surprised it wasn't identified
> before.  At the very least, we should have heard about it earlier in
> the 8.0 release cycle ...

Well, the common theme in each case IIRC is a fairly high transaction
rate; on the order of hundreds if not thousands per second.

Could something like that be added to regression, or maybe as a seperate
test case for the buildfarm?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Assert failure found in 8.1RC1

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> Could something like that be added to regression, or maybe as a seperate
> test case for the buildfarm?

If you don't have a self-contained, reproducible test case, it's a bit
pointless to suggest adding the nonexistent test case to the regression
suite.
        regards, tom lane


Re: Assert failure found in 8.1RC1

From
"Jim C. Nasby"
Date:
On Fri, Nov 04, 2005 at 04:35:10PM -0500, Tom Lane wrote:
> "Jim C. Nasby" <jnasby@pervasive.com> writes:
> > Could something like that be added to regression, or maybe as a seperate
> > test case for the buildfarm?
> 
> If you don't have a self-contained, reproducible test case, it's a bit
> pointless to suggest adding the nonexistent test case to the regression
> suite.

Well, for things like race conditions I don't know that you can create
reproducable test cases. My point was that this bug was exposed by
databases with workloads that involved very high transaction rates. I
know in the case of my client this is due to some sub-optimal design
decisions, and I believe the other case was similar. My suggestion is
that having a test that involves a lot of row-by-row type operations
that generate a very high transaction rate would help expose these kinds
of bugs.

Of course if someone can come up with a self-contained reproducable test
case for this race condition that would be great as well. :)
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Assert failure found in 8.1RC1

From
Andrew Dunstan
Date:

Jim C. Nasby wrote:

>On Fri, Nov 04, 2005 at 04:35:10PM -0500, Tom Lane wrote:
>  
>
>>"Jim C. Nasby" <jnasby@pervasive.com> writes:
>>    
>>
>>>Could something like that be added to regression, or maybe as a seperate
>>>test case for the buildfarm?
>>>      
>>>
>>If you don't have a self-contained, reproducible test case, it's a bit
>>pointless to suggest adding the nonexistent test case to the regression
>>suite.
>>    
>>
>
>Well, for things like race conditions I don't know that you can create
>reproducable test cases. My point was that this bug was exposed by
>databases with workloads that involved very high transaction rates. I
>know in the case of my client this is due to some sub-optimal design
>decisions, and I believe the other case was similar. My suggestion is
>that having a test that involves a lot of row-by-row type operations
>that generate a very high transaction rate would help expose these kinds
>of bugs.
>
>Of course if someone can come up with a self-contained reproducable test
>case for this race condition that would be great as well. :)
>  
>

These conditions make it quite unsuitable for buildfarm, which is 
designed as a thin veneer over the postgres build process, and intended 
to run anywhere you can build postgres.

Maybe you could use one of the Linux labs, since your client is on RHEL.

cheers

andrew


Re: Assert failure found in 8.1RC1

From
"Jim C. Nasby"
Date:
On Fri, Nov 04, 2005 at 05:26:25PM -0500, Andrew Dunstan wrote:
> >Well, for things like race conditions I don't know that you can create
> >reproducable test cases. My point was that this bug was exposed by
> >databases with workloads that involved very high transaction rates. I
> >know in the case of my client this is due to some sub-optimal design
> >decisions, and I believe the other case was similar. My suggestion is
> >that having a test that involves a lot of row-by-row type operations
> >that generate a very high transaction rate would help expose these kinds
> >of bugs.
> >
> >Of course if someone can come up with a self-contained reproducable test
> >case for this race condition that would be great as well. :)
> > 
> >
> 
> These conditions make it quite unsuitable for buildfarm, which is 
> designed as a thin veneer over the postgres build process, and intended 
> to run anywhere you can build postgres.
> 
> Maybe you could use one of the Linux labs, since your client is on RHEL.

I'm not worried about my client, I'm just thinking of a way to better
ferret out bugs like this. And there's no real reason why something like
this couldn't be part of regression, or an additional build target.

BTW, I just realized that part of the answer to Tom's musing about why
this hasn't been seen before now is that few (if any) regular users are
running with asserts turned on, so odds are good that they'd never know
if this problem occured or not. Further argument for trying to test this
on the buildfarm and/or enabling assertions by default, IMHO.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Assert failure found in 8.1RC1

From
"Marc G. Fournier"
Date:
On Fri, 4 Nov 2005, Jim C. Nasby wrote:

> On Fri, Nov 04, 2005 at 05:26:25PM -0500, Andrew Dunstan wrote:
>>> Well, for things like race conditions I don't know that you can create
>>> reproducable test cases. My point was that this bug was exposed by
>>> databases with workloads that involved very high transaction rates. I
>>> know in the case of my client this is due to some sub-optimal design
>>> decisions, and I believe the other case was similar. My suggestion is
>>> that having a test that involves a lot of row-by-row type operations
>>> that generate a very high transaction rate would help expose these kinds
>>> of bugs.
>>>
>>> Of course if someone can come up with a self-contained reproducable test
>>> case for this race condition that would be great as well. :)
>>>
>>>
>>
>> These conditions make it quite unsuitable for buildfarm, which is
>> designed as a thin veneer over the postgres build process, and intended
>> to run anywhere you can build postgres.
>>
>> Maybe you could use one of the Linux labs, since your client is on RHEL.
>
> I'm not worried about my client, I'm just thinking of a way to better
> ferret out bugs like this. And there's no real reason why something like
> this couldn't be part of regression, or an additional build target.

For all the talk about "couldn't it be part of regression", I haven't seen 
anyone submit a patch that would test for it ... since I believe both you 
and Tom have both stated that "for things like race conditions, I don't 
know that you can create reproducable cases", can you submit a patch for 
how you propose this should be added to the regression tests?

----
Marc G. Fournier           Hub.Org Networking Services (http://www.hub.org)
Email: scrappy@hub.org           Yahoo!: yscrappy              ICQ: 7615664


Re: Assert failure found in 8.1RC1

From
"Jim C. Nasby"
Date:
On Fri, Nov 04, 2005 at 08:46:27PM -0400, Marc G. Fournier wrote:
> On Fri, 4 Nov 2005, Jim C. Nasby wrote:
> For all the talk about "couldn't it be part of regression", I haven't seen 
> anyone submit a patch that would test for it ... since I believe both you 
> and Tom have both stated that "for things like race conditions, I don't 
> know that you can create reproducable cases", can you submit a patch for 
> how you propose this should be added to the regression tests?

I have an idea, but it might be better if Robert could produce a test
case since it would cover both a context storm issue as well as this
race condition.

Baring that, my idea was to spawn a number of processes, all of which
were trying to insert/update a random value in a table using David
Fetter's plpgsql code for doing a merge. This would produce a heavy
workload that also used subtransactions (due to the exception handling
in plpgsql).

Suggestions for a better test welcome...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Assert failure found in 8.1RC1

From
Robert Creager
Date:
On Tue, 08 Nov 2005 14:09:58 -0600
"Jim C. Nasby" <jnasby@pervasive.com> wrote:

> On Fri, Nov 04, 2005 at 08:46:27PM -0400, Marc G. Fournier wrote:
> > On Fri, 4 Nov 2005, Jim C. Nasby wrote:
> > For all the talk about "couldn't it be part of regression", I haven't seen 
> > anyone submit a patch that would test for it ... since I believe both you 
> > and Tom have both stated that "for things like race conditions, I don't 
> > know that you can create reproducable cases", can you submit a patch for 
> > how you propose this should be added to the regression tests?
> 
> I have an idea, but it might be better if Robert could produce a test
> case since it would cover both a context storm issue as well as this
> race condition.
> 

Actually, I have a test case.  I just sent it out to Tom a couple of hours ago. 
The quick and dirty is that it shows the problem after running for about 20
minutes on my Xenon system with 8.1.0...  I cannot get it to fail on my AMD
system with a much higher load...

I can send it to others who are interested.  The e-mail with dump, module and
script is just over 1Mb.

Cheers,
Rob


Re: Assert failure found in 8.1RC1

From
"Jim C. Nasby"
Date:
On Tue, Nov 08, 2005 at 02:09:35PM -0700, Robert Creager wrote:
> On Tue, 08 Nov 2005 14:09:58 -0600
> "Jim C. Nasby" <jnasby@pervasive.com> wrote:
> 
> > On Fri, Nov 04, 2005 at 08:46:27PM -0400, Marc G. Fournier wrote:
> > > On Fri, 4 Nov 2005, Jim C. Nasby wrote:
> > > For all the talk about "couldn't it be part of regression", I haven't seen 
> > > anyone submit a patch that would test for it ... since I believe both you 
> > > and Tom have both stated that "for things like race conditions, I don't 
> > > know that you can create reproducable cases", can you submit a patch for 
> > > how you propose this should be added to the regression tests?
> > 
> > I have an idea, but it might be better if Robert could produce a test
> > case since it would cover both a context storm issue as well as this
> > race condition.
> > 
> 
> Actually, I have a test case.  I just sent it out to Tom a couple of hours ago. 
> The quick and dirty is that it shows the problem after running for about 20
> minutes on my Xenon system with 8.1.0...  I cannot get it to fail on my AMD
> system with a much higher load...
> 
> I can send it to others who are interested.  The e-mail with dump, module and
> script is just over 1Mb.

Just to clarify, did it show the assert failure, the context switch
storm, or both?

Yes, I'd like to take a look at this if you could send it on to me. Is
there any simple way to populate the database? I doubt people would be
keen on having a 1MB dump in CVS...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Assert failure found in 8.1RC1

From
Robert Creager
Date:
On Tue, 08 Nov 2005 15:36:18 -0600
"Jim C. Nasby" <jnasby@pervasive.com> wrote:
> 
> Just to clarify, did it show the assert failure, the context switch
> storm, or both?

I didn't try for the assert after the patch.  I was developing the test when I
ran across the assert problem.  It should trigger the assert problem.

> 
> Yes, I'd like to take a look at this if you could send it on to me. Is
> there any simple way to populate the database? I doubt people would be
> keen on having a 1MB dump in CVS...

Hmmm...  Should be possible to populate all the data algorithmically.  For the
most part, the specific data doesn't matter, just the general patterns in the
data.

I'll re-send the e-mail to you.

Cheers,
Rob