Re: database vacuum from cron hanging - Mailing list pgsql-hackers

From Tom Lane
Subject Re: database vacuum from cron hanging
Date
Msg-id 1046.1129127186@sss.pgh.pa.us
Whole thread Raw
In response to Re: database vacuum from cron hanging  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> (gdb) p BufferDescriptors[781]
>> $1 = {tag = {rnode = {spcNode = 1663, dbNode = 16385, relNode = 2666}, blockNum = 1}, flags = 70, usage_count = 5,
refcount= 4294967294,
 
>> wait_backend_pid = 748, buf_hdr_lock = 0 '\0', buf_id = 781, freeNext = -2, io_in_progress_lock = 1615, content_lock
=1616}
 

> Whoa.  refcount -2?

After meditating overnight, I have a theory.  There seem to be two basic
categories of possible explanations for the above state:

1. Some path of control decrements refcount more times than it increments it.
2. Occasionally, an intended increment gets lost.

Yesterday I was thinking in terms of #1, but it really doesn't seem to
fit the observed facts very well.  I don't see a reason why such a bug
would preferentially affect pg_constraint_contypid_index; also it seems
like it would be fairly easily repeatable by many people.  The pin
tracking logic is all internal to individual backends and doesn't look
very vulnerable to, say, timing-related glitches.

On the other hand, it's not hard to concoct a plausible explanation
using #2: suppose that two backends wanting to pin the same buffer at
about the same time pick up the same original value of refcount, add
one, store back.  This is not supposed to happen of course, but maybe
the compiler is optimizing some code in a way that gives this effect
(ie, by reading refcount before the buffer header spinlock has been
acquired).  Now we can account for pg_constraint_contypid_index being
hit: we know you use domains a lot, and that uncached catalog search in
GetDomainConstraints would result in a whole lot of concurrent accesses
to that particular index, so it would be a likely place for such a bug
to manifest.  And we can account for you being the only one seeing it:
this theory makes it compiler- and platform-dependent.

Accordingly: what's the platform exactly? (CPU type, and OS just in
case.)  What compiler was used?  (If gcc, show "gcc -v" output.)
Also please show the output of "pg_config".
        regards, tom lane


pgsql-hackers by date:

Previous
From: Alfranio Correia Junior
Date:
Subject: Re: Need A Suggestion
Next
From: Tom Lane
Date:
Subject: Re: Socket problem using beta2 on Windows-XP