Re: Problem with locks - Mailing list pgsql-hackers

From Gregory Stark
Subject Re: Problem with locks
Date
Msg-id 87odhgelpg.fsf@oxford.xeocode.com
Whole thread Raw
In response to Re: Problem with locks  (Gregory Stark <stark@enterprisedb.com>)
Responses Re: Problem with locks  (Gregory Stark <stark@enterprisedb.com>)
List pgsql-hackers
"Gregory Stark" <stark@enterprisedb.com> writes:

> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
>> Gregory Stark <stark@enterprisedb.com> writes:
>>> We're seeing a problem where occasionally a process appears to be granted a
>>> lock but miss its semaphore signal.
>>
>> Kernel bug maybe?  What's the platform?
>
> It does sound like it given the way my description went. I was worried it may
> be some code path not setting waitStatus properly or the compiler caching it
> incorrectly somehow.
>
> But now that I check I see it's a pretty old kernel version (Linux 2.6.5) 

For what it's worth we've reproduced the problem with 2.6.16.21 which is
"only" about a year old. I want to rerun this with a shiny new 2.6.22 kernel
but really 2.6.16 is recent enough that I don't know of any major bugs fixed
in IPC handling since then (with the exception of hugetlb interaction which
we're not using on this machine) .

So now this is probably either an ongoing kernel bug affecting Postgres or
it's elsewhere -- either in Postgres or GCC.

I'm really concerned about this because while the behaviour with
deadlock_timeout set quite high (we have it set to 60s on this machine) is bad
enough -- the behaviour with it set to the default 1s is far more scary.

On the default 1s timeout on a machine undergoing lock waits which are mostly
under 1s you will probably never notice anything recognizably similar to this.
You'll occasionally have some lock waits which last a second for no good
reason but you'll never notice that. 

*But* if you should have a lock wait which lasts more than 1s before it's
granted, then when it's granted the semaphore gets lost you're in serious doo
doo. The deadlock timeout only fires once and then nothing's going to wake up
that process ever again.

IIRC we've actually gotten a couple reports of people claiming they've got a
"deadlock" when there was no evidence of a deadlock in pg_locks. We always
chalked it down to a single long-lived process holding the lock and blocking,
but never did much analysis on those reports to see if that was really the
case. It's quite possible we had users already observing this problem.

If it's a real problem then we're in a bit of a bind. Even if we find and fix
a Linux kernel problem we'll still have users on versions of the kernel prior
to 2.6.23 or whatever has the bug fixed. We may be best off including an
option to have the deadlock timer refire every deadlock_timeout interval
instead of just firing once. Then we could print a message any time it occurs
and include a HINT about upgrading to a kernel with the bug fixed.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: "Merlin Moncure"
Date:
Subject: Re: crypting prosrc in pg_proc
Next
From: Tomoaki Sato
Date:
Subject: createlang/droplang -l outputs