Re: heap_update temporary release of buffer lock - Mailing list pgsql-hackers

From Robert Haas
Subject Re: heap_update temporary release of buffer lock
Date
Msg-id CA+TgmobiZBoqv0L80XVHbQ8qi8ObHLQd0Bh+7tPrhm5A08ayuQ@mail.gmail.com
Whole thread Raw
In response to Re: heap_update temporary release of buffer lock  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: heap_update temporary release of buffer lock
List pgsql-hackers
On Tue, Sep 20, 2011 at 2:28 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 20.09.2011 20:42, Alvaro Herrera wrote:
>> I notice that heap_update releases the buffer lock, after checking the
>> HeapTupleSatifiesUpdate result, and before marking the tuple as updated,
>> to pin the visibility map page -- heapam.c lines 2638ff in master branch.
>>
>> Is this not a bug?  I imagine that while this code releases the lock,
>> someone else could acquire it and grab a FOR SHARE lock on the tuple; if
>> the update later ends up modifying the tuple completely, this could
>> cause an FK to be broken, for example.
>
> Yeah, I think you're right. Not only might someone grab a FOR SHARE lock on
> the tuple, but someone might even update it under your nose.

Yeah, I think he's right, too.  :-(

The easiest fix seems to be (as you suggest) to add "goto l2" after
reacquiring the lock.  Can we get away with (and is there any benefit
to) doing that only if xmax has changed?

> BTW, I think we're playing a bit fast and loose with that
> set-xmax-before-unlocking trick. We haven't WAL-logged anything at that
> point yet, so it's possible that while we're busy toasting the tuple, the
> page is flushed from shared buffers with the xmax and the infomask already
> update. Now, the system crashes, and you get a torn page, so that the xmax
> is already updated, but the HEAP_XMAX_COMMITTED flag was *not* cleared, so
> it's still set. Oops. Highly unlikely in practice, because xmax and infomask
> are very close to each other, but it violates the principles of what we
> usually expect from the underlying system wrt. torn pages.

I think our usual assumption is that the disk might write in chunks as
small as 512 bytes.  IIUC, tuples are only guaranteed to have 8-byte
alignment, and xmax is at offset 4, while the infomask is at offset
20.  So that doesn't seem safe.  If we were doing this over again we
could rearrange things to put them in the same eight-byte aligned
chunk by  (i.e. swap t_xmax and t_field3, move t_infomask before
t_ctid), but it's a bit late for that now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Darren Duncan
Date:
Subject: Re: Back-branch releases upcoming this week
Next
From: Tom Lane
Date:
Subject: Re: unite recovery.conf and postgresql.conf