Thread: missing LockBuffer(buffer, BUFFER_LOCK_SHARE) in trigger.c GetTupleForTrigger?

missing LockBuffer(buffer, BUFFER_LOCK_SHARE) in trigger.c GetTupleForTrigger?

From
Andres Freund
Date:
Hi,

while looking at trigger.c from the POV of foreign key locks I noticed
that GetTupleForTrigger commentless assumes it can just look at a pages
content without a   LockBuffer(buffer, BUFFER_LOCK_SHARE);

That code path is only reached for AFTER ... FOR EACH ROW ... triggers,
so its fine it's not locking the tuple itself. That already needs to
have happened before.

The code in question:

if (newslot_which_is_passed_by_before_triggers)...
else
{Page        page;ItemId        lp;
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
page = BufferGetPage(buffer);lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
Assert(ItemIdIsNormal(lp));
tuple.t_data = (HeapTupleHeader) PageGetItem(page, lp);tuple.t_len = ItemIdGetLength(lp);tuple.t_self =
*tid;tuple.t_tableOid= RelationGetRelid(relation);
 
}

result = heap_copytuple(&tuple);
ReleaseBuffer(buffer);

As can be seen in the excerpt above this is basically a very stripped
down heap_fetch(). But without any locking on the buffer we just read.

I can't believe this is safe? Just about anything but eviction could
happen to that buffer at that moment?

Am I missing something?

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: missing LockBuffer(buffer, BUFFER_LOCK_SHARE) in trigger.c GetTupleForTrigger?

From
Pavan Deolasee
Date:


On Fri, Nov 30, 2012 at 3:19 AM, Andres Freund <andres@2ndquadrant.com> wrote:
Hi,

while looking at trigger.c from the POV of foreign key locks I noticed
that GetTupleForTrigger commentless assumes it can just look at a pages
content without a
    LockBuffer(buffer, BUFFER_LOCK_SHARE);

That code path is only reached for AFTER ... FOR EACH ROW ... triggers,
so its fine it's not locking the tuple itself. That already needs to
have happened before.

The code in question:

if (newslot_which_is_passed_by_before_triggers)
        ...
else
{
        Page            page;
        ItemId          lp;

        buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));

        page = BufferGetPage(buffer);
        lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));

        Assert(ItemIdIsNormal(lp));

        tuple.t_data = (HeapTupleHeader) PageGetItem(page, lp);
        tuple.t_len = ItemIdGetLength(lp);
        tuple.t_self = *tid;
        tuple.t_tableOid = RelationGetRelid(relation);
}

result = heap_copytuple(&tuple);
ReleaseBuffer(buffer);

As can be seen in the excerpt above this is basically a very stripped
down heap_fetch(). But without any locking on the buffer we just read.

I can't believe this is safe? Just about anything but eviction could
happen to that buffer at that moment?

Am I missing something?


That seems to be safe to me. Anything thats been read above can't really change. The tuple is already locked, so a concurrent update/delete has to wait on us. We have a pin on the buffer, so VACUUM or HOT-prune can't happen either. I can't see any other operation that can really change those fields.

heap_fetch() though looks very similar has an important difference. It might be reading the tuple without lock on it and we need the buffer lock to protect against concurrent update/delete to the tuple. AFAIK once the tuple is passed through qualification checks etc, even callers of heap_fetch() can read the tuple data without any lock on the buffer itself.

Thanks,
Pavan
--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: missing LockBuffer(buffer, BUFFER_LOCK_SHARE) in trigger.c GetTupleForTrigger?

From
Andres Freund
Date:
On 2012-11-30 08:52:18 +0530, Pavan Deolasee wrote:
> On Fri, Nov 30, 2012 at 3:19 AM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > Hi,
> >
> > while looking at trigger.c from the POV of foreign key locks I noticed
> > that GetTupleForTrigger commentless assumes it can just look at a pages
> > content without a
> >     LockBuffer(buffer, BUFFER_LOCK_SHARE);
> >
> > That code path is only reached for AFTER ... FOR EACH ROW ... triggers,
> > so its fine it's not locking the tuple itself. That already needs to
> > have happened before.
> >
> > The code in question:
> >
> > if (newslot_which_is_passed_by_before_triggers)
> >         ...
> > else
> > {
> >         Page            page;
> >         ItemId          lp;
> >
> >         buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
> >
> >         page = BufferGetPage(buffer);
> >         lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));
> >
> >         Assert(ItemIdIsNormal(lp));
> >
> >         tuple.t_data = (HeapTupleHeader) PageGetItem(page, lp);
> >         tuple.t_len = ItemIdGetLength(lp);
> >         tuple.t_self = *tid;
> >         tuple.t_tableOid = RelationGetRelid(relation);
> > }
> >
> > result = heap_copytuple(&tuple);
> > ReleaseBuffer(buffer);
> >
> > As can be seen in the excerpt above this is basically a very stripped
> > down heap_fetch(). But without any locking on the buffer we just read.
> >
> > I can't believe this is safe? Just about anything but eviction could
> > happen to that buffer at that moment?
> >
> > Am I missing something?
> >
> >
> That seems to be safe to me. Anything thats been read above can't really
> change. The tuple is already locked, so a concurrent update/delete has to
> wait on us. We have a pin on the buffer, so VACUUM or HOT-prune can't
> happen either. I can't see any other operation that can really change those
> fields.

We only get the pin right there, I don't see any preexisting pin. Which
means we might have just opened a page thats in the process of being
pruned/vacuumed by another backend.
I think a concurrent heap_(insert|update)/PageAddItem might actually be
already dangerous because it might move line pointers around

> heap_fetch() though looks very similar has an important difference. It
> might be reading the tuple without lock on it and we need the buffer lock
> to protect against concurrent update/delete to the tuple. AFAIK once the
> tuple is passed through qualification checks etc, even callers of
> heap_fetch() can read the tuple data without any lock on the buffer itself.

Sure, but the page header isn't accessed there, just the tuple data
itself which always stays at the same place on the page.
(A normal heap_fetch shouldn't be worried about updates/deletions to its
tuple, MVCC to the rescue.)

Even if it turns out to be safe, I think this deserves at least a
comment...


Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund escribió:
> On 2012-11-30 08:52:18 +0530, Pavan Deolasee wrote:
> > On Fri, Nov 30, 2012 at 3:19 AM, Andres Freund <andres@2ndquadrant.com>wrote:

> > > I can't believe this is safe? Just about anything but eviction could
> > > happen to that buffer at that moment?

Yeah, it does seem fishy to me as well.

> Even if it turns out to be safe, I think this deserves at least a
> comment...

No doubt that trigger.c is hugely underdocumented in some places.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: missing LockBuffer(buffer, BUFFER_LOCK_SHARE) in trigger.c GetTupleForTrigger?

From
Pavan Deolasee
Date:


On Fri, Nov 30, 2012 at 5:28 PM, Andres Freund <andres@2ndquadrant.com> wrote:

> >
> That seems to be safe to me. Anything thats been read above can't really
> change. The tuple is already locked, so a concurrent update/delete has to
> wait on us. We have a pin on the buffer, so VACUUM or HOT-prune can't
> happen either. I can't see any other operation that can really change those
> fields.

We only get the pin right there, I don't see any preexisting pin. Which
means we might have just opened a page thats in the process of being
pruned/vacuumed by another backend.

Hmm. Yeah, you're right. That is a possible risky scenario. Even though cleanup lock waits for all pins to go away, it will work only if every reader takes at least a SHARE lock unless it was continuously holding a pin on a buffer (in which case its OK to drop lock and read a tuple body without reacquiring it again). Otherwise, as you rightly pointed out, we could actually be reading a page which being actively cleaned up and tuples are being moved around.
 
I think a concurrent heap_(insert|update)/PageAddItem might actually be
already dangerous because it might move line pointers around


I don't we move the line pointers around ever because the indexes will be pointing to them. But the vacuum/prune is dangerous enough to require a SHARE lock here in any case.
 
Thanks,
On 30 November 2012 11:58, Andres Freund <andres@2ndquadrant.com> wrote:

> We only get the pin right there, I don't see any preexisting pin.

Seems easy enough to test with an Assert patch.

If the Assert doesn't fail, we apply it as "documentation" of the
requirement for a pin.

If it fails, we fix the bug.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: missing LockBuffer(buffer, BUFFER_LOCK_SHARE) in trigger.c GetTupleForTrigger?

From
Andres Freund
Date:
On 2012-11-30 18:09:49 +0530, Pavan Deolasee wrote:
> On Fri, Nov 30, 2012 at 5:28 PM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> >
> > > >
> > > That seems to be safe to me. Anything thats been read above can't really
> > > change. The tuple is already locked, so a concurrent update/delete has to
> > > wait on us. We have a pin on the buffer, so VACUUM or HOT-prune can't
> > > happen either. I can't see any other operation that can really change
> > those
> > > fields.
> >
> > We only get the pin right there, I don't see any preexisting pin. Which
> > means we might have just opened a page thats in the process of being
> > pruned/vacuumed by another backend.
> >
>
> Hmm. Yeah, you're right. That is a possible risky scenario. Even though
> cleanup lock waits for all pins to go away, it will work only if every
> reader takes at least a SHARE lock unless it was continuously holding a pin
> on a buffer (in which case its OK to drop lock and read a tuple body
> without reacquiring it again). Otherwise, as you rightly pointed out, we
> could actually be reading a page which being actively cleaned up and tuples
> are being moved around.

Well, live (or recently dead) tuples can't be just moved arround unless
I miss something. That would cause problems with with HeapTuples
directly pointing into the page as youve noticed.

> > I think a concurrent heap_(insert|update)/PageAddItem might actually be
> > already dangerous because it might move line pointers around
> >
> >
> I don't we move the line pointers around ever because the indexes will be
> pointing to them.

The indexes point to a tid, not to a line pointer. So reshuffling the
linepointer array - while keeping the tids valid - is entirely
fine. Right?
Also, PageAddItem does that all the time, so I think we would have
noticed if not ;)

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On 2012-11-30 12:50:06 +0000, Simon Riggs wrote:
> On 30 November 2012 11:58, Andres Freund <andres@2ndquadrant.com> wrote:
>
> > We only get the pin right there, I don't see any preexisting pin.
>
> Seems easy enough to test with an Assert patch.
>
> If the Assert doesn't fail, we apply it as "documentation" of the
> requirement for a pin.
>
> If it fails, we fix the bug.

I think its wrong even if we were holding a pin all the time due the the
aforementioned PageAddItem reshuffling of line pointers. So that Assert
wouldn't proof enough.

I can try to proof corruption there, but I would rather see somebody
coming along telling me why its safe and that I am dumb for not
realizing it.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: missing LockBuffer(buffer, BUFFER_LOCK_SHARE) in trigger.c GetTupleForTrigger?

From
Pavan Deolasee
Date:


On Fri, Nov 30, 2012 at 6:21 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2012-11-30 18:09:49 +0530, Pavan Deolasee wrote:
> On Fri, Nov 30, 2012 at 5:28 PM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> >
> > > >
> > > That seems to be safe to me. Anything thats been read above can't really
> > > change. The tuple is already locked, so a concurrent update/delete has to
> > > wait on us. We have a pin on the buffer, so VACUUM or HOT-prune can't
> > > happen either. I can't see any other operation that can really change
> > those
> > > fields.
> >
> > We only get the pin right there, I don't see any preexisting pin. Which
> > means we might have just opened a page thats in the process of being
> > pruned/vacuumed by another backend.
> >
>
> Hmm. Yeah, you're right. That is a possible risky scenario. Even though
> cleanup lock waits for all pins to go away, it will work only if every
> reader takes at least a SHARE lock unless it was continuously holding a pin
> on a buffer (in which case its OK to drop lock and read a tuple body
> without reacquiring it again). Otherwise, as you rightly pointed out, we
> could actually be reading a page which being actively cleaned up and tuples
> are being moved around.

Well, live (or recently dead) tuples can't be just moved arround unless
I miss something. That would cause problems with with HeapTuples
directly pointing into the page as youve noticed.


I think we confusing with the terminology. The tuple headers and bodies (live or dead) can be moved around freely as long as you have a cleanup lock on the page. Thats how HOT-prune frees up fragmented space.
 
> > I think a concurrent heap_(insert|update)/PageAddItem might actually be
> > already dangerous because it might move line pointers around
> >
> >
> I don't we move the line pointers around ever because the indexes will be
> pointing to them.

The indexes point to a tid, not to a line pointer. So reshuffling the
linepointer array - while keeping the tids valid - is entirely
fine. Right?

I think its again terminology confusion :-) I thought TID and Line Pointers are the same and used interchangeably. Or at least I've done so for long. But I think you are reading line pointer as the thing stored in the ItemId structure and not the ItemId itself.
 
Also, PageAddItem() doesn't move things around normally. It does that only during recovery, at least for heao pages. Elsewhere it will either reuse an UNUSED pointer or add at the end. Isn't that how it is ?

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: missing LockBuffer(buffer, BUFFER_LOCK_SHARE) in trigger.c GetTupleForTrigger?

From
Pavan Deolasee
Date:


On Fri, Nov 30, 2012 at 5:28 PM, Andres Freund <andres@2ndquadrant.com> wrote:


> heap_fetch() though looks very similar has an important difference. It
> might be reading the tuple without lock on it and we need the buffer lock
> to protect against concurrent update/delete to the tuple. AFAIK once the
> tuple is passed through qualification checks etc, even callers of
> heap_fetch() can read the tuple data without any lock on the buffer itself.

Sure, but the page header isn't accessed there, just the tuple data
itself which always stays at the same place on the page.
(A normal heap_fetch shouldn't be worried about updates/deletions to its
tuple, MVCC to the rescue.)


heap_fetch() reads the tuple header though to apply snapshot visibility rules. So a SHARE lock is must for that purpose because a concurrent update/delete may set XMAX or other visibility related fields. On the other hand, you can read the tuple body without a page lock if you were holding a pin on the buffer continuously since you last dropped the lock.

In any case, a lock is needed in GetTupleForTrigger unless someone can argue that a pin is continuously held on the buffer since the lock was last dropped, something I don't see happening in this case.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Re: missing LockBuffer(buffer, BUFFER_LOCK_SHARE) in trigger.c GetTupleForTrigger?

From
Andres Freund
Date:
On 2012-11-30 18:35:05 +0530, Pavan Deolasee wrote:
> On Fri, Nov 30, 2012 at 6:21 PM, Andres Freund <andres@2ndquadrant.com>wrote:
>
> > On 2012-11-30 18:09:49 +0530, Pavan Deolasee wrote:
> > > On Fri, Nov 30, 2012 at 5:28 PM, Andres Freund <andres@2ndquadrant.com
> > >wrote:
> > >
> > > >
> > > > > >
> > > > > That seems to be safe to me. Anything thats been read above can't
> > really
> > > > > change. The tuple is already locked, so a concurrent update/delete
> > has to
> > > > > wait on us. We have a pin on the buffer, so VACUUM or HOT-prune can't
> > > > > happen either. I can't see any other operation that can really change
> > > > those
> > > > > fields.
> > > >
> > > > We only get the pin right there, I don't see any preexisting pin. Which
> > > > means we might have just opened a page thats in the process of being
> > > > pruned/vacuumed by another backend.
> > > >
> > >
> > > Hmm. Yeah, you're right. That is a possible risky scenario. Even though
> > > cleanup lock waits for all pins to go away, it will work only if every
> > > reader takes at least a SHARE lock unless it was continuously holding a
> > pin
> > > on a buffer (in which case its OK to drop lock and read a tuple body
> > > without reacquiring it again). Otherwise, as you rightly pointed out, we
> > > could actually be reading a page which being actively cleaned up and
> > tuples
> > > are being moved around.
> >
> > Well, live (or recently dead) tuples can't be just moved arround unless
> > I miss something. That would cause problems with with HeapTuples
> > directly pointing into the page as youve noticed.

> I think we confusing with the terminology. The tuple headers and bodies
> (live or dead) can be moved around freely as long as you have a cleanup
> lock on the page. Thats how HOT-prune frees up fragmented space.

Need to read more code here. This is nothing I really have looked into
before.Youre probably right.

Up to now I thought hot pruning only removed intermediate & surely dead
tuples but didn't move stuff arround. And that page fragementation was
repaired separately. But it looks like I am wrong.

> > > > I think a concurrent heap_(insert|update)/PageAddItem might actually be
> > > > already dangerous because it might move line pointers around
> > > >
> > > >
> > > I don't we move the line pointers around ever because the indexes will be
> > > pointing to them.
> >
> > The indexes point to a tid, not to a line pointer. So reshuffling the
> > linepointer array - while keeping the tids valid - is entirely
> > fine. Right?

> I think its again terminology confusion :-) I thought TID and Line Pointers
> are the same and used interchangeably. Or at least I've done so for long.
> But I think you are reading line pointer as the thing stored in the ItemId
> structure and not the ItemId itself.

I always understood ItemIds to be line pointers and ItemPointers to be
tids. Line pointers are only meaningful within a single page, store the
actual offset within the page and so on. ItemPointers (cids) are longer
lasting and store (page, offset_number)…

> Also, PageAddItem() doesn't move things around normally. It does that only
> during recovery, at least for heao pages. Elsewhere it will either reuse an
> UNUSED pointer or add at the end. Isn't that how it is ?

Hm? It doesn't move the page contents around but it moves the ItemId
array during completely normal operation (c.f. needshuffle in
PageAddItem). Or am I missing something?

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: missing LockBuffer(buffer, BUFFER_LOCK_SHARE) in trigger.c GetTupleForTrigger?

From
Pavan Deolasee
Date:


On Fri, Nov 30, 2012 at 6:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:


Hm? It doesn't move the page contents around but it moves the ItemId
array during completely normal operation (c.f. needshuffle in
PageAddItem). Or am I missing something?


I think that probably only used for non-heap pages. For heap pages, it just doesn't make sense to shuffle the ItemId array. That would defeat the entire purpose of having them in the first place.

Thanks,
Pavan 

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On 2012-11-30 13:57:46 +0100, Andres Freund wrote:
> On 2012-11-30 12:50:06 +0000, Simon Riggs wrote:
> > On 30 November 2012 11:58, Andres Freund <andres@2ndquadrant.com> wrote:
> >
> > > We only get the pin right there, I don't see any preexisting pin.
> >
> > Seems easy enough to test with an Assert patch.
> >
> > If the Assert doesn't fail, we apply it as "documentation" of the
> > requirement for a pin.
> >
> > If it fails, we fix the bug.
> 
> I think its wrong even if we were holding a pin all the time due the the
> aforementioned PageAddItem reshuffling of line pointers. So that Assert
> wouldn't proof enough.

But a failing Assert obviously proofs something. Stupid me. So here we
go:

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 98b8207..3b61d06 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2662,6 +2662,16 @@ ltrmark:;        buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
+#ifdef USE_ASSERT_CHECKING
+        if (!BufferIsLocal(buffer))
+        {
+            /* Only pinned by the above ReadBuffer */
+            if (PrivateRefCount[buffer - 1] <= 1)
+                elog(ERROR, "too low local pin count: %d",
+                     PrivateRefCount[buffer - 1]);
+        }
+#endif
+        page = BufferGetPage(buffer);        lp = PageGetItemId(page, ItemPointerGetOffsetNumber(tid));


CREATE OR REPLACE FUNCTION raise_notice_id()   RETURNS trigger   LANGUAGE plpgsql AS $body$
BEGIN   RAISE NOTICE 'id: %', OLD.id;   RETURN NULL;
END
$body$;

postgres=#
CREATE TABLE crashdummy(id serial primary key, data int);

postgres=#
CREATE TRIGGER crashdummy_after_delete   AFTER DELETE ON crashdummy   FOR EACH ROW EXECUTE PROCEDURE
raise_notice_id();

postgres=#
INSERT INTO crashdummy(data) SELECT * FROM generate_series(1, 1000);

postgres=# 
DELETE FROM crashdummy WHERE ctid IN (SELECT ctid FROM crashdummy WHERE data < 1000);
ERROR:  too low local pin count: 1
Time: 4.515 ms

A plain DELETE without the subselect doesn't trigger the problem though,
thats probably the reason we haven't seen any problems so far.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund <andres@2ndquadrant.com> writes:
> But a failing Assert obviously proofs something.

It doesn't prove that the code is unsafe though.

I am suspicious that it is safe, because it's pretty darn hard to
believe we'd not have seen field reports of problems with triggers
otherwise.  It's not like this is a seldom-executed code path.

But I concede that I don't see *why* it's safe --- it looks like
it ought to be possible for a VACUUM to move the tuple while the
trigger code is fetching it.  You'd need the timing to be just
so...
        regards, tom lane



Re: missing LockBuffer(buffer, BUFFER_LOCK_SHARE) in trigger.c GetTupleForTrigger?

From
Andres Freund
Date:
On 2012-11-30 10:42:21 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > But a failing Assert obviously proofs something.
>
> It doesn't prove that the code is unsafe though.

Hehe.

> I am suspicious that it is safe, because it's pretty darn hard to
> believe we'd not have seen field reports of problems with triggers
> otherwise.  It's not like this is a seldom-executed code path.

Thats true, but I think due to the fact that the plain DELETEs do *not*
trigger the Assert we might just not have noticed it. A make check with
the error checking in place doesn't error out in fact.
Also I think the wrong data caused by it aren't that likely to be
noticed. Its just the OLD in triggers so its not going to be looked at
all the time all too closely and we might return data thats somewhat
validly looking.

I think most executor paths actually hold a separate pin "by accident"
while this is executing. I think that my example is hitting that case is
due to the ReleaseBuffer() after ExecStoreTuple() in TidNext(). Seqscans
have another pin via scan->rs_cbuf, indexscans one via ->xs_cbuf. Many
of the other nodes that are likely below ExecUpdate/Delete probably work
similar.

I think most cases with high throughput (which you probably need to
actually hit the relatively small window) won't use plans that are
complex enough to trigger the bug.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund <andres@2ndquadrant.com> writes:
> On 2012-11-30 10:42:21 -0500, Tom Lane wrote:
>> I am suspicious that it is safe, because it's pretty darn hard to
>> believe we'd not have seen field reports of problems with triggers
>> otherwise.  It's not like this is a seldom-executed code path.

> Thats true, but I think due to the fact that the plain DELETEs do *not*
> trigger the Assert we might just not have noticed it.

OK, I've managed to reproduce an actual failure, ie VACUUM moving the
tuple underneath the fetch.  It's not easy to do though, and the timing
has to be *really* tight.

The reason that simple update/delete queries don't show the issue is
that the tuple being fetched is the "old" one that was just
updated/deleted, and in a simple query that one was just returned by a
relation-scan plan node that's underneath the ModifyTuple node, and
*that scan node still has pin* on the table page; or more accurately the
TupleTableSlot it's returned the scan tuple in has the pin.  This pin's
been held since we did a LockBuffer to examine the tuple's liveness, so
it's sufficient to prevent anyone from getting a cleanup lock.  However,
in a more complex plan such as a join, the ModifyTable node could be
receiving tuples that aren't the current tuple of a scan anymore.
(Your example case passes the tuples through a Sort, so none of them
are pinned by the time the ModifyTable node gets them.)

Another thing that reduces the odds of seeing this, in recent versions,
is that when we first scanned the page containing the "old" tuple,
we'll have (tried to) do a page prune.  So even if a VACUUM comes along
now, the odds are that it will not find any newly-reclaimable space.
To reproduce the problem, I had to arrange for another tuple on the
same page to become reclaimable between the relation scan and the
GetTupleForTrigger fetch (which I did by having another, serializable
transaction scan the table, then deleting the other tuple, then allowing
the serializable transaction to commit after the page prune step).

Lastly, the only way you see a problem is if VACUUM acquires cleanup
lock before GetTupleForTrigger pins the page, finds some reclaimable
space, and moves the target tuple on the page in order to compact the
free space, and that data movement happens in the narrow time window
between where GetTupleForTrigger does PageGetItem() and where it
finishes the heap_copytuple() call just below that.  Even then, you
might escape seeing a problem, unless the data movement also shifts
some *other* tuple into the space formerly occupied by the target.

So that's why nobody's reported the problem --- it's not happening
often enough for anyone to see it as a repeatable issue.

I'll go insert a LockBuffer call.  Good catch!
        regards, tom lane



On 2012-11-30 12:53:27 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2012-11-30 10:42:21 -0500, Tom Lane wrote:
> >> I am suspicious that it is safe, because it's pretty darn hard to
> >> believe we'd not have seen field reports of problems with triggers
> >> otherwise.  It's not like this is a seldom-executed code path.
>
> > Thats true, but I think due to the fact that the plain DELETEs do *not*
> > trigger the Assert we might just not have noticed it.
>
> OK, I've managed to reproduce an actual failure, ie VACUUM moving the
> tuple underneath the fetch.  It's not easy to do though, and the timing
> has to be *really* tight.
>
> So that's why nobody's reported the problem --- it's not happening
> often enough for anyone to see it as a repeatable issue.

Yea. I have been playing with trying to reproduce the issue as well and
I needed 3 sessions and two gdb's to cause any problem... And even then
it took me several tries to get all conditions right.

We call heap_prune* surprisingly often...

> I'll go insert a LockBuffer call.  Good catch!

Thanks.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



BTW, I went looking for other places that might have a similar mistake.
I found that contrib/pageinspect/btreefuncs.c pokes around in btree
pages without any buffer lock, which seems pretty broken --- don't we
want it to take share lock?
        regards, tom lane



Re: missing LockBuffer(buffer, BUFFER_LOCK_SHARE) in trigger.c GetTupleForTrigger?

From
Andres Freund
Date:
On 2012-11-30 14:08:05 -0500, Tom Lane wrote:
> BTW, I went looking for other places that might have a similar mistake.
> I found that contrib/pageinspect/btreefuncs.c pokes around in btree
> pages without any buffer lock, which seems pretty broken --- don't we
> want it to take share lock?

I seem to remember comments somewhere indicating that pageinspect (?)
doesn't take locks by intention to make debugging of locking problems
easier. Not sure whether thats really realistic, but ...

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund <andres@2ndquadrant.com> writes:
> On 2012-11-30 14:08:05 -0500, Tom Lane wrote:
>> BTW, I went looking for other places that might have a similar mistake.
>> I found that contrib/pageinspect/btreefuncs.c pokes around in btree
>> pages without any buffer lock, which seems pretty broken --- don't we
>> want it to take share lock?

> I seem to remember comments somewhere indicating that pageinspect (?)
> doesn't take locks by intention to make debugging of locking problems
> easier. Not sure whether thats really realistic, but ...

Dunno, that seems like a pretty lame argument when compared to the very
real possibility of crashing due to processing corrupt data.  Besides,
the heap-page examination functions in the same module take buffer lock,
so why shouldn't these?
        regards, tom lane



BTW, on further inspection, there's yet another reason why we've not
heard about this from the field: even if all the wrong things happen and
GetTupleForTrigger manages to copy bogus data for the old tuple, that
data *is not passed to the trigger function*.  The only thing we do with
it is decide whether to queue an event for the trigger.  So if you've
got a WHEN condition for the trigger, that expression would see the bad
data, or if it's an RI trigger the bad data is passed to
RI_FKey_pk_upd_check_required or RI_FKey_fk_upd_check_required.  But
unless the data is corrupt enough to cause a crash, the worst outcome
would be a wrong decision about whether the trigger needs to be run.

It's possible this explains some of the reports we've seen about RI
constraints being violated.
        regards, tom lane