Re: Reviewing freeze map code - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Reviewing freeze map code
Date
Msg-id CAA4eK1Kop7jfQFUfeuSdDgZAKEm-z51Ya5zHuJXEXG7PGG6Taw@mail.gmail.com
Whole thread Raw
In response to Re: Reviewing freeze map code  (Andres Freund <andres@anarazel.de>)
Responses Re: Reviewing freeze map code  (Thomas Munro <thomas.munro@enterprisedb.com>)
Re: Reviewing freeze map code  (Thomas Munro <thomas.munro@enterprisedb.com>)
Re: Reviewing freeze map code  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2016-06-15 08:56:52 -0400, Robert Haas wrote:
> > Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> > and CTID without logging anything or clearing the all-visible flag and
> > then releases the lock on the heap page to go do some more work that
> > might even ERROR out.  Only if that other work all goes OK do we
> > relock the page and perform the WAL-logged actions.
> >
> > That doesn't seem like a good idea even in existing releases, because
> > you've taken a tuple on an all-visible page and made it not
> > all-visible, and you've made a page modification that is not
> > necessarily atomic without logging it.
>
> Right, that's broken.
>
>
> > I'm not sure what to do about this: this part of the heap_update()
> > logic has been like this forever, and I assume that if it were easy to
> > refactor this away, somebody would have done it by now.
>
> Well, I think generally nobody seriously looked at actually refactoring
> heap_update(), even though that'd be a good idea.  But in this instance,
> the problem seems relatively fundamental:
>
> We need to lock the origin page, to do visibility checks, etc. Then we
> need to figure out the target page. Even disregarding toasting - which
> we could be doing earlier with some refactoring - we're going to have to
> release the page level lock, to lock them in ascending order. Otherwise
> we'll risk kinda likely deadlocks.
>

Can we consider to use some strategy to avoid deadlocks without releasing the lock on old page?  Consider if we could have a mechanism such that RelationGetBufferForTuple() will ensure that it always returns a new buffer which has targetblock greater than the old block (on which we already held a lock).  I think here tricky part is whether we can get anything like that from FSM. Also, there could be cases where we need to extend the heap when there were pages in heap with space available, but we have ignored them because there block number is smaller than the block number on which we have lock.

>  We also certainly don't want to nest
> the lwlocks for the toast stuff, inside a content lock for the old
> tupe's page.
>
> So far the best idea I have - and it's really not a good one - is to
> invent a new hint-bit that tells concurrent updates to acquire a
> heavyweight tuple lock, while releasing the page-level lock. If that
> hint bit does not require any other modifications - and we don't need an
> xid in xmax for this use case - that'll avoid doing all the other
> `already_marked` stuff early, which should address the correctness
> issue.
>

Don't we need to clear such a flag in case of error?  Also don't we need to reset it later, like when modifying the old page later before WAL.

>  It's kinda invasive though, and probably has performance
> implications.
>

Do you see performance implication due to requirement of heavywieht tuple lock in more cases than now or something else?

Some others ways could be:

Before releasing the lock on buffer containing old tuple, clear the VM and visibility info from page and WAL log it.  I think this could impact performance depending on how frequently we need to perform this action.

Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic was introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set the same in old tuple header before releasing lock on buffer and teach tqual.c to honor the flag.  I think tqual.c should consider  HEAP_XMAX_UNLOGGED as an indication of aborted transaction unless it is currently in-progress.  Also, I think we need to clear this flag before WAL logging in heap_update.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: parallel.c is not marked as test covered
Next
From: Michael Paquier
Date:
Subject: Re: primary_conninfo missing from pg_stat_wal_receiver