Re: Parallel scan with SubTransGetTopmostTransaction assert coredump - Mailing list pgsql-hackers

From Maxim Orlov
Subject Re: Parallel scan with SubTransGetTopmostTransaction assert coredump
Date
Msg-id 150825360832da68872abb4d891dfc3a@postgrespro.ru
Whole thread Raw
In response to Re: Parallel scan with SubTransGetTopmostTransaction assert coredump  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: Parallel scan with SubTransGetTopmostTransaction assert coredump  (Pavel Borisov <pashkin.elfe@gmail.com>)
List pgsql-hackers
On 2021-07-09 20:36, Tomas Vondra wrote:
> Hi,
> 
> I took a quick look on this - I'm no expert in the details of 
> snapshots,
> so take my comments with a grain of salt.
> 
> AFAICS both Greg Nancarrow and Pavel Borisov are kinda right. I think
> Greg is right the v3 patch does not seem like the right (or correct)
> solution, for a couple reasons:
> 
> 
> 1) It fixes the symptoms, not the cause. If we set TransactionXmin to a
> bogus value, this only fixes it locally in 
> SubTransGetTopmostTransaction
> but I'd be surprised if other places did not have the same issue.
> 
> 
> 2) The xid/TransactionXmin comparison in the v2 fix:
> 
>   xid = xid > TransactionXmin ? xid : TransactionXmin;
> 
> seems broken, because it compares the XIDs directly, but that's not
> going to work after generating enough transactions.
> 
> 
> 3) But even if this uses TransactionIdFollowsOrEquals, it seems very
> strange because the "xid" comes from
> 
>   XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
> 
> i.e. it's the raw xmin from the tuple, so why should we be setting it 
> to
> TransactionXmin? That seems pretty strange, TBH.
> 
> 
> So yeah, I think this is due to confusion with two snapshots and 
> failing
> to consider both of them when calculating TransactionXmin.
> 
> But I think removing one of the snapshots (as the v2 does it) is rather
> strange too. I very much doubt having both the transaction and active
> snapshots in the parallel worker is not intentional, and Pavel may very
> well be right this breaks isolation levels that use the xact snapshot
> (i.e. REPEATABLE READ and SERIALIZABLE). I haven't checked, though.
> 
> So I think we need to keep both snapshots, but fix TransactionXmin to
> consider both of them - I suppose ParallelWorkerMain could simply look
> at the two snapshots, and use the minimum. Which is what [1] (already
> linked by Pavel) talks about, although that's related to concerns about
> one of the processes dying, which is not what's happening here.
> 
> 
> I'm wondering what consequences this may have on production systems,
> though. We've only seen this failing because of the assert, so what
> happens when the build has asserts disabled?
> 
> Looking at SubTransGetTopmostTransaction, it seems the while loop ends
> immediately (because it's pretty much the opposite of the assert), so 
> we
> just return the "xid" as topmost XID. But will that produce incorrect
> query results, or what?
> 
> 
> 
> regards
> 
> 
> [1]
> https://www.postgresql.org/message-id/20150208002027.GH9201%40alap3.anarazel.de

PFA v4 patch based on the ideas above.

In principle I see little difference with v3. But I agree it is more 
correct.

I did test this patch on Linux and MacOS using testing algo above and 
got no error. On master branch before the patch I still see this error.

---
Best regards,
Maxim Orlov.
Attachment

pgsql-hackers by date:

Previous
From: Masahiro Ikeda
Date:
Subject: Re: Fix comments of heap_prune_chain()
Next
From: Masahiro Ikeda
Date:
Subject: Re: Fix comments of heap_prune_chain()