Re: Parallel scan with SubTransGetTopmostTransaction assert coredump - Mailing list pgsql-hackers

From Pavel Borisov
Subject Re: Parallel scan with SubTransGetTopmostTransaction assert coredump
Date
Msg-id CALT9ZEFA47XhPPZ9ggcnQi=++otKsZVOJgT+WjXQqfmF-YCfRw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel scan with SubTransGetTopmostTransaction assert coredump  (Greg Nancarrow <gregn4422@gmail.com>)
Responses Re: Parallel scan with SubTransGetTopmostTransaction assert coredump  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
пт, 23 июл. 2021 г. в 10:00, Greg Nancarrow <gregn4422@gmail.com>:
On Thu, Jul 22, 2021 at 2:15 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Thanks to Thomas Munro for drawing my attention to this thread. I
> wasn't intentionally ignoring it, but there's a lot of email in the
> world and only so much time.
>
> When I wrote this code originally, the idea that I had in mind was
> simply this: whatever state we have in the leader ought to be
> reproduced in each worker. So if there's an active snapshot in the
> leader, we ought to make that active in all the workers, and if
> there's a transaction snapshot in the leader, we ought to make that
> the transaction snapshot in all of the workers.
>
> But I see now that my thinking was fuzzy, and I'm going to blame that
> on the name GetTransactionSnapshot() being slightly misleading. If
> IsolationUsesXactSnapshot() is true, then there's really such a thing
> as a transaction snapshot and reproducing that in the worker is a
> sensible thing to do. But when !IsolationUsesXactSnapshot(),
> GetTransactionSnapshot() doesn't just "get the transaction snapshot",
> because there isn't any such thing. It takes a whole new snapshot, on
> the theory that you wouldn't be calling this function unless you had
> finished up with the snapshot you got the last time you called this
> function. And in the case of initiating parallel query, that is the
> wrong thing.
>
> I think that, at least in the case where IsolationUsesXactSnapshot()
> is true, we need to make sure that calling GetTransactionSnapshot() in
> a worker produces the same result that it would have produced in the
> leader. Say one of the workers calls an sql or plpgsql function and
> that function runs a bunch of SQL statements. It seems to me that
> there's probably a way for this to result in calls inside the worker
> to GetTransactionSnapshot(), and if that doesn't return the same
> snapshot as in the leader, then we've broken MVCC.
>
> What about when IsolationUsesXactSnapshot() is false? Perhaps it's OK
> to just skip this altogether in that case. Certainly what we're doing
> can't be right, because copying a snapshot that wouldn't have been
> taken without parallel query can't ever be the right thing to do.
> Perhaps we need to copy something else instead. I'm not really sure.
>
> So I think v2 is probably on the right track, but wrong when the
> transaction isolation level is REPEATABLE READ or SERIALIZABLE, and v3
> and v4 just seem like unprincipled hacks that try to avoid the
> assertion failure by lying about whether there's a problem.
>

Many thanks for taking time to respond to this (and thanks to Thomas Munro too).
It's much appreciated, as this is a complex area.
For the time being, I'll reinstate the v2 patch (say as v5) as a
partial solution, and then work on addressing the REPEATABLE READ and
SERIALIZABLE transaction isolation levels, which you point out are not
handled correctly by the patch.
`
I've looked at v5 patch. It is completely similar to v2 patch, which I've already tested using the workflow, I've described in the comments above.  Before the patch I get the errors quite soon, the patch corrects them. Furthermore I've tested the same patch under REPEATABLE READ and SERIALIZABLE and detected no flaws. So, now, when we've got Robert's explanation, it seems to me that v2 (aka v5) patch can be committed (leaving possible REPEATABLE READ and SERIALIZABLE improvements for the future). I don't really sure it is still possible on 07/21 СF, but I'd change status of the patch to the ready-for-committer. Also I'd like the bugfix to be backported to the previous PG versions. 

Further consideration on the patch and on the backporting is very much welcome!

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace
Next
From: Tomas Vondra
Date:
Subject: Re: Use generation context to speed up tuplesorts