Hi,
I've spent some time looking at the code and history of this bug
because I also ran into it [1] and I was deeply curious why concurrent
updates worked with a hash join + seq scan but not a nested loop +
index scan.
I was also curious if this bug is why Yuri didn't see deadlocks in
this related bug [2].
FWIW, I found a simpler way to reproduce this:
create table t(id int primary key, count int);
insert into t values (1, 0), (2, 0);
--session 1
begin;
update t set count = count + 1;
--session 2
set enable_seqscan = off;
update t set count = count + 1 from (values (1), (2)) v(id) where t.id = v.id;
Only one update succeeds in session 2.
Tender Wang wrote:
> Dean Rasheed wrote:
>> I'm somewhat conflicted as to which approach is better. I think maybe
>> there is less chance of other unintended side-effects if the set of
>> RTEs included in PlannerGlobal.allRelids, unprunableRelids, and
>> es_unpruned_relids is not changed. However, as it stands,
>> PlannerGlobal.allRelids is misnamed (it should probably have been
>> called "relationRelids", in line with the "relationOids" field).
>> Making it actually include all RTEs would solve that.
> Per the comment, I guess Amit didn't plan to include all RTEs in his design.
While digging, I noticed that allRelids seems only to be used to
create unprunableRelids. But unprunableRelids is used among other
things, to create es_unpruned_relids and as an unpruned_relids
parameter to ExecInitRangeTable. It's difficult to follow all the
paths where the missing relids will matter. So I think Tender's
approach is safer if amended to include all relids. I'm very new to
postgres source code, so this is a very tentative opinion. Amit can
best answer after his holiday.
I was curious enough to test in the meantime. I made an experimental
patch that just adds all rels to allRelids. This fixed all the issues
I'm aware of. It caused overexplain tests to fail by adding in new
non-relation relIds as expected. Apologies if I'm getting the
terminology wrong.
I ran Yuri's reproducer on my patched branch and indeed got several deadlocks.
psql:crash.sql:3: ERROR: deadlock detected
DETAIL: Process 71673 waits for ShareLock on transaction 2766;
blocked by process 71643.
Process 71643 waits for ShareLock on transaction 2777; blocked by process 71673.
Curiously, in the test I added (based on Dean's), enable_indexscan or
enable_seqscan didn't make any difference to the plan. But when I run
the reproducer manually, it does change the plan. Since the bug
disappears with a hash join + seq scan, the plan is critical. So I
included 'set enable_seqscan to 0' for reference and future proofing.
I didn't attempt to fix the comment because I don't understand it. I
also didn't check all my tabs vs spaces, or run TAP tests, as this
patch is just meant for info. Most of the code is still way over my
head, so my apologies if none of this is helpful.
As to why this works for a hash join but not a nested loop, I have
some ideas, but I'm still digging.
Happy New Year!
Bernice
[1]
https://www.postgresql.org/message-id/flat/CAMbWs4-uC8DWRZvByHej%2B9E5em7V_%2BzNyRfFx-UUpR4HwifK4w%40mail.gmail.com
[2] https://www.postgresql.org/message-id/ly6rss443iajiv5cfypzu6fgrmxdbwzdtuujkddr4eis7s2gcq%40gcdytigety6e