On Fri, Oct 29, 2021 at 9:30 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> I can propose the debugging patch to reproduce the issue that replaces
> the hang with the assert and modifies a pair of crash-causing test
> scripts to simplify the reproducing. (Sorry, I have no time now to prune
> down the scripts further as I have to leave for a week.)
Just FYI, I tried to reproduce this today on v16, using this formula,
with some hacking around to try to get it working on my MacBook, and I
couldn't get it to crash.
I doubt my failure to reproduce is because anything has been fixed.
It's probably due to some kind of dumbitude on my part. The patch
doesn't apply to the head of v16, unsurprisingly, and a bunch of the
Linux commands don't work here, so it's all kind of a muddle for me
trying to get the same setup in place. I should probably be better at
reproducing things like this than I am.
I also tried to reproduce with a simpler setup where I just ran a
normal pgbench in one terminal and a pgbench running only "vacuum
pgbench_accounts;" in another. What I see is that such a setup never
does a "goto retry;". I tried putting a loop around this code:
res = HeapTupleSatisfiesVacuum(&tuple, vacrel->cutoffs.OldestXmin,
buf);
...to do the same thing 1000 times, and I do see that if I do that,
the return value of HeapTupleSatisfiesVacuum() sometimes changes part
way through the thousand iterations. But so far I haven't seen a
single instance of it changing to HEAPTUPLE_DEAD. It seems to be all
stuff like HEAP_TUPLE_LIVE -> HEAPTUPLE_DELETE_IN_PROGRESS or
HEAPTUPLE_DELETE_IN_PROGRESS -> HEAPTUPLE_RECENTLY_DEAD. There are a
few other combinations that I see appearing too, but the point is that
the new state in this test never seems to be HEAPTUPLE_DEAD, and
therefore there's no retry and no ability to reproduce the bug.
Any ideas?
P.S. See also discussion on the "relfrozenxid may disagree with row
XIDs after 1ccc1e05ae" thread.
--
Robert Haas
EDB: http://www.enterprisedb.com