Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune() - Mailing list pgsql-bugs

From Alena Rybakina
Subject Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Date
Msg-id 0a994343-c552-4535-a9cf-b4caa4edc1e8@yandex.ru
Whole thread Raw
In response to Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Alena Rybakina <lena.ribackina@yandex.ru>)
Responses Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
List pgsql-bugs
On 02.05.2024 21:01, Alena Rybakina wrote:
On 02.05.2024 19:52, Peter Geoghegan wrote:
On Sat, Apr 27, 2024 at 10:38 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
In 17, we don't ever get a new HTSV_Result, so if the tuple is not
removed, it would be because HeapTupleSatisfiesVacuumHorizon()
returned HEAPTUPLE_RECENTLY_DEAD and, if GlobalVisTestIsRemovableXid()
was called, dead_after did not precede GlobalVisState->maybe_needed.
This tuple, during this vacuum of the relation, would never be
determined to be HEAPTUPLE_DEAD or it would have been removed.
That makes sense.

It will always be HEAPTUPLE_RECENTLY_DEAD in 17 and in <= 16, if
HeapTupleSatisfiesVacuum() returns HEAPTUPLE_DEAD, we wouldn't call
heap_prepare_freeze_tuple() because of the retry loop.
The retry loop exists precisely because heap_prepare_freeze_tuple()
isn't prepared to deal with HEAPTUPLE_DEAD tuples. So I agree that
that won't be allowed to happen on versions that have the retry loop
(14 - 16).
So, it can't happen in back branches. Let's just address 17. Help me
understand how this can happen in 17.
Just to be clear, I never said that it was possible in 17. If I
somehow implied it, then I didn't mean to.

Hi! I also investigated this issue and reproduced it using this test added to the isolated tests, where I added 2 tuples, deleted them and ran vacuum and printed the tuple_deleted and dead_tuples statistics (I attached test c to this email as a patch). Within 400 iterations or more, I got the results:

n_dead_tup|n_live_tup|n_tup_del ----------------+------------+------------- 0| 0| 0 (1 row)

After 400 or more running cycles, I felt the differences, as shown earlier:

 n_dead_tup|n_live_tup|n_tup_del
 ----------+----------+---------
-         0|         0|        0
+         2|         0|        0
 (1 row)


I debugged and found that the test produces results with 0 dead tuples if GlobalVisTempRels.maybe_needed is less than the x_max of the tuple. In the code, this condition works in heap_prune_satisfies_vacuum:

else if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
{
     res = HEAPTUPLE_DEAD;
} But when GlobalVisTempRels.maybe_needed is equal to the x_max xid of the tuple, vacuum does not touch this tuple, because the heap_prune_satisfies_vacuum function returns the status of the RECENTLY_DEAD tuple.

Unfortunately, I have not found any explanation why GlobalVisTempRels.maybe_needed does not change after 400 iterations or more. I'm still studying it. Perhaps this information will help you.

I reproduced the problem on REL_16_STABLE.

I reproduced this test in the master branch as well, but used a more complex test for it: I added 700 tuples to the table, deleted half of the table, and then started vacuum. I expected to get only 350 live tuples and 0 dead and deleted tuples, but after 800 iterations I got 350 dead tuples and 350 live tuples: n_dead_tup|n_live_tup|n_tup_del

 ---------------+-------------+-------------
-                0|          350|             0
+          350|          350|             0
 (1 row)

I have added other steps in the test, but so far I have not seen any falls there or have not reached them.


Just in case, I ran the test with this bash command:

for i in `seq 2000`;do echo "ITER $i"; make -s installcheck -C src/test/isolation/ || break;done

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

pgsql-bugs by date:

Previous
From: Alena Rybakina
Date:
Subject: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Next
From: Tom Lane
Date:
Subject: Re: BUG #18449: Altering column type fails when an SQL routine depends on the column