Re: should vacuum's first heap pass be read-only? - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: should vacuum's first heap pass be read-only?
Date
Msg-id CAFiTN-tf=cg8Xxcz5vpDozTQO3Q2tXvqmt4uVPYn_WmZOBDcdA@mail.gmail.com
Whole thread Raw
In response to Re: should vacuum's first heap pass be read-only?  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: should vacuum's first heap pass be read-only?
List pgsql-hackers
On Mon, Feb 7, 2022 at 10:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Feb 4, 2022 at 4:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I had imagined that we'd
> > want to do heap vacuuming in the same way as today with the dead TID
> > conveyor belt stuff -- it just might take several VACUUM operations
> > until we are ready to do a round of heap vacuuming.
>
> I am trying to understand exactly what you are imagining here. Do you
> mean we'd continue to lazy_scan_heap() at the start of every vacuum,
> and lazy_vacuum_heap_rel() at the end? I had assumed that we didn't
> want to do that, because we might already know from the conveyor belt
> that there are some dead TIDs that could be marked unused, and it
> seems strange to just ignore that knowledge at a time when we're
> scanning the heap anyway. However, on reflection, that approach has
> something to recommend it, because it would be somewhat simpler to
> understand what's actually being changed. We could just:
>
> 1. Teach lazy_scan_heap() that it should add TIDs to the conveyor
> belt, if we're using one, unless they're already there, but otherwise
> work as today.
>
> 2. Teach lazy_vacuum_heap_rel() that it, if there is a conveyor belt,
> it should try to clear from the indexes all of the dead TIDs that are
> eligible.
>
> 3. If there is a conveyor belt, use some kind of magic to decide when
> to skip vacuuming some or all indexes. When we skip one or more
> indexes, the subsequent lazy_vacuum_heap_rel() can't possibly mark as
> unused any of the dead TIDs we found this time, so we should just skip
> it, unless somehow there are TIDs on the conveyor belt that were
> already ready to be marked unused at the start of this VACUUM, in
> which case we can still handle those.

Based on this discussion, IIUC, we are saying that now we will do the
lazy_scan_heap every time like we are doing now.  And we will
conditionally skip the index vacuum for all or some of the indexes and
then based on how much index vacuum is done we will conditionally do
the lazy_vacuum_heap_rel().  Is my understanding correct?

IMHO, if we are doing the heap scan every time and then we are going
to get the same dead items again which we had previously collected in
the conveyor belt.  I agree that we will not add them again into the
conveyor belt but why do we want to store them in the conveyor belt
when we want to redo the whole scanning again?

I think (without global indexes) the main advantage of using the
conveyor belt is that if we skip the index scan for some of the
indexes then we can save the dead item somewhere so that without
scanning the heap again we have those dead items to do the index
vacuum sometime in future but if you are going to rescan the heap
again next time before doing any index vacuuming then why we want to
store them anyway.

IMHO, what we should do is, if there are not many new dead tuples in
the heap (total dead tuple based on the statistic - existing items in
the conveyor belt) then we should conditionally skip the heap scanning
(first pass) and directly jump to the index vacuuming for some or all
the indexes based on the index size bloat.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: "osumi.takamichi@fujitsu.com"
Date:
Subject: RE: Optionally automatically disable logical replication subscriptions on error
Next
From: Magnus Hagander
Date:
Subject: Expose JIT counters/timing in pg_stat_statements