Re: decoupling table and index vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: decoupling table and index vacuum
Date
Msg-id CAD21AoB04JYc_z6OGoZU6udW2GqO=zUP9bcgR=HxXMZWgy-9tg@mail.gmail.com
Whole thread Raw
In response to Re: decoupling table and index vacuum  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Sat, Apr 24, 2021 at 12:22 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Apr 23, 2021 at 7:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > I think we can divide the TID fork into 16MB or 32MB chunks like WAL
> > segment files so that we can easily remove old chunks. Regarding the
> > efficient search part, I think we need to consider the case where the
> > TID fork gets bigger than maintenance_work_mem. In that case, during
> > the heap scan, we need to check if the discovered TID exists in a
> > chunk of the TID fork that could be on the disk. Even if all
> > known-dead-TIDs are loaded into an array on the memory, it could get
> > much slower than the current heap scan to bsearch over the array for
> > each dead TID discovered during heap scan. So it would be better to
> > have a way to skip searching by already recorded TIDs. For example,
> > during heap scan or HOT pruning, I think that when marking TIDs dead
> > and recording it to the dead TID fork we can mark them “dead and
> > recorded” instead of just “dead” so that future heap scans can skip
> > those TIDs without existence check.
>
> I'm not very excited about this. If we did this, and if we ever
> generated dead-but-not-recorded TIDs, then we will potentially dirty
> those blocks again later to mark them recorded.

Since the idea I imagined is that we always mark a TID recorded at the
same time when marking it dead we don't dirty the page again, but,
yes, if we do that the recorded flag is not necessary. We can simply
think that TID marked dead is recorded to the TID fork. Future vacuum
can skip TID that are already marked dead.

>
> Also, if bsearch() is a bottleneck, how about just using an O(1)
> algorithm instead of an O(lg n) algorithm, rather than changing the
> on-disk format?
>
> Also, can you clarify exactly what you think the problem case is here?
> It seems to me that:
>
> 1. If we're pruning the heap to collect dead TIDs, we should stop when
> the number of TIDs we've accumulated reaches maintenance_work_mem. It
> is possible that we could find when starting to prune that there are
> *already* more dead TIDs than will fit, because maintenance_work_mem
> might have been reduced since they were gathered. But that's OK: we
> can figure out that there are more than will fit without loading them
> all, and since we shouldn't do additional pruning in this case,
> there's no issue.

The case I'm thinking is that pruning the heap and sanitizing indexes
are running concurrently as you mentioned that concurrency is one of
the benefits of decoupling vacuum phases. In that case, one process is
doing index vacuuming using known-dead-TIDs in the TID fork while
another process is appending new dead TIDs. We can suspend heap
pruning until the size of the TID fork gets smaller as you mentioned
but it seems inefficient.

>
> 2. If we're sanitizing indexes, we should normally discover that there
> are few enough TIDs that we can still fit them all in memory. But if
> that proves not to be the case, again because for example
> maintenance_work_mem has been reduced, then we can handle that with
> multiple index passes just as we do today.

Yeah, there seems to be room for improvement but not worse than today.
I imagine users will want to set a high maintenance_work_mem for
sanitizing global index separately from the setting for heap pruning.

>
> 3. If we're going back to the heap to permit TIDs to be recycled by
> setting dead line pointers to unused, we can load in as many of those
> as will fit in maintenance_work_mem, sort them by block number, and go
> through block by block and DTRT. Then, we can release all that memory
> and, if necessary, do the whole thing again. This isn't even
> particularly inefficient.

Agreed.

Just an idea: during pruning the heap, we can buffer the collected
dead TIDs before writing the TID fork to the disk so that we can sort
the dead TIDs in a chunk (say a 16MB chunk consists of 8KB blocks)? We
write the chunk to the disk either when the chunk filled with dead
TIDs or when index sanitizing starts. The latter timing is required to
remember the chunk ID or uint64 ID of dead TID indicating how far
index sanitizing removed dead TIDs up to. One of the benefits would be
to reduce the disk I/O for the dead TID fork. Another would be we’re
likely to complete the recycle phase in one heap scan since we load
only one block per chunk during scanning the heap.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: GISTSTATE is too large
Next
From: Tom Lane
Date:
Subject: Does rewriteTargetListIU still need to add UPDATE tlist entries?