Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From John Naylor
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAFBsxsFH3f6r3hSmXSX1bvsDe7EFGExa9Sf3Ysd5DKJ=C1bxgQ@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> >
> > I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault.
>
> Right. I guess we've discussed what we use for calculating the *used*
> memory amount but I don't remember.
>
> I think I was confused by the fact that we use some different
> approaches to calculate the amount of used memory. Parallel hash and
> tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
> in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
> allocated block size.

That's good to know. The latter says:

 * After adding a new group to the hash table, check whether we need to enter
 * spill mode. Allocations may happen without adding new groups (for instance,
 * if the transition state size grows), so this check is imperfect.

I'm willing to claim that vacuum can be imperfect also, given the tid store's properties: 1) on average much more efficient in used space, and 2) no longer bound by the 1GB limit. 

> > I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.
>
> Yes, the progress reporting could be confusable. Particularly, in
> shared tidstore cases, the dead_tuple_bytes could be much bigger than
> max_dead_tuple_bytes. Probably what we need might be functions for
> MemoryContext and dsa_area to get the amount of memory that has been
> allocated, by not tracking every chunk space. For example, the
> functions would be like what SlabStats() does; iterate over every
> block and calculates the total/free memory usage.

I'm not sure we need to invent new infrastructure for this. Looking at v29 in vacuumlazy.c, the order of operations for memory accounting is:

First, get the block-level space -- stop and vacuum indexes if we exceed the limit:

/*
 * Consider if we definitely have enough space to process TIDs on page
 * already.  If we are close to overrunning the available space for
 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 * this page.
 */
if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if (TidStoreMemoryUsage(ts) > ts->control->max_bytes)"

Then, after pruning the current page, store the tids and then get the block-level space again:

else if (prunestate.num_offsets > 0)
{
  /* Save details of the LP_DEAD items from the page in dead_items */
  TidStoreSetBlockOffsets(...);

  pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
                               TidStoreMemoryUsage(dead_items));
}

Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.

Thoughts?

But now that I'm looking more closely at the details of memory accounting, I don't like that TidStoreMemoryUsage() is called twice per page pruned (see above). Maybe it wouldn't noticeably slow things down, but it's a bit sloppy. It seems like we should call it once per loop and save the result somewhere. If that's the right way to go, that possibly indicates that TidStoreIsFull() is not a useful interface, at least in this form.

--
John Naylor
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Pavel Luzanov
Date:
Subject: Re: psql: Add role's membership options to the \du+ command
Next
From: Amit Kapila
Date:
Subject: Re: Deduplicate logicalrep_read_tuple()