Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoAYDcPFrYwHphnPHVXqJODKtxJRM_u8fgztRP06d31f1A@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
On Fri, Mar 3, 2023 at 8:04 PM John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > >
> > > I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that
needsto stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing
thereis what invokes a page fault. 
> >
> > Right. I guess we've discussed what we use for calculating the *used*
> > memory amount but I don't remember.
> >
> > I think I was confused by the fact that we use some different
> > approaches to calculate the amount of used memory. Parallel hash and
> > tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
> > in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
> > allocated block size.
>
> That's good to know. The latter says:
>
>  * After adding a new group to the hash table, check whether we need to enter
>  * spill mode. Allocations may happen without adding new groups (for instance,
>  * if the transition state size grows), so this check is imperfect.
>
> I'm willing to claim that vacuum can be imperfect also, given the tid store's properties: 1) on average much more
efficientin used space, and 2) no longer bound by the 1GB limit. 
>
> > > I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes
biggerthan max_dead_tuple_bytes. 
> >
> > Yes, the progress reporting could be confusable. Particularly, in
> > shared tidstore cases, the dead_tuple_bytes could be much bigger than
> > max_dead_tuple_bytes. Probably what we need might be functions for
> > MemoryContext and dsa_area to get the amount of memory that has been
> > allocated, by not tracking every chunk space. For example, the
> > functions would be like what SlabStats() does; iterate over every
> > block and calculates the total/free memory usage.
>
> I'm not sure we need to invent new infrastructure for this. Looking at v29 in vacuumlazy.c, the order of operations
formemory accounting is: 
>
> First, get the block-level space -- stop and vacuum indexes if we exceed the limit:
>
> /*
>  * Consider if we definitely have enough space to process TIDs on page
>  * already.  If we are close to overrunning the available space for
>  * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
>  * this page.
>  */
> if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if (TidStoreMemoryUsage(ts) >
ts->control->max_bytes)"
>
> Then, after pruning the current page, store the tids and then get the block-level space again:
>
> else if (prunestate.num_offsets > 0)
> {
>   /* Save details of the LP_DEAD items from the page in dead_items */
>   TidStoreSetBlockOffsets(...);
>
>   pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
>                                TidStoreMemoryUsage(dead_items));
> }
>
> Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the
actionshere, effectively reporting progress for the *last page* and not the current one: First update progress with the
currentmemory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be
writtento. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our
earlierattempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to,
Ithink it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense. 
>
> Thoughts?

It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.

BTW I realized that since the caller can pass dsa_area to tidstore
(and radix tree), if other data are allocated in the same DSA are,
TidStoreMemoryUsage() (and RT_MEMORY_USAGE()) returns the memory usage
that includes not only itself but also other data. Probably it's
better to comment that the passed dsa_area should be dedicated to a
tidstore (or a radix tree).

>
> But now that I'm looking more closely at the details of memory accounting, I don't like that TidStoreMemoryUsage() is
calledtwice per page pruned (see above). Maybe it wouldn't noticeably slow things down, but it's a bit sloppy. It seems
likewe should call it once per loop and save the result somewhere. If that's the right way to go, that possibly
indicatesthat TidStoreIsFull() is not a useful interface, at least in this form. 

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Next
From: Ajin Cherian
Date:
Subject: Re: Support logical replication of DDLs