Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CAD21AoAYDcPFrYwHphnPHVXqJODKtxJRM_u8fgztRP06d31f1A@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (John Naylor <john.naylor@enterprisedb.com>) |
Responses |
Re: [PoC] Improve dead tuple storage for lazy vacuum
|
List | pgsql-hackers |
On Fri, Mar 3, 2023 at 8:04 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > > > > > I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needsto stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing thereis what invokes a page fault. > > > > Right. I guess we've discussed what we use for calculating the *used* > > memory amount but I don't remember. > > > > I think I was confused by the fact that we use some different > > approaches to calculate the amount of used memory. Parallel hash and > > tidbitmap use the allocated chunk size whereas hash_agg_check_limits() > > in nodeAgg.c uses MemoryContextMemAllocated(), which uses the > > allocated block size. > > That's good to know. The latter says: > > * After adding a new group to the hash table, check whether we need to enter > * spill mode. Allocations may happen without adding new groups (for instance, > * if the transition state size grows), so this check is imperfect. > > I'm willing to claim that vacuum can be imperfect also, given the tid store's properties: 1) on average much more efficientin used space, and 2) no longer bound by the 1GB limit. > > > > I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes biggerthan max_dead_tuple_bytes. > > > > Yes, the progress reporting could be confusable. Particularly, in > > shared tidstore cases, the dead_tuple_bytes could be much bigger than > > max_dead_tuple_bytes. Probably what we need might be functions for > > MemoryContext and dsa_area to get the amount of memory that has been > > allocated, by not tracking every chunk space. For example, the > > functions would be like what SlabStats() does; iterate over every > > block and calculates the total/free memory usage. > > I'm not sure we need to invent new infrastructure for this. Looking at v29 in vacuumlazy.c, the order of operations formemory accounting is: > > First, get the block-level space -- stop and vacuum indexes if we exceed the limit: > > /* > * Consider if we definitely have enough space to process TIDs on page > * already. If we are close to overrunning the available space for > * dead_items TIDs, pause and do a cycle of vacuuming before we tackle > * this page. > */ > if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if (TidStoreMemoryUsage(ts) > ts->control->max_bytes)" > > Then, after pruning the current page, store the tids and then get the block-level space again: > > else if (prunestate.num_offsets > 0) > { > /* Save details of the LP_DEAD items from the page in dead_items */ > TidStoreSetBlockOffsets(...); > > pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES, > TidStoreMemoryUsage(dead_items)); > } > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actionshere, effectively reporting progress for the *last page* and not the current one: First update progress with the currentmemory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be writtento. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlierattempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, Ithink it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense. > > Thoughts? It looks to work but it still doesn't work in a case where a shared tidstore is created with a 64kB memory limit, right? TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true from the beginning. BTW I realized that since the caller can pass dsa_area to tidstore (and radix tree), if other data are allocated in the same DSA are, TidStoreMemoryUsage() (and RT_MEMORY_USAGE()) returns the memory usage that includes not only itself but also other data. Probably it's better to comment that the passed dsa_area should be dedicated to a tidstore (or a radix tree). > > But now that I'm looking more closely at the details of memory accounting, I don't like that TidStoreMemoryUsage() is calledtwice per page pruned (see above). Maybe it wouldn't noticeably slow things down, but it's a bit sloppy. It seems likewe should call it once per loop and save the result somewhere. If that's the right way to go, that possibly indicatesthat TidStoreIsFull() is not a useful interface, at least in this form. Agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: