Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoAYA0fnPC7ww2v1USvn3VQ52Zn5XK6gyhWbaXuXn16Unw@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Mon, Dec 19, 2022 at 4:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Dec 12, 2022 at 7:14 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > >
> > > On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > > > >
> > >
> > > > > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in
moredetail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just
dosome useful work and not fail). 
> > > >
> > > > The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
> > > > the radix tree is using dsa_get_total_size(). If the size returned by
> > > > dsa_get_total_size() (+ some memory used by TIDStore meta information)
> > > > exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
> > > > and heap vacuum. However, when allocating DSA memory for
> > > > radix_tree_control at creation, we allocate 1MB
> > > > (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
> > > > radix_tree_control from it. das_get_total_size() returns 1MB even if
> > > > there is no TID collected.
> > >
> > > 2MB makes sense.
> > >
> > > If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be
drivenby blocks allocated. I have an idea on that below. 
> > >
> > > > > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen
kilobytesto reduce contention on the tidstore. We could use such an array even for a single worker (always doing the
samething is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop,
insertinto the store, and check the store's memory usage before continuing. 
> > > >
> > > > Right, I think it's no problem in slab cases. In DSA cases, the new
> > > > segment size follows a geometric series that approximately doubles the
> > > > total storage each time we create a new segment. This behavior comes
> > > > from the fact that the underlying DSM system isn't designed for large
> > > > numbers of segments.
> > >
> > > And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area
allocatedis greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the
areawill contain just under a power of two the last time it passes the test. The next segment will bring it to about
3/4full, like this: 
> > >
> > > maintenance work mem = 256MB, so stop if we go over 128MB:
> > >
> > > 2*(1+2+4+8+16+32) = 126MB -> keep going
> > > 126MB + 64 = 190MB        -> stop
> > >
> > > That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last
segmentwould be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even
withPeter G.'s VM snapshot informing the allocation size, I imagine). 
> >
> > Right. In this case, even if we allocate 64MB, we will use only 2088
> > bytes at maximum. So I think the memory space used for vacuum is
> > practically limited to half.
> >
> > >
> > > And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with
technicallygoing over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would
nevergo over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and
smalllocal arrays of TIDs, I think with this scheme we'll be well under the limit. 
> >
> > Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
> > seems that they look at only memory that are actually dsa_allocate'd.
> > To be exact, we estimate the number of hash buckets based on work_mem
> > (and hash_mem_multiplier) and use it as the upper limit. So I've
> > confirmed that the result of dsa_get_total_size() could exceed the
> > limit. I'm not sure it's a known and legitimate usage. If we can
> > follow such usage, we can probably track how much dsa_allocate'd
> > memory is used in the radix tree.
>
> I've experimented with this idea. The newly added 0008 patch changes
> the radix tree so that it counts the memory usage for both local and
> shared cases.

I've attached updated version patches to make cfbot happy.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures
Next
From: Amit Kapila
Date:
Subject: Re: Time delayed LR (WAS Re: logical replication restrictions)