Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoDEuTs8c4j0Pz-YN_yq8DC1B=Tz7P+_zCG0a=eG6QDe2g@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
List pgsql-hackers
On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Dec 12, 2022 at 7:14 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> >
> > On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:
> > > >
> >
> > > > I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more
detailwhat exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do
someuseful work and not fail). 
> > >
> > > The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
> > > the radix tree is using dsa_get_total_size(). If the size returned by
> > > dsa_get_total_size() (+ some memory used by TIDStore meta information)
> > > exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
> > > and heap vacuum. However, when allocating DSA memory for
> > > radix_tree_control at creation, we allocate 1MB
> > > (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
> > > radix_tree_control from it. das_get_total_size() returns 1MB even if
> > > there is no TID collected.
> >
> > 2MB makes sense.
> >
> > If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be
drivenby blocks allocated. I have an idea on that below. 
> >
> > > > Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen
kilobytesto reduce contention on the tidstore. We could use such an array even for a single worker (always doing the
samething is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop,
insertinto the store, and check the store's memory usage before continuing. 
> > >
> > > Right, I think it's no problem in slab cases. In DSA cases, the new
> > > segment size follows a geometric series that approximately doubles the
> > > total storage each time we create a new segment. This behavior comes
> > > from the fact that the underlying DSM system isn't designed for large
> > > numbers of segments.
> >
> > And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area
allocatedis greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the
areawill contain just under a power of two the last time it passes the test. The next segment will bring it to about
3/4full, like this: 
> >
> > maintenance work mem = 256MB, so stop if we go over 128MB:
> >
> > 2*(1+2+4+8+16+32) = 126MB -> keep going
> > 126MB + 64 = 190MB        -> stop
> >
> > That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last
segmentwould be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even
withPeter G.'s VM snapshot informing the allocation size, I imagine). 
>
> Right. In this case, even if we allocate 64MB, we will use only 2088
> bytes at maximum. So I think the memory space used for vacuum is
> practically limited to half.
>
> >
> > And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with
technicallygoing over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would
nevergo over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and
smalllocal arrays of TIDs, I think with this scheme we'll be well under the limit. 
>
> Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
> seems that they look at only memory that are actually dsa_allocate'd.
> To be exact, we estimate the number of hash buckets based on work_mem
> (and hash_mem_multiplier) and use it as the upper limit. So I've
> confirmed that the result of dsa_get_total_size() could exceed the
> limit. I'm not sure it's a known and legitimate usage. If we can
> follow such usage, we can probably track how much dsa_allocate'd
> memory is used in the radix tree.

I've experimented with this idea. The newly added 0008 patch changes
the radix tree so that it counts the memory usage for both local and
shared cases. As shown below, there is an overhead for that:

w/o 0008 patch

=# select * from bench_load_random_int(1000000)
NOTICE:  num_keys = 1000000, height = 7, n4 = 4970924, n15 = 38277,
n32 = 27205, n125 = 0, n256 = 257
 mem_allocated | load_ms
---------------+---------
     298453544 |     282
(1 row)

w/0 0008 patch

=# select * from bench_load_random_int(1000000)
NOTICE:  num_keys = 1000000, height = 7, n4 = 4970924, n15 = 38277,
n32 = 27205, n125 = 0, n256 = 257
 mem_allocated | load_ms
---------------+---------
     293603184 |     297
(1 row)

Although it adds some overhead, I think this idea is straightforward
and the most practical for users. And it seems to be consistent with
other components using DSA. We can improve this part in the future for
better memory control, for example, by introducing slab-like DSA
memory management.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: pg_upgrade allows itself to be run twice
Next
From: Michael Paquier
Date:
Subject: Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures