Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoBJtOzhC6pCKrZg0vNiUPbyM-uZ=Xz5BNS_UFgGatj1cw@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum
List pgsql-hackers
On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Feb 28, 2023 at 3:42 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> >
> > On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Wed, Feb 22, 2023 at 6:55 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> > > >
> > > > That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same
answer,the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the
wholepatchset won't be committed. 
> >
> > > My idea is to make the bug investigation easier but on
> > > reflection, it seems not the best idea given this purpose.
> >
> > My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't
seemlike a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same
thingas comparing existing code with new code. That was my idea upthread. 
> >
> > Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that
ifI want that level of paranoia, I'm going to have to do it myself. 
> >
> > > What do you think
> > > about the attached patch? Please note that it also includes the
> > > changes for minimum memory requirement.
> >
> > Most of the asserts look logical, or at least harmless.
> >
> > - int max_off; /* the maximum offset number */
> > + OffsetNumber max_off; /* the maximum offset number */
> >
> > I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If
wedecided against the new asserts, this would be easy to lose. 
>
> Right. I'll separate this change as a separate patch.
>
> >
> > This change, however, defies common sense:
> >
> > +/*
> > + * The minimum amount of memory required by TidStore is 2MB, the current minimum
> > + * valid value for the maintenance_work_mem GUC. This is required to allocate the
> > + * DSA initial segment, 1MB, and some meta data. This number is applied also to
> > + * the local TidStore cases for simplicity.
> > + */
> > +#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
> >
> > + /* Sanity check for the max_bytes */
> > + if (max_bytes < TIDSTORE_MIN_MEMORY)
> > + elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
> > + TIDSTORE_MIN_MEMORY, max_bytes);
> >
> > Aside from the fact that this elog's something that would never get past development, the #define just adds a
hard-codedcopy of something that is already hard-coded somewhere else, whose size depends on an implementation detail
ina third place. 
> >
> > This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of
someday unifying with tidbitmap.c, and maybe other applications will be limited by work_mem. 
> >
> > But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a
designproblem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even
thoughoperations like parallel hash and parallel bitmap heap scan are limited by work_mem. 
>
> Right.
>
> >  It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is
noteven considered?). 
>
> IIUC both don't care about the allocated DSA segment size. Parallel
> hash accounts actual tuple (+ header) size as used memory but doesn't
> consider how much DSA segment is allocated behind. Both parallel hash
> and parallel bitmap scan can work even with work_mem = 64kB, but when
> checking the total DSA segment size allocated during these operations,
> it was 1MB.
>
> I realized that there is a similar memory limit design issue also on
> the non-shared tidstore cases. We deduct 70kB from max_bytes but it
> won't work fine with work_mem = 64kB.  Probably we need to reconsider
> it. FYI 70kB comes from the maximum slab block size for node256.

Currently, we calculate the slab block size enough to allocate 32
chunks from there. For node256, the leaf node is 2,088 bytes and the
slab block size is 66,816 bytes. One idea to fix this issue to
decrease it. For example, with 16 chunks the slab block size is 33,408
bytes and with 8 chunks it's 16,704 bytes. I ran a brief benchmark
test with 70kB block size and 16kB block size:

* 70kB slab blocks:
select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFFFFFF');
height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     143085184 |    1216 |       750
(1 row)

* 16kB slab blocks:
select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFFFFFF');
height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793
 mem_allocated | load_ms | search_ms
---------------+---------+-----------
     157601248 |    1220 |       786
(1 row)

There is a performance difference a bit but a smaller slab block size
seems to be acceptable if there is no other better way.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Damir Belyalov
Date:
Subject: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Next
From: Bharath Rupireddy
Date:
Subject: Avoid multiple SetLatch() calls in procsignal_sigusr1_handler()