Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] Improve dead tuple storage for lazy vacuum
Date
Msg-id CAD21AoCdnyT+8Zah7JwNsAjebUw65gpG3tqdBHEi7p7rpWZVig@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Responses Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Fri, Mar 17, 2023 at 4:03 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Tue, Mar 14, 2023 at 8:27 PM John Naylor
> > <john.naylor@enterprisedb.com> wrote:
> > >
> > > I wrote:
> > >
> > > > > > Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the
orderof the actions here, effectively reporting progress for the *last page* and not the current one: First update
progresswith the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of
thatwill be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's
kindof like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have
actuallywritten to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers
willmake sense. 
>
> > > I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if
it'sright but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future
casesat the moment. 
>
> > IIUC you suggested measuring memory usage by tracking how much memory
> > chunks are allocated within a block. If your idea at the top of the
> > page follows this method, it still doesn't deal with the point Andres
> > mentioned.
>
> Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The
re-orderingwas just to make sure that progress reporting didn't show current-use > max-use. 

Right. I still like your re-ordering idea. It's true that the most
area of the last allocated block before heap scanning stops is not
actually used yet. I'm guessing we can just check if the context
memory has gone over the limit. But I'm concerned it might not work
well in systems where overcommit memory is disabled.

>
> However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations.
Itseems it just wasn't designed for things where memory growth is unpredictable. 
>
> I'm starting to wonder if we need to give DSA a bit more info at the start. Imagine a "soft" limit given to the DSA
areawhen it is initialized. If the total segment usage exceeds this, it stops doubling and instead new segments get
smaller.Modifying an example we used for the fudge-factor idea some time ago: 
>
> m_w_m = 1GB, so calculate the soft limit to be 512MB and pass it to the DSA area.
>
> 2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> hit soft limit, so "stairstep down" the new segment sizes:
>
> 766 + 2*(128) + 64 = 1086MB -> stop
>
> That's just an undeveloped idea, however, so likely v17 development, even assuming it's not a bad idea (could be).

This is an interesting idea. But I'm concerned we don't have enough
time to get confident with adding this new concept to DSA.

>
> And sadly, unless we find some other, simpler answer soon for tracking and limiting shared memory, the tid store is
lookinglike v17 material. 

Another problem we need to deal with is the supported minimum memory
in shared tidstore cases. Since the initial DSA segment size is 1MB,
memory usage of a shared tidstore will start from 1MB+. This is higher
than the minimum values of both work_mem and maintenance_work_mem,
64kB and 1MB respectively. Increasing the minimum m_w_m to 2MB seems
to be acceptable in the community but not for work_mem. One idea is to
deny the memory limit less than 2MB so it won't work with small m_w_m
settings. While it might be an acceptable restriction at this stage
(where there is no use case of using tidstore with work_mem in the
core) but it will be a blocker for the future adoptions such as
unifying with tidbitmap.c. Another idea is that the process can
specify the initial segment size at dsa_create() so that DSA can start
with a smaller segment, say 32kB. That way, a tidstore with a 32kB
limit gets full once it allocates the next DSA segment, 32kB. . But a
downside of this idea is to increase the number of segments behind
DSA. Assuming it's a relatively rare case where we use such a low
work_mem, it might be acceptable. FYI, the total number of DSM
segments available on the system is calculated by:

#define PG_DYNSHMEM_FIXED_SLOTS         64
#define PG_DYNSHMEM_SLOTS_PER_BACKEND   5

maxitems = PG_DYNSHMEM_FIXED_SLOTS
    + PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Önder Kalacı
Date:
Subject: Re: Dropped and generated columns might cause wrong data on subs when REPLICA IDENTITY FULL
Next
From: Peter Eisentraut
Date:
Subject: Re: gcc 13 warnings