Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem - Mailing list pgsql-hackers

From Claudio Freire
Subject Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
Date
Msg-id CAGTBQpaTkEVKt_bWC0T9bRXsrHn6SHT=0BYsJcbpvBWm1g2bjQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Tue, Apr 11, 2017 at 4:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 11, 2017 at 2:59 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Tue, Apr 11, 2017 at 3:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> 1TB / 8kB per page * 60 tuples/page * 20% * 6 bytes/tuple = 9216MB of
>>> maintenance_work_mem
>>>
>>> So we'll allocate 128MB+256MB+512MB+1GB+2GB+4GB which won't be quite
>>> enough so we'll allocate another 8GB, for a total of 16256MB, but more
>>> than three-quarters of that last allocation ends up being wasted.
>>> I've been told on this list before that doubling is the one true way
>>> of increasing the size of an allocated chunk of memory, but I'm still
>>> a bit unconvinced.
>>
>> There you're wrong. The allocation is capped to 1GB, so wastage has an
>> upper bound of 1GB.
>
> Ah, OK.  Sorry, didn't really look at the code.  I stand corrected,
> but then it seems a bit strange to me that the largest and smallest
> allocations are only 8x different. I still don't really understand
> what that buys us.

Basically, attacking the problem (that, I think, you mentioned) of
very small systems in which overallocation for small vacuums was an
issue.

The "slow start" behavior of starting with smaller segments tries to
improve the situation for small vacuums, not big ones.

By starting at 128M and growing up to 1GB, overallocation is bound to
the range 128M-1GB and is proportional to the amount of dead tuples,
not table size, as it was before. Starting at 128M helps the initial
segment search, but I could readily go for starting at 64M, I don't
think it would make a huge difference. Removing exponential growth,
however, would.

As the patch stands, small systems (say 32-bit systems) without
overcommit and with slowly-changing data can now set high m_w_m
without running into overallocation issues with autovacuum reserving
too much virtual space, as it will reserve memory only proportional to
the amount of dead tuples. Previously, it would reserve all of m_w_m
regardless of whether it was needed or not, with the only exception
being really small tables, so m_w_m=1GB was unworkable in those cases.
Now it should be fine.

> What would we lose if we just made 'em all 128MB?

TBH, not that much. We'd need 8x compares to find the segment, that
forces a switch to binary search of the segments, which is less
cache-friendly. So it's more complex code, less cache locality. I'm
just not sure what's the benefit given current limits.

The only aim of this multiarray approach was making *virtual address
space reservations* proportional to the amount of actual memory
needed, as opposed to configured limits. It doesn't need to be a tight
fit, because calling palloc on its own doesn't actually use that
memory, at least on big allocations like these - the OS will not map
the memory pages until they're first touched. That's true in most
modern systems, and many ancient ones too.

In essence, the patch as it is proposed, doesn't *need* a binary
search, because the segment list can only grow up to 15 segments at
its biggest, and that's a size small enough that linear search will
outperform (or at least perform as well as) binary search. Reducing
the initial segment size wouldn't change that. If the 12GB limit is
lifted, or the maximum segment size reduced (from 1GB to 128MB for
example), however, that would change.

I'd be more in favor of lifting the 12GB limit than of reducing the
maximum segment size, for the reasons above. Raising the 12GB limit
has concrete and readily apparent benefits, whereas using bigger (or
smaller) segments is far more debatable. Yes, that will need a binary
search. But, I was hoping that could be a second (or third) patch, to
keep things simple, and benefits measurable.

Also, the plan as discussed in this very long thread, was to
eventually try to turn segments into bitmaps if dead tuple density was
big enough. That benefits considerably from big segments, since lookup
on a bitmap is O(1) - the bigger the segments, the faster the lookup,
as the search on the segment list would be dominant.

So... what shall we do?

At this point, I've given all my arguments for the current design. If
the more senior developers don't agree, I'll be happy to try your way.



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: [HACKERS] Replication status in logical replication
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] src/interfaces/libpq shipping nmake-related Makefiles