Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem - Mailing list pgsql-hackers

From Claudio Freire
Subject Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
Date
Msg-id CAGTBQpbXt8nj-vPk70YFSEYcXmkZF4COBqxdxVAqzQZqUo5NgA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem  (Andres Freund <andres@anarazel.de>)
Responses Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
List pgsql-hackers
On Fri, Apr 7, 2017 at 10:12 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-04-07 22:06:13 -0300, Claudio Freire wrote:
>> On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund <andres@anarazel.de> wrote:
>> > Hi,
>> >
>> >
>> > On 2017-04-07 19:43:39 -0300, Claudio Freire wrote:
>> >> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote:
>> >> > Hi,
>> >> >
>> >> > I've *not* read the history of this thread.  So I really might be
>> >> > missing some context.
>> >> >
>> >> >
>> >> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
>> >> >> From: Claudio Freire <klaussfreire@gmail.com>
>> >> >> Date: Mon, 12 Sep 2016 23:36:42 -0300
>> >> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
>> >> >>
>> >> >> Turn the dead_tuples array into a structure composed of several
>> >> >> exponentially bigger arrays, to enable usage of more than 1GB
>> >> >> of work mem during vacuum and thus reduce the number of full
>> >> >> index scans necessary to remove all dead tids when the memory is
>> >> >> available.
>> >> >
>> >> >>   * We are willing to use at most maintenance_work_mem (or perhaps
>> >> >>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>> >> >> - * initially allocate an array of TIDs of that size, with an upper limit that
>> >> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that
>> >> >>   * depends on table size (this limit ensures we don't allocate a huge area
>> >> >> - * uselessly for vacuuming small tables).  If the array threatens to overflow,
>> >> >> - * we suspend the heap scan phase and perform a pass of index cleanup and page
>> >> >> - * compaction, then resume the heap scan with an empty TID array.
>> >> >> + * uselessly for vacuuming small tables). Additional arrays of increasingly
>> >> >> + * large sizes are allocated as they become necessary.
>> >> >> + *
>> >> >> + * The TID array is thus represented as a list of multiple segments of
>> >> >> + * varying size, beginning with the initial size of up to 128MB, and growing
>> >> >> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
>> >> >> + * is used up.
>> >> >
>> >> > When the chunk size is 128MB, I'm a bit unconvinced that using
>> >> > exponential growth is worth it. The allocator overhead can't be
>> >> > meaningful in comparison to collecting 128MB dead tuples, the potential
>> >> > waste is pretty big, and it increases memory fragmentation.
>> >>
>> >> The exponential strategy is mainly to improve lookup time (ie: to
>> >> avoid large segment lists).
>> >
>> > Well, if we were to do binary search on the segment list, that'd not be
>> > necessary.
>>
>> True, but the initial lookup might be slower in the end, since the
>> array would be bigger and cache locality worse.
>>
>> Why do you say exponential growth fragments memory? AFAIK, all those
>> allocations are well beyond the point where malloc starts mmaping
>> memory, so each of those segments should be a mmap segment,
>> independently freeable.
>
> Not all platforms have that, and even on platforms with it, frequent,
> unevenly sized, very large allocations can lead to enough fragmentation
> that further allocations are harder and fragment / enlarge the
> pagetable.

I wouldn't call this frequent. You can get at most slightly more than
a dozen such allocations given the current limits.
And allocation sizes are quite regular - you get 128M or multiples of
128M, so each free block can be reused for N smaller allocations if
needed. I don't think it has much potential to fragment memory.

This isn't significantly different from tuplesort or any other code
that can do big allocations, and the differences favor less
fragmentation than those, so I don't see why this would need special
treatment.

My point being that it's not been simple getting to a point where this
beats even in CPU time the original single binary search.
If we're to scrap this implementation and go for a double binary
search, I'd like to have a clear measurable benefit to chase from
doing so. Fragmentation is hard to measure, and I cannot get CPU-bound
vacuums on the test hardware I have to test lookup performance at big
scales.

>> >> Yes, the benchmarks are upthread. The earlier runs were run on my
>> >> laptop and made little sense, so I'd ignore them as inaccurate. The
>> >> latest run[1] with a pgbench scale of 4000 gave an improvement in CPU
>> >> time (ie: faster) of about 20%. Anastasia did another one[2] and saw
>> >> improvements as well, roughly 30%, though it's not measuring CPU time
>> >> but rather elapsed time.
>> >
>> > I'd be more concerned about cases that'd already fit into memory, not ones
>> > where we avoid doing another scan - and I think you mostly measured that?
>> >
>> > - Andres
>>
>> Well, scale 400 is pretty much as big as you can get with the old 1GB
>> limit, and also suffered no significant regression. Although, true, id
>> didn't significantly improve either.
>
> Aren't more interesting cases those where not that many dead tuples are
> found, but the indexes are pretty large?  IIRC the index vacuum scans
> still visit every leaf index tuple, no?

Indeed they do, and that's what motivated this patch. But I'd need
TB-sized tables to set up something like that. I don't have the
hardware or time available to do that (vacuum on bloated TB-sized
tables can take days in my experience). Scale 4000 is as big as I can
get without running out of space for the tests in my test hardware.

If anybody else has the ability, I'd be thankful if they did test it
under those conditions, but I cannot. I think Anastasia's test is
closer to such a test, that's probably why it shows a bigger
improvement in total elapsed time.

Our production database could possibly be used, but it can take about
a week to clone it, upgrade it (it's 9.5 currently), and run the
relevant vacuum.

I did perform tests against the same pgbench databases referenced in
the post I linked earlier, but deleting only a fraction of the rows,
or on uncorrelated indexes. The benchmarks weren't very interesting,
and results were consistent with the linked benchmark (slight CPU time
improvement, just less impactful), so I didn't post them.

I think all those tests show that, if there's a workload that
regresses, it's a rare one, running on very powerful I/O hardware (to
make vacuum CPU-bound). And even if that were to happen, considering a
single (or fewer) index scan, even if slower, will cause less WAL
traffic that has to be archived/streamed, it would still most likely
be a win overall.



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw
Next
From: Bruce Momjian
Date:
Subject: Re: [pgsql-www] [HACKERS] Small issue in online devel documentationbuild