Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem - Mailing list pgsql-hackers
From | Claudio Freire |
---|---|
Subject | Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem |
Date | |
Msg-id | CAGTBQpbXt8nj-vPk70YFSEYcXmkZF4COBqxdxVAqzQZqUo5NgA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem (Andres Freund <andres@anarazel.de>) |
Responses |
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
|
List | pgsql-hackers |
On Fri, Apr 7, 2017 at 10:12 PM, Andres Freund <andres@anarazel.de> wrote: > On 2017-04-07 22:06:13 -0300, Claudio Freire wrote: >> On Fri, Apr 7, 2017 at 9:56 PM, Andres Freund <andres@anarazel.de> wrote: >> > Hi, >> > >> > >> > On 2017-04-07 19:43:39 -0300, Claudio Freire wrote: >> >> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote: >> >> > Hi, >> >> > >> >> > I've *not* read the history of this thread. So I really might be >> >> > missing some context. >> >> > >> >> > >> >> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001 >> >> >> From: Claudio Freire <klaussfreire@gmail.com> >> >> >> Date: Mon, 12 Sep 2016 23:36:42 -0300 >> >> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem >> >> >> >> >> >> Turn the dead_tuples array into a structure composed of several >> >> >> exponentially bigger arrays, to enable usage of more than 1GB >> >> >> of work mem during vacuum and thus reduce the number of full >> >> >> index scans necessary to remove all dead tids when the memory is >> >> >> available. >> >> > >> >> >> * We are willing to use at most maintenance_work_mem (or perhaps >> >> >> * autovacuum_work_mem) memory space to keep track of dead tuples. We >> >> >> - * initially allocate an array of TIDs of that size, with an upper limit that >> >> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that >> >> >> * depends on table size (this limit ensures we don't allocate a huge area >> >> >> - * uselessly for vacuuming small tables). If the array threatens to overflow, >> >> >> - * we suspend the heap scan phase and perform a pass of index cleanup and page >> >> >> - * compaction, then resume the heap scan with an empty TID array. >> >> >> + * uselessly for vacuuming small tables). Additional arrays of increasingly >> >> >> + * large sizes are allocated as they become necessary. >> >> >> + * >> >> >> + * The TID array is thus represented as a list of multiple segments of >> >> >> + * varying size, beginning with the initial size of up to 128MB, and growing >> >> >> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem >> >> >> + * is used up. >> >> > >> >> > When the chunk size is 128MB, I'm a bit unconvinced that using >> >> > exponential growth is worth it. The allocator overhead can't be >> >> > meaningful in comparison to collecting 128MB dead tuples, the potential >> >> > waste is pretty big, and it increases memory fragmentation. >> >> >> >> The exponential strategy is mainly to improve lookup time (ie: to >> >> avoid large segment lists). >> > >> > Well, if we were to do binary search on the segment list, that'd not be >> > necessary. >> >> True, but the initial lookup might be slower in the end, since the >> array would be bigger and cache locality worse. >> >> Why do you say exponential growth fragments memory? AFAIK, all those >> allocations are well beyond the point where malloc starts mmaping >> memory, so each of those segments should be a mmap segment, >> independently freeable. > > Not all platforms have that, and even on platforms with it, frequent, > unevenly sized, very large allocations can lead to enough fragmentation > that further allocations are harder and fragment / enlarge the > pagetable. I wouldn't call this frequent. You can get at most slightly more than a dozen such allocations given the current limits. And allocation sizes are quite regular - you get 128M or multiples of 128M, so each free block can be reused for N smaller allocations if needed. I don't think it has much potential to fragment memory. This isn't significantly different from tuplesort or any other code that can do big allocations, and the differences favor less fragmentation than those, so I don't see why this would need special treatment. My point being that it's not been simple getting to a point where this beats even in CPU time the original single binary search. If we're to scrap this implementation and go for a double binary search, I'd like to have a clear measurable benefit to chase from doing so. Fragmentation is hard to measure, and I cannot get CPU-bound vacuums on the test hardware I have to test lookup performance at big scales. >> >> Yes, the benchmarks are upthread. The earlier runs were run on my >> >> laptop and made little sense, so I'd ignore them as inaccurate. The >> >> latest run[1] with a pgbench scale of 4000 gave an improvement in CPU >> >> time (ie: faster) of about 20%. Anastasia did another one[2] and saw >> >> improvements as well, roughly 30%, though it's not measuring CPU time >> >> but rather elapsed time. >> > >> > I'd be more concerned about cases that'd already fit into memory, not ones >> > where we avoid doing another scan - and I think you mostly measured that? >> > >> > - Andres >> >> Well, scale 400 is pretty much as big as you can get with the old 1GB >> limit, and also suffered no significant regression. Although, true, id >> didn't significantly improve either. > > Aren't more interesting cases those where not that many dead tuples are > found, but the indexes are pretty large? IIRC the index vacuum scans > still visit every leaf index tuple, no? Indeed they do, and that's what motivated this patch. But I'd need TB-sized tables to set up something like that. I don't have the hardware or time available to do that (vacuum on bloated TB-sized tables can take days in my experience). Scale 4000 is as big as I can get without running out of space for the tests in my test hardware. If anybody else has the ability, I'd be thankful if they did test it under those conditions, but I cannot. I think Anastasia's test is closer to such a test, that's probably why it shows a bigger improvement in total elapsed time. Our production database could possibly be used, but it can take about a week to clone it, upgrade it (it's 9.5 currently), and run the relevant vacuum. I did perform tests against the same pgbench databases referenced in the post I linked earlier, but deleting only a fraction of the rows, or on uncorrelated indexes. The benchmarks weren't very interesting, and results were consistent with the linked benchmark (slight CPU time improvement, just less impactful), so I didn't post them. I think all those tests show that, if there's a workload that regresses, it's a rare one, running on very powerful I/O hardware (to make vacuum CPU-bound). And even if that were to happen, considering a single (or fewer) index scan, even if slower, will cause less WAL traffic that has to be archived/streamed, it would still most likely be a win overall.
pgsql-hackers by date: