Re: maintenance_work_mem used by Vacuum - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: maintenance_work_mem used by Vacuum |
Date | |
Msg-id | CAA4eK1JhpNsTiHj+JOy3N8uCGyTBMH8xDhUEtBw8ZeCAPRGp6Q@mail.gmail.com Whole thread Raw |
In response to | Re: maintenance_work_mem used by Vacuum (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On Tue, Oct 8, 2019 at 12:57 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sun, Oct 6, 2019 at 6:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > We wanted to decide how a parallel vacuum should use memory? Can each worker consume maintenance_work_mem to clean upthe gin Index or all workers should use no more than maintenance_work_mem? We were thinking of later but before we decidewhat is the right behavior for parallel vacuum, I thought it is better to once discuss if the current memory usagemodel is right. > > Well, I had the idea when we were developing parallel query that we > should just ignore the problem of work_mem: every node can use X > amount of work_mem, and if there are multiple copies of the node in > multiple processes, then you probably end up using more memory. I > have been informed by Thomas Munro -- in very polite terminology -- > that this was a terrible decision which is causing all kinds of > problems for users. I haven't actually encountered that situation > myself, but I don't doubt that it's an issue. > > I think it's a lot easier to do better when we're talking about > maintenance commands rather than queries. Maintenance operations > typically don't have the problem that queries do with an unknown > number of nodes using memory; you typically know all of your memory > needs up front. So it's easier to budget that out across workers or > whatever. It's a little harder in this case, because you could have > any number of GIN indexes (1 to infinity) and the amount of memory you > can use depends on not only on how many of them there are but, > presumably also, the number of those that are going to be vacuumed at > the same time. So you might have 8 indexes, 3 workers, and 2 of the > indexes are GIN. In that case, you know that you can't have more than > 2 GIN indexes being processed at the same time, but it's likely to be > only one, and maybe with proper scheduling you could make it sure it's > only one. On the other hand, if you dole out the memory assuming it's > only 1, what happens if you start that one, then process all 6 of the > non-GIN indexes, and that one isn't done yet. I guess you could wait > to start cleanup on the other GIN indexes until the previous index > cleanup finishes, but that kinda sucks too. So I'm not really sure how > to handle this particular case. I think the principle of dividing up > the memory rather than just using more is probably a good one, but > figuring out exactly how that should work seems tricky. > Yeah and what if we have workers equal to indexes, so doing the clean up of Gin indexes serially (wait for the prior index to finish before starting the clean up of next Gin index) in that case would be bad too. I think we can do something simple like choose minimum among 'number of Gin Indexes', 'number of workers requested for parallel vacuum' and 'number of max_parallel_maintenance_workers' and then divide the maintenance_work_mem by that to get the memory used by each of the Gin indexes. I think it has some caveats like we might not be able to launch the number of workers we decided and in that case we probably could have computed bigger value of work_mem that can be used by Gin indexes. I think whatever we pick here can be good for some cases and not-so-good for others, so why not pick something general and simple. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: