Thread: 9.3 feature proposal: vacuumdb -j #
Hackers, It occurs to me that I would find it quite personally useful if the vacuumdb utility was multiprocess capable. For example, just today I needed to manually analyze a database with over 500 tables, on a server with 24 cores. And I needed to know when the analyze was done, because it was part of a downtime. I had to resort to a python script. I'm picturing doing this in the simplest way possible: get the list of tables and indexes, divide them by the number of processes, and give each child process its own list. Any reason not to hack on this for 9.3? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Am 13.01.2012 22:50, schrieb Josh Berkus: > It occurs to me that I would find it quite personally useful if the > vacuumdb utility was multiprocess capable. > > For example, just today I needed to manually analyze a database with > over 500 tables, on a server with 24 cores. And I needed to know when > the analyze was done, because it was part of a downtime. I had to > resort to a python script. > > I'm picturing doing this in the simplest way possible: get the list of > tables and indexes, divide them by the number of processes, and give > each child process its own list. > > Any reason not to hack on this for 9.3? I don't see any reason not to do it, but plenty to do it. Right now I have systems hosting many databases, I need to vacuum full from time to time. I have wrapped vacuumdb with a shell script to actually use all the capacity that is available. A vacuumdb -faz just isn't that usefull on large machines anymore. Jan
On Friday, January 13, 2012 10:50:32 PM Josh Berkus wrote: > Hackers, > > It occurs to me that I would find it quite personally useful if the > vacuumdb utility was multiprocess capable. > > For example, just today I needed to manually analyze a database with > over 500 tables, on a server with 24 cores. And I needed to know when > the analyze was done, because it was part of a downtime. I had to > resort to a python script. > > I'm picturing doing this in the simplest way possible: get the list of > tables and indexes, divide them by the number of processes, and give > each child process its own list. That doesn't sound like a good idea. Its way too likely that you will end up with one backend doing all the work because it got some big tables. I don't think this task deserves using threads or subprocesses. Multiple connections from one process seems way more sensible and mostly avoids the above problem. Andres
On 13-01-2012 18:50, Josh Berkus wrote: > It occurs to me that I would find it quite personally useful if the > vacuumdb utility was multiprocess capable. > It is in the mid of my TODO list. reindexdb is in the plans too. -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte24x7 e Treinamento
On Fri, Jan 13, 2012 at 4:50 PM, Josh Berkus <josh@agliodbs.com> wrote: > It occurs to me that I would find it quite personally useful if the > vacuumdb utility was multiprocess capable. > > For example, just today I needed to manually analyze a database with > over 500 tables, on a server with 24 cores. And I needed to know when > the analyze was done, because it was part of a downtime. I had to > resort to a python script. > > I'm picturing doing this in the simplest way possible: get the list of > tables and indexes, divide them by the number of processes, and give > each child process its own list. I think "simplest" isn't *quite* best... There's the risk that all the big tables get tied to one child, and so the one child is doing them serially. Better: Have two logical tasks: a) A process that manages the list, and b) Child processes doing vacuums. Each time a child completes a table, it asks the parent for another one. So the tendency will be that if there are 8 big tables, and 12 child processes, it's *certain* that the 8 big tables will be spread across the children. It guarantees that the child processes will all be busy until there are fewer tables left than there are child processes. -- When confronted by a difficult problem, solve it by reducing it to the question, "How would the Lone Ranger handle this?"
On 1/13/12 2:12 PM, Euler Taveira de Oliveira wrote: > On 13-01-2012 18:50, Josh Berkus wrote: >> It occurs to me that I would find it quite personally useful if the >> vacuumdb utility was multiprocess capable. >> > It is in the mid of my TODO list. reindexdb is in the plans too. I'm even happier to have someone else do it. ;-) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Am 13.01.2012 22:50, schrieb Josh Berkus: > Hackers, > > It occurs to me that I would find it quite personally useful if the > vacuumdb utility was multiprocess capable. > > For example, just today I needed to manually analyze a database with > over 500 tables, on a server with 24 cores. And I needed to know when > the analyze was done, because it was part of a downtime. I had to > resort to a python script. > > I'm picturing doing this in the simplest way possible: get the list of > tables and indexes, divide them by the number of processes, and give > each child process its own list. > > Any reason not to hack on this for 9.3? > Hello, I like the idea - but ... I would prefer to have an option that the user is able to tell on how much cores it should be shared. Something like --share-cores=N. Default is total core number of the machine but users should be able to say - ok - my machine has 24 cores but I want that vacuumdb just will use 12 of them. Especially on startups - you are able to find machines that aren't database-only machines. Often you find database and web server as single machine. Also you could have run more cluster on same machine for offering your business in different languages (one cluster per language). I already saw such a setup. There might be side businesses on the cores - so it should be possible that the users decides on how much cores he wants to share vacuumdb. Susanne -- Dipl. Inf. Susanne Ebrecht - 2ndQuadrant PostgreSQL Development, 24x7 Support, Training and Services www.2ndQuadrant.com
On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote: > I would prefer to have an option that the user is able to tell on how much > cores it should be shared. Something like --share-cores=N. Uhm. -j # does exactly that or am I missing your point? Andres
On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund <andres@anarazel.de> wrote: > On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote: >> I would prefer to have an option that the user is able to tell on how much >> cores it should be shared. Something like --share-cores=N. > Uhm. -j # does exactly that or am I missing your point? > not really. if you have 12 cores and you say -j 12 you would have 1 process per core, with Susanne's suggestion, AFAIUI, you can say -j 12 --shared-cores=6... so you would only use 6 cores of the 12 and have 2 processes per core -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación
On 01/17/2012 07:33 AM, Jaime Casanova wrote: > On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund<andres@anarazel.de> wrote: >> On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote: >>> I would prefer to have an option that the user is able to tell on how much >>> cores it should be shared. Something like --share-cores=N. >> Uhm. -j # does exactly that or am I missing your point? >> > not really. > > if you have 12 cores and you say -j 12 you would have 1 process per > core, with Susanne's suggestion, AFAIUI, you can say -j 12 > --shared-cores=6... so you would only use 6 cores of the 12 and have 2 > processes per core > That looks messy. IMNSHO it should work just like pg_restore's -j. cheers andrew
On Tuesday, January 17, 2012 01:33:06 PM Jaime Casanova wrote: > On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund <andres@anarazel.de> wrote: > > On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote: > >> I would prefer to have an option that the user is able to tell on how > >> much cores it should be shared. Something like --share-cores=N. > > > > Uhm. -j # does exactly that or am I missing your point? > > not really. > > if you have 12 cores and you say -j 12 you would have 1 process per > core, with Susanne's suggestion, AFAIUI, you can say -j 12 > --shared-cores=6... so you would only use 6 cores of the 12 and have 2 > processes per core I don't really get what that should do. If vacuumdb itself is a limit in any form in this we did something *very* wrong (in my opinion using processes for this is pointless anyway. Using async queries seems to be much easier for this special case. Especially for distributing individual commands.). I don't really see how you could enforce sharing cores on the server side (well, there are cpusets, but were sure not introduce usage of that just for vacuumdb). Andres
On Jan 13, 2012, at 4:15 PM, Christopher Browne wrote: > Have two logical tasks: > a) A process that manages the list, and > b) Child processes doing vacuums. > > Each time a child completes a table, it asks the parent for another one. There is also a middle ground, because having the the scheduling process sounds like a lot more work than what Josh was proposing. CREATE TEMP SEQUENCE s; SELECT relname, s mod <number of backends> AS backend_number FROM ( SELECT relname FROM pg_class ORDERBY relpages ); Of course, having an actual scheduling process is most likely the most efficient. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On 01/17/2012 07:09 PM, Jim Nasby wrote: > On Jan 13, 2012, at 4:15 PM, Christopher Browne wrote: >> Have two logical tasks: >> a) A process that manages the list, and >> b) Child processes doing vacuums. >> >> Each time a child completes a table, it asks the parent for another one. > There is also a middle ground, because having the the scheduling process sounds like a lot more work than what Josh wasproposing. > > CREATE TEMP SEQUENCE s; > SELECT relname, s mod<number of backends> AS backend_number > FROM ( SELECT relname > FROM pg_class > ORDER BY relpages > ); > > Of course, having an actual scheduling process is most likely the most efficient. We already have a model for this in parallel pg_restore. It would probably not be terribly hard to adapt to parallel vacuum. cheers andrew