Thread: 9.3 feature proposal: vacuumdb -j #

9.3 feature proposal: vacuumdb -j #

From
Josh Berkus
Date:
Hackers,

It occurs to me that I would find it quite personally useful if the
vacuumdb utility was multiprocess capable.

For example, just today I needed to manually analyze a database with
over 500 tables, on a server with 24 cores.   And I needed to know when
the analyze was done, because it was part of a downtime.  I had to
resort to a python script.

I'm picturing doing this in the simplest way possible: get the list of
tables and indexes, divide them by the number of processes, and give
each child process its own list.

Any reason not to hack on this for 9.3?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: 9.3 feature proposal: vacuumdb -j #

From
Jan Lentfer
Date:
Am 13.01.2012 22:50, schrieb Josh Berkus:
> It occurs to me that I would find it quite personally useful if the
> vacuumdb utility was multiprocess capable.
>
> For example, just today I needed to manually analyze a database with
> over 500 tables, on a server with 24 cores.   And I needed to know when
> the analyze was done, because it was part of a downtime.  I had to
> resort to a python script.
>
> I'm picturing doing this in the simplest way possible: get the list of
> tables and indexes, divide them by the number of processes, and give
> each child process its own list.
>
> Any reason not to hack on this for 9.3?

I don't see any reason not to do it, but plenty to do it.
Right now I have systems hosting many databases, I need to vacuum full 
from time to time. I have wrapped vacuumdb with a shell script to 
actually use all the capacity that is available. A vacuumdb -faz just 
isn't that usefull on large machines anymore.

Jan




Re: 9.3 feature proposal: vacuumdb -j #

From
Andres Freund
Date:
On Friday, January 13, 2012 10:50:32 PM Josh Berkus wrote:
> Hackers,
> 
> It occurs to me that I would find it quite personally useful if the
> vacuumdb utility was multiprocess capable.
> 
> For example, just today I needed to manually analyze a database with
> over 500 tables, on a server with 24 cores.   And I needed to know when
> the analyze was done, because it was part of a downtime.  I had to
> resort to a python script.
> 
> I'm picturing doing this in the simplest way possible: get the list of
> tables and indexes, divide them by the number of processes, and give
> each child process its own list.
That doesn't sound like a good idea. Its way too likely that you will end up 
with one backend doing all the work because it got some big tables.

I don't think this task deserves using threads or subprocesses. Multiple 
connections from one process seems way more sensible and mostly avoids the 
above problem.

Andres


Re: 9.3 feature proposal: vacuumdb -j #

From
Euler Taveira de Oliveira
Date:
On 13-01-2012 18:50, Josh Berkus wrote:
> It occurs to me that I would find it quite personally useful if the
> vacuumdb utility was multiprocess capable.
> 
It is in the mid of my TODO list. reindexdb is in the plans too.


--   Euler Taveira de Oliveira - Timbira       http://www.timbira.com.br/  PostgreSQL: Consultoria, Desenvolvimento,
Suporte24x7 e Treinamento
 


Re: 9.3 feature proposal: vacuumdb -j #

From
Christopher Browne
Date:
On Fri, Jan 13, 2012 at 4:50 PM, Josh Berkus <josh@agliodbs.com> wrote:
> It occurs to me that I would find it quite personally useful if the
> vacuumdb utility was multiprocess capable.
>
> For example, just today I needed to manually analyze a database with
> over 500 tables, on a server with 24 cores.   And I needed to know when
> the analyze was done, because it was part of a downtime.  I had to
> resort to a python script.
>
> I'm picturing doing this in the simplest way possible: get the list of
> tables and indexes, divide them by the number of processes, and give
> each child process its own list.

I think "simplest" isn't *quite* best...

There's the risk that all the big tables get tied to one child, and so
the one child is doing them serially.

Better:

Have two logical tasks:
a) A process that manages the list, and
b) Child processes doing vacuums.

Each time a child completes a table, it asks the parent for another one.

So the tendency will be that if there are 8 big tables, and 12 child
processes, it's *certain* that the 8 big tables will be spread across
the children.

It guarantees that the child processes will all be busy until there
are fewer tables left than there are child processes.

--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: 9.3 feature proposal: vacuumdb -j #

From
Josh Berkus
Date:
On 1/13/12 2:12 PM, Euler Taveira de Oliveira wrote:
> On 13-01-2012 18:50, Josh Berkus wrote:
>> It occurs to me that I would find it quite personally useful if the
>> vacuumdb utility was multiprocess capable.
>>
> It is in the mid of my TODO list. reindexdb is in the plans too.

I'm even happier to have someone else do it.  ;-)


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: 9.3 feature proposal: vacuumdb -j #

From
Susanne Ebrecht
Date:
Am 13.01.2012 22:50, schrieb Josh Berkus:
> Hackers,
>
> It occurs to me that I would find it quite personally useful if the
> vacuumdb utility was multiprocess capable.
>
> For example, just today I needed to manually analyze a database with
> over 500 tables, on a server with 24 cores.   And I needed to know when
> the analyze was done, because it was part of a downtime.  I had to
> resort to a python script.
>
> I'm picturing doing this in the simplest way possible: get the list of
> tables and indexes, divide them by the number of processes, and give
> each child process its own list.
>
> Any reason not to hack on this for 9.3?
>

Hello,

I like the idea - but ...
I would prefer to have an option that the user is able to tell on how much
cores it should be shared. Something like --share-cores=N.

Default is total core number of the machine but users should be able to 
say - ok -
my machine has 24 cores but I want that vacuumdb just will use 12 of them.

Especially on startups - you are able to find machines that aren't 
database-only
machines. Often you find database and web server as single machine.

Also you could have run more cluster on same machine for offering your 
business in
different languages (one cluster per language). I already saw such a setup.

There might be side businesses on the cores - so it should be possible 
that the
users decides on how much cores he wants to share vacuumdb.

Susanne


-- 
Dipl. Inf. Susanne Ebrecht - 2ndQuadrant
PostgreSQL Development, 24x7 Support, Training and Services
www.2ndQuadrant.com



Re: 9.3 feature proposal: vacuumdb -j #

From
Andres Freund
Date:
On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote:
> I would prefer to have an option that the user is able to tell on how much
> cores it should be shared. Something like --share-cores=N.
Uhm. -j # does exactly that or am I missing your point?

Andres


Re: 9.3 feature proposal: vacuumdb -j #

From
Jaime Casanova
Date:
On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund <andres@anarazel.de> wrote:
> On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote:
>> I would prefer to have an option that the user is able to tell on how much
>> cores it should be shared. Something like --share-cores=N.
> Uhm. -j # does exactly that or am I missing your point?
>

not really.

if you have 12 cores and you say -j 12 you would have 1 process per
core, with Susanne's suggestion, AFAIUI, you can say -j 12
--shared-cores=6... so you would only use 6 cores of the 12 and have 2
processes per core

--
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación


Re: 9.3 feature proposal: vacuumdb -j #

From
Andrew Dunstan
Date:

On 01/17/2012 07:33 AM, Jaime Casanova wrote:
> On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund<andres@anarazel.de>  wrote:
>> On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote:
>>> I would prefer to have an option that the user is able to tell on how much
>>> cores it should be shared. Something like --share-cores=N.
>> Uhm. -j # does exactly that or am I missing your point?
>>
> not really.
>
> if you have 12 cores and you say -j 12 you would have 1 process per
> core, with Susanne's suggestion, AFAIUI, you can say -j 12
> --shared-cores=6... so you would only use 6 cores of the 12 and have 2
> processes per core
>

That looks messy. IMNSHO it should work just like pg_restore's -j.

cheers

andrew


Re: 9.3 feature proposal: vacuumdb -j #

From
Andres Freund
Date:
On Tuesday, January 17, 2012 01:33:06 PM Jaime Casanova wrote:
> On Tue, Jan 17, 2012 at 7:23 AM, Andres Freund <andres@anarazel.de> wrote:
> > On Tuesday, January 17, 2012 01:18:53 PM Susanne Ebrecht wrote:
> >> I would prefer to have an option that the user is able to tell on how
> >> much cores it should be shared. Something like --share-cores=N.
> > 
> > Uhm. -j # does exactly that or am I missing your point?
> 
> not really.
> 
> if you have 12 cores and you say -j 12 you would have 1 process per
> core, with Susanne's suggestion, AFAIUI, you can say -j 12
> --shared-cores=6... so you would only use 6 cores of the 12 and have 2
> processes per core
I don't really get what that should do. If vacuumdb itself is a limit in any 
form in this we did something *very* wrong (in my opinion using processes for 
this is pointless anyway. Using async queries seems to be much easier for this 
special case. Especially for distributing individual commands.).
I don't really see how you could enforce sharing cores on the server side 
(well, there are cpusets, but were sure not introduce usage of that just for 
vacuumdb).

Andres


Re: 9.3 feature proposal: vacuumdb -j #

From
Jim Nasby
Date:
On Jan 13, 2012, at 4:15 PM, Christopher Browne wrote:
> Have two logical tasks:
> a) A process that manages the list, and
> b) Child processes doing vacuums.
>
> Each time a child completes a table, it asks the parent for another one.

There is also a middle ground, because having the the scheduling process sounds like a lot more work than what Josh was
proposing.

CREATE TEMP SEQUENCE s;
SELECT relname, s mod <number of backends> AS backend_number FROM ( SELECT relname          FROM pg_class
ORDERBY relpages 
);

Of course, having an actual scheduling process is most likely the most efficient.
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: 9.3 feature proposal: vacuumdb -j #

From
Andrew Dunstan
Date:

On 01/17/2012 07:09 PM, Jim Nasby wrote:
> On Jan 13, 2012, at 4:15 PM, Christopher Browne wrote:
>> Have two logical tasks:
>> a) A process that manages the list, and
>> b) Child processes doing vacuums.
>>
>> Each time a child completes a table, it asks the parent for another one.
> There is also a middle ground, because having the the scheduling process sounds like a lot more work than what Josh
wasproposing.
 
>
> CREATE TEMP SEQUENCE s;
> SELECT relname, s mod<number of backends>  AS backend_number
>    FROM ( SELECT relname
>             FROM pg_class
>             ORDER BY relpages
> );
>
> Of course, having an actual scheduling process is most likely the most efficient.

We already have a model for this in parallel pg_restore. It would 
probably not be terribly hard to adapt to parallel vacuum.

cheers

andrew