Home > mailing lists

Thread: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

07 November 2013, 18:14:11

This patch implementing the following TODO item

Allow parallel cores to be used by vacuumdb

http://www.postgresql.org/message-id/4F10A728.7090403@agliodbs.com

Like Parallel pg_dump, vacuumdb is provided with the option to run the vacuum of multiple tables in parallel. [ vacuumdb –j ]

1. One new option is provided with vacuumdb to give the number of workers.

2. All worker will be started in beginning and all will be waiting for the vacuum instruction from the master.

3. Now, if table list is provided in vacuumdb command using –t then, it will send the vacuum of one table to one of the IDLE worker, next table to next IDLE worker and so on.

4. If vacuum is given for one DB then, it will execute select on pg_class to get the table list and fetch the table name one by one and also assign the vacuum responsibility to IDLE workers.

Performance Data by parallel vacuumdb:

Machine Configuration:

Core : 8

RAM: 24GB

Test Scenario:

16 tables all with 4M records. [many records are deleted and inserted using some pattern, (files is attached in the mail)]

Test Result

{Base Code} Time(s) %CPU Usage Avg Read(kB/s) Avg Write(kB/s)

521 3% 12000 20000

{With Parallel Vacuum Patch}

worker Time(s) %CPU Usage Avg Read(kB/s) Avg Write(kB/s)

1 518 3% 12000 20000 --> this will take the same path as base code

2 390 5% 14000 30000

8 235 7% 18000 40000

16 197 8% 20000 50000

Conclusion:

By running the vacuumdb in parallel, CPU and I/O throughput is increasing and it can give >50% performance improvement.

Work to be Done:

1. Documentations of the new command.

2. Parallel support for vacuum all db.

Is it required to move the common code for parallel operation of pg_dump and vacuumdb to one place and reuse it ?

Prototype patch is attached in the mail, please provide your feedback/Suggestions…

Thanks & Regards,

Dilip Kumar

Attachment

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Euler Taveira

Date:

07 November 2013, 21:52:08

On 07-11-2013 09:42, Dilip kumar wrote:

Dilip, this is on my TODO for 9.4. I've already had a half-backed patch
for it. Let's see what I can come up with.

> Is it required to move the common code for parallel operation of pg_dump and vacuumdb to one place and reuse it ?
> 
I'm not sure about that because the pg_dump parallel code is tight to
TOC entry. Also, dependency matters for pg_dump while in the scripts
case, an order to be choosen will be used. However, vacuumdb can share
the parallel code with clusterdb and reindexdb (my patch does it).

Of course, a refactor to unify parallel code (pg_dump and scripts) can
be done in a separate patch.

> Prototype patch is attached in the mail, please provide your feedback/Suggestions...
> 
I'll try to merge your patch with the one I have here until the next CF.


--   Euler Taveira                   Timbira - http://www.timbira.com.br/  PostgreSQL: Consultoria, Desenvolvimento,
Suporte24x7 e Treinamento

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Jan Lentfer

Date:

08 November 2013, 08:08:21

Am 07.11.2013 12:42, schrieb Dilip kumar:

This patch implementing the following TODO item

Allow parallel cores to be used by vacuumdb
http://www.postgresql.org/message-id/4F10A728.7090403@agliodbs.com

Like Parallel pg_dump, vacuumdb is provided with the option to run the vacuum of multiple tables in parallel. [ vacuumdb –j ]

1.       One new option is provided with vacuumdb to give the number of workers.
2.       All worker will be started in beginning and all will be waiting for the vacuum instruction from the master.
3.       Now, if table list is provided in vacuumdb command using –t then, it will send the vacuum of one table to one of the IDLE worker, next table to next IDLE worker and so on.
4.       If vacuum is given for one DB then, it will execute select on pg_class to get the table list and fetch the table name one by one and also assign the vacuum responsibility to IDLE workers.
[...]

For this use case, would it make sense to queue work (tables) in order of their size, starting on the largest one?

For the case where you have tables of varying size this would lead to a reduced overall processing time as it prevents large (read: long processing time) tables to be processed in the last step. While processing large tables at first and filling up "processing slots/jobs" when they get free with smaller tables one after the other would safe overall execution time.

Regards

Jan

-- 
professional: http://www.oscar-consult.de

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

08 November 2013, 13:17:17

 On 08 November 2013 03:22, Euler Taveira Wrote

> On 07-11-2013 09:42, Dilip kumar wrote:
>
> Dilip, this is on my TODO for 9.4. I've already had a half-backed patch
> for it. Let's see what I can come up with.

Ok, Let me know if I can contribute to this..

> > Is it required to move the common code for parallel operation of
> pg_dump and vacuumdb to one place and reuse it ?
> >
> I'm not sure about that because the pg_dump parallel code is tight to
> TOC entry. Also, dependency matters for pg_dump while in the scripts
> case, an order to be choosen will be used. However, vacuumdb can share
> the parallel code with clusterdb and reindexdb (my patch does it).

+1

>
> Of course, a refactor to unify parallel code (pg_dump and scripts) can
> be done in a separate patch.
>
> > Prototype patch is attached in the mail, please provide your
> feedback/Suggestions...
> >
> I'll try to merge your patch with the one I have here until the next CF.

Regards,
Dilip

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

08 November 2013, 13:17:45

On 08 November 2013 13:38, Jan Lentfer

> For this use case, would it make sense to queue work (tables) in order of their size, starting on the largest one?

> For the case where you have tables of varying size this would lead to a reduced overall processing time as it prevents large (read: long processing time) tables to be processed in the last step. While processing large tables at first and filling up "processing slots/jobs" when they get free with smaller tables one after the other would safe overall execution time.

Good point, I have made the change and attached the modified patch.

Regards,

Dilip

Attachment

vacuumdb_parallel_v2.patch

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Euler Taveira

Date:

08 November 2013, 14:06:54

On 08-11-2013 05:07, Jan Lentfer wrote:
> For the case where you have tables of varying size this would lead to
> a reduced overall processing time as it prevents large (read: long
> processing time) tables to be processed in the last step. While
> processing large tables at first and filling up "processing
> slots/jobs" when they get free with smaller tables one after the
> other would safe overall execution time.
> 
That is certainly a good strategy (not the optimal [1] -- that is hard
to achieve). Also, the strategy must:

(i) consider the relation age before size (for vacuum);
(ii) consider that you can't pick indexes for the same relation (for
reindex).


[1]
http://www.postgresql.org/message-id/CA+TgmobwxqsagXKtyQ1S8+gMpqxF_MLXv=4350tFZVqAwKEqgQ@mail.gmail.com


--   Euler Taveira                   Timbira - http://www.timbira.com.br/  PostgreSQL: Consultoria, Desenvolvimento,
Suporte24x7 e Treinamento

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Michael Paquier

Date:

13 November 2013, 01:36:41

On Thu, Nov 7, 2013 at 8:42 PM, Dilip kumar <dilip.kumar@huawei.com> wrote:
> This patch implementing the following TODO item
>
> Allow parallel cores to be used by vacuumdb
> http://www.postgresql.org/message-id/4F10A728.7090403@agliodbs.com
>
>
>
> Like Parallel pg_dump, vacuumdb is provided with the option to run the
> vacuum of multiple tables in parallel. [ vacuumdb –j ]
>
>
>
> 1.       One new option is provided with vacuumdb to give the number of
> workers.
>
> 2.       All worker will be started in beginning and all will be waiting for
> the vacuum instruction from the master.
>
> 3.       Now, if table list is provided in vacuumdb command using –t then,
> it will send the vacuum of one table to one of the IDLE worker, next table
> to next IDLE worker and so on.
>
> 4.       If vacuum is given for one DB then, it will execute select on
> pg_class to get the table list and fetch the table name one by one and also
> assign the vacuum responsibility to IDLE workers.
>
>
>
> Performance Data by parallel vacuumdb:
>
> Machine Configuration:
>
>                                 Core : 8
>
>                                 RAM: 24GB
>
> Test Scenario:
>
>                                 16 tables all with 4M records. [many records
> are deleted and inserted using some pattern, (files is attached in the
> mail)]
>
>
>
> Test Result
>
>
>
> {Base Code}    Time(s)    %CPU Usage      Avg Read(kB/s)    Avg Write(kB/s)
>
>                                 521       3%                         12000
> 20000
>
>
>
>
>
> {With Parallel Vacuum Patch}
>
>    worker          Time(s)    %CPU Usage    Avg Read(kB/s)          Avg
> Write(kB/s)
>
>       1                     518                     3%                 12000
> 20000   --> this will take the same path as base code
>
>       2                     390                     5%                 14000
> 30000
>
>       8                     235                     7%                 18000
> 40000
>
>       16                   197                     8%                 20000
> 50000
>
>
>
> Conclusion:
>
>                 By running the vacuumdb in parallel, CPU and I/O throughput
> is increasing and it can give >50% performance improvement.
>
>
>
> Work to be Done:
>
> 1.       Documentations of the new command.
>
> 2.       Parallel support for vacuum all db.
>
>
>
> Is it required to move the common code for parallel operation of pg_dump and
> vacuumdb to one place and reuse it ?
>
>
>
> Prototype patch is attached in the mail, please provide your
> feedback/Suggestions…
>
>
>
>                 Thanks & Regards,
>
>                 Dilip Kumar
>
>
>
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Michael

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Michael Paquier

Date:

13 November 2013, 01:37:58

On Thu, Nov 7, 2013 at 8:42 PM, Dilip kumar <dilip.kumar@huawei.com> wrote:
> This patch implementing the following TODO item
>
> Allow parallel cores to be used by vacuumdb
> http://www.postgresql.org/message-id/4F10A728.7090403@agliodbs.com
Cool. Could you add this patch to the next commit fest for 9.4? It
begins officially in a couple of days. Here is the URL to it:
https://commitfest.postgresql.org/action/commitfest_view?id=20

Regards,
-- 
Michael

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Euler Taveira

Date:

16 January 2014, 14:23:20

On 08-11-2013 06:20, Dilip kumar wrote:
> On 08 November 2013 13:38, Jan Lentfer
> 
> 
>> For this use case, would it make sense to queue work (tables) in order of their size, starting on the largest one?
> 
>> For the case where you have tables of varying size this would lead to a reduced overall processing time as it
preventslarge (read: long processing time) tables to be processed in the last step. While processing large tables at
firstand filling up "processing slots/jobs" when they get free with smaller tables one after the other would safe
overallexecution time.
 
> Good point, I have made the change and attached the modified patch.
> 
Don't you submit it for a CF, do you? Is it too late for this CF?


--   Euler Taveira                   Timbira - http://www.timbira.com.br/  PostgreSQL: Consultoria, Desenvolvimento,
Suporte24x7 e Treinamento

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Alvaro Herrera

Date:

16 January 2014, 15:05:32

Euler Taveira wrote:
> On 08-11-2013 06:20, Dilip kumar wrote:
> > On 08 November 2013 13:38, Jan Lentfer
> > 
> > 
> >> For this use case, would it make sense to queue work (tables) in order of their size, starting on the largest
one?
> > 
> >> For the case where you have tables of varying size this would lead to a reduced overall processing time as it
preventslarge (read: long processing time) tables to be processed in the last step. While processing large tables at
firstand filling up "processing slots/jobs" when they get free with smaller tables one after the other would safe
overallexecution time.
 
> > Good point, I have made the change and attached the modified patch.
> > 
> Don't you submit it for a CF, do you? Is it too late for this CF?

Not too late.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

21 March 2014, 07:49:34

On 16 January 2014 19:53, Euler Taveira Wrote,

> >
> >> For the case where you have tables of varying size this would lead
> to a reduced overall processing time as it prevents large (read: long
> processing time) tables to be processed in the last step. While
> processing large tables at first and filling up "processing slots/jobs"
> when they get free with smaller tables one after the other would safe
> overall execution time.
> > Good point, I have made the change and attached the modified patch.
> >
> Don't you submit it for a CF, do you? Is it too late for this CF?

 Attached the latest updated patch
 1. Rebased the patch to current GIT head.
 2. Doc is updated.
 3. Supported parallel execution for all db option also.
 
Same I will add to current open commitfest..

Regards,
Dilip

Attachment

vacuumdb_parallel_v3.patch

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Jeff Janes

Date:

23 June 2014, 22:01:04

On Fri, Mar 21, 2014 at 12:48 AM, Dilip kumar <dilip.kumar@huawei.com> wrote:
> On 16 January 2014 19:53, Euler Taveira Wrote,
>
>> >
>> >> For the case where you have tables of varying size this would lead
>> to a reduced overall processing time as it prevents large (read: long
>> processing time) tables to be processed in the last step. While
>> processing large tables at first and filling up "processing slots/jobs"
>> when they get free with smaller tables one after the other would safe
>> overall execution time.
>> > Good point, I have made the change and attached the modified patch.
>> >
>> Don't you submit it for a CF, do you? Is it too late for this CF?
>
>  Attached the latest updated patch
>  1. Rebased the patch to current GIT head.
>  2. Doc is updated.
>  3. Supported parallel execution for all db option also.

This patch needs to be rebased after the analyze-in-stages patch,
c92c3d50d7fbe7391b5fc864b44434.

Although that patch still needs to some work itself, despite being
committed, as still loops over the stages for each db, rather than the
dbs for each stage.

So I don't know if this patch is really reviewable at this point, as
it is not clear how those things are going to interact with each
other.

Cheers,

Jeff

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

24 June 2014, 04:13:36

On 24 June 2014 03:31, Jeff Wrote,

> >  Attached the latest updated patch
> >  1. Rebased the patch to current GIT head.
> >  2. Doc is updated.
> >  3. Supported parallel execution for all db option also.
> 
> This patch needs to be rebased after the analyze-in-stages patch,
> c92c3d50d7fbe7391b5fc864b44434.

Thank you for giving your attention to this, I will rebase this.. 
> Although that patch still needs to some work itself, despite being
> committed, as still loops over the stages for each db, rather than the
> dbs for each stage.

If I understood your comment properly, Here you mean to say that 
In vacuum_all_databases instead to running all DB's in parallel, we are running db by db in parallel?

I think we can fix this..

> 
> So I don't know if this patch is really reviewable at this point, as it
> is not clear how those things are going to interact with each other.

Exactly what points you want to mention here ?

Regards,
Dilip Kumar

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Jeff Janes

Date:

24 June 2014, 05:32:18

On Monday, June 23, 2014, Dilip kumar <dilip.kumar@huawei.com> wrote:

On 24 June 2014 03:31, Jeff Wrote,

> > Attached the latest updated patch
> > 1. Rebased the patch to current GIT head.
> > 2. Doc is updated.
> > 3. Supported parallel execution for all db option also.
>
> This patch needs to be rebased after the analyze-in-stages patch,
> c92c3d50d7fbe7391b5fc864b44434.

Thank you for giving your attention to this, I will rebase this..

> Although that patch still needs to some work itself, despite being
> committed, as still loops over the stages for each db, rather than the
> dbs for each stage.

If I understood your comment properly, Here you mean to say that
In vacuum_all_databases instead to running all DB's in parallel, we are running db by db in parallel?

I mean that the other commit, the one conflicting with your patch, is still not finished. It probably would not have been committed if we realized the problem at the time. That other patch runs analyze in stages at different settings of default_statistics_target, but it has the loops in the wrong order, so it analyzes one database in all three stages, then moves to the next database. I think that these two changes are going to interact with each other. But I can't predict right now what that interaction will look like. So it is hard for me to evaluate your patch, until the other one is resolved.

Normally I would evaluate your patch in isolation, but since the conflicting patch is already committed (and is in the 9.4 branch) that would probably not be very useful in this case.

Cheers,

Jeff

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

24 June 2014, 05:51:51

<div class="WordSection1"><p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">On</span><span
style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"></span><span
style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">24June 2014 11:02 Jeff Wrote,</span><p
class="MsoNormal"><spanstyle="font-size:10.0pt;font-family:"Tahoma","sans-serif""> </span><p class="MsoNormal">>I
meanthat the other commit, the one conflicting with your patch, is still not finished.  It probably would not have been
committedif we realized the problem at the time.  That other patch runs analyze in stages at  <p class="MsoNormal">>
differentsettings of default_statistics_target, but it has the loops in the wrong order, so it analyzes one database in
allthree stages, then moves to the next database.  I think that these two changes are going to<p class="MsoNormal">>
interactwith each other.  But I can't predict right now what that interaction will look like.   So it is hard for me to
evaluateyour patch, until the other one is resolved.<p class="MsoNormal"> <p class="MsoNormal">>Normally I would
evaluateyour patch in isolation, but since the conflicting patch is already committed (and is in the 9.4 branch) that
wouldprobably not be very useful in this case.<p class="MsoNormal"> <p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Tahoma","sans-serif""> </span><pclass="MsoNormal"><span
style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">Ohk,  Got your point, I will also try to think how these two
patchcan interact together.. </span><p class="MsoNormal"><span
style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"> </span><pclass="MsoNormal"><span
style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">Regards,</span><pclass="MsoNormal"><span
style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">Dilip</span></div>

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Sawada Masahiko

Date:

25 June 2014, 18:07:31

Hi,

I got following FAILED when I patched v3 to HEAD.

$ patch -d. -p1 < ../patch/vacuumdb_parallel_v3.patch
patching file doc/src/sgml/ref/vacuumdb.sgml
Hunk #1 succeeded at 224 (offset 20 lines).
patching file src/bin/scripts/Makefile
Hunk #2 succeeded at 65 with fuzz 2 (offset -1 lines).
patching file src/bin/scripts/vac_parallel.c
patching file src/bin/scripts/vac_parallel.h
patching file src/bin/scripts/vacuumdb.c
Hunk #3 succeeded at 61 with fuzz 2.
Hunk #4 succeeded at 87 (offset 2 lines).
Hunk #5 succeeded at 143 (offset 2 lines).
Hunk #6 succeeded at 158 (offset 5 lines).
Hunk #7 succeeded at 214 with fuzz 2 (offset 5 lines).
Hunk #8 FAILED at 223.
Hunk #9 succeeded at 374 with fuzz 1 (offset 35 lines).
Hunk #10 FAILED at 360.
Hunk #11 FAILED at 387.
3 out of 11 hunks FAILED -- saving rejects to file src/bin/scripts/vacuumdb.c.rej

---

Sawada Masahiko

On Friday, March 21, 2014, Dilip kumar <dilip.kumar@huawei.com> wrote:

On 16 January 2014 19:53, Euler Taveira Wrote,

> >
> >> For the case where you have tables of varying size this would lead
> to a reduced overall processing time as it prevents large (read: long
> processing time) tables to be processed in the last step. While
> processing large tables at first and filling up "processing slots/jobs"
> when they get free with smaller tables one after the other would safe
> overall execution time.
> > Good point, I have made the change and attached the modified patch.
> >
> Don't you submit it for a CF, do you? Is it too late for this CF?

Attached the latest updated patch
1. Rebased the patch to current GIT head.
2. Doc is updated.
3. Supported parallel execution for all db option also.

Same I will add to current open commitfest..

Regards,
Dilip

--
Regards,

-------
Sawada Masahiko

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

26 June 2014, 09:36:11

On 25 June 2014 23:37 Sawada Masahiko Wrote

>I got following FAILED when I patched v3 to HEAD.

>$ patch -d. -p1 < ../patch/vacuumdb_parallel_v3.patch
>patching file doc/src/sgml/ref/vacuumdb.sgml
>Hunk #1 succeeded at 224 (offset 20 lines).
>patching file src/bin/scripts/Makefile
>Hunk #2 succeeded at 65 with fuzz 2 (offset -1 lines).
>patching file src/bin/scripts/vac_parallel.c
>patching file src/bin/scripts/vac_parallel.h
>patching file src/bin/scripts/vacuumdb.c
>Hunk #3 succeeded at 61 with fuzz 2.
>Hunk #4 succeeded at 87 (offset 2 lines).
>Hunk #5 succeeded at 143 (offset 2 lines).
>Hunk #6 succeeded at 158 (offset 5 lines).
>Hunk #7 succeeded at 214 with fuzz 2 (offset 5 lines).
>Hunk #8 FAILED at 223.
>Hunk #9 succeeded at 374 with fuzz 1 (offset 35 lines).
>Hunk #10 FAILED at 360.
>Hunk #11 FAILED at 387.
>3 out of 11 hunks FAILED -- saving rejects to file src/bin/scripts/vacuumdb.c.rej

Thank you for giving your time, Please review the updated patch attached in the mail.

1. Rebased the patch

2. Implemented parallel execution for new option --analyze-in-stages

Regards,

Dilip Kumar

Attachment

vacuumdb_parallel_v4.patch

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Jeff Janes

Date:

26 June 2014, 21:27:31

On Thu, Jun 26, 2014 at 2:35 AM, Dilip kumar <dilip.kumar@huawei.com> wrote:

>
> Thank you for giving your time,  Please review the updated patch attached in
> the mail.
>
>
>
> 1.      Rebased the patch
>
> 2.      Implemented parallel execution for new option --analyze-in-stages

Hi Dilip,

Thanks for rebasing.

I haven't done an architectural or code review on it, I just applied
it and used it a little on Linux.

Based on that, I find most importantly that it doesn't seem to
correctly vacuum tables which have upper case letters in the name,
because it does not quote the table names when they need quotes.

Of course that needs to be fixed, but taking it as it is, the
resulting error message to the console is just:
: Execute failed

Which is not very informative.  I get the same error if I do a "pg_ctl
shutdown -mi" while running the parallel vacuumdb. Without the -j
option it produces a more descriptive error message "FATAL:
terminating connection due to administrator command", so something
about the new feature suppresses the informative error messages.

I get some compiler warnings with the new patch:

vac_parallel.c: In function 'parallel_msg_master':
vac_parallel.c:147: warning: function might be possible candidate for
'gnu_printf' format attribute
vac_parallel.c:147: warning: function might be possible candidate for
'gnu_printf' format attribute
vac_parallel.c: In function 'exit_horribly':
vac_parallel.c:1071: warning: 'noreturn' function does return

In the usage message, the string has a tab embedded within it
(immediately before "use") that should be converted to literal spaces,
otherwise the output of --help gets misaligned:

printf(_("  -j, --jobs=NUM                  use this many parallel
jobs to vacuum\n"));

Thanks for the work on this.

Cheers,

Jeff

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

27 June 2014, 11:10:33

On 27 June 2014 02:57, Jeff Wrote,

> Based on that, I find most importantly that it doesn't seem to
> correctly vacuum tables which have upper case letters in the name,
> because it does not quote the table names when they need quotes.

Thanks for your comments....

There are two problem
First -> When doing the vacuum of complete database that time if any table with upper case letter, it was giving error
--FIXED by adding quotes for table name

Second -> When user pass the table using -t option, and if it has uppercase letter 
--This is the existing problem (without parallel implementation), 

One solution to this is, always add Quote to the relation name passed by user, but this can break existing applications
forsome users..
 

 
> Of course that needs to be fixed, but taking it as it is, the resulting
> error message to the console is just:

FIXED

> 
> Which is not very informative.  I get the same error if I do a "pg_ctl
> shutdown -mi" while running the parallel vacuumdb. Without the -j
> option it produces a more descriptive error message "FATAL:
> terminating connection due to administrator command", so something
> about the new feature suppresses the informative error messages.
> 
> I get some compiler warnings with the new patch:
> 
> vac_parallel.c: In function 'parallel_msg_master':
> vac_parallel.c:147: warning: function might be possible candidate for
> 'gnu_printf' format attribute
> vac_parallel.c:147: warning: function might be possible candidate for
> 'gnu_printf' format attribute
> vac_parallel.c: In function 'exit_horribly':
> vac_parallel.c:1071: warning: 'noreturn' function does return

FIXED

> In the usage message, the string has a tab embedded within it
> (immediately before "use") that should be converted to literal spaces,
> otherwise the output of --help gets misaligned:
> 
> printf(_("  -j, --jobs=NUM                  use this many parallel
> jobs to vacuum\n"));
> 

FIXED

Updated patch is attached in the mail..


Thanks & Regards,
Dilip Kumar

Attachment

vacuumdb_parallel_v5.patch

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Jeff Janes

Date:

30 June 2014, 22:00:49

On Fri, Jun 27, 2014 at 4:10 AM, Dilip kumar <dilip.kumar@huawei.com> wrote:
...
>
> Updated patch is attached in the mail..

Thanks Dilip.

I get a compiler warning when building on Windows.  When I started
looking into that, I see that two files have too much code duplication
between them:

src/bin/scripts/vac_parallel.c   (new file)
src/bin/pg_dump/parallel.c      (existing file)

In particular, pgpipe is almost an exact duplicate between them,
except the copy in vac_parallel.c has fallen behind changes made to
parallel.c.  (Those changes would have fixed the Windows warnings).  I
think that this function (and perhaps other parts as
well--"exit_horribly" for example) need to refactored into a common
file that both files can include.  I don't know where the best place
for that would be, though.  (I haven't done this type of refactoring
myself.)

Also, there are several places in the patch which use spaces for
indentation where tabs are called for by the coding style. It looks
like you may have copied the code from one terminal window and copied
it into another one, converting tabs to spaces in the process.  This
makes it hard to evaluate the amount of code duplication.

In some places the code spins in a tight loop while waiting for a
worker process to become free.  If I strace the process, I got a long
list of selects with 0 time outs:

select(13, [6 8 10 12], NULL, NULL, {0, 0}) = 0 (Timeout)

I have not tried to track down the code that causes it.  I did notice
that vacuumdb spends an awful lot of time at the top of the Linux
"top" output, and this is probably why.

Cheers,

Jeff

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Alvaro Herrera

Date:

01 July 2014, 00:43:18

Jeff Janes wrote:

> In particular, pgpipe is almost an exact duplicate between them,
> except the copy in vac_parallel.c has fallen behind changes made to
> parallel.c.  (Those changes would have fixed the Windows warnings).  I
> think that this function (and perhaps other parts as
> well--"exit_horribly" for example) need to refactored into a common
> file that both files can include.  I don't know where the best place
> for that would be, though.  (I haven't done this type of refactoring
> myself.)

I think commit d2c1740dc275543a46721ed254ba3623f63d2204 is apropos.
Maybe we should move pgpipe back to src/port and have pg_dump and this
new thing use that.  I'm not sure about the rest of duplication in
vac_parallel.c; there might be a lot in common with what
pg_dump/parallel.c does too.  Having two copies of code is frowned upon
for good reasons.  This patch introduces 1200 lines of new code in
vac_parallel.c, ugh.

If we really require 1200 lines to get parallel vacuum working for
vacuumdb, I would question the wisdom of this effort.  To me, it seems
better spent improving autovacuum to cover whatever it is that this
patch is supposed to be good for --- or maybe just enable having a shell
script that launches multiple vacuumdb instances in parallel ...

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

01 July 2014, 03:58:41

On 01 July 2014 03:31, Jeff Janes Wrote,
> 
> I get a compiler warning when building on Windows.  When I started
> looking into that, I see that two files have too much code duplication
> between them:

Thanks for Reviewing, 

> 
> src/bin/scripts/vac_parallel.c   (new file)
> src/bin/pg_dump/parallel.c      (existing file)
> 
> In particular, pgpipe is almost an exact duplicate between them, except
> the copy in vac_parallel.c has fallen behind changes made to parallel.c.
> (Those changes would have fixed the Windows warnings).  I think that
> this function (and perhaps other parts as well--"exit_horribly" for
> example) need to refactored into a common file that both files can
> include.  I don't know where the best place for that would be, though.
> (I haven't done this type of refactoring
> myself.)

When I started doing this patch, I thought of sharing the common code b/w vacuumdb and pg_dump, But if we notice 
Pg_dump code is tightly coupled with ArchiveHandle, almost all function take this parameter as input or they operate on
this,and other functions
 
uses some structure like ParallelState or ParallelSlot which has ArchiveHandle member. I think making this code common
mayneed to change complete code of
 
Parallel pg_dump.

However there are some function which are independent of Archive Handle and can directly move to common code,
As you mention pg_pipe, piperead, readMessageFromPipe, select_loop.
For moving them to common place we need to decide where the common file to be placed.

Thoughts ?

> 
> Also, there are several places in the patch which use spaces for
> indentation where tabs are called for by the coding style. It looks
> like you may have copied the code from one terminal window and copied
> it into another one, converting tabs to spaces in the process.  This
> makes it hard to evaluate the amount of code duplication.
> 
> In some places the code spins in a tight loop while waiting for a
> worker process to become free.  If I strace the process, I got a long
> list of selects with 0 time outs:
> 
> select(13, [6 8 10 12], NULL, NULL, {0, 0}) = 0 (Timeout)
> 
> I have not tried to track down the code that causes it.  I did notice
> that vacuumdb spends an awful lot of time at the top of the Linux "top"
> output, and this is probably why.

I will look into these and fix..

Thanks & Regards,
Dilip Kumar

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

01 July 2014, 04:26:24

On 01 July 2014 03:48, Alvaro Wrote,

> > In particular, pgpipe is almost an exact duplicate between them,
> > except the copy in vac_parallel.c has fallen behind changes made to
> > parallel.c.  (Those changes would have fixed the Windows warnings).
> I
> > think that this function (and perhaps other parts as
> > well--"exit_horribly" for example) need to refactored into a common
> > file that both files can include.  I don't know where the best place
> > for that would be, though.  (I haven't done this type of refactoring
> > myself.)
>
> I think commit d2c1740dc275543a46721ed254ba3623f63d2204 is apropos.
> Maybe we should move pgpipe back to src/port and have pg_dump and this
> new thing use that.  I'm not sure about the rest of duplication in
> vac_parallel.c; there might be a lot in common with what
> pg_dump/parallel.c does too.  Having two copies of code is frowned upon
> for good reasons.  This patch introduces 1200 lines of new code in
> vac_parallel.c, ugh.

>
> If we really require 1200 lines to get parallel vacuum working for
> vacuumdb, I would question the wisdom of this effort.  To me, it seems
> better spent improving autovacuum to cover whatever it is that this
> patch is supposed to be good for --- or maybe just enable having a
> shell script that launches multiple vacuumdb instances in parallel ...

Thanks for looking into the patch,

I think if we use shell script for launching parallel vacuumdb, we cannot get complete control of dividing the task,
If we directly divide table b/w multiple process, it may happen some process get very big tables then it will be as
goodas one process is doing operation. 

In this patch at a time we assign only one table to each process and whichever process finishes fast, we assign new
table,this way all process get equal sharing of the task. 


Thanks & Regards,
Dilip Kumar

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Sawada Masahiko

Date:

01 July 2014, 16:47:01

On Tue, Jul 1, 2014 at 1:25 PM, Dilip kumar <dilip.kumar@huawei.com> wrote:
> On 01 July 2014 03:48, Alvaro Wrote,
>
>> > In particular, pgpipe is almost an exact duplicate between them,
>> > except the copy in vac_parallel.c has fallen behind changes made to
>> > parallel.c.  (Those changes would have fixed the Windows warnings).
>> I
>> > think that this function (and perhaps other parts as
>> > well--"exit_horribly" for example) need to refactored into a common
>> > file that both files can include.  I don't know where the best place
>> > for that would be, though.  (I haven't done this type of refactoring
>> > myself.)
>>
>> I think commit d2c1740dc275543a46721ed254ba3623f63d2204 is apropos.
>> Maybe we should move pgpipe back to src/port and have pg_dump and this
>> new thing use that.  I'm not sure about the rest of duplication in
>> vac_parallel.c; there might be a lot in common with what
>> pg_dump/parallel.c does too.  Having two copies of code is frowned upon
>> for good reasons.  This patch introduces 1200 lines of new code in
>> vac_parallel.c, ugh.
>
>>
>> If we really require 1200 lines to get parallel vacuum working for
>> vacuumdb, I would question the wisdom of this effort.  To me, it seems
>> better spent improving autovacuum to cover whatever it is that this
>> patch is supposed to be good for --- or maybe just enable having a
>> shell script that launches multiple vacuumdb instances in parallel ...
>
> Thanks for looking into the patch,
>
> I think if we use shell script for launching parallel vacuumdb, we cannot get complete control of dividing the task,
> If we directly divide table b/w multiple process, it may happen some process get very big tables then it will be as
goodas one process is doing operation.
 
>
> In this patch at a time we assign only one table to each process and whichever process finishes fast, we assign new
table,this way all process get equal sharing of the task.
 
>
>
> Thanks & Regards,
> Dilip Kumar
>

I have executed latest patch.
One question is that how to use --jobs option is correct?
$ vacuumdb  -d postgres  --jobs=30

I got following error.
vacuumdb: unrecognized option '--jobs=30'
Try "vacuumdb --help" for more information.

Regards,
-------
Sawada Masahiko

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

02 July 2014, 05:27:38

On 01 July 2014 22:17, Sawada Masahiko Wrote,

 
> I have executed latest patch.
> One question is that how to use --jobs option is correct?
> $ vacuumdb  -d postgres  --jobs=30
> 
> I got following error.
> vacuumdb: unrecognized option '--jobs=30'
> Try "vacuumdb --help" for more information.
> 

Thanks for comments, Your usage are correct, but there are some problem in code and I have fixed the same in attached
patch.

Apart from this issue fix currently I am working on jeff's comments for making the code common between
pg_dump/parallel.cand scripts/vac_parallel.c.
 
I found that almost 300 lines of code we can move to common place, but only problem is where to keep the common code.

I am thinking of 
1. keeping a common folder in bin folder --> src/bin/common and move common code which is specific to parallel
operationin src/bin/common/parallel_common.c
 
2. Both vacuum db and pg_dump will compile this file in while generating there executables(in future other executable
likereindex can also use same for parallel functionality)
 

Thoughts ?

Thanks & Regards,
Dilip Kumar

Attachment

vacuumdb_parallel_v6.patch

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Jeff Janes

Date:

02 July 2014, 17:52:59

On Mon, Jun 30, 2014 at 3:17 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Jeff Janes wrote:
>
>> In particular, pgpipe is almost an exact duplicate between them,
>> except the copy in vac_parallel.c has fallen behind changes made to
>> parallel.c.  (Those changes would have fixed the Windows warnings).  I
>> think that this function (and perhaps other parts as
>> well--"exit_horribly" for example) need to refactored into a common
>> file that both files can include.  I don't know where the best place
>> for that would be, though.  (I haven't done this type of refactoring
>> myself.)
>
> I think commit d2c1740dc275543a46721ed254ba3623f63d2204 is apropos.
> Maybe we should move pgpipe back to src/port and have pg_dump and this
> new thing use that.  I'm not sure about the rest of duplication in
> vac_parallel.c; there might be a lot in common with what
> pg_dump/parallel.c does too.  Having two copies of code is frowned upon
> for good reasons.  This patch introduces 1200 lines of new code in
> vac_parallel.c, ugh.
>
> If we really require 1200 lines to get parallel vacuum working for
> vacuumdb, I would question the wisdom of this effort.  To me, it seems
> better spent improving autovacuum to cover whatever it is that this
> patch is supposed to be good for --- or maybe just enable having a shell
> script that launches multiple vacuumdb instances in parallel ...

I would only envision using the parallel feature for vacuumdb after a
pg_upgrade or some other major maintenance window (that is the only
time I ever envision using vacuumdb at all).  I don't think autovacuum
can be expected to handle such situations well, as it is designed to
be a smooth background process.

I guess the ideal solution would be for manual VACUUM to have a
PARALLEL option, then vacuumdb could just invoke that one table at a
time.  That way you would get within-table parallelism which would be
important if one table dominates the entire database cluster. But I
don't foresee that happening any time soon.

I don't know how to calibrate the number of lines that is worthwhile.
If you write in C and need to have cross-platform compatibility and
robust error handling, it seems to take hundreds of lines to do much
of anything.  The code duplication is a problem, but I don't think
just raw line count is, especially since it has already been written.

The trend in this project seems to be for shell scripts to eventually
get converted into C programs.  In fact, src/bin/scripts now has no
scripts at all.  Also it is important to vacuum/analyze tables in the
same database at the same time, otherwise you will not get much
speed-up in the ordinary case where there is only one meaningful
database.  Doing that in a shell script would be fairly hard.  It
should be pretty easy in Perl (at least for me--I'm sure others
disagree), but that also doesn't seem to be the way we do things for
programs intended for end users.

Cheers,

Jeff

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Alvaro Herrera

Date:

02 July 2014, 18:15:29

Jeff Janes wrote:

> I would only envision using the parallel feature for vacuumdb after a
> pg_upgrade or some other major maintenance window (that is the only
> time I ever envision using vacuumdb at all).  I don't think autovacuum
> can be expected to handle such situations well, as it is designed to
> be a smooth background process.

That's a fair point.  One thing that would be pretty neat but I don't
think I would get anyone to implement it, is having the user control the
autovacuum launcher in some way.  For instance "please vacuum this set
of tables as quickly as possible", and it would launch as many workers
are configured.  It would take months to get a UI settled for this,
however.

> I guess the ideal solution would be for manual VACUUM to have a
> PARALLEL option, then vacuumdb could just invoke that one table at a
> time.  That way you would get within-table parallelism which would be
> important if one table dominates the entire database cluster. But I
> don't foresee that happening any time soon.

I see this as a completely different feature, which might also be pretty
neat, at least if you're open to spending more I/O bandwidth processing
a single table: have several processes scanning the heap simultaneously.
Since I think vacuum is mostly I/O bound at the moment, I'm not sure
there is much point in this currently.

> I don't know how to calibrate the number of lines that is worthwhile.
> If you write in C and need to have cross-platform compatibility and
> robust error handling, it seems to take hundreds of lines to do much
> of anything.  The code duplication is a problem, but I don't think
> just raw line count is, especially since it has already been written.

Well, there are (at least) two types of duplicate code: first you have
these common routines such as pgpipe that are duplicates for no good
reason.  Just move them to src/port or something and it's all good.  But
the OP said there is code that cannot be shared even though it's very
similar in both incarnations.  That means we cannot (or it's difficult
to) just have one copy, which means as they fix bugs in one copy we need
to update the other.  This is bad -- witness the situation with ecpg's
copy of date/time code, where there are bugs fixed in the backend
version but the ecpg version does not have the fix.  It's difficult to
keep track of these things.

> The trend in this project seems to be for shell scripts to eventually
> get converted into C programs.  In fact, src/bin/scripts now has no
> scripts at all.  Also it is important to vacuum/analyze tables in the
> same database at the same time, otherwise you will not get much
> speed-up in the ordinary case where there is only one meaningful
> database.  Doing that in a shell script would be fairly hard.  It
> should be pretty easy in Perl (at least for me--I'm sure others
> disagree), but that also doesn't seem to be the way we do things for
> programs intended for end users.

Yeah, shipping shell scripts doesn't work very well for us.  I'm
thinking perhaps we can have sample scripts in which we show how to use
parallel(1) to run multiple vacuumdb's in parallel in Unix and some
similar mechanism in Windows, and that's it.  So we wouldn't provide the
complete toolset, but the platform surely has ways to make it happen.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Sawada Masahiko

Date:

02 July 2014, 18:31:35

On Wed, Jul 2, 2014 at 2:27 PM, Dilip kumar <dilip.kumar@huawei.com> wrote:
> On 01 July 2014 22:17, Sawada Masahiko Wrote,
>
>
>> I have executed latest patch.
>> One question is that how to use --jobs option is correct?
>> $ vacuumdb  -d postgres  --jobs=30
>>
>> I got following error.
>> vacuumdb: unrecognized option '--jobs=30'
>> Try "vacuumdb --help" for more information.
>>
>
> Thanks for comments, Your usage are correct, but there are some problem in code and I have fixed the same in attached
patch.
>

This patch allows to set 0 to -j option?
When I set 0 to -j option, I think that the behavior of this is same
as when I set to 1.

Regards,

-------
Sawada Masahiko

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

03 July 2014, 04:02:02

On 03 July 2014 00:01, Sawada Masahiko Wrote,
 >
> 
> This patch allows to set 0 to -j option?
> When I set 0 to -j option, I think that the behavior of this is same as
> when I set to 1.
>

I have changed the patch, now It will return error if -j set to 0 or less than 0.
"vacuumdb: Number of parallel "jobs" should be at least 1"

Thanks & Regards,
Dilip

Attachment

vacuumdb_parallel_v7.patch

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Amit Kapila

Date:

03 July 2014, 06:29:40

On Wed, Jul 2, 2014 at 11:45 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Jeff Janes wrote:
>
> > I would only envision using the parallel feature for vacuumdb after a
> > pg_upgrade or some other major maintenance window (that is the only
> > time I ever envision using vacuumdb at all). I don't think autovacuum
> > can be expected to handle such situations well, as it is designed to
> > be a smooth background process.
>
> That's a fair point. One thing that would be pretty neat but I don't
> think I would get anyone to implement it, is having the user control the
> autovacuum launcher in some way. For instance "please vacuum this set
> of tables as quickly as possible", and it would launch as many workers
> are configured. It would take months to get a UI settled for this,
> however.

This sounds to be a better way to have multiple workers working

on vacuuming tables. For vacuum as we already have some sort

of infrastructure (vacuum workers) to perform tasks in parallel, why

not to leverage that instead of inventing a new one even if we assume

that we can reduce the duplicate code.

> > I don't know how to calibrate the number of lines that is worthwhile.
> > If you write in C and need to have cross-platform compatibility and
> > robust error handling, it seems to take hundreds of lines to do much
> > of anything. The code duplication is a problem, but I don't think
> > just raw line count is, especially since it has already been written.
>
> Well, there are (at least) two types of duplicate code: first you have
> these common routines such as pgpipe that are duplicates for no good
> reason. Just move them to src/port or something and it's all good. But
> the OP said there is code that cannot be shared even though it's very
> similar in both incarnations. That means we cannot (or it's difficult
> to) just have one copy, which means as they fix bugs in one copy we need
> to update the other.

I checked briefly the duplicate code among both versions and I think,

we might be able to reduce it to a significant amount by making common

functions and use AH where passed (as an example, I have checked

function ParallelBackupStart() which is more than 100 lines). If you see

code duplication as a major point for which you don't prefer this patch,

then I think that can be ameliorated or atleast it is worth a try to do so.

However I think it might be better to achieve in a way suggested by you

using autovacuum launcher.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

04 July 2014, 05:15:45

On 02 July 2014 23:45, Alvaro Herrera Wrote,
>
> Well, there are (at least) two types of duplicate code: first you have
> these common routines such as pgpipe that are duplicates for no good
> reason.  Just move them to src/port or something and it's all good.
> But the OP said there is code that cannot be shared even though it's
> very similar in both incarnations.  That means we cannot (or it's
> difficult
> to) just have one copy, which means as they fix bugs in one copy we
> need to update the other.  This is bad -- witness the situation with
> ecpg's copy of date/time code, where there are bugs fixed in the
> backend version but the ecpg version does not have the fix.  It's
> difficult to keep track of these things.

In attached patch, I have moved pgpipe, piperead functions to src/port/pipe.c

There are some more common function what Jeff and Amit also mentioned to move to common place,
Currently I am not sure where we can move other functions to.
Can we move other parallel functions to src/port, may be one new file parallel.c under src/port ?

Thanks & Regards,
Dilip Kumar

Attachment

vacuumdb_parallel_v8.patch

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Robert Haas

Date:

07 July 2014, 12:24:57

On Fri, Jul 4, 2014 at 1:15 AM, Dilip kumar <dilip.kumar@huawei.com> wrote:
> In attached patch, I have moved pgpipe, piperead functions to src/port/pipe.c

If we want to consider proceeding with this approach, you should
probably separate this into a refactoring patch that doesn't do
anything but move code around and a feature patch that applies on top
of it.

(As to whether this is the right approach, I'm not sure.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

10 July 2014, 06:59:18

On 07 July 2014 17:55 Rebert Hass Wrote,

> 
> On Fri, Jul 4, 2014 at 1:15 AM, Dilip kumar <dilip.kumar@huawei.com>
> wrote:
> > In attached patch, I have moved pgpipe, piperead functions to
> > src/port/pipe.c
> 
> If we want to consider proceeding with this approach, you should
> probably separate this into a refactoring patch that doesn't do
> anything but move code around and a feature patch that applies on top
> of it.
> 
> (As to whether this is the right approach, I'm not sure.)

I have done the refactoring of the code.

Two patches are attached 

1. vacuumdb_parallel_refactor.patch  --> Moved pg_dump, parallel code to port/parallel_utils.c  (almost 800 lines are
movedto the common code).

2. vacuumdb_parallel_v9 --> Feature changes for vaccumdb parallel (created on top of first patch).

I think by this changes, we are able to address all the concerns we were having related to duplicate code.

Thanks & Regards,
Dilip Kumar

Attachment

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Jeff Janes

Date:

10 July 2014, 19:58:10

On Fri, Jun 27, 2014 at 4:10 AM, Dilip kumar <dilip.kumar@huawei.com> wrote:
> On 27 June 2014 02:57, Jeff Wrote,
>
>> Based on that, I find most importantly that it doesn't seem to
>> correctly vacuum tables which have upper case letters in the name,
>> because it does not quote the table names when they need quotes.
>
> Thanks for your comments....
>
> There are two problem
> First -> When doing the vacuum of complete database that time if any table with upper case letter, it was giving
error
> --FIXED by adding quotes for table name
>
> Second -> When user pass the table using -t option, and if it has uppercase letter
> --This is the existing problem (without parallel implementation),

Just for the record, I don't think the second one is actually a bug.
If someone uses -t option from the command line, they are required to
provide the quotes if quotes are needed, just like they would need to
in psql.  That can be annoying to do from a shell, as you then need to
protect the quotes themselves from the shell, but that is the way it
is.

vacuumdb -t '"CrAzY QuOtE"'
or
vacuumdb -t \"CrAzY\ QuOtE\"

Cheers,

Jeff

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Magnus Hagander

Date:

15 July 2014, 13:31:11

On Tue, Jul 1, 2014 at 6:25 AM, Dilip kumar <dilip.kumar@huawei.com> wrote:
> On 01 July 2014 03:48, Alvaro Wrote,
>
>> > In particular, pgpipe is almost an exact duplicate between them,
>> > except the copy in vac_parallel.c has fallen behind changes made to
>> > parallel.c.  (Those changes would have fixed the Windows warnings).
>> I
>> > think that this function (and perhaps other parts as
>> > well--"exit_horribly" for example) need to refactored into a common
>> > file that both files can include.  I don't know where the best place
>> > for that would be, though.  (I haven't done this type of refactoring
>> > myself.)
>>
>> I think commit d2c1740dc275543a46721ed254ba3623f63d2204 is apropos.
>> Maybe we should move pgpipe back to src/port and have pg_dump and this
>> new thing use that.  I'm not sure about the rest of duplication in
>> vac_parallel.c; there might be a lot in common with what
>> pg_dump/parallel.c does too.  Having two copies of code is frowned upon
>> for good reasons.  This patch introduces 1200 lines of new code in
>> vac_parallel.c, ugh.
>
>>
>> If we really require 1200 lines to get parallel vacuum working for
>> vacuumdb, I would question the wisdom of this effort.  To me, it seems
>> better spent improving autovacuum to cover whatever it is that this
>> patch is supposed to be good for --- or maybe just enable having a
>> shell script that launches multiple vacuumdb instances in parallel ...
>
> Thanks for looking into the patch,
>
> I think if we use shell script for launching parallel vacuumdb, we cannot get complete control of dividing the task,
> If we directly divide table b/w multiple process, it may happen some process get very big tables then it will be as
goodas one process is doing operation.
 
>
> In this patch at a time we assign only one table to each process and whichever process finishes fast, we assign new
table,this way all process get equal sharing of the task.
 


I am late to this game, but the first thing to my mind was - do we
really need the whole forking/threading thing on the client at all? We
need it for things like pg_dump/pg_restore because they can themselvse
benefit from parallelism at the client level, but for something like
this, might the code become a lot simpler if we just use multiple
database connections and async queries? That would also bring the
benefit of less platform dependent code, less cleanup needs etc.

(Oh, and for some reason at my quick review i also noticed - you added
quoting of the table name, but forgot to do it for the schema name.
You should probably also look at using something like
quote_identifier(), that'll make things easier).

-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

16 July 2014, 03:59:02

On 15 July 2014 19:01, Magnus Hagander Wrote,
 
> I am late to this game, but the first thing to my mind was - do we
> really need the whole forking/threading thing on the client at all? We
> need it for things like pg_dump/pg_restore because they can themselvse
> benefit from parallelism at the client level, but for something like
> this, might the code become a lot simpler if we just use multiple
> database connections and async queries? That would also bring the
> benefit of less platform dependent code, less cleanup needs etc.

Thanks for the review, I understand you point, but I think if we have do this directly by independent connection, 
It's difficult to equally divide the jobs b/w multiple independent connections.

As per this implementation we are able to share the load b/w the processes quite well,
1. If one process finishes the work faster it can take the other load.
2. Specially while vacuuming whole database, it's very difficult to divide the load if they are not centralized
control.

By above points, I think that we can have this patch.. 

> 
> (Oh, and for some reason at my quick review i also noticed - you added
> quoting of the table name, but forgot to do it for the schema name.
> You should probably also look at using something like
> quote_identifier(), that'll make things easier).

Thanks for the comments, I have attached the updated patch.

vacuumdb_parallel_refactor --> No change
vacuumdb_parallel_v9  --> Quotes added for namespace


Thanks & Regards,
Dilip

Attachment

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Tom Lane

Date:

16 July 2014, 04:28:49

Dilip kumar <dilip.kumar@huawei.com> writes:
> On 15 July 2014 19:01, Magnus Hagander Wrote,
>> I am late to this game, but the first thing to my mind was - do we
>> really need the whole forking/threading thing on the client at all?

> Thanks for the review, I understand you point, but I think if we have do this directly by independent connection, 
> It's difficult to equally divide the jobs b/w multiple independent connections.

That argument seems like complete nonsense.  You're confusing work
allocation strategy with the implementation technology for the multiple
working threads.  I see no reason why a good allocation strategy couldn't
work with either approach; indeed, I think it would likely be easier to
do some things *without* client-side physical parallelism, because that
makes it much simpler to handle feedback between the results of different
operational threads.
        regards, tom lane

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Alvaro Herrera

Date:

16 July 2014, 05:05:19

Tom Lane wrote:
> Dilip kumar <dilip.kumar@huawei.com> writes:
> > On 15 July 2014 19:01, Magnus Hagander Wrote,
> >> I am late to this game, but the first thing to my mind was - do we
> >> really need the whole forking/threading thing on the client at all?
> 
> > Thanks for the review, I understand you point, but I think if we have do this directly by independent connection, 
> > It's difficult to equally divide the jobs b/w multiple independent connections.
> 
> That argument seems like complete nonsense.  You're confusing work
> allocation strategy with the implementation technology for the multiple
> working threads.  I see no reason why a good allocation strategy couldn't
> work with either approach; indeed, I think it would likely be easier to
> do some things *without* client-side physical parallelism, because that
> makes it much simpler to handle feedback between the results of different
> operational threads.

So you would have one initial connection, which generates a task list;
then open N libpq connections.  Launch one vacuum on each, and then
sleep on select() on the three sockets.  Whenever one returns
read-ready, the vacuuming is done and we send another item from the task
list.  Repeat until tasklist is empty.  No need to fork anything.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Magnus Hagander

Date:

16 July 2014, 06:43:33

<p dir="ltr"><br /> On Jul 16, 2014 7:05 AM, "Alvaro Herrera" <<a
href="mailto:alvherre@2ndquadrant.com">alvherre@2ndquadrant.com</a>>wrote:<br /> ><br /> > Tom Lane wrote:<br
/>> > Dilip kumar <<a href="mailto:dilip.kumar@huawei.com">dilip.kumar@huawei.com</a>> writes:<br /> >
>> On 15 July 2014 19:01, Magnus Hagander Wrote,<br /> > > >> I am late to this game, but the first
thingto my mind was - do we<br /> > > >> really need the whole forking/threading thing on the client at
all?<br/> > ><br /> > > > Thanks for the review, I understand you point, but I think if we have do this
directlyby independent connection,<br /> > > > It's difficult to equally divide the jobs b/w multiple
independentconnections.<br /> > ><br /> > > That argument seems like complete nonsense.  You're confusing
work<br/> > > allocation strategy with the implementation technology for the multiple<br /> > > working
threads. I see no reason why a good allocation strategy couldn't<br /> > > work with either approach; indeed, I
thinkit would likely be easier to<br /> > > do some things *without* client-side physical parallelism, because
that<br/> > > makes it much simpler to handle feedback between the results of different<br /> > >
operationalthreads.<br /> ><br /> > So you would have one initial connection, which generates a task list;<br />
>then open N libpq connections.  Launch one vacuum on each, and then<br /> > sleep on select() on the three
sockets. Whenever one returns<br /> > read-ready, the vacuuming is done and we send another item from the task<br />
>list.  Repeat until tasklist is empty.  No need to fork anything.<br /> ><p dir="ltr">Yeah, those are exactly my
points.I think it would be significantly simpler to do it that way, rather than forking and threading. And also easier
tomake portable... <p dir="ltr">(and as a  optimization on Alvaros suggestion, you can of course reuse the initial
connectionas one of the workers as long as you got the full list of tasks from it up front, which I think you  do
anywayin order to do sorting of tasks...) <p dir="ltr">/Magnus <br />

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

16 July 2014, 12:31:33

On 16 July 2014 12:13 Magnus Hagander Wrote,

>>Yeah, those are exactly my points. I think it would be significantly simpler to do it that way, rather than forking and threading. And also easier to make portable...

>>(and as a optimization on Alvaros suggestion, you can of course reuse the initial connection as one of the workers as long as you got the full list of tasks from it up front, which I think you do anyway in order to do sorting of tasks...)

Oh, I got your point, I will update my patch and send,

Now we can completely remove vac_parallel.h file and no need of refactoring also:)

Thanks & Regards,

Dilip Kumar

From: Magnus Hagander [mailto:magnus@hagander.net]
Sent: 16 July 2014 12:13
To: Alvaro Herrera
Cc: Dilip kumar; Jan Lentfer; Tom Lane; PostgreSQL-development; Sawada Masahiko; Euler Taveira
Subject: Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

On Jul 16, 2014 7:05 AM, "Alvaro Herrera" <alvherre@2ndquadrant.com> wrote:
>
> Tom Lane wrote:
> > Dilip kumar <dilip.kumar@huawei.com> writes:
> > > On 15 July 2014 19:01, Magnus Hagander Wrote,
> > >> I am late to this game, but the first thing to my mind was - do we
> > >> really need the whole forking/threading thing on the client at all?
> >
> > > Thanks for the review, I understand you point, but I think if we have do this directly by independent connection,
> > > It's difficult to equally divide the jobs b/w multiple independent connections.
> >
> > That argument seems like complete nonsense. You're confusing work
> > allocation strategy with the implementation technology for the multiple
> > working threads. I see no reason why a good allocation strategy couldn't
> > work with either approach; indeed, I think it would likely be easier to
> > do some things *without* client-side physical parallelism, because that
> > makes it much simpler to handle feedback between the results of different
> > operational threads.
>
> So you would have one initial connection, which generates a task list;
> then open N libpq connections. Launch one vacuum on each, and then
> sleep on select() on the three sockets. Whenever one returns
> read-ready, the vacuuming is done and we send another item from the task
> list. Repeat until tasklist is empty. No need to fork anything.
>

Yeah, those are exactly my points. I think it would be significantly simpler to do it that way, rather than forking and threading. And also easier to make portable...

(and as a optimization on Alvaros suggestion, you can of course reuse the initial connection as one of the workers as long as you got the full list of tasks from it up front, which I think you do anyway in order to do sorting of tasks...)

/Magnus

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

18 July 2014, 04:53:03

On 16 July 2014 12:13, Magnus Hagander Wrote,

>Yeah, those are exactly my points. I think it would be significantly simpler to do it that way, rather than forking and threading. And also easier to make portable...

>(and as a optimization on Alvaros suggestion, you can of course reuse the initial connection as one of the workers as long as you got the full list of tasks from it up front, which I think you do anyway in order to sorting of tasks...)

I have modified the patch as per the suggestion,

Now in beginning we create all connections, and first connection we use for getting table list in beginning, After that all connections will be involved in vacuum task.

Please have a look and provide your opinion…

Thanks & Regards,

Dilip Kumar

Attachment

vacuumdb_parallel_v11.patch

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Jeff Janes

Date:

18 July 2014, 19:58:21

On Wed, Jul 16, 2014 at 5:30 AM, Dilip kumar <dilip.kumar@huawei.com> wrote:

On 16 July 2014 12:13 Magnus Hagander Wrote,
>>Yeah, those are exactly my points. I think it would be significantly simpler to do it that way, rather than forking and threading. And also easier to make portable...
>>(and as a optimization on Alvaros suggestion, you can of course reuse the initial connection as one of the workers as long as you got the full list of tasks from it up front, which I think you do anyway in order to do sorting of tasks...)
Oh, I got your point, I will update my patch and send,
Now we can completely remove vac_parallel.h file and no need of refactoring also:)
Thanks & Regards,
Dilip Kumar

Should we push the refactoring through anyway? I have a hard time believing that pg_dump is going to be the only client program we ever have that will need process-level parallelism, even if this feature itself does not need it. Why make the next person who comes along re-invent that re-factoring of this wheel?

Cheers,

Jeff

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Alvaro Herrera

Date:

18 July 2014, 20:29:48

Jeff Janes wrote:

> Should we push the refactoring through anyway?  I have a hard time
> believing that pg_dump is going to be the only client program we ever have
> that will need process-level parallelism, even if this feature itself does
> not need it.  Why make the next person who comes along re-invent that
> re-factoring of this wheel?

I gave the refactoring patch a look some days ago, and my conclusion was
that it is reasonably sound but it needed quite some cleanup in order
for it to be committable.  Without any immediate use case, it's hard to
justify going through all that effort.  Maybe we can add a TODO item and
have it point to the posted patch, so that if in the future we see a
need for another parallel client program we can easily rebase the
current patch.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Amit Kapila

Date:

31 July 2014, 05:29:33

On Fri, Jul 18, 2014 at 10:22 AM, Dilip kumar <dilip.kumar@huawei.com> wrote:
> On 16 July 2014 12:13, Magnus Hagander Wrote,
> >Yeah, those are exactly my points. I think it would be significantly simpler to do it that way, rather than forking and threading. And also easier to make portable...
>
> >(and as a optimization on Alvaros suggestion, you can of course reuse the initial connection as one of the workers as long as you got the full list of tasks from it up front, which I think you do anyway in order to sorting of tasks...)
>
> I have modified the patch as per the suggestion,
>
> Now in beginning we create all connections, and first connection we use for getting table list in beginning, After that all connections will be involved in vacuum task.
>
> Please have a look and provide your opinion…

+ connSlot = (ParallelSlot*)pg_malloc(parallel * sizeof(ParallelSlot));

+ for (i = 0; i < parallel; i++)

+ {

+ connSlot[i].connection = connectDatabase(dbname, host, port, username,

+ prompt_password, progname, false);

+ PQsetnonblocking(connSlot[i].connection, 1);

+ connSlot[i].isFree = true;

+ connSlot[i].sock = PQsocket(connSlot[i].connection);

+ }

Here it seems to me that you are opening connections before

getting or checking tables list, so in case you have lesser

number of tables, won't the extra connections be always idle.

Simple case to erify the same is with below example

vacuumdb -t t1 -d postgres -j 4

+ res = executeQuery(conn,

+ "select relname, nspname from pg_class c, pg_namespace ns"

+ " where relkind= \'r\' and c.relnamespace = ns.oid"

+ " order by relpages desc",

+ progname, echo);

Here it is just trying to get the list of relations, however

Vacuum command processes materialized views as well, so

I think here the list should include materialized views as well

unless you have any specific reason for not including those.

3. In function vacuum_parallel(), if user has not provided list of tables,

then it is retrieving all the tables in database and then in run_parallel_vacuum(),

it tries to do Vacuum for each of table using Async mechanism, now

consider a case when after getting list if any table is dropped by user

from some other session, then patch will error out. However without patch

or Vacuum command will internally ignore such a case and complete

the Vacuum for other tables. Don't you think the patch should maintain

the existing behaviour?

+ <term><option>-j <replaceable class="parameter">jobs</replaceable></></term>

+ Number of parallel process to perform the operation.

Change this description as per new implementation. Also I think

there is a need of some explanation for this new option.

It seems there is no change in below function decalration:

static void vacuum_one_database(const char *dbname, bool full, bool verbose,

! bool and_analyze, bool analyze_only, bool analyze_in_stages,

! bool freeze, const char *table, const char *host,

! const char *port, const char *username,

! enum trivalue prompt_password,

const char *progname, bool echo);

+ printf(_(" -j, --jobs=NUM use this many parallel jobs to vacuum\n"));

Change the description as per new implementation.

/* This will give the free connection slot, if no slot is free it will

wait for atleast one slot to get free.*/

Multiline comments should be written like (refer other places)

* This will give the free connection slot, if no slot is free it will

* wait for atleast one slot to get free.

Kindly correct at other places if similar instance exist in patch.

Isn't it a good idea to check performance of this new patch

especially for some worst cases like when there is not much

to vacuum in the tables inside a database. The reason I wanted

to check is that because with new algorithm (for a vacuum of database,

now it will get the list of tables and perform vacuum on individual

tables) we have to repeat certain actions in server side like

allocation/deallocataion of context, sending stats which would have

been otherwise done once.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

From

Dilip kumar

Date:

04 August 2014, 06:13:09

On 31 July 2014 10:59, Amit kapila Wrote,

Thanks for the review and valuable comments.

I have fixed all the comments and attached the revised patch.

As per your suggestion I have taken the performance report also…

Test1:

Machine Configuration:

Core : 8 (Intel(R) Xeon(R) CPU E5520 @ 2.27GHz)

RAM: 48GB

Test Scenario:

8 tables all with 1M+ records. [many records are deleted and inserted using some pattern, (files is attached in the mail)]

Test Result

Base Code: 43.126s

Parallel Vacuum Code

2 Threads : 29.687s

8 Threads : 14.647s

Test2: (as per your scenario, where actual vacuum time is very less)

Vacuum done for complete DB

8 tables all with 10000 records and few dead tuples

Test Result

Base Code: 0.59s

Parallel Vacuum Code

2 Threads : 0.50s

4 Threads : 0.29s

8 Threads : 0.18s

Regards,

Dilip Kumar

From: Amit Kapila [mailto:amit.kapila16@gmail.com]
Sent: 31 July 2014 10:59
To: Dilip kumar
Cc: Magnus Hagander; Alvaro Herrera; Jan Lentfer; Tom Lane; PostgreSQL-development; Sawada Masahiko; Euler Taveira
Subject: Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]