Thread: Inefficiency in parallel pg_restore with many tables

Inefficiency in parallel pg_restore with many tables

From
Tom Lane
Date:
I looked into the performance gripe at [1] about pg_restore not making
effective use of parallel workers when there are a lot of tables.
I was able to reproduce that by dumping and parallel restoring 100K
tables made according to this script:

do $$
begin
for i in 1..100000 loop
  execute format('create table t%s (f1 int unique, f2 int unique);', i);
  execute format('insert into t%s select x, x from generate_series(1,1000) x',
                 i);
  if i % 100 = 0 then commit; end if;
end loop;
end
$$;

Once pg_restore reaches the parallelizable part of the restore, what
I see is that the parent pg_restore process goes to 100% CPU while its
children (and the server) mostly sit idle; that is, the task dispatch
logic in pg_backup_archiver.c is unable to dispatch tasks fast enough
to keep the children busy.  A quick perf check showed most of the time
being eaten by pg_qsort and TocEntrySizeCompare.

What I believe is happening is that we start the parallel restore phase
with 100K TableData items that are ready to go (they are in the
ready_list) and 200K AddConstraint items that are pending, because
we make those have dependencies on the corresponding TableData so we
don't build an index until after its table is populated.  Each time
one of the TableData items is completed by some worker, the two
AddConstraint items for its table are moved from the pending_list
to the ready_list --- and that means ready_list_insert marks the
ready_list as no longer sorted.  When we go to pop the next task
from the ready_list, we re-sort that entire list first.  So
we spend something like O(N^2 * log(N)) time just sorting, if
there are N tables.  Clearly, this code is much less bright
than it thinks it is (and that's all my fault, if memory serves).

I'm not sure how big a deal this is in practice: in most situations
the individual jobs are larger than they are in this toy example,
plus the initial non-parallelizable part of the restore is a bigger
bottleneck anyway with this many tables.  Still, we do have one
real-world complaint, so maybe we should look into improving it.

I wonder if we could replace the sorted ready-list with a priority heap,
although that might be complicated by the fact that pop_next_work_item
has to be capable of popping something that's not necessarily the
largest remaining job.  Another idea could be to be a little less eager
to sort the list every time; I think in practice scheduling wouldn't
get much worse if we only re-sorted every so often.

I don't have time to pursue this right now, but perhaps someone
else would like to.

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/CAEzn%3DHSPXi6OS-5KzGMcZeKzWKOOX1me2u2eCiGtMEZDz9Fqdg%40mail.gmail.com



Re: Inefficiency in parallel pg_restore with many tables

From
Andres Freund
Date:
Hi,

On 2023-07-15 13:47:12 -0400, Tom Lane wrote:
> I wonder if we could replace the sorted ready-list with a priority heap,
> although that might be complicated by the fact that pop_next_work_item
> has to be capable of popping something that's not necessarily the
> largest remaining job.  Another idea could be to be a little less eager
> to sort the list every time; I think in practice scheduling wouldn't
> get much worse if we only re-sorted every so often.

Perhaps we could keep track of where the newly inserted items are, and use
insertion sort or such when the number of new elements is much smaller than
the size of the already sorted elements?

As you say, a straight priority heap might not be easy. But we could just open
code using two sorted arrays, one large, one for recent additions that needs
to be newly sorted. And occasionally merge the small array into the big array,
once it has gotten large enough that sorting becomes expensive.  We could go
for a heap of N>2 such arrays, but I doubt it would be worth much.

Greetings,

Andres Freund



Re: Inefficiency in parallel pg_restore with many tables

From
Andrew Dunstan
Date:


On 2023-07-15 Sa 13:47, Tom Lane wrote:
I looked into the performance gripe at [1] about pg_restore not making
effective use of parallel workers when there are a lot of tables.
I was able to reproduce that by dumping and parallel restoring 100K
tables made according to this script:

do $$
begin
for i in 1..100000 loop  execute format('create table t%s (f1 int unique, f2 int unique);', i);  execute format('insert into t%s select x, x from generate_series(1,1000) x',                 i);  if i % 100 = 0 then commit; end if;
end loop;
end
$$;

Once pg_restore reaches the parallelizable part of the restore, what
I see is that the parent pg_restore process goes to 100% CPU while its
children (and the server) mostly sit idle; that is, the task dispatch
logic in pg_backup_archiver.c is unable to dispatch tasks fast enough
to keep the children busy.  A quick perf check showed most of the time
being eaten by pg_qsort and TocEntrySizeCompare.

What I believe is happening is that we start the parallel restore phase
with 100K TableData items that are ready to go (they are in the
ready_list) and 200K AddConstraint items that are pending, because
we make those have dependencies on the corresponding TableData so we
don't build an index until after its table is populated.  Each time
one of the TableData items is completed by some worker, the two
AddConstraint items for its table are moved from the pending_list
to the ready_list --- and that means ready_list_insert marks the
ready_list as no longer sorted.  When we go to pop the next task
from the ready_list, we re-sort that entire list first.  So
we spend something like O(N^2 * log(N)) time just sorting, if
there are N tables.  Clearly, this code is much less bright
than it thinks it is (and that's all my fault, if memory serves).

I'm not sure how big a deal this is in practice: in most situations
the individual jobs are larger than they are in this toy example,
plus the initial non-parallelizable part of the restore is a bigger
bottleneck anyway with this many tables.  Still, we do have one
real-world complaint, so maybe we should look into improving it.

I wonder if we could replace the sorted ready-list with a priority heap,
although that might be complicated by the fact that pop_next_work_item
has to be capable of popping something that's not necessarily the
largest remaining job.  Another idea could be to be a little less eager
to sort the list every time; I think in practice scheduling wouldn't
get much worse if we only re-sorted every so often.


Yeah, I think that last idea is reasonable. Something like if the number added since the last sort is more than min(50, list_length/4) then sort. That shouldn't be too invasive.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Inefficiency in parallel pg_restore with many tables

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2023-07-15 Sa 13:47, Tom Lane wrote:
>> I wonder if we could replace the sorted ready-list with a priority heap,
>> although that might be complicated by the fact that pop_next_work_item
>> has to be capable of popping something that's not necessarily the
>> largest remaining job.  Another idea could be to be a little less eager
>> to sort the list every time; I think in practice scheduling wouldn't
>> get much worse if we only re-sorted every so often.

> Yeah, I think that last idea is reasonable. Something like if the number 
> added since the last sort is more than min(50, list_length/4) then sort. 
> That shouldn't be too invasive.

Actually, as long as we're talking about approximately-correct behavior:
let's make the ready_list be a priority heap, and then just make
pop_next_work_item scan forward from the array start until it finds an
item that's runnable per the lock heuristic.  If the heap root is
blocked, the next things we'll examine will be its two children.
We might pick the lower-priority of those two, but it's still known to
be higher priority than at least 50% of the remaining heap entries, so
it shouldn't be too awful as a choice.  The argument gets weaker the
further you go into the heap, but we're not expecting that having most
of the top entries blocked will be a common case.  (Besides which, the
priorities are pretty crude to begin with.)  Once selected, pulling out
an entry that is not the heap root is no problem: you just start the
sift-down process from there.

The main advantage of this over the only-sort-sometimes idea is that
we can guarantee that the largest ready item will always be dispatched
as soon as it can be (because it will be the heap root).  So cases
involving one big table (with big indexes) and a lot of little ones
should get scheduled sanely, which is the main thing we want this
algorithm to ensure.  With the other approach we can't really promise
much at all.

            regards, tom lane



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Sun, Jul 16, 2023 at 09:45:54AM -0400, Tom Lane wrote:
> Actually, as long as we're talking about approximately-correct behavior:
> let's make the ready_list be a priority heap, and then just make
> pop_next_work_item scan forward from the array start until it finds an
> item that's runnable per the lock heuristic.  If the heap root is
> blocked, the next things we'll examine will be its two children.
> We might pick the lower-priority of those two, but it's still known to
> be higher priority than at least 50% of the remaining heap entries, so
> it shouldn't be too awful as a choice.  The argument gets weaker the
> further you go into the heap, but we're not expecting that having most
> of the top entries blocked will be a common case.  (Besides which, the
> priorities are pretty crude to begin with.)  Once selected, pulling out
> an entry that is not the heap root is no problem: you just start the
> sift-down process from there.
> 
> The main advantage of this over the only-sort-sometimes idea is that
> we can guarantee that the largest ready item will always be dispatched
> as soon as it can be (because it will be the heap root).  So cases
> involving one big table (with big indexes) and a lot of little ones
> should get scheduled sanely, which is the main thing we want this
> algorithm to ensure.  With the other approach we can't really promise
> much at all.

This seems worth a try.  IIUC you are suggesting making binaryheap.c
frontend-friendly and expanding its API a bit.  If no one has volunteered,
I could probably hack something together.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Sun, Jul 16, 2023 at 08:54:24PM -0700, Nathan Bossart wrote:
> This seems worth a try.  IIUC you are suggesting making binaryheap.c
> frontend-friendly and expanding its API a bit.  If no one has volunteered,
> I could probably hack something together.

I spent some time on the binaryheap changes.  I haven't had a chance to
plug it into the ready_list yet.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Inefficiency in parallel pg_restore with many tables

From
Alvaro Herrera
Date:
On 2023-Jul-17, Nathan Bossart wrote:

> @@ -35,7 +42,11 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
>      binaryheap *heap;
>  
>      sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
> +#ifdef FRONTEND
> +    heap = (binaryheap *) pg_malloc(sz);
> +#else
>      heap = (binaryheap *) palloc(sz);
> +#endif

Hmm, as I recall fe_memutils.c provides you with palloc() in the
frontend environment, so you don't actually need this one.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there.  Did
somebody make percentages logarithmic while I wasn't looking?"
                http://smylers.hates-software.com/2005/09/08/1995c749.html



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Tue, Jul 18, 2023 at 06:05:11PM +0200, Alvaro Herrera wrote:
> On 2023-Jul-17, Nathan Bossart wrote:
> 
>> @@ -35,7 +42,11 @@ binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
>>      binaryheap *heap;
>>  
>>      sz = offsetof(binaryheap, bh_nodes) + sizeof(Datum) * capacity;
>> +#ifdef FRONTEND
>> +    heap = (binaryheap *) pg_malloc(sz);
>> +#else
>>      heap = (binaryheap *) palloc(sz);
>> +#endif
> 
> Hmm, as I recall fe_memutils.c provides you with palloc() in the
> frontend environment, so you don't actually need this one.

Ah, yes it does.  Thanks for the pointer.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
Here is a work-in-progress patch set for converting ready_list to a
priority queue.  On my machine, Tom's 100k-table example [0] takes 11.5
minutes without these patches and 1.5 minutes with them.

One item that requires more thought is binaryheap's use of Datum.  AFAICT
the Datum definitions live in postgres.h and aren't available to frontend
code.  I think we'll either need to move the Datum definitions to c.h or to
adjust binaryheap to use "void *".

[0] https://postgr.es/m/3612876.1689443232%40sss.pgh.pa.us

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Thu, Jul 20, 2023 at 12:06:44PM -0700, Nathan Bossart wrote:
> Here is a work-in-progress patch set for converting ready_list to a
> priority queue.  On my machine, Tom's 100k-table example [0] takes 11.5
> minutes without these patches and 1.5 minutes with them.
> 
> One item that requires more thought is binaryheap's use of Datum.  AFAICT
> the Datum definitions live in postgres.h and aren't available to frontend
> code.  I think we'll either need to move the Datum definitions to c.h or to
> adjust binaryheap to use "void *".

In v3, I moved the Datum definitions to c.h.  I first tried modifying
binaryheap to use "int" or "void *" instead, but that ended up requiring
some rather invasive changes in backend code, not to mention any extensions
that happen to be using it.  I also looked into moving the definitions to a
separate datumdefs.h header that postgres.h would include, but that felt
awkward because 1) postgres.h clearly states that it is intended for things
"that never escape the backend" and 2) the definitions seem relatively
inexpensive.  However, I think the latter option is still viable, so I'm
fine with switching to it if folks think that is a better approach.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Sat, Jul 22, 2023 at 04:19:41PM -0700, Nathan Bossart wrote:
> In v3, I moved the Datum definitions to c.h.  I first tried modifying
> binaryheap to use "int" or "void *" instead, but that ended up requiring
> some rather invasive changes in backend code, not to mention any extensions
> that happen to be using it.  I also looked into moving the definitions to a
> separate datumdefs.h header that postgres.h would include, but that felt
> awkward because 1) postgres.h clearly states that it is intended for things
> "that never escape the backend" and 2) the definitions seem relatively
> inexpensive.  However, I think the latter option is still viable, so I'm
> fine with switching to it if folks think that is a better approach.

BTW we might be able to replace the open-coded heap in pg_dump_sort.c
(added by 79273cc) with a binaryheap, too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Inefficiency in parallel pg_restore with many tables

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> On Thu, Jul 20, 2023 at 12:06:44PM -0700, Nathan Bossart wrote:
>> One item that requires more thought is binaryheap's use of Datum.  AFAICT
>> the Datum definitions live in postgres.h and aren't available to frontend
>> code.  I think we'll either need to move the Datum definitions to c.h or to
>> adjust binaryheap to use "void *".

> In v3, I moved the Datum definitions to c.h.  I first tried modifying
> binaryheap to use "int" or "void *" instead, but that ended up requiring
> some rather invasive changes in backend code, not to mention any extensions
> that happen to be using it.

I'm quite uncomfortable with putting Datum in c.h.  I know that the
typedef is merely a uintptr_t, but this solution seems to me to be
blowing all kinds of holes in the abstraction, because exactly none
of the infrastructure that goes along with Datum is or is ever likely
to be in any frontend build.  At the very least, frontend code that
refers to Datum will be misleading as hell.

I wonder whether we can't provide some alternate definition or "skin"
for binaryheap that preserves the Datum API for backend code that wants
that, while providing a void *-based API for frontend code to use.

            regards, tom lane



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Sat, Jul 22, 2023 at 07:47:50PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> I first tried modifying
>> binaryheap to use "int" or "void *" instead, but that ended up requiring
>> some rather invasive changes in backend code, not to mention any extensions
>> that happen to be using it.

I followed through with the "void *" approach (attached), and it wasn't as
bad as I expected.

> I wonder whether we can't provide some alternate definition or "skin"
> for binaryheap that preserves the Datum API for backend code that wants
> that, while providing a void *-based API for frontend code to use.

I can give this a try next, but it might be rather #ifdef-heavy.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Inefficiency in parallel pg_restore with many tables

From
Pierre Ducroquet
Date:
On Saturday, July 15, 2023 7:47:12 PM CEST Tom Lane wrote:
> I'm not sure how big a deal this is in practice: in most situations
> the individual jobs are larger than they are in this toy example,
> plus the initial non-parallelizable part of the restore is a bigger
> bottleneck anyway with this many tables.  Still, we do have one
> real-world complaint, so maybe we should look into improving it.

Hi

For what it's worth, at my current job it's kind of a big deal. I was going to 
start looking at the bad performance I got on pg_restore for some databases 
with over 50k tables (in 200 namespaces) when I found this thread. The dump 
weights in about 2,8GB, the toc.dat file is 230MB, 50 120 tables, 142 069 
constraints and 73 669 indexes.

HEAD pg_restore duration: 30 minutes
pg_restore with latest patch from Nathan Bossart: 23 minutes

This is indeed better, but there is still a lot of room for improvements. With 
such usecases, I was able to go much faster using the patched pg_restore with 
a script that parallelize on each schema instead of relying on the choices 
made by pg_restore. It seems the choice of parallelizing only the data loading 
is losing nice speedup opportunities with a huge number of objects.

patched pg_restore + parallel restore of schemas: 10 minutes

Anyway, the patch works really fine as is, and I will certainly keep trying 
future iterations.

Regards

 Pierre






Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Sat, Jul 22, 2023 at 10:57:03PM -0700, Nathan Bossart wrote:
> On Sat, Jul 22, 2023 at 07:47:50PM -0400, Tom Lane wrote:
>> I wonder whether we can't provide some alternate definition or "skin"
>> for binaryheap that preserves the Datum API for backend code that wants
>> that, while providing a void *-based API for frontend code to use.
> 
> I can give this a try next, but it might be rather #ifdef-heavy.

Here is a sketch of this approach.  It required fewer #ifdefs than I was
expecting.  At the moment, this one seems like the winner to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Mon, Jul 24, 2023 at 12:00:15PM -0700, Nathan Bossart wrote:
> Here is a sketch of this approach.  It required fewer #ifdefs than I was
> expecting.  At the moment, this one seems like the winner to me.

Here is a polished patch set for this approach.  I've also added a 0004
that replaces the open-coded heap in pg_dump_sort.c with a binaryheap.
IMHO these patches are in decent shape.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Tue, Jul 25, 2023 at 11:53:36AM -0700, Nathan Bossart wrote:
> Here is a polished patch set for this approach.  I've also added a 0004
> that replaces the open-coded heap in pg_dump_sort.c with a binaryheap.
> IMHO these patches are in decent shape.

I'm hoping to commit these patches at some point in the current commitfest.
I don't sense anything tremendously controversial, and they provide a
pretty nice speedup in some cases.  Are there any remaining concerns?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Inefficiency in parallel pg_restore with many tables

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> I'm hoping to commit these patches at some point in the current commitfest.
> I don't sense anything tremendously controversial, and they provide a
> pretty nice speedup in some cases.  Are there any remaining concerns?

I've not actually looked at any of these patchsets after the first one.
I have added myself as a reviewer and will hopefully get to it within
a week or so.

            regards, tom lane



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Fri, Sep 01, 2023 at 01:41:41PM -0400, Tom Lane wrote:
> I've not actually looked at any of these patchsets after the first one.
> I have added myself as a reviewer and will hopefully get to it within
> a week or so.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Inefficiency in parallel pg_restore with many tables

From
Robert Haas
Date:
On Tue, Jul 25, 2023 at 2:53 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Mon, Jul 24, 2023 at 12:00:15PM -0700, Nathan Bossart wrote:
> > Here is a sketch of this approach.  It required fewer #ifdefs than I was
> > expecting.  At the moment, this one seems like the winner to me.
>
> Here is a polished patch set for this approach.  I've also added a 0004
> that replaces the open-coded heap in pg_dump_sort.c with a binaryheap.
> IMHO these patches are in decent shape.

[ drive-by comment that hopefully doesn't cause too much pain ]

In hindsight, I think that making binaryheap depend on Datum was a bad
idea. I think that was my idea, and I think it wasn't very smart.
Considering that people have coded to that decision up until now, it
might not be too easy to change at this point. But in principle I
guess you'd want to be able to make a heap out of any C data type,
rather than just Datum, or just Datum in the backend and just void *
in the frontend.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Fri, Sep 01, 2023 at 04:00:44PM -0400, Robert Haas wrote:
> In hindsight, I think that making binaryheap depend on Datum was a bad
> idea. I think that was my idea, and I think it wasn't very smart.
> Considering that people have coded to that decision up until now, it
> might not be too easy to change at this point. But in principle I
> guess you'd want to be able to make a heap out of any C data type,
> rather than just Datum, or just Datum in the backend and just void *
> in the frontend.

Yeah, something similar to simplehash for binary heaps could be nice.  That
being said, I don't know if there's a strong reason to specialize the
implementation for a given C data type in most cases.  I suspect many
callers are just fine with dealing with pointers (e.g., I wouldn't store an
entire TocEntry in the array), and smaller types like integers are already
stored directly in the array thanks to the use of Datum.  However, it
_would_ allow us to abandon this frontend/backend void */Datum kludge,
which is something.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Fri, Sep 01, 2023 at 01:52:48PM -0700, Nathan Bossart wrote:
> On Fri, Sep 01, 2023 at 04:00:44PM -0400, Robert Haas wrote:
>> In hindsight, I think that making binaryheap depend on Datum was a bad
>> idea. I think that was my idea, and I think it wasn't very smart.
>> Considering that people have coded to that decision up until now, it
>> might not be too easy to change at this point. But in principle I
>> guess you'd want to be able to make a heap out of any C data type,
>> rather than just Datum, or just Datum in the backend and just void *
>> in the frontend.
> 
> Yeah, something similar to simplehash for binary heaps could be nice.  That
> being said, I don't know if there's a strong reason to specialize the
> implementation for a given C data type in most cases.  I suspect many
> callers are just fine with dealing with pointers (e.g., I wouldn't store an
> entire TocEntry in the array), and smaller types like integers are already
> stored directly in the array thanks to the use of Datum.  However, it
> _would_ allow us to abandon this frontend/backend void */Datum kludge,
> which is something.

I ended up hacking together a (nowhere near committable) patch to see how
hard it would be to allow using any type with binaryheap.  It doesn't seem
too bad.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Inefficiency in parallel pg_restore with many tables

From
Alvaro Herrera
Date:
On 2023-Sep-02, Nathan Bossart wrote:

> On Fri, Sep 01, 2023 at 01:52:48PM -0700, Nathan Bossart wrote:

> > Yeah, something similar to simplehash for binary heaps could be nice.  That
> > being said, I don't know if there's a strong reason to specialize the
> > implementation for a given C data type in most cases.
> 
> I ended up hacking together a (nowhere near committable) patch to see how
> hard it would be to allow using any type with binaryheap.  It doesn't seem
> too bad.

Yeah, using void * seems to lead to interfaces that are pretty much the
same as bsearch() or qsort().  (Why isn't your payload type const,
though?)

I do wonder why did you change _remove_first and _first to have a
'result' output argument instead of a return value.  Does this change
actually buy you anything?  simplehash.h doesn't do that either.

> -extern void binaryheap_add(binaryheap *heap, Datum d);
> -extern Datum binaryheap_first(binaryheap *heap);
> -extern Datum binaryheap_remove_first(binaryheap *heap);
> -extern void binaryheap_replace_first(binaryheap *heap, Datum d);
> +extern void binaryheap_add(binaryheap *heap, void *d);
> +extern void binaryheap_first(binaryheap *heap, void *result);
> +extern void binaryheap_remove_first(binaryheap *heap, void *result);
> +extern void binaryheap_replace_first(binaryheap *heap, void *d);

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Sun, Sep 03, 2023 at 12:04:00PM +0200, Alvaro Herrera wrote:
> On 2023-Sep-02, Nathan Bossart wrote:
>> I ended up hacking together a (nowhere near committable) patch to see how
>> hard it would be to allow using any type with binaryheap.  It doesn't seem
>> too bad.
> 
> Yeah, using void * seems to lead to interfaces that are pretty much the
> same as bsearch() or qsort().

Right.  This is what I had in mind.

> (Why isn't your payload type const,
> though?)

It probably should be const.  This patch was just a proof-of-concept and
still requireѕ a bit of work.

> I do wonder why did you change _remove_first and _first to have a
> 'result' output argument instead of a return value.  Does this change
> actually buy you anything?  simplehash.h doesn't do that either.
> 
>> -extern void binaryheap_add(binaryheap *heap, Datum d);
>> -extern Datum binaryheap_first(binaryheap *heap);
>> -extern Datum binaryheap_remove_first(binaryheap *heap);
>> -extern void binaryheap_replace_first(binaryheap *heap, Datum d);
>> +extern void binaryheap_add(binaryheap *heap, void *d);
>> +extern void binaryheap_first(binaryheap *heap, void *result);
>> +extern void binaryheap_remove_first(binaryheap *heap, void *result);
>> +extern void binaryheap_replace_first(binaryheap *heap, void *d);

_first could likely just return a pointer to the data in the binary heap's
array.  However, _remove_first has to copy the data somewhere, so I think
the alternative would be to return a palloc'd value.  Is there another way
that I'm not thinking of?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Sat, Sep 02, 2023 at 11:55:21AM -0700, Nathan Bossart wrote:
> I ended up hacking together a (nowhere near committable) patch to see how
> hard it would be to allow using any type with binaryheap.  It doesn't seem
> too bad.

I spent some more time on this patch and made the relevant adjustments to
the rest of the set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Inefficiency in parallel pg_restore with many tables

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> I spent some more time on this patch and made the relevant adjustments to
> the rest of the set.

Hmm ... I do not like v7 very much at all.  It requires rather ugly
changes to all of the existing callers, and what are we actually
buying?  If anything, it makes things slower for pass-by-value items
like integers.  I'd stick with the Datum convention in the backend.

Instead, I took a closer look through the v6 patch set.
I think that's in pretty good shape and nearly committable,
but I have a few thoughts:

* I'm not sure about defining bh_node_type as a macro:

+#ifdef FRONTEND
+#define bh_node_type void *
+#else
+#define bh_node_type Datum
+#endif

rather than an actual typedef:

+#ifdef FRONTEND
+typedef void *bh_node_type;
+#else
+typedef Datum bh_node_type;
+#endif

My concern here is that bh_node_type is effectively acting as a
typedef, so that pgindent might misbehave if it's not declared as a
typedef.  On the other hand, there doesn't seem to be any indentation
problem in the patchset as it stands, and we don't expect any code
outside binaryheap.h/.c to refer to bh_node_type, so maybe it's fine.
(If you do choose to make it a typedef, remember to add it to
typedefs.list.)

* As a matter of style, I'd recommend adding braces in places
like this:

     if (heap->bh_size >= heap->bh_space)
+    {
+#ifdef FRONTEND
+        pg_fatal("out of binary heap slots");
+#else
         elog(ERROR, "out of binary heap slots");
+#endif
+    }
     heap->bh_nodes[heap->bh_size] = d;

It's not wrong as you have it, but I think it's more readable
and less easy to accidentally break with the extra braces.

* In 0002, isn't the comment for binaryheap_remove_node wrong?

+ * Removes the nth node from the heap.  The caller must ensure that there are
+ * at least (n - 1) nodes in the heap.  O(log n) worst case.

Shouldn't that be "(n + 1)"?  Also, I'd specify "n'th (zero based) node"
for clarity.

* I would say that this bit in 0004:

-        j = removeHeapElement(pendingHeap, heapLength--);
+        j = (intptr_t) binaryheap_remove_first(pendingHeap);

needs an explicit cast to int:

+        j = (int) (intptr_t) binaryheap_remove_first(pendingHeap);

otherwise some compilers might complain about the result possibly
not fitting in "j".

Other than those nitpicks, I like v6.  I'll mark this RfC.

            regards, tom lane



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Sun, Sep 10, 2023 at 12:35:10PM -0400, Tom Lane wrote:
> Hmm ... I do not like v7 very much at all.  It requires rather ugly
> changes to all of the existing callers, and what are we actually
> buying?  If anything, it makes things slower for pass-by-value items
> like integers.  I'd stick with the Datum convention in the backend.
> 
> Instead, I took a closer look through the v6 patch set.
> I think that's in pretty good shape and nearly committable,
> but I have a few thoughts:

Thanks for reviewing.  I'm fine with proceeding with the v6 approach.  Even
though the alternative approach makes the API consistent for the frontend
and backend, I'm also not a huge fan of the pointer gymnastics required in
the comparators.  Granted, we still have to do some intptr_t conversions in
pg_dump_sort.c with the v6 approach, but that seems to be an exception.

> * I'm not sure about defining bh_node_type as a macro:
> 
> +#ifdef FRONTEND
> +#define bh_node_type void *
> +#else
> +#define bh_node_type Datum
> +#endif
> 
> rather than an actual typedef:
> 
> +#ifdef FRONTEND
> +typedef void *bh_node_type;
> +#else
> +typedef Datum bh_node_type;
> +#endif
> 
> My concern here is that bh_node_type is effectively acting as a
> typedef, so that pgindent might misbehave if it's not declared as a
> typedef.  On the other hand, there doesn't seem to be any indentation
> problem in the patchset as it stands, and we don't expect any code
> outside binaryheap.h/.c to refer to bh_node_type, so maybe it's fine.
> (If you do choose to make it a typedef, remember to add it to
> typedefs.list.)

I think a typedef makes more sense here.

> * As a matter of style, I'd recommend adding braces in places
> like this:
> 
>      if (heap->bh_size >= heap->bh_space)
> +    {
> +#ifdef FRONTEND
> +        pg_fatal("out of binary heap slots");
> +#else
>          elog(ERROR, "out of binary heap slots");
> +#endif
> +    }
>      heap->bh_nodes[heap->bh_size] = d;
> 
> It's not wrong as you have it, but I think it's more readable
> and less easy to accidentally break with the extra braces.

Fair point.

> * In 0002, isn't the comment for binaryheap_remove_node wrong?
> 
> + * Removes the nth node from the heap.  The caller must ensure that there are
> + * at least (n - 1) nodes in the heap.  O(log n) worst case.
> 
> Shouldn't that be "(n + 1)"?  Also, I'd specify "n'th (zero based) node"
> for clarity.

Yeah, that's a mistake.

> * I would say that this bit in 0004:
> 
> -        j = removeHeapElement(pendingHeap, heapLength--);
> +        j = (intptr_t) binaryheap_remove_first(pendingHeap);
> 
> needs an explicit cast to int:
> 
> +        j = (int) (intptr_t) binaryheap_remove_first(pendingHeap);
> 
> otherwise some compilers might complain about the result possibly
> not fitting in "j".

Sure.  IMO it's a tad more readable, too.

> Other than those nitpicks, I like v6.  I'll mark this RfC.

Great.  I've posted a v8 with your comments addressed in order to get one
more round of cfbot coverage.  Assuming those tests pass and there is no
additional feedback, I'll plan on committing this in the next few days.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Wed, Sep 13, 2023 at 11:34:50AM -0700, Nathan Bossart wrote:
> On Sun, Sep 10, 2023 at 12:35:10PM -0400, Tom Lane wrote:
>> Other than those nitpicks, I like v6.  I'll mark this RfC.
> 
> Great.  I've posted a v8 with your comments addressed in order to get one
> more round of cfbot coverage.  Assuming those tests pass and there is no
> additional feedback, I'll plan on committing this in the next few days.

Upon closer inspection, I found a rather nasty problem.  The qsort
comparator expects a TocEntry **, but the binaryheap comparator expects a
TocEntry *, and we simply pass the arguments through to the qsort
comparator.  In v9, I added the requisite ampersands.  I'm surprised this
worked at all.  I'm planning to run some additional tests to make sure this
patch set works as expected.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Inefficiency in parallel pg_restore with many tables

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> Upon closer inspection, I found a rather nasty problem.  The qsort
> comparator expects a TocEntry **, but the binaryheap comparator expects a
> TocEntry *, and we simply pass the arguments through to the qsort
> comparator.  In v9, I added the requisite ampersands.

Ooops :-(

> I'm surprised this
> worked at all.

Probably it was not sorting things appropriately.  Might be worth adding
some test scaffolding to check that bigger tasks are chosen before
smaller ones.

            regards, tom lane



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Wed, Sep 13, 2023 at 08:01:39PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> Upon closer inspection, I found a rather nasty problem.  The qsort
>> comparator expects a TocEntry **, but the binaryheap comparator expects a
>> TocEntry *, and we simply pass the arguments through to the qsort
>> comparator.  In v9, I added the requisite ampersands.
> 
> Ooops :-(
> 
>> I'm surprised this
>> worked at all.
> 
> Probably it was not sorting things appropriately.  Might be worth adding
> some test scaffolding to check that bigger tasks are chosen before
> smaller ones.

Further testing revealed that the binaryheap comparator function was
actually generating a min-heap since the qsort comparator sorts by
decreasing dataLength.  This is fixed in v10.  And I am 0 for 2 today...

Now that this appears to be functioning as expected, I see that the larger
entries are typically picked up earlier, but we do sometimes pick entries
quite a bit further down the list, as anticipated.  The case I was testing
(10k tables with the number of rows equal to the table number) was much
faster with this patch (just over a minute) than without it (over 16
minutes).

Sincerest apologies for the noise.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
For now, I've committed 0001 and 0002.  I intend to commit the others soon.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Inefficiency in parallel pg_restore with many tables

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> For now, I've committed 0001 and 0002.  I intend to commit the others soon.

bowerbird is unhappy with this.  I suppose you missed out updating
the src/tools/msvc/ scripts.  (Weren't we about ready to nuke those?)

            regards, tom lane



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Mon, Sep 18, 2023 at 09:23:20PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> For now, I've committed 0001 and 0002.  I intend to commit the others soon.
> 
> bowerbird is unhappy with this.  I suppose you missed out updating
> the src/tools/msvc/ scripts.  (Weren't we about ready to nuke those?)

I saw that and have attempted to fix it with 83223f5.  I'm still waiting
for an MSVC animal to report back.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Inefficiency in parallel pg_restore with many tables

From
Michael Paquier
Date:
On Mon, Sep 18, 2023 at 09:23:20PM -0400, Tom Lane wrote:
> bowerbird is unhappy with this.  I suppose you missed out updating
> the src/tools/msvc/ scripts.
> (Weren't we about ready to nuke those?)

hamerkop seems to be the only buildfarm member that would complain if
these were to be gone today, on top of bowerbird, of course.
--
Michael

Attachment

Re: Inefficiency in parallel pg_restore with many tables

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> On Mon, Sep 18, 2023 at 09:23:20PM -0400, Tom Lane wrote:
>> bowerbird is unhappy with this.  I suppose you missed out updating
>> the src/tools/msvc/ scripts.  (Weren't we about ready to nuke those?)

> I saw that and have attempted to fix it with 83223f5.

Ah, right, sorry for the noise.

But in any case, how long are we keeping src/tools/msvc/ ?

            regards, tom lane



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Mon, Sep 18, 2023 at 09:36:03PM -0400, Tom Lane wrote:
> But in any case, how long are we keeping src/tools/msvc/ ?

From a skim of [0], it seems like it could be removed now.  I see a couple
of work-in-progress patches from Andres [1] that would probably serve as a
good starting point.  I won't have much time for this for the next few
weeks, so if someone else wants to pick it up, please feel free.

[0] https://postgr.es/m/20230408191007.7lysd42euafwl74f%40awork3.anarazel.de
[1] https://github.com/anarazel/postgres/commits/drop-homegrown-msvc

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Inefficiency in parallel pg_restore with many tables

From
Nathan Bossart
Date:
On Mon, Sep 18, 2023 at 02:22:32PM -0700, Nathan Bossart wrote:
> For now, I've committed 0001 and 0002.  I intend to commit the others soon.

I've committed the rest of the patches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com