I looked into the performance gripe at [1] about pg_restore not making
effective use of parallel workers when there are a lot of tables.
I was able to reproduce that by dumping and parallel restoring 100K
tables made according to this script:
do $$
begin
for i in 1..100000 loop
execute format('create table t%s (f1 int unique, f2 int unique);', i);
execute format('insert into t%s select x, x from generate_series(1,1000) x',
i);
if i % 100 = 0 then commit; end if;
end loop;
end
$$;
Once pg_restore reaches the parallelizable part of the restore, what
I see is that the parent pg_restore process goes to 100% CPU while its
children (and the server) mostly sit idle; that is, the task dispatch
logic in pg_backup_archiver.c is unable to dispatch tasks fast enough
to keep the children busy. A quick perf check showed most of the time
being eaten by pg_qsort and TocEntrySizeCompare.
What I believe is happening is that we start the parallel restore phase
with 100K TableData items that are ready to go (they are in the
ready_list) and 200K AddConstraint items that are pending, because
we make those have dependencies on the corresponding TableData so we
don't build an index until after its table is populated. Each time
one of the TableData items is completed by some worker, the two
AddConstraint items for its table are moved from the pending_list
to the ready_list --- and that means ready_list_insert marks the
ready_list as no longer sorted. When we go to pop the next task
from the ready_list, we re-sort that entire list first. So
we spend something like O(N^2 * log(N)) time just sorting, if
there are N tables. Clearly, this code is much less bright
than it thinks it is (and that's all my fault, if memory serves).
I'm not sure how big a deal this is in practice: in most situations
the individual jobs are larger than they are in this toy example,
plus the initial non-parallelizable part of the restore is a bigger
bottleneck anyway with this many tables. Still, we do have one
real-world complaint, so maybe we should look into improving it.
I wonder if we could replace the sorted ready-list with a priority heap,
although that might be complicated by the fact that pop_next_work_item
has to be capable of popping something that's not necessarily the
largest remaining job. Another idea could be to be a little less eager
to sort the list every time; I think in practice scheduling wouldn't
get much worse if we only re-sorted every so often.
I don't have time to pursue this right now, but perhaps someone
else would like to.
regards, tom lane
[1] https://www.postgresql.org/message-id/flat/CAEzn%3DHSPXi6OS-5KzGMcZeKzWKOOX1me2u2eCiGtMEZDz9Fqdg%40mail.gmail.com