Home > mailing lists

Re: Inefficiency in parallel pg_restore with many tables - Mailing list pgsql-hackers

From	Andrew Dunstan
Subject	Re: Inefficiency in parallel pg_restore with many tables
Date	July 16, 2023 15:17:49
Msg-id	cf80eacd-8446-2a25-370b-40b2eb1cf890@dunslane.net Whole thread Raw
In response to	Inefficiency in parallel pg_restore with many tables (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Inefficiency in parallel pg_restore with many tables
List	pgsql-hackers

Tree view

On 2023-07-15 Sa 13:47, Tom Lane wrote:

I looked into the performance gripe at [1] about pg_restore not making
effective use of parallel workers when there are a lot of tables.
I was able to reproduce that by dumping and parallel restoring 100K
tables made according to this script:

do $$
begin
for i in 1..100000 loop  execute format('create table t%s (f1 int unique, f2 int unique);', i);  execute format('insert into t%s select x, x from generate_series(1,1000) x',                 i);  if i % 100 = 0 then commit; end if;
end loop;
end
$$;

Once pg_restore reaches the parallelizable part of the restore, what
I see is that the parent pg_restore process goes to 100% CPU while its
children (and the server) mostly sit idle; that is, the task dispatch
logic in pg_backup_archiver.c is unable to dispatch tasks fast enough
to keep the children busy.  A quick perf check showed most of the time
being eaten by pg_qsort and TocEntrySizeCompare.

What I believe is happening is that we start the parallel restore phase
with 100K TableData items that are ready to go (they are in the
ready_list) and 200K AddConstraint items that are pending, because
we make those have dependencies on the corresponding TableData so we
don't build an index until after its table is populated.  Each time
one of the TableData items is completed by some worker, the two
AddConstraint items for its table are moved from the pending_list
to the ready_list --- and that means ready_list_insert marks the
ready_list as no longer sorted.  When we go to pop the next task
from the ready_list, we re-sort that entire list first.  So
we spend something like O(N^2 * log(N)) time just sorting, if
there are N tables.  Clearly, this code is much less bright
than it thinks it is (and that's all my fault, if memory serves).

I'm not sure how big a deal this is in practice: in most situations
the individual jobs are larger than they are in this toy example,
plus the initial non-parallelizable part of the restore is a bigger
bottleneck anyway with this many tables.  Still, we do have one
real-world complaint, so maybe we should look into improving it.

I wonder if we could replace the sorted ready-list with a priority heap,
although that might be complicated by the fact that pop_next_work_item
has to be capable of popping something that's not necessarily the
largest remaining job.  Another idea could be to be a little less eager
to sort the list every time; I think in practice scheduling wouldn't
get much worse if we only re-sorted every so often.

Yeah, I think that last idea is reasonable. Something like if the number added since the last sort is more than min(50, list_length/4) then sort. That shouldn't be too invasive.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

pgsql-hackers by date:

From: Magnus Hagander
Date: 16 July 2023, 14:24:06
Subject: Re: Should we remove db_user_namespace?

From: Tom Lane
Date: 16 July 2023, 16:45:54
Subject: Re: Inefficiency in parallel pg_restore with many tables

Re: Inefficiency in parallel pg_restore with many tables - Mailing list pgsql-hackers

Previous

Next