Re: Inefficiency in parallel pg_restore with many tables - Mailing list pgsql-hackers
From | Andrew Dunstan |
---|---|
Subject | Re: Inefficiency in parallel pg_restore with many tables |
Date | |
Msg-id | cf80eacd-8446-2a25-370b-40b2eb1cf890@dunslane.net Whole thread Raw |
In response to | Inefficiency in parallel pg_restore with many tables (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Inefficiency in parallel pg_restore with many tables
|
List | pgsql-hackers |
On 2023-07-15 Sa 13:47, Tom Lane wrote:
I looked into the performance gripe at [1] about pg_restore not making effective use of parallel workers when there are a lot of tables. I was able to reproduce that by dumping and parallel restoring 100K tables made according to this script: do $$ begin for i in 1..100000 loop execute format('create table t%s (f1 int unique, f2 int unique);', i); execute format('insert into t%s select x, x from generate_series(1,1000) x', i); if i % 100 = 0 then commit; end if; end loop; end $$; Once pg_restore reaches the parallelizable part of the restore, what I see is that the parent pg_restore process goes to 100% CPU while its children (and the server) mostly sit idle; that is, the task dispatch logic in pg_backup_archiver.c is unable to dispatch tasks fast enough to keep the children busy. A quick perf check showed most of the time being eaten by pg_qsort and TocEntrySizeCompare. What I believe is happening is that we start the parallel restore phase with 100K TableData items that are ready to go (they are in the ready_list) and 200K AddConstraint items that are pending, because we make those have dependencies on the corresponding TableData so we don't build an index until after its table is populated. Each time one of the TableData items is completed by some worker, the two AddConstraint items for its table are moved from the pending_list to the ready_list --- and that means ready_list_insert marks the ready_list as no longer sorted. When we go to pop the next task from the ready_list, we re-sort that entire list first. So we spend something like O(N^2 * log(N)) time just sorting, if there are N tables. Clearly, this code is much less bright than it thinks it is (and that's all my fault, if memory serves). I'm not sure how big a deal this is in practice: in most situations the individual jobs are larger than they are in this toy example, plus the initial non-parallelizable part of the restore is a bigger bottleneck anyway with this many tables. Still, we do have one real-world complaint, so maybe we should look into improving it. I wonder if we could replace the sorted ready-list with a priority heap, although that might be complicated by the fact that pop_next_work_item has to be capable of popping something that's not necessarily the largest remaining job. Another idea could be to be a little less eager to sort the list every time; I think in practice scheduling wouldn't get much worse if we only re-sorted every so often.
Yeah, I think that last idea is reasonable. Something like if the number added since the last sort is more than min(50, list_length/4) then sort. That shouldn't be too invasive.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
pgsql-hackers by date: