Re: Fix pg_upgrade to preserve datdba - Mailing list pgsql-hackers

From Jan Wieck
Subject Re: Fix pg_upgrade to preserve datdba
Date
Msg-id a1e200e6-adde-2561-422b-a166ec084e3b@wi3ck.info
Whole thread Raw
In response to Re: Fix pg_upgrade to preserve datdba  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 3/21/21 3:56 PM, Tom Lane wrote:
> Jan Wieck <jan@wi3ck.info> writes:
>> So let's focus on the actual problem of running out of XIDs and memory 
>> while doing the upgrade involving millions of small large objects.
> 
> Right.  So as far as --single-transaction vs. --create goes, that's
> mostly a definitional problem.  As long as the contents of a DB are
> restored in one transaction, it's not gonna matter if we eat one or
> two more XIDs while creating the DB itself.  So we could either
> relax pg_restore's complaint, or invent a different switch that's
> named to acknowledge that it's not really only one transaction.
> 
> That still leaves us with the lots-o-locks problem.  However, once
> we've crossed the Rubicon of "it's not really only one transaction",
> you could imagine that the switch is "--fewer-transactions", and the
> idea is for pg_restore to commit after every (say) 100000 operations.
> That would both bound its lock requirements and greatly cut its XID
> consumption.

It leaves us with three things.

1) tremendous amounts of locks
2) tremendous amounts of memory needed
3) taking forever because it is single threaded.

I created a pathological case here on a VM with 24GB of RAM, 80GB of 
SWAP sitting on NVME. The database has 20 million large objects, each of 
which has 2 GRANTS, 1 COMMENT and 1 SECURITY LABEL (dummy). Each LO only 
contains a string "large object <oid>", so the whole database in 9.5 is 
about 15GB in size.

A stock pg_upgrade to version 14devel using --link takes about 15 hours. 
This is partly because the pg_dump and pg_restore both grow to something 
like 50GB+ to hold the TOC. Which sounds out of touch considering that 
the entire system catalog on disk is less than 15GB. But aside from the 
ridiculous amount of swapping, the whole thing also suffers from 
consuming about 80 million transactions and apparently having just as 
many network round trips with a single client.

> 
> The work you described sounded like it could fit into that paradigm,
> with the additional ability to run some parallel restore tasks
> that are each consuming a bounded number of locks.

I have attached a POC patch that implements two new options for pg_upgrade.

   --restore-jobs=NUM             --jobs parameter passed to pg_restore
   --restore-blob-batch-size=NUM  number of blobs restored in one xact

It does a bit more than just that. It rearranges the way large objects 
are dumped so that most of the commands are all in one TOC entry and the 
entry is emitted into SECTION_DATA when in binary upgrade mode (which 
guarantees that there isn't any actual BLOB data in the dump). This 
greatly reduces the number of network round trips and when using 8 
parallel restore jobs, almost saturates the 4-core VM. Reducing the 
number of TOC entries also reduces the total virtual memory need of 
pg_restore to 15G, so there is a lot less swapping going on.

It cuts down the pg_upgrade time from 15 hours to 1.5 hours. In that run 
I used --restore-jobs=8 and --restore-blob-batch-size=10000 (with a 
max_locks_per_transaction=12000).

As said, this isn't a "one size fits all" solution. The pg_upgrade 
parameters for --jobs and --restore-jobs will really depend on the 
situation. Hundreds of small databases want --jobs, but one database 
with millions of large objects wants --restore-jobs.


Regards, Jan

-- 
Jan Wieck
Principle Database Engineer
Amazon Web Services

Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Nicer error when connecting to standby with hot_standby=off
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] Custom compression methods