Thread: Re: pg_upgrade failing for 200+ million Large Objects
On 3/8/21 11:58 AM, Tom Lane wrote: > The answer up to now has been "raise max_locks_per_transaction enough > so you don't see the failure". Having now consumed a little more > caffeine, I remember that that works in pg_upgrade scenarios too, > since the user can fiddle with the target cluster's postgresql.conf > before starting pg_upgrade. > > So it seems like the path of least resistance is > > (a) make pg_upgrade use --single-transaction when calling pg_restore > > (b) document (better) how to get around too-many-locks failures. That would first require to fix how pg_upgrade is creating the databases. It uses "pg_restore --create", which is mutually exclusive with --single-transaction because we cannot create a database inside of a transaction. On the way pg_upgrade also mangles the pg_database.datdba (all databases are owned by postgres after an upgrade; will submit a separate patch for that as I consider that a bug by itself). All that aside, the entire approach doesn't scale. In a hacked up pg_upgrade that does "createdb" first before calling pg_upgrade with --single-transaction. I can upgrade 1M large objects with max_locks_per_transaction = 5300 max_connectinons=100 which contradicts the docs. Need to find out where that math went off the rails because that config should only have room for 530,000 locks, not 1M. The same test fails with max_locks_per_transaction = 5200. But this would mean that one has to modify the postgresql.conf to something like 530,000 max_locks_per_transaction at 100 max_connections in order to actually run a successful upgrade of 100M large objects. This config requires 26GB of memory just for locks. Add to that the memory pg_restore needs to load the entire TOC before even restoring a single object. Not going to work. But tests are still ongoing ... Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
On 3/20/21 12:39 AM, Jan Wieck wrote: > On 3/8/21 11:58 AM, Tom Lane wrote: >> The answer up to now has been "raise max_locks_per_transaction enough >> so you don't see the failure". Having now consumed a little more >> caffeine, I remember that that works in pg_upgrade scenarios too, >> since the user can fiddle with the target cluster's postgresql.conf >> before starting pg_upgrade. >> >> So it seems like the path of least resistance is >> >> (a) make pg_upgrade use --single-transaction when calling pg_restore >> >> (b) document (better) how to get around too-many-locks failures. > > That would first require to fix how pg_upgrade is creating the > databases. It uses "pg_restore --create", which is mutually exclusive > with --single-transaction because we cannot create a database inside > of a transaction. On the way pg_upgrade also mangles the > pg_database.datdba (all databases are owned by postgres after an > upgrade; will submit a separate patch for that as I consider that a > bug by itself). > > All that aside, the entire approach doesn't scale. > > In a hacked up pg_upgrade that does "createdb" first before calling > pg_upgrade with --single-transaction. I can upgrade 1M large objects with > max_locks_per_transaction = 5300 > max_connectinons=100 > which contradicts the docs. Need to find out where that math went off > the rails because that config should only have room for 530,000 locks, > not 1M. The same test fails with max_locks_per_transaction = 5200. > > But this would mean that one has to modify the postgresql.conf to > something like 530,000 max_locks_per_transaction at 100 > max_connections in order to actually run a successful upgrade of 100M > large objects. This config requires 26GB of memory just for locks. Add > to that the memory pg_restore needs to load the entire TOC before even > restoring a single object. > > Not going to work. But tests are still ongoing ... I thought Tom's suggestion upthread: > Would it be sane to have the backend not bother to > take any locks in binary-upgrade mode? was interesting. Could we do that on the restore side? After all, what are we locking against in binary upgrade mode? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Jan Wieck <jan@wi3ck.info> writes: > On 3/8/21 11:58 AM, Tom Lane wrote: >> So it seems like the path of least resistance is >> (a) make pg_upgrade use --single-transaction when calling pg_restore >> (b) document (better) how to get around too-many-locks failures. > That would first require to fix how pg_upgrade is creating the > databases. It uses "pg_restore --create", which is mutually exclusive > with --single-transaction because we cannot create a database inside of > a transaction. Ugh. > All that aside, the entire approach doesn't scale. Yeah, agreed. When we gave large objects individual ownership and ACL info, it was argued that pg_dump could afford to treat each one as a separate TOC entry because "you wouldn't have that many of them, if they're large". The limits of that approach were obvious even at the time, and I think now we're starting to see people for whom it really doesn't work. I wonder if pg_dump could improve matters cheaply by aggregating the large objects by owner and ACL contents. That is, do select distinct lomowner, lomacl from pg_largeobject_metadata; and make just *one* BLOB TOC entry for each result. Then dump out all the matching blobs under that heading. A possible objection is that it'd reduce the ability to restore blobs selectively, so maybe we'd need to make it optional. Of course, that just reduces the memory consumption on the client side; it does nothing for the locks. Can we get away with releasing the lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob? regards, tom lane
On Sat, Mar 20, 2021 at 11:23:19AM -0400, Tom Lane wrote: > I wonder if pg_dump could improve matters cheaply by aggregating the > large objects by owner and ACL contents. That is, do > > select distinct lomowner, lomacl from pg_largeobject_metadata; > > and make just *one* BLOB TOC entry for each result. Then dump out > all the matching blobs under that heading. > > A possible objection is that it'd reduce the ability to restore blobs > selectively, so maybe we'd need to make it optional. > > Of course, that just reduces the memory consumption on the client > side; it does nothing for the locks. Can we get away with releasing the > lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob? Well, in pg_upgrade mode you can, since there are no other cluster users, but you might be asking for general pg_dump usage. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Bruce Momjian <bruce@momjian.us> writes: > On Sat, Mar 20, 2021 at 11:23:19AM -0400, Tom Lane wrote: >> Of course, that just reduces the memory consumption on the client >> side; it does nothing for the locks. Can we get away with releasing the >> lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob? > Well, in pg_upgrade mode you can, since there are no other cluster > users, but you might be asking for general pg_dump usage. Yeah, this problem doesn't only affect pg_upgrade scenarios, so it'd really be better to find a way that isn't dependent on binary-upgrade mode. regards, tom lane
On 3/20/21 11:23 AM, Tom Lane wrote: > Jan Wieck <jan@wi3ck.info> writes: >> All that aside, the entire approach doesn't scale. > > Yeah, agreed. When we gave large objects individual ownership and ACL > info, it was argued that pg_dump could afford to treat each one as a > separate TOC entry because "you wouldn't have that many of them, if > they're large". The limits of that approach were obvious even at the > time, and I think now we're starting to see people for whom it really > doesn't work. It actually looks more like some users have millions of "small objects". I am still wondering where that is coming from and why they are abusing LOs in that way, but that is more out of curiosity. Fact is that they are out there and that they cannot upgrade from their 9.5 databases, which are now past EOL. > > I wonder if pg_dump could improve matters cheaply by aggregating the > large objects by owner and ACL contents. That is, do > > select distinct lomowner, lomacl from pg_largeobject_metadata; > > and make just *one* BLOB TOC entry for each result. Then dump out > all the matching blobs under that heading. What I am currently experimenting with is moving the BLOB TOC entries into the parallel data phase of pg_restore "when doing binary upgrade". It seems to scale nicely with the number of cores in the system. In addition to that have options for pg_upgrade and pg_restore that cause the restore to batch them into transactions, like 10,000 objects at a time. There was a separate thread for that but I guess it is better to keep it all together here now. > > A possible objection is that it'd reduce the ability to restore blobs > selectively, so maybe we'd need to make it optional. I fully intend to make all this into new "options". I am afraid that there is no one-size-fits-all solution here. > > Of course, that just reduces the memory consumption on the client > side; it does nothing for the locks. Can we get away with releasing the > lock immediately after doing an ALTER OWNER or GRANT/REVOKE on a blob? I'm not very fond of the idea going lockless when at the same time trying to parallelize the restore phase. That can lead to really nasty race conditions. For now I'm aiming at batches in transactions. Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
On 3/20/21 12:55 PM, Jan Wieck wrote: > On 3/20/21 11:23 AM, Tom Lane wrote: >> Jan Wieck <jan@wi3ck.info> writes: >>> All that aside, the entire approach doesn't scale. >> >> Yeah, agreed. When we gave large objects individual ownership and ACL >> info, it was argued that pg_dump could afford to treat each one as a >> separate TOC entry because "you wouldn't have that many of them, if >> they're large". The limits of that approach were obvious even at the >> time, and I think now we're starting to see people for whom it really >> doesn't work. > > It actually looks more like some users have millions of "small > objects". I am still wondering where that is coming from and why they > are abusing LOs in that way, but that is more out of curiosity. Fact > is that they are out there and that they cannot upgrade from their 9.5 > databases, which are now past EOL. > One possible (probable?) source is the JDBC driver, which currently treats all Blobs (and Clobs, for that matter) as LOs. I'm working on improving that some: <https://github.com/pgjdbc/pgjdbc/pull/2093> cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Fix pg_upgrade to preserve datdba (was: Re: pg_upgrade failing for 200+ million Large Objects)
From
Jan Wieck
Date:
On 3/20/21 12:39 AM, Jan Wieck wrote: > On the way pg_upgrade also mangles the pg_database.datdba > (all databases are owned by postgres after an upgrade; will submit a > separate patch for that as I consider that a bug by itself). Patch attached. Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
Attachment
On 3/21/21 7:47 AM, Andrew Dunstan wrote: > One possible (probable?) source is the JDBC driver, which currently > treats all Blobs (and Clobs, for that matter) as LOs. I'm working on > improving that some: <https://github.com/pgjdbc/pgjdbc/pull/2093> You mean the user is using OID columns pointing to large objects and the JDBC driver is mapping those for streaming operations? Yeah, that would explain a lot. Thanks, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
Re: Fix pg_upgrade to preserve datdba (was: Re: pg_upgrade failing for 200+ million Large Objects)
From
Tom Lane
Date:
Jan Wieck <jan@wi3ck.info> writes: > On 3/20/21 12:39 AM, Jan Wieck wrote: >> On the way pg_upgrade also mangles the pg_database.datdba >> (all databases are owned by postgres after an upgrade; will submit a >> separate patch for that as I consider that a bug by itself). > Patch attached. Hmm, doesn't this lose all *other* database-level properties? I think maybe what we have here is a bug in pg_restore, its --create switch ought to be trying to update the database's ownership. regards, tom lane
On 3/21/21 12:57 PM, Tom Lane wrote: > Jan Wieck <jan@wi3ck.info> writes: >> On 3/20/21 12:39 AM, Jan Wieck wrote: >>> On the way pg_upgrade also mangles the pg_database.datdba >>> (all databases are owned by postgres after an upgrade; will submit a >>> separate patch for that as I consider that a bug by itself). > >> Patch attached. > > Hmm, doesn't this lose all *other* database-level properties? > > I think maybe what we have here is a bug in pg_restore, its > --create switch ought to be trying to update the database's > ownership. Possibly. I didn't look into that route. Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
On 3/21/21 1:15 PM, Jan Wieck wrote: > On 3/21/21 12:57 PM, Tom Lane wrote: >> Jan Wieck <jan@wi3ck.info> writes: >>> On 3/20/21 12:39 AM, Jan Wieck wrote: >>>> On the way pg_upgrade also mangles the pg_database.datdba >>>> (all databases are owned by postgres after an upgrade; will submit a >>>> separate patch for that as I consider that a bug by itself). >> >>> Patch attached. >> >> Hmm, doesn't this lose all *other* database-level properties? >> >> I think maybe what we have here is a bug in pg_restore, its >> --create switch ought to be trying to update the database's >> ownership. > > Possibly. I didn't look into that route. Thanks for that. I like this patch a lot better. Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
Attachment
On 3/21/21 12:56 PM, Jan Wieck wrote: > On 3/21/21 7:47 AM, Andrew Dunstan wrote: >> One possible (probable?) source is the JDBC driver, which currently >> treats all Blobs (and Clobs, for that matter) as LOs. I'm working on >> improving that some: <https://github.com/pgjdbc/pgjdbc/pull/2093> > > You mean the user is using OID columns pointing to large objects and > the JDBC driver is mapping those for streaming operations? > > Yeah, that would explain a lot. > > > Probably in most cases the database is designed by Hibernate, and the front end programmers know nothing at all of Oids or LOs, they just ask for and get a Blob. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Jan Wieck <jan@wi3ck.info> writes: >> On 3/21/21 12:57 PM, Tom Lane wrote: >>> I think maybe what we have here is a bug in pg_restore, its >>> --create switch ought to be trying to update the database's >>> ownership. > Thanks for that. I like this patch a lot better. Needs a little more work than that --- we should allow it to respond to the --no-owner switch, for example. But I think likely we can do it where other object ownership is handled. I'll look in a bit. regards, tom lane
I wrote: > Needs a little more work than that --- we should allow it to respond > to the --no-owner switch, for example. But I think likely we can do > it where other object ownership is handled. I'll look in a bit. Actually ... said code already DOES do that, so now I'm confused. I tried regression=# create user joe; CREATE ROLE regression=# create database joe owner joe; CREATE DATABASE regression=# \q $ pg_dump -Fc joe >joe.dump $ pg_restore --create -f - joe.dump | more and I see -- -- Name: joe; Type: DATABASE; Schema: -; Owner: joe -- CREATE DATABASE joe WITH TEMPLATE = template0 ENCODING = 'SQL_ASCII' LOCALE = 'C'; ALTER DATABASE joe OWNER TO joe; so at least in this case it's doing the right thing. We need a bit more detail about the context in which it's doing the wrong thing for you. regards, tom lane
I wrote: > ... so at least in this case it's doing the right thing. We need a bit > more detail about the context in which it's doing the wrong thing > for you. Just to cross-check, I tried modifying pg_upgrade's regression test as attached, and it still passes. (And inspection of the leftover dump2.sql file verifies that the database ownership was correct.) So I'm not sure what's up here. regards, tom lane diff --git a/src/bin/pg_upgrade/test.sh b/src/bin/pg_upgrade/test.sh index 9c6deae294..436646b5ba 100644 --- a/src/bin/pg_upgrade/test.sh +++ b/src/bin/pg_upgrade/test.sh @@ -150,6 +150,9 @@ export EXTRA_REGRESS_OPTS standard_initdb "$oldbindir"/initdb "$oldbindir"/pg_ctl start -l "$logdir/postmaster1.log" -o "$POSTMASTER_OPTS" -w +# Create another user (just to exercise database ownership restoration). +createuser regression_dbowner || createdb_status=$? + # Create databases with names covering the ASCII bytes other than NUL, BEL, # LF, or CR. BEL would ring the terminal bell in the course of this test, and # it is not otherwise a special case. PostgreSQL doesn't support the rest. @@ -160,7 +163,7 @@ dbname1='\"\'$dbname1'\\"\\\' dbname2=`awk 'BEGIN { for (i = 46; i < 91; i++) printf "%c", i }' </dev/null` dbname3=`awk 'BEGIN { for (i = 91; i < 128; i++) printf "%c", i }' </dev/null` createdb "regression$dbname1" || createdb_status=$? -createdb "regression$dbname2" || createdb_status=$? +createdb --owner=regression_dbowner "regression$dbname2" || createdb_status=$? createdb "regression$dbname3" || createdb_status=$? if "$MAKE" -C "$oldsrc" installcheck-parallel; then @@ -227,7 +230,7 @@ PGDATA="$BASE_PGDATA" standard_initdb 'initdb' -pg_upgrade $PG_UPGRADE_OPTS -d "${PGDATA}.old" -D "$PGDATA" -b "$oldbindir" -p "$PGPORT" -P "$PGPORT" +pg_upgrade $PG_UPGRADE_OPTS -d "${PGDATA}.old" -D "$PGDATA" -b "$oldbindir" -p "$PGPORT" -P "$PGPORT" -j 4 # make sure all directories and files have group permissions, on Unix hosts # Windows hosts don't support Unix-y permissions.
On 3/21/21 2:34 PM, Tom Lane wrote: > and I see > > -- > -- Name: joe; Type: DATABASE; Schema: -; Owner: joe > -- > > CREATE DATABASE joe WITH TEMPLATE = template0 ENCODING = 'SQL_ASCII' LOCALE = 'C'; > > > ALTER DATABASE joe OWNER TO joe; > > so at least in this case it's doing the right thing. We need a bit > more detail about the context in which it's doing the wrong thing > for you. After moving all of this to a pristine postgresql.org based repo I see the same. My best guess at this point is that the permission hoops, that RDS and Aurora PostgreSQL are jumping through, was messing with this. But that has nothing to do with the actual topic. So let's focus on the actual problem of running out of XIDs and memory while doing the upgrade involving millions of small large objects. Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
Jan Wieck <jan@wi3ck.info> writes: > So let's focus on the actual problem of running out of XIDs and memory > while doing the upgrade involving millions of small large objects. Right. So as far as --single-transaction vs. --create goes, that's mostly a definitional problem. As long as the contents of a DB are restored in one transaction, it's not gonna matter if we eat one or two more XIDs while creating the DB itself. So we could either relax pg_restore's complaint, or invent a different switch that's named to acknowledge that it's not really only one transaction. That still leaves us with the lots-o-locks problem. However, once we've crossed the Rubicon of "it's not really only one transaction", you could imagine that the switch is "--fewer-transactions", and the idea is for pg_restore to commit after every (say) 100000 operations. That would both bound its lock requirements and greatly cut its XID consumption. The work you described sounded like it could fit into that paradigm, with the additional ability to run some parallel restore tasks that are each consuming a bounded number of locks. regards, tom lane
On 3/21/21 3:56 PM, Tom Lane wrote: > Jan Wieck <jan@wi3ck.info> writes: >> So let's focus on the actual problem of running out of XIDs and memory >> while doing the upgrade involving millions of small large objects. > > Right. So as far as --single-transaction vs. --create goes, that's > mostly a definitional problem. As long as the contents of a DB are > restored in one transaction, it's not gonna matter if we eat one or > two more XIDs while creating the DB itself. So we could either > relax pg_restore's complaint, or invent a different switch that's > named to acknowledge that it's not really only one transaction. > > That still leaves us with the lots-o-locks problem. However, once > we've crossed the Rubicon of "it's not really only one transaction", > you could imagine that the switch is "--fewer-transactions", and the > idea is for pg_restore to commit after every (say) 100000 operations. > That would both bound its lock requirements and greatly cut its XID > consumption. It leaves us with three things. 1) tremendous amounts of locks 2) tremendous amounts of memory needed 3) taking forever because it is single threaded. I created a pathological case here on a VM with 24GB of RAM, 80GB of SWAP sitting on NVME. The database has 20 million large objects, each of which has 2 GRANTS, 1 COMMENT and 1 SECURITY LABEL (dummy). Each LO only contains a string "large object <oid>", so the whole database in 9.5 is about 15GB in size. A stock pg_upgrade to version 14devel using --link takes about 15 hours. This is partly because the pg_dump and pg_restore both grow to something like 50GB+ to hold the TOC. Which sounds out of touch considering that the entire system catalog on disk is less than 15GB. But aside from the ridiculous amount of swapping, the whole thing also suffers from consuming about 80 million transactions and apparently having just as many network round trips with a single client. > > The work you described sounded like it could fit into that paradigm, > with the additional ability to run some parallel restore tasks > that are each consuming a bounded number of locks. I have attached a POC patch that implements two new options for pg_upgrade. --restore-jobs=NUM --jobs parameter passed to pg_restore --restore-blob-batch-size=NUM number of blobs restored in one xact It does a bit more than just that. It rearranges the way large objects are dumped so that most of the commands are all in one TOC entry and the entry is emitted into SECTION_DATA when in binary upgrade mode (which guarantees that there isn't any actual BLOB data in the dump). This greatly reduces the number of network round trips and when using 8 parallel restore jobs, almost saturates the 4-core VM. Reducing the number of TOC entries also reduces the total virtual memory need of pg_restore to 15G, so there is a lot less swapping going on. It cuts down the pg_upgrade time from 15 hours to 1.5 hours. In that run I used --restore-jobs=8 and --restore-blob-batch-size=10000 (with a max_locks_per_transaction=12000). As said, this isn't a "one size fits all" solution. The pg_upgrade parameters for --jobs and --restore-jobs will really depend on the situation. Hundreds of small databases want --jobs, but one database with millions of large objects wants --restore-jobs. Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
Attachment
Hi,
w.r.t. pg_upgrade_improvements.v2.diff.
+ blobInXact = false;
The count and bool flag are always reset in tandem. It seems variable blobInXact is not needed.
Cheers
On 3/22/21 5:36 PM, Zhihong Yu wrote: > Hi, > > w.r.t. pg_upgrade_improvements.v2.diff. > > + blobBatchCount = 0; > + blobInXact = false; > > The count and bool flag are always reset in tandem. It seems > variable blobInXact is not needed. You are right. I will fix that. Thanks, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
On 3/22/21 7:18 PM, Jan Wieck wrote: > On 3/22/21 5:36 PM, Zhihong Yu wrote: >> Hi, >> >> w.r.t. pg_upgrade_improvements.v2.diff. >> >> + blobBatchCount = 0; >> + blobInXact = false; >> >> The count and bool flag are always reset in tandem. It seems >> variable blobInXact is not needed. > > You are right. I will fix that. New patch v3 attached. Thanks, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
Attachment
On Tue, Mar 23, 2021 at 08:51:32AM -0400, Jan Wieck wrote: > On 3/22/21 7:18 PM, Jan Wieck wrote: > > On 3/22/21 5:36 PM, Zhihong Yu wrote: > > > Hi, > > > > > > w.r.t. pg_upgrade_improvements.v2.diff. > > > > > > + blobBatchCount = 0; > > > + blobInXact = false; > > > > > > The count and bool flag are always reset in tandem. It seems > > > variable blobInXact is not needed. > > > > You are right. I will fix that. > > New patch v3 attached. Would it be better to allow pg_upgrade to pass arbitrary arguments to pg_restore, instead of just these specific ones? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On 3/23/21 10:56 AM, Bruce Momjian wrote: > On Tue, Mar 23, 2021 at 08:51:32AM -0400, Jan Wieck wrote: >> On 3/22/21 7:18 PM, Jan Wieck wrote: >> > On 3/22/21 5:36 PM, Zhihong Yu wrote: >> > > Hi, >> > > >> > > w.r.t. pg_upgrade_improvements.v2.diff. >> > > >> > > + blobBatchCount = 0; >> > > + blobInXact = false; >> > > >> > > The count and bool flag are always reset in tandem. It seems >> > > variable blobInXact is not needed. >> > >> > You are right. I will fix that. >> >> New patch v3 attached. > > Would it be better to allow pg_upgrade to pass arbitrary arguments to > pg_restore, instead of just these specific ones? > That would mean arbitrary parameters to pg_dump as well as pg_restore. But yes, that would probably be better in the long run. Any suggestion as to how that would actually look like? Unfortunately pg_restore has -[dDoOr] already used, so it doesn't look like there will be any naturally intelligible short options for that. Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
On Tue, Mar 23, 2021 at 01:25:15PM -0400, Jan Wieck wrote: > On 3/23/21 10:56 AM, Bruce Momjian wrote: > > Would it be better to allow pg_upgrade to pass arbitrary arguments to > > pg_restore, instead of just these specific ones? > > > > That would mean arbitrary parameters to pg_dump as well as pg_restore. But > yes, that would probably be better in the long run. > > Any suggestion as to how that would actually look like? Unfortunately > pg_restore has -[dDoOr] already used, so it doesn't look like there will be > any naturally intelligible short options for that. We have the postmaster which can pass arbitrary arguments to postgres processes using -o. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On 3/23/21 2:06 PM, Bruce Momjian wrote: > We have the postmaster which can pass arbitrary arguments to postgres > processes using -o. Right, and -o is already taken in pg_upgrade for sending options to the old postmaster. What we are looking for are options for sending options to pg_dump and pg_restore, which are not postmasters or children of postmaster, but rather clients. There is no option to send options to clients of postmasters. So the question remains, how do we name this? --pg-dump-options "<string>" --pg-restore-options "<string>" where "<string>" could be something like "--whatever[=NUM] [...]" would be something unambiguous. Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
On Tue, Mar 23, 2021 at 02:23:03PM -0400, Jan Wieck wrote: > On 3/23/21 2:06 PM, Bruce Momjian wrote: > > We have the postmaster which can pass arbitrary arguments to postgres > > processes using -o. > > Right, and -o is already taken in pg_upgrade for sending options to the old > postmaster. > > What we are looking for are options for sending options to pg_dump and > pg_restore, which are not postmasters or children of postmaster, but rather > clients. There is no option to send options to clients of postmasters. > > So the question remains, how do we name this? > > --pg-dump-options "<string>" > --pg-restore-options "<string>" > > where "<string>" could be something like "--whatever[=NUM] [...]" would be > something unambiguous. Sure. I don't think the letter you use is a problem. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Jan Wieck <jan@wi3ck.info> writes: > So the question remains, how do we name this? > --pg-dump-options "<string>" > --pg-restore-options "<string>" If you're passing multiple options, that is --pg-dump-options "--foo=x --bar=y" it seems just horribly fragile. Lose the double quotes and suddenly --bar is a separate option to pg_upgrade itself, not part of the argument for the previous option. That's pretty easy to do when passing things through shell scripts, too. So it'd likely be safer to write --pg-dump-option=--foo=x --pg-dump-option=--bar=y which requires pg_upgrade to allow aggregating multiple options, but you'd probably want it to act that way anyway. regards, tom lane
On 3/23/21 2:35 PM, Tom Lane wrote: > Jan Wieck <jan@wi3ck.info> writes: >> So the question remains, how do we name this? > >> --pg-dump-options "<string>" >> --pg-restore-options "<string>" > > If you're passing multiple options, that is > > --pg-dump-options "--foo=x --bar=y" > > it seems just horribly fragile. Lose the double quotes and suddenly > --bar is a separate option to pg_upgrade itself, not part of the argument > for the previous option. That's pretty easy to do when passing things > through shell scripts, too. So it'd likely be safer to write > > --pg-dump-option=--foo=x --pg-dump-option=--bar=y > > which requires pg_upgrade to allow aggregating multiple options, > but you'd probably want it to act that way anyway. ... which would be all really easy if pg_upgrade wouldn't be assembling a shell script string to pass into parallel_exec_prog() by itself. But I will see what I can do ... Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
Jan Wieck <jan@wi3ck.info> writes: > On 3/23/21 2:35 PM, Tom Lane wrote: >> If you're passing multiple options, that is >> --pg-dump-options "--foo=x --bar=y" >> it seems just horribly fragile. Lose the double quotes and suddenly >> --bar is a separate option to pg_upgrade itself, not part of the argument >> for the previous option. That's pretty easy to do when passing things >> through shell scripts, too. > ... which would be all really easy if pg_upgrade wouldn't be assembling > a shell script string to pass into parallel_exec_prog() by itself. No, what I was worried about is shell script(s) that invoke pg_upgrade and have to pass down some of these options through multiple levels of option parsing. BTW, it doesn't seem like the "pg-" prefix has any value-add here, so maybe "--dump-option" and "--restore-option" would be suitable spellings. regards, tom lane
On 3/23/21 2:59 PM, Tom Lane wrote: > Jan Wieck <jan@wi3ck.info> writes: >> On 3/23/21 2:35 PM, Tom Lane wrote: >>> If you're passing multiple options, that is >>> --pg-dump-options "--foo=x --bar=y" >>> it seems just horribly fragile. Lose the double quotes and suddenly >>> --bar is a separate option to pg_upgrade itself, not part of the argument >>> for the previous option. That's pretty easy to do when passing things >>> through shell scripts, too. > >> ... which would be all really easy if pg_upgrade wouldn't be assembling >> a shell script string to pass into parallel_exec_prog() by itself. > > No, what I was worried about is shell script(s) that invoke pg_upgrade > and have to pass down some of these options through multiple levels of > option parsing. The problem here is that pg_upgrade itself is invoking a shell again. It is not assembling an array of arguments to pass into exec*(). I'd be a happy camper if it did the latter. But as things are we'd have to add full shell escapeing for arbitrary strings. > > BTW, it doesn't seem like the "pg-" prefix has any value-add here, > so maybe "--dump-option" and "--restore-option" would be suitable > spellings. Agreed. Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
Jan Wieck <jan@wi3ck.info> writes: > The problem here is that pg_upgrade itself is invoking a shell again. It > is not assembling an array of arguments to pass into exec*(). I'd be a > happy camper if it did the latter. But as things are we'd have to add > full shell escapeing for arbitrary strings. Surely we need that (and have it already) anyway? I think we've stayed away from exec* because we'd have to write an emulation for Windows. Maybe somebody will get fed up and produce such code, but it's not likely to be the least-effort route to the goal. regards, tom lane
On 3/23/21 3:35 PM, Tom Lane wrote: > Jan Wieck <jan@wi3ck.info> writes: >> The problem here is that pg_upgrade itself is invoking a shell again. It >> is not assembling an array of arguments to pass into exec*(). I'd be a >> happy camper if it did the latter. But as things are we'd have to add >> full shell escapeing for arbitrary strings. > > Surely we need that (and have it already) anyway? There are functions to shell escape a single string, like appendShellString() but that is hardly enough when a single optarg for --restore-option could look like any of --jobs 8 --jobs=8 --jobs='8' --jobs '8' --jobs "8" --jobs="8" --dont-bother-about-jobs When placed into a shell string, those things have very different effects on your args[]. I also want to say that we are overengineering this whole thing. Yes, there is the problem of shell quoting possibly going wrong as it passes from one shell to another. But for now this is all about passing a few numbers down from pg_upgrade to pg_restore (and eventually pg_dump). Have we even reached a consensus yet on that doing it the way, my patch is proposing, is the right way to go? Like that emitting BLOB TOC entries into SECTION_DATA when in binary upgrade mode is a good thing? Or that bunching all the SQL statements for creating the blob, changing the ACL and COMMENT and SECLABEL all in one multi-statement-query is. Maybe we should focus on those details before getting into all the parameter naming stuff. Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
Jan Wieck <jan@wi3ck.info> writes: > Have we even reached a consensus yet on that doing it the way, my patch > is proposing, is the right way to go? Like that emitting BLOB TOC > entries into SECTION_DATA when in binary upgrade mode is a good thing? > Or that bunching all the SQL statements for creating the blob, changing > the ACL and COMMENT and SECLABEL all in one multi-statement-query is. Now you're asking for actual review effort, which is a little hard to come by towards the tail end of the last CF of a cycle. I'm interested in this topic, but I can't justify spending much time on it right now. regards, tom lane
On 3/23/21 4:55 PM, Tom Lane wrote: > Jan Wieck <jan@wi3ck.info> writes: >> Have we even reached a consensus yet on that doing it the way, my patch >> is proposing, is the right way to go? Like that emitting BLOB TOC >> entries into SECTION_DATA when in binary upgrade mode is a good thing? >> Or that bunching all the SQL statements for creating the blob, changing >> the ACL and COMMENT and SECLABEL all in one multi-statement-query is. > > Now you're asking for actual review effort, which is a little hard > to come by towards the tail end of the last CF of a cycle. I'm > interested in this topic, but I can't justify spending much time > on it right now. Understood. In any case I changed the options so that they behave the same way, the existing -o and -O (for old/new postmaster options) work. I don't think it would be wise to have option forwarding work differently between options for postmaster and options for pg_dump/pg_restore. Regards, Jan -- Jan Wieck Principle Database Engineer Amazon Web Services
On 3/24/21 12:04 PM, Jan Wieck wrote: > In any case I changed the options so that they behave the same way, the > existing -o and -O (for old/new postmaster options) work. I don't think > it would be wise to have option forwarding work differently between > options for postmaster and options for pg_dump/pg_restore. Attaching the actual diff might help. -- Jan Wieck Principle Database Engineer Amazon Web Services
Attachment
On Wed, Mar 24, 2021 at 12:05:27PM -0400, Jan Wieck wrote: > On 3/24/21 12:04 PM, Jan Wieck wrote: >> In any case I changed the options so that they behave the same way, the >> existing -o and -O (for old/new postmaster options) work. I don't think >> it would be wise to have option forwarding work differently between >> options for postmaster and options for pg_dump/pg_restore. > > Attaching the actual diff might help. I'd like to revive this thread, so I've created a commitfest entry [0] and attached a hastily rebased patch that compiles and passes the tests. I am aiming to spend some more time on this in the near future. [0] https://commitfest.postgresql.org/39/3841/ -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On 8/24/22 17:32, Nathan Bossart wrote: > I'd like to revive this thread, so I've created a commitfest entry [0] and > attached a hastily rebased patch that compiles and passes the tests. I am > aiming to spend some more time on this in the near future. Just to clarify, was Justin's statement upthread (that the XID problem is fixed) correct? And is this patch just trying to improve the remaining memory and lock usage problems? I took a quick look at the pg_upgrade diffs. I agree with Jan that the escaping problem is a pretty bad smell, but even putting that aside for a bit, is it safe to expose arbitrary options to pg_dump/restore during upgrade? It's super flexible, but I can imagine that some of those flags might really mess up the new cluster... And yeah, if you choose to do that then you get to keep both pieces, I guess, but I like that pg_upgrade tries to be (IMO) fairly bulletproof. --Jacob
On Wed, Sep 07, 2022 at 02:42:05PM -0700, Jacob Champion wrote: > Just to clarify, was Justin's statement upthread (that the XID problem > is fixed) correct? And is this patch just trying to improve the > remaining memory and lock usage problems? I think "fixed" might not be totally accurate, but that is the gist. > I took a quick look at the pg_upgrade diffs. I agree with Jan that the > escaping problem is a pretty bad smell, but even putting that aside for > a bit, is it safe to expose arbitrary options to pg_dump/restore during > upgrade? It's super flexible, but I can imagine that some of those flags > might really mess up the new cluster... > > And yeah, if you choose to do that then you get to keep both pieces, I > guess, but I like that pg_upgrade tries to be (IMO) fairly bulletproof. IIUC the main benefit of this approach is that it isn't dependent on binary-upgrade mode, which seems to be a goal based on the discussion upthread [0]. I think it'd be easily possible to fix only pg_upgrade by simply dumping and restoring pg_largeobject_metadata, as Andres suggested in 2018 [1]. In fact, it seems like it ought to be possible to just copy pg_largeobject_metadata's files as was done before 12a53c7. AFAICT this would only work for clusters upgrading from v12 and newer, and it'd break if any of the underlying data types change their storage format. This seems unlikely for OIDs, but there is ongoing discussion about changing aclitem. I still think this is a problem worth fixing, but it's not yet clear how to proceed. [0] https://postgr.es/m/227228.1616259220%40sss.pgh.pa.us [1] https://postgr.es/m/20181122001415.ef5bncxqin2y3esb%40alap3.anarazel.de -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Sep 8, 2022 at 4:18 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > IIUC the main benefit of this approach is that it isn't dependent on > binary-upgrade mode, which seems to be a goal based on the discussion > upthread [0]. To clarify, I agree that pg_dump should contain the core fix. What I'm questioning is the addition of --dump-options to make use of that fix from pg_upgrade, since it also lets the user do "exciting" new things like --exclude-schema and --include-foreign-data and so on. I don't think we should let them do that without a good reason. Thanks, --Jacob
On Thu, Sep 08, 2022 at 04:29:10PM -0700, Jacob Champion wrote: > On Thu, Sep 8, 2022 at 4:18 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> IIUC the main benefit of this approach is that it isn't dependent on >> binary-upgrade mode, which seems to be a goal based on the discussion >> upthread [0]. > > To clarify, I agree that pg_dump should contain the core fix. What I'm > questioning is the addition of --dump-options to make use of that fix > from pg_upgrade, since it also lets the user do "exciting" new things > like --exclude-schema and --include-foreign-data and so on. I don't > think we should let them do that without a good reason. Ah, yes, I think that is a fair point. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Sep 08, 2022 at 04:34:07PM -0700, Nathan Bossart wrote: > On Thu, Sep 08, 2022 at 04:29:10PM -0700, Jacob Champion wrote: >> To clarify, I agree that pg_dump should contain the core fix. What I'm >> questioning is the addition of --dump-options to make use of that fix >> from pg_upgrade, since it also lets the user do "exciting" new things >> like --exclude-schema and --include-foreign-data and so on. I don't >> think we should let them do that without a good reason. > > Ah, yes, I think that is a fair point. It has been more than four weeks since the last activity of this thread and there has been what looks like some feedback to me, so marked as RwF for the time being. -- Michael
Attachment
Hi Everyone , I want to continue this thread , I have rebased the patch to latest master and fixed an issue when pg_restore prints to file. ` ╰─$ pg_restore dump_small.custom --restore-blob-batch-size=2 --file=a -- -- End BLOB restore batch -- COMMIT; ` > On 09/11/2023, 17:05, "Jacob Champion" <jchampion@timescale.com <mailto:jchampion@timescale.com>> wrote: > To clarify, I agree that pg_dump should contain the core fix. What I'm > questioning is the addition of --dump-options to make use of that fix > from pg_upgrade, since it also lets the user do "exciting" new things > like --exclude-schema and --include-foreign-data and so on. I don't > think we should let them do that without a good reason. Earlier idea was to not expose these options to users and use [1] --restore-jobs=NUM --jobs parameter passed to pg_restore --restore-blob-batch-size=NUM number of blobs restored in one xact But this was later expanded to use --dump-options and --restore-options [2]. With --restore-options user can use --exclude-schema , So maybe we can go back to [1] [1] https://www.postgresql.org/message-id/a1e200e6-adde-2561-422b-a166ec084e3b%40wi3ck.info [2] https://www.postgresql.org/message-id/8d8d3961-8e8b-3dbe-f911-6f418c5fb1d3%40wi3ck.info Regards Sachin Amazon Web Services: https://aws.amazon.com
[ Jacob's email address updated ] "Kumar, Sachin" <ssetiya@amazon.com> writes: > Hi Everyone , I want to continue this thread , I have rebased the patch to latest > master and fixed an issue when pg_restore prints to file. Um ... you didn't attach the patch? FWIW, I agree with Jacob's concern about it being a bad idea to let users of pg_upgrade pass down arbitrary options to pg_dump/pg_restore. I think we'd regret going there, because it'd hugely expand the set of cases pg_upgrade has to deal with. Also, pg_upgrade is often invoked indirectly via scripts, so I do not especially buy the idea that we're going to get useful control input from some human somewhere. I think we'd be better off to assume that pg_upgrade is on its own to manage the process, so that if we need to switch strategies based on object count or whatever, we should put in a heuristic to choose the strategy automatically. It might not be perfect, but that will give better results for the pretty large fraction of users who are not going to mess with weird little switches. regards, tom lane
Hi, On November 9, 2023 10:41:01 AM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Also, pg_upgrade is often invoked indirectly via scripts, so I do >not especially buy the idea that we're going to get useful control >input from some human somewhere. I think we'd be better off to >assume that pg_upgrade is on its own to manage the process, so that >if we need to switch strategies based on object count or whatever, >we should put in a heuristic to choose the strategy automatically. >It might not be perfect, but that will give better results for the >pretty large fraction of users who are not going to mess with >weird little switches. +1 - even leaving everything else aside, just about no user would know about the option. There are cases where we can't dobetter than giving the user control, but we are certainly adding options at a rate that doesn't seem sustainable. And hereit doesn't seem that hard to do better. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
> On 09/11/2023, 18:41, "Tom Lane" <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> wrote: > Um ... you didn't attach the patch? Sorry , patch attached Regards Sachin
Attachment
> "Tom Lane" <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> wrote: > FWIW, I agree with Jacob's concern about it being a bad idea to let > users of pg_upgrade pass down arbitrary options to pg_dump/pg_restore. > I think we'd regret going there, because it'd hugely expand the set > of cases pg_upgrade has to deal with. > Also, pg_upgrade is often invoked indirectly via scripts, so I do > not especially buy the idea that we're going to get useful control > input from some human somewhere. I think we'd be better off to > assume that pg_upgrade is on its own to manage the process, so that > if we need to switch strategies based on object count or whatever, > we should put in a heuristic to choose the strategy automatically. > It might not be perfect, but that will give better results for the > pretty large fraction of users who are not going to mess with > weird little switches. I have updated the patch to use heuristic, During pg_upgrade we count Large objects per database. During pg_restore execution if db large_objects count is greater than LARGE_OBJECTS_THRESOLD (1k) we will use --restore-blob-batch-size. I also modified pg_upgrade --jobs behavior if we have large_objects (> LARGE_OBJECTS_THRESOLD) + /* Restore all the dbs where LARGE_OBJECTS_THRESOLD is not breached */ + restore_dbs(stats, true); + /* reap all children */ + while (reap_child(true) == true) + ; + /* Restore rest of the dbs one by one with pg_restore --jobs = user_opts.jobs */ + restore_dbs(stats, false); /* reap all children */ while (reap_child(true) == true) ; Regards Sachin
Attachment
> I have updated the patch to use heuristic, During pg_upgrade we count > Large objects per database. During pg_restore execution if db large_objects > count is greater than LARGE_OBJECTS_THRESOLD (1k) we will use > --restore-blob-batch-size. I think both SECTION_DATA and SECTION_POST_DATA can be parallelized by pg_restore, So instead of storing large objects in heuristics, we can store SECTION_DATA + SECTION_POST_DATA. Regards Sachin
I spent some time looking at the v7 patch. I can't help feeling that this is going off in the wrong direction, primarily for these reasons: * It focuses only on cutting the number of transactions needed to restore a large number of blobs (large objects). Certainly that's a pain point, but it's not the only one of this sort. If you have a lot of tables, restore will consume just as many transactions as it would for a similar number of blobs --- probably more, in fact, since we usually need more commands per table than per blob. * I'm not too thrilled with the (undocumented) rearrangements in pg_dump. I really don't like the idea of emitting a fundamentally different TOC layout in binary-upgrade mode; that seems unmaintainably bug-prone. Plus, the XID-consumption problem is not really confined to pg_upgrade. What I think we actually ought to do is one of the alternatives discussed upthread: teach pg_restore to be able to commit every so often, without trying to provide the all-or-nothing guarantees of --single-transaction mode. This cuts its XID consumption by whatever multiple "every so often" is, while allowing us to limit the number of locks taken during any one transaction. It also seems a great deal safer than the idea I floated of not taking locks at all during a binary upgrade; plus, it has some usefulness with regular pg_restore that's not under control of pg_upgrade. So I had a go at coding that, and attached is the result. It invents a --transaction-size option, and when that's active it will COMMIT after every N TOC items. (This seems simpler to implement and less bug-prone than every-N-SQL-commands.) I had initially supposed that in a parallel restore we could have child workers also commit after every N TOC items, but was soon disabused of that idea. After a worker processes a TOC item, any dependent items (such as index builds) might get dispatched to some other worker, which had better be able to see the results of the first worker's step. So at least in this implementation, we disable the multi-command-per-COMMIT behavior during the parallel part of the restore. Maybe that could be improved in future, but it seems like it'd add a lot more complexity, and it wouldn't make life any better for pg_upgrade (which doesn't use parallel pg_restore, and seems unlikely to want to in future). I've not spent a lot of effort on pg_upgrade changes here: I just hard-wired it to select --transaction-size=1000. Given the default lock table size of 64*100, that gives us enough headroom for each TOC to take half a dozen locks. We could go higher than that by making pg_upgrade force the destination postmaster to create a larger-than-default lock table, but I'm not sure if it's worth any trouble. We've already bought three orders of magnitude improvement as it stands, which seems like enough ambition for today. (Also, having pg_upgrade override the user's settings in the destination cluster might not be without downsides.) Another thing I'm wondering about is why this is only a pg_restore option not also a pg_dump/pg_dumpall option. I did it like that because --single-transaction is pg_restore only, but that seems more like an oversight or laziness than a well-considered decision. Maybe we should back-fill that omission; but it could be done later. Thoughts? regards, tom lane diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml index 1a23874da6..2e3ba80258 100644 --- a/doc/src/sgml/ref/pg_restore.sgml +++ b/doc/src/sgml/ref/pg_restore.sgml @@ -786,6 +786,30 @@ PostgreSQL documentation </listitem> </varlistentry> + <varlistentry> + <term><option>--transaction-size=<replaceable class="parameter">N</replaceable></option></term> + <listitem> + <para> + Execute the restore as a series of transactions, each processing + up to <replaceable class="parameter">N</replaceable> database + objects. This option implies <option>--exit-on-error</option>. + </para> + <para> + <option>--transaction-size</option> offers an intermediate choice + between the default behavior (one transaction per SQL command) + and <option>-1</option>/<option>--single-transaction</option> + (one transaction for all restored objects). + While <option>--single-transaction</option> has the least + overhead, it may be impractical for large databases because the + transaction will take a lock on each restored object, possibly + exhausting the server's lock table space. + Using <option>--transaction-size</option> with a size of a few + thousand objects offers nearly the same performance benefits while + capping the amount of lock table space needed. + </para> + </listitem> + </varlistentry> + <varlistentry> <term><option>--use-set-session-authorization</option></term> <listitem> diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h index 9ef2f2017e..fbf5f1c515 100644 --- a/src/bin/pg_dump/pg_backup.h +++ b/src/bin/pg_dump/pg_backup.h @@ -149,7 +149,9 @@ typedef struct _restoreOptions * compression */ int suppressDumpWarnings; /* Suppress output of WARNING entries * to stderr */ - bool single_txn; + + bool single_txn; /* restore all TOCs in one transaction */ + int txn_size; /* restore this many TOCs per txn, if > 0 */ bool *idWanted; /* array showing which dump IDs to emit */ int enable_row_security; diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 256d1e35a4..600482c93c 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -502,7 +502,28 @@ RestoreArchive(Archive *AHX) /* Otherwise, drop anything that's selected and has a dropStmt */ if (((te->reqs & (REQ_SCHEMA | REQ_DATA)) != 0) && te->dropStmt) { + bool not_allowed_in_txn = false; + pg_log_info("dropping %s %s", te->desc, te->tag); + + /* + * In --transaction-size mode, we have to temporarily exit our + * transaction block to drop objects that can't be dropped + * within a transaction. + */ + if (ropt->txn_size > 0) + { + if (strcmp(te->desc, "DATABASE") == 0 || + strcmp(te->desc, "DATABASE PROPERTIES") == 0) + { + not_allowed_in_txn = true; + if (AH->connection) + CommitTransaction(AHX); + else + ahprintf(AH, "COMMIT;\n"); + } + } + /* Select owner and schema as necessary */ _becomeOwner(AH, te); _selectOutputSchema(AH, te->namespace); @@ -615,6 +636,33 @@ RestoreArchive(Archive *AHX) } } } + + /* + * In --transaction-size mode, re-establish the transaction + * block if needed; otherwise, commit after every N drops. + */ + if (ropt->txn_size > 0) + { + if (not_allowed_in_txn) + { + if (AH->connection) + StartTransaction(AHX); + else + ahprintf(AH, "BEGIN;\n"); + AH->txnCount = 0; + } + else if (++AH->txnCount >= ropt->txn_size) + { + if (AH->connection) + { + CommitTransaction(AHX); + StartTransaction(AHX); + } + else + ahprintf(AH, "COMMIT;\nBEGIN;\n"); + AH->txnCount = 0; + } + } } } @@ -711,7 +759,11 @@ RestoreArchive(Archive *AHX) } } - if (ropt->single_txn) + /* + * Close out any persistent transaction we may have. While these two + * cases are started in different places, we can end both cases here. + */ + if (ropt->single_txn || ropt->txn_size > 0) { if (AH->connection) CommitTransaction(AHX); @@ -772,6 +824,25 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel) */ if ((reqs & REQ_SCHEMA) != 0) { + bool object_is_db = false; + + /* + * In --transaction-size mode, must exit our transaction block to + * create a database or set its properties. + */ + if (strcmp(te->desc, "DATABASE") == 0 || + strcmp(te->desc, "DATABASE PROPERTIES") == 0) + { + object_is_db = true; + if (ropt->txn_size > 0) + { + if (AH->connection) + CommitTransaction(&AH->public); + else + ahprintf(AH, "COMMIT;\n\n"); + } + } + /* Show namespace in log message if available */ if (te->namespace) pg_log_info("creating %s \"%s.%s\"", @@ -822,10 +893,10 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel) /* * If we created a DB, connect to it. Also, if we changed DB * properties, reconnect to ensure that relevant GUC settings are - * applied to our session. + * applied to our session. (That also restarts the transaction block + * in --transaction-size mode.) */ - if (strcmp(te->desc, "DATABASE") == 0 || - strcmp(te->desc, "DATABASE PROPERTIES") == 0) + if (object_is_db) { pg_log_info("connecting to new database \"%s\"", te->tag); _reconnectToDB(AH, te->tag); @@ -951,6 +1022,25 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel) } } + /* + * If we emitted anything for this TOC entry, that counts as one action + * against the transaction-size limit. Commit if it's time to. + */ + if ((reqs & (REQ_SCHEMA | REQ_DATA)) != 0 && ropt->txn_size > 0) + { + if (++AH->txnCount >= ropt->txn_size) + { + if (AH->connection) + { + CommitTransaction(&AH->public); + StartTransaction(&AH->public); + } + else + ahprintf(AH, "COMMIT;\nBEGIN;\n\n"); + AH->txnCount = 0; + } + } + if (AH->public.n_errors > 0 && status == WORKER_OK) status = WORKER_IGNORED_ERRORS; @@ -1297,7 +1387,12 @@ StartRestoreLOs(ArchiveHandle *AH) { RestoreOptions *ropt = AH->public.ropt; - if (!ropt->single_txn) + /* + * LOs must be restored within a transaction block, since we need the LO + * handle to stay open while we write it. Establish a transaction unless + * there's one being used globally. + */ + if (!(ropt->single_txn || ropt->txn_size > 0)) { if (AH->connection) StartTransaction(&AH->public); @@ -1316,7 +1411,7 @@ EndRestoreLOs(ArchiveHandle *AH) { RestoreOptions *ropt = AH->public.ropt; - if (!ropt->single_txn) + if (!(ropt->single_txn || ropt->txn_size > 0)) { if (AH->connection) CommitTransaction(&AH->public); @@ -3149,6 +3244,19 @@ _doSetFixedOutputState(ArchiveHandle *AH) else ahprintf(AH, "SET row_security = off;\n"); + /* + * In --transaction-size mode, we should always be in a transaction when + * we begin to restore objects. + */ + if (ropt && ropt->txn_size > 0) + { + if (AH->connection) + StartTransaction(&AH->public); + else + ahprintf(AH, "\nBEGIN;\n"); + AH->txnCount = 0; + } + ahprintf(AH, "\n"); } @@ -3991,6 +4099,14 @@ restore_toc_entries_prefork(ArchiveHandle *AH, TocEntry *pending_list) } } + /* + * In --transaction-size mode, we must commit the open transaction before + * dropping the database connection. This also ensures that child workers + * can see the objects we've created so far. + */ + if (AH->public.ropt->txn_size > 0) + CommitTransaction(&AH->public); + /* * Now close parent connection in prep for parallel steps. We do this * mainly to ensure that we don't exceed the specified number of parallel @@ -4730,6 +4846,10 @@ CloneArchive(ArchiveHandle *AH) clone = (ArchiveHandle *) pg_malloc(sizeof(ArchiveHandle)); memcpy(clone, AH, sizeof(ArchiveHandle)); + /* Likewise flat-copy the RestoreOptions, so we can alter them locally */ + clone->public.ropt = (RestoreOptions *) pg_malloc(sizeof(RestoreOptions)); + memcpy(clone->public.ropt, AH->public.ropt, sizeof(RestoreOptions)); + /* Handle format-independent fields */ memset(&(clone->sqlparse), 0, sizeof(clone->sqlparse)); @@ -4748,6 +4868,13 @@ CloneArchive(ArchiveHandle *AH) /* clone has its own error count, too */ clone->public.n_errors = 0; + /* + * Clone connections disregard --transaction-size; they must commit after + * each command so that the results are immediately visible to other + * workers. + */ + clone->public.ropt->txn_size = 0; + /* * Connect our new clone object to the database, using the same connection * parameters used for the original connection. diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h index 917283fd34..c21fdfe596 100644 --- a/src/bin/pg_dump/pg_backup_archiver.h +++ b/src/bin/pg_dump/pg_backup_archiver.h @@ -322,6 +322,9 @@ struct _archiveHandle char *currTablespace; /* current tablespace, or NULL */ char *currTableAm; /* current table access method, or NULL */ + /* in --transaction-size mode, this counts objects emitted in cur xact */ + int txnCount; + void *lo_buf; size_t lo_buf_used; size_t lo_buf_size; diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c index c3beacdec1..5ea78cf7cc 100644 --- a/src/bin/pg_dump/pg_restore.c +++ b/src/bin/pg_dump/pg_restore.c @@ -120,6 +120,7 @@ main(int argc, char **argv) {"role", required_argument, NULL, 2}, {"section", required_argument, NULL, 3}, {"strict-names", no_argument, &strict_names, 1}, + {"transaction-size", required_argument, NULL, 5}, {"use-set-session-authorization", no_argument, &use_setsessauth, 1}, {"no-comments", no_argument, &no_comments, 1}, {"no-publications", no_argument, &no_publications, 1}, @@ -289,10 +290,18 @@ main(int argc, char **argv) set_dump_section(optarg, &(opts->dumpSections)); break; - case 4: + case 4: /* filter */ read_restore_filters(optarg, opts); break; + case 5: /* transaction-size */ + if (!option_parse_int(optarg, "--transaction-size", + 1, INT_MAX, + &opts->txn_size)) + exit(1); + opts->exit_on_error = true; + break; + default: /* getopt_long already emitted a complaint */ pg_log_error_hint("Try \"%s --help\" for more information.", progname); @@ -337,6 +346,9 @@ main(int argc, char **argv) if (opts->dataOnly && opts->dropSchema) pg_fatal("options -c/--clean and -a/--data-only cannot be used together"); + if (opts->single_txn && opts->txn_size > 0) + pg_fatal("options -1/--single-transaction and --transaction-size cannot be used together"); + /* * -C is not compatible with -1, because we can't create a database inside * a transaction block. @@ -484,6 +496,7 @@ usage(const char *progname) printf(_(" --section=SECTION restore named section (pre-data, data, or post-data)\n")); printf(_(" --strict-names require table and/or schema include patterns to\n" " match at least one entity each\n")); + printf(_(" --transaction-size=N commit after every N objects\n")); printf(_(" --use-set-session-authorization\n" " use SET SESSION AUTHORIZATION commands instead of\n" " ALTER OWNER commands to set ownership\n")); diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c index 3960af4036..5cfd2282e1 100644 --- a/src/bin/pg_upgrade/pg_upgrade.c +++ b/src/bin/pg_upgrade/pg_upgrade.c @@ -548,6 +548,7 @@ create_new_objects(void) true, true, "\"%s/pg_restore\" %s %s --exit-on-error --verbose " + "--transaction-size=1000 " "--dbname postgres \"%s/%s\"", new_cluster.bindir, cluster_conn_opts(&new_cluster), @@ -586,6 +587,7 @@ create_new_objects(void) parallel_exec_prog(log_file_name, NULL, "\"%s/pg_restore\" %s %s --exit-on-error --verbose " + "--transaction-size=1000 " "--dbname template1 \"%s/%s\"", new_cluster.bindir, cluster_conn_opts(&new_cluster),
I have spent some more effort in this area and developed a patch series that I think addresses all of the performance issues that we've discussed in this thread, both for pg_upgrade and more general use of pg_dump/pg_restore. Concretely, it absorbs the pg_restore --transaction-size switch that I proposed before to cut the number of transactions needed during restore, and rearranges the representation of BLOB-related TOC entries to reduce the client-side memory requirements, and fixes some ancient mistakes that prevent both selective restore of BLOBs and parallel restore of BLOBs. As a demonstration, I made a database containing 100K empty blobs, and measured the time needed to dump/restore that using -Fd and -j 10. HEAD doesn't get any useful parallelism on blobs, but with this patch series we do: dump restore HEAD: 14sec 15sec after 0002: 7sec 10sec after 0003: 7sec 3sec There are a few loose ends: * I did not invent a switch to control the batching of blobs; it's just hard-wired at 1000 blobs per group here. Probably we need some user knob for that, but I'm unsure if we want to expose a count or just a boolean for one vs more than one blob per batch. The point of forcing one blob per batch would be to allow exact control during selective restore, and I'm not sure if there's any value in random other settings. On the other hand, selective restore of blobs has been completely broken for the last dozen years and I can't recall any user complaints about that; so maybe nobody cares and we could just leave this as an internal choice. * Likewise, there's no user-accessible knob to control what transaction size pg_upgrade uses. Do we need one? In any case, it's likely that the default needs a bit more thought than I've given it. I used 1000, but if pg_upgrade is launching parallel restore jobs we likely need to divide that by the number of restore jobs. * As the patch stands, we still build a separate TOC entry for each comment or seclabel or ACL attached to a blob. If you have a lot of blobs with non-default properties then the TOC bloat problem comes back again. We could do something about that, but it would take a bit of tedious refactoring, and the most obvious way to handle it probably re-introduces too-many-locks problems. Is this a scenario that's worth spending a lot of time on? More details appear in the commit messages below. Patch 0004 is nearly the same as the v8 patch I posted before, although it adds some logic to ensure that a large blob metadata batch doesn't create too many locks. Comments? regards, tom lane PS: I don't see any active CF entry for this thread, so I'm going to go make one. From eecef8f312967ff7cc0f47899c6db2c3e654371d Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed, 20 Dec 2023 13:52:28 -0500 Subject: [PATCH v9 1/4] Some small preliminaries for pg_dump changes. Centralize management of the lo_buf used to hold data while restoring blobs. The code previously had each format handler create lo_buf, which seems rather pointless given that the format handlers all make it the same way. Moreover, the format handlers never use lo_buf directly, making this setup a failure from a separation-of-concerns standpoint. Let's move the responsibility into pg_backup_archiver.c, which is the only module concerned with lo_buf. The main reason to do this now is that it allows a centralized fix for the soon-to-be-false assumption that we never restore blobs in parallel. Also, get rid of dead code in DropLOIfExists: it's been a long time since we had any need to be able to restore to a pre-9.0 server. --- src/bin/pg_dump/pg_backup_archiver.c | 9 +++++++++ src/bin/pg_dump/pg_backup_custom.c | 7 ------- src/bin/pg_dump/pg_backup_db.c | 27 +++++---------------------- src/bin/pg_dump/pg_backup_directory.c | 6 ------ src/bin/pg_dump/pg_backup_null.c | 4 ---- src/bin/pg_dump/pg_backup_tar.c | 4 ---- 6 files changed, 14 insertions(+), 43 deletions(-) diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 256d1e35a4..26c2c684c8 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -1343,6 +1343,12 @@ StartRestoreLO(ArchiveHandle *AH, Oid oid, bool drop) AH->loCount++; /* Initialize the LO Buffer */ + if (AH->lo_buf == NULL) + { + /* First time through (in this process) so allocate the buffer */ + AH->lo_buf_size = LOBBUFSIZE; + AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); + } AH->lo_buf_used = 0; pg_log_info("restoring large object with OID %u", oid); @@ -4748,6 +4754,9 @@ CloneArchive(ArchiveHandle *AH) /* clone has its own error count, too */ clone->public.n_errors = 0; + /* clones should not share lo_buf */ + clone->lo_buf = NULL; + /* * Connect our new clone object to the database, using the same connection * parameters used for the original connection. diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c index b576b29924..7c6ac89dd4 100644 --- a/src/bin/pg_dump/pg_backup_custom.c +++ b/src/bin/pg_dump/pg_backup_custom.c @@ -140,10 +140,6 @@ InitArchiveFmt_Custom(ArchiveHandle *AH) ctx = (lclContext *) pg_malloc0(sizeof(lclContext)); AH->formatData = (void *) ctx; - /* Initialize LO buffering */ - AH->lo_buf_size = LOBBUFSIZE; - AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); - /* * Now open the file */ @@ -902,9 +898,6 @@ _Clone(ArchiveHandle *AH) * share knowledge about where the data blocks are across threads. * _PrintTocData has to be careful about the order of operations on that * state, though. - * - * Note: we do not make a local lo_buf because we expect at most one BLOBS - * entry per archive, so no parallelism is possible. */ } diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c index f766b65059..b297ca049d 100644 --- a/src/bin/pg_dump/pg_backup_db.c +++ b/src/bin/pg_dump/pg_backup_db.c @@ -544,26 +544,9 @@ CommitTransaction(Archive *AHX) void DropLOIfExists(ArchiveHandle *AH, Oid oid) { - /* - * If we are not restoring to a direct database connection, we have to - * guess about how to detect whether the LO exists. Assume new-style. - */ - if (AH->connection == NULL || - PQserverVersion(AH->connection) >= 90000) - { - ahprintf(AH, - "SELECT pg_catalog.lo_unlink(oid) " - "FROM pg_catalog.pg_largeobject_metadata " - "WHERE oid = '%u';\n", - oid); - } - else - { - /* Restoring to pre-9.0 server, so do it the old way */ - ahprintf(AH, - "SELECT CASE WHEN EXISTS(" - "SELECT 1 FROM pg_catalog.pg_largeobject WHERE loid = '%u'" - ") THEN pg_catalog.lo_unlink('%u') END;\n", - oid, oid); - } + ahprintf(AH, + "SELECT pg_catalog.lo_unlink(oid) " + "FROM pg_catalog.pg_largeobject_metadata " + "WHERE oid = '%u';\n", + oid); } diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c index 679c60420b..16491d6a95 100644 --- a/src/bin/pg_dump/pg_backup_directory.c +++ b/src/bin/pg_dump/pg_backup_directory.c @@ -143,10 +143,6 @@ InitArchiveFmt_Directory(ArchiveHandle *AH) ctx->dataFH = NULL; ctx->LOsTocFH = NULL; - /* Initialize LO buffering */ - AH->lo_buf_size = LOBBUFSIZE; - AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); - /* * Now open the TOC file */ @@ -823,8 +819,6 @@ _Clone(ArchiveHandle *AH) ctx = (lclContext *) AH->formatData; /* - * Note: we do not make a local lo_buf because we expect at most one BLOBS - * entry per archive, so no parallelism is possible. Likewise, * TOC-entry-local state isn't an issue because any one TOC entry is * touched by just one worker child. */ diff --git a/src/bin/pg_dump/pg_backup_null.c b/src/bin/pg_dump/pg_backup_null.c index 08f096251b..776f057770 100644 --- a/src/bin/pg_dump/pg_backup_null.c +++ b/src/bin/pg_dump/pg_backup_null.c @@ -63,10 +63,6 @@ InitArchiveFmt_Null(ArchiveHandle *AH) AH->ClonePtr = NULL; AH->DeClonePtr = NULL; - /* Initialize LO buffering */ - AH->lo_buf_size = LOBBUFSIZE; - AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); - /* * Now prevent reading... */ diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c index aad88ad559..4cb9707e63 100644 --- a/src/bin/pg_dump/pg_backup_tar.c +++ b/src/bin/pg_dump/pg_backup_tar.c @@ -156,10 +156,6 @@ InitArchiveFmt_Tar(ArchiveHandle *AH) ctx->filePos = 0; ctx->isSpecialScript = 0; - /* Initialize LO buffering */ - AH->lo_buf_size = LOBBUFSIZE; - AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); - /* * Now open the tar file, and load the TOC if we're in read mode. */ -- 2.39.3 From b3239164371648ccb0053f045ddc14a762e88d49 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed, 20 Dec 2023 15:34:19 -0500 Subject: [PATCH v9 2/4] In dumps, group large objects into matching metadata and data entries. Commit c0d5be5d6 caused pg_dump to create a separate BLOB metadata TOC entry for each large object (blob), but it did not touch the ancient decision to put all the blobs' data into a single BLOBS TOC entry. This is bad for a few reasons: for databases with millions of blobs, the TOC becomes unreasonably large, causing performance issues; selective restore of just some blobs is quite impossible; and we cannot parallelize either dump or restore of the blob data, since our architecture for that relies on farming out whole TOC entries to worker processes. To improve matters, let's group multiple blobs into each blob metadata TOC entry, and then make corresponding per-group blob data TOC entries. Selective restore using pg_restore's -l/-L switches is then possible, though only at the group level. (We should provide a switch to allow forcing one-blob-per-group for users who need precise selective restore and don't have huge numbers of blobs. This patch doesn't yet do that, instead just hard-wiring the maximum number of blobs per entry at 1000.) The blobs in a group must all have the same owner, since the TOC entry format only allows one owner to be named. In this implementation we also require them to all share the same ACL (grants); the archive format wouldn't require that, but pg_dump's representation of DumpableObjects does. It seems unlikely that either restriction will be problematic for databases with huge numbers of blobs. The metadata TOC entries now have a "desc" string of "BLOB METADATA", and their "defn" string is just a newline-separated list of blob OIDs. The restore code has to generate creation commands, ALTER OWNER commands, and drop commands (for --clean mode) from that. We would need special-case code for ALTER OWNER and drop in any case, so the alternative of keeping the "defn" as directly executable SQL code for creation wouldn't buy much, and it seems like it'd bloat the archive to little purpose. The data TOC entries ("BLOBS") can be exactly the same as before, except that now there can be more than one, so we'd better give them identifying tag strings. We have to bump the archive file format version number, since existing versions of pg_restore wouldn't know they need to do something special for BLOB METADATA, plus they aren't going to work correctly with multiple BLOBS entries. Also, the directory and tar-file format handlers need some work for multiple BLOBS entries: they used to hard-wire the file name as "blobs.toc", which is replaced here with "blobs_<dumpid>.toc". The 002_pg_dump.pl test script also knows about that and requires minor updates. (I had to drop the test for manually-compressed blobs.toc files with LZ4, because lz4's obtuse command line design requires explicit specification of the output file name which seems impractical here. I don't think we're losing any useful test coverage thereby; that test stanza seems completely duplicative with the gzip and zstd cases anyway.) As this stands, we still generate a separate TOC entry for any comment, security label, or ACL attached to a blob. I feel comfortable in believing that comments and security labels on blobs are rare; but we might have to do something about aggregating blob ACLs into grouped TOC entries to avoid blowing up the TOC size, if there are use cases with large numbers of non-default blob ACLs. That can be done later though, as it would not create any compatibility issues. --- src/bin/pg_dump/common.c | 26 +++ src/bin/pg_dump/pg_backup_archiver.c | 76 +++++-- src/bin/pg_dump/pg_backup_archiver.h | 6 +- src/bin/pg_dump/pg_backup_custom.c | 4 +- src/bin/pg_dump/pg_backup_db.c | 27 +++ src/bin/pg_dump/pg_backup_directory.c | 38 ++-- src/bin/pg_dump/pg_backup_null.c | 4 +- src/bin/pg_dump/pg_backup_tar.c | 39 +++- src/bin/pg_dump/pg_dump.c | 280 +++++++++++++++----------- src/bin/pg_dump/pg_dump.h | 11 + src/bin/pg_dump/t/002_pg_dump.pl | 30 ++- 11 files changed, 354 insertions(+), 187 deletions(-) diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c index 8b0c1e7b53..c38700c21e 100644 --- a/src/bin/pg_dump/common.c +++ b/src/bin/pg_dump/common.c @@ -46,6 +46,8 @@ static DumpId lastDumpId = 0; /* Note: 0 is InvalidDumpId */ * expects that it can move them around when resizing the table. So we * cannot make the DumpableObjects be elements of the hash table directly; * instead, the hash table elements contain pointers to DumpableObjects. + * This does have the advantage of letting us map multiple CatalogIds + * to one DumpableObject, which is useful for blobs. * * It turns out to be convenient to also use this data structure to map * CatalogIds to owning extensions, if any. Since extension membership @@ -696,6 +698,30 @@ AssignDumpId(DumpableObject *dobj) } } +/* + * recordAdditionalCatalogID + * Record an additional catalog ID for the given DumpableObject + */ +void +recordAdditionalCatalogID(CatalogId catId, DumpableObject *dobj) +{ + CatalogIdMapEntry *entry; + bool found; + + /* CatalogId hash table must exist, if we have a DumpableObject */ + Assert(catalogIdHash != NULL); + + /* Add reference to CatalogId hash */ + entry = catalogid_insert(catalogIdHash, catId, &found); + if (!found) + { + entry->dobj = NULL; + entry->ext = NULL; + } + Assert(entry->dobj == NULL); + entry->dobj = dobj; +} + /* * Assign a DumpId that's not tied to a DumpableObject. * diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 26c2c684c8..73b9972da4 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -512,7 +512,20 @@ RestoreArchive(Archive *AHX) * don't necessarily emit it verbatim; at this point we add an * appropriate IF EXISTS clause, if the user requested it. */ - if (*te->dropStmt != '\0') + if (strcmp(te->desc, "BLOB METADATA") == 0) + { + /* We must generate the per-blob commands */ + if (ropt->if_exists) + IssueCommandPerBlob(AH, te, + "SELECT pg_catalog.lo_unlink(oid) " + "FROM pg_catalog.pg_largeobject_metadata " + "WHERE oid = '", "'"); + else + IssueCommandPerBlob(AH, te, + "SELECT pg_catalog.lo_unlink('", + "')"); + } + else if (*te->dropStmt != '\0') { if (!ropt->if_exists || strncmp(te->dropStmt, "--", 2) == 0) @@ -528,12 +541,12 @@ RestoreArchive(Archive *AHX) { /* * Inject an appropriate spelling of "if exists". For - * large objects, we have a separate routine that + * old-style large objects, we have a routine that * knows how to do it, without depending on * te->dropStmt; use that. For other objects we need * to parse the command. */ - if (strncmp(te->desc, "BLOB", 4) == 0) + if (strcmp(te->desc, "BLOB") == 0) { DropLOIfExists(AH, te->catalogId.oid); } @@ -1290,7 +1303,7 @@ EndLO(Archive *AHX, Oid oid) **********/ /* - * Called by a format handler before any LOs are restored + * Called by a format handler before a group of LOs is restored */ void StartRestoreLOs(ArchiveHandle *AH) @@ -1309,7 +1322,7 @@ StartRestoreLOs(ArchiveHandle *AH) } /* - * Called by a format handler after all LOs are restored + * Called by a format handler after a group of LOs is restored */ void EndRestoreLOs(ArchiveHandle *AH) @@ -2994,13 +3007,14 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH) { /* * Special Case: If 'SEQUENCE SET' or anything to do with LOs, then it - * is considered a data entry. We don't need to check for the BLOBS - * entry or old-style BLOB COMMENTS, because they will have hadDumper - * = true ... but we do need to check new-style BLOB ACLs, comments, + * is considered a data entry. We don't need to check for BLOBS or + * old-style BLOB COMMENTS entries, because they will have hadDumper = + * true ... but we do need to check new-style BLOB ACLs, comments, * etc. */ if (strcmp(te->desc, "SEQUENCE SET") == 0 || strcmp(te->desc, "BLOB") == 0 || + strcmp(te->desc, "BLOB METADATA") == 0 || (strcmp(te->desc, "ACL") == 0 && strncmp(te->tag, "LARGE OBJECT ", 13) == 0) || (strcmp(te->desc, "COMMENT") == 0 && @@ -3041,6 +3055,7 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH) if (!(ropt->sequence_data && strcmp(te->desc, "SEQUENCE SET") == 0) && !(ropt->binary_upgrade && (strcmp(te->desc, "BLOB") == 0 || + strcmp(te->desc, "BLOB METADATA") == 0 || (strcmp(te->desc, "ACL") == 0 && strncmp(te->tag, "LARGE OBJECT ", 13) == 0) || (strcmp(te->desc, "COMMENT") == 0 && @@ -3612,18 +3627,26 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData) } /* - * Actually print the definition. + * Actually print the definition. Normally we can just print the defn + * string if any, but we have two special cases: * - * Really crude hack for suppressing AUTHORIZATION clause that old pg_dump + * 1. A crude hack for suppressing AUTHORIZATION clause that old pg_dump * versions put into CREATE SCHEMA. Don't mutate the variant for schema * "public" that is a comment. We have to do this when --no-owner mode is * selected. This is ugly, but I see no other good way ... + * + * 2. BLOB METADATA entries need special processing since their defn + * strings are just lists of OIDs, not complete SQL commands. */ if (ropt->noOwner && strcmp(te->desc, "SCHEMA") == 0 && strncmp(te->defn, "--", 2) != 0) { ahprintf(AH, "CREATE SCHEMA %s;\n\n\n", fmtId(te->tag)); } + else if (strcmp(te->desc, "BLOB METADATA") == 0) + { + IssueCommandPerBlob(AH, te, "SELECT pg_catalog.lo_create('", "')"); + } else { if (te->defn && strlen(te->defn) > 0) @@ -3644,18 +3667,31 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData) te->owner && strlen(te->owner) > 0 && te->dropStmt && strlen(te->dropStmt) > 0) { - PQExpBufferData temp; + if (strcmp(te->desc, "BLOB METADATA") == 0) + { + /* BLOB METADATA needs special code to handle multiple LOs */ + char *cmdEnd = psprintf(" OWNER TO %s", fmtId(te->owner)); + + IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", cmdEnd); + pg_free(cmdEnd); + } + else + { + /* For all other cases, we can use _getObjectDescription */ + PQExpBufferData temp; - initPQExpBuffer(&temp); - _getObjectDescription(&temp, te); + initPQExpBuffer(&temp); + _getObjectDescription(&temp, te); - /* - * If _getObjectDescription() didn't fill the buffer, then there is no - * owner. - */ - if (temp.data[0]) - ahprintf(AH, "ALTER %s OWNER TO %s;\n\n", temp.data, fmtId(te->owner)); - termPQExpBuffer(&temp); + /* + * If _getObjectDescription() didn't fill the buffer, then there + * is no owner. + */ + if (temp.data[0]) + ahprintf(AH, "ALTER %s OWNER TO %s;\n\n", + temp.data, fmtId(te->owner)); + termPQExpBuffer(&temp); + } } /* diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h index 917283fd34..e4dd395582 100644 --- a/src/bin/pg_dump/pg_backup_archiver.h +++ b/src/bin/pg_dump/pg_backup_archiver.h @@ -68,10 +68,12 @@ #define K_VERS_1_15 MAKE_ARCHIVE_VERSION(1, 15, 0) /* add * compression_algorithm * in header */ +#define K_VERS_1_16 MAKE_ARCHIVE_VERSION(1, 16, 0) /* BLOB METADATA entries + * and multiple BLOBS */ /* Current archive version number (the format we can output) */ #define K_VERS_MAJOR 1 -#define K_VERS_MINOR 15 +#define K_VERS_MINOR 16 #define K_VERS_REV 0 #define K_VERS_SELF MAKE_ARCHIVE_VERSION(K_VERS_MAJOR, K_VERS_MINOR, K_VERS_REV) @@ -448,6 +450,8 @@ extern void InitArchiveFmt_Tar(ArchiveHandle *AH); extern bool isValidTarHeader(char *header); extern void ReconnectToServer(ArchiveHandle *AH, const char *dbname); +extern void IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, + const char *cmdBegin, const char *cmdEnd); extern void DropLOIfExists(ArchiveHandle *AH, Oid oid); void ahwrite(const void *ptr, size_t size, size_t nmemb, ArchiveHandle *AH); diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c index 7c6ac89dd4..55107b2005 100644 --- a/src/bin/pg_dump/pg_backup_custom.c +++ b/src/bin/pg_dump/pg_backup_custom.c @@ -338,7 +338,7 @@ _EndData(ArchiveHandle *AH, TocEntry *te) } /* - * Called by the archiver when starting to save all BLOB DATA (not schema). + * Called by the archiver when starting to save BLOB DATA (not schema). * This routine should save whatever format-specific information is needed * to read the LOs back into memory. * @@ -398,7 +398,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) } /* - * Called by the archiver when finishing saving all BLOB DATA. + * Called by the archiver when finishing saving BLOB DATA. * * Optional. */ diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c index b297ca049d..c14d813b21 100644 --- a/src/bin/pg_dump/pg_backup_db.c +++ b/src/bin/pg_dump/pg_backup_db.c @@ -541,6 +541,33 @@ CommitTransaction(Archive *AHX) ExecuteSqlCommand(AH, "COMMIT", "could not commit database transaction"); } +/* + * Issue per-blob commands for the large object(s) listed in the TocEntry + * + * The TocEntry's defn string is assumed to consist of large object OIDs, + * one per line. Wrap these in the given SQL command fragments and issue + * the commands. (cmdEnd need not include a semicolon.) + */ +void +IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, + const char *cmdBegin, const char *cmdEnd) +{ + /* Make a writable copy of the command string */ + char *buf = pg_strdup(te->defn); + char *st; + char *en; + + st = buf; + while ((en = strchr(st, '\n')) != NULL) + { + *en++ = '\0'; + ahprintf(AH, "%s%s%s;\n", cmdBegin, st, cmdEnd); + st = en; + } + ahprintf(AH, "\n"); + pg_free(buf); +} + void DropLOIfExists(ArchiveHandle *AH, Oid oid) { diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c index 16491d6a95..829832586f 100644 --- a/src/bin/pg_dump/pg_backup_directory.c +++ b/src/bin/pg_dump/pg_backup_directory.c @@ -5,8 +5,10 @@ * A directory format dump is a directory, which contains a "toc.dat" file * for the TOC, and a separate file for each data entry, named "<oid>.dat". * Large objects are stored in separate files named "blob_<oid>.dat", - * and there's a plain-text TOC file for them called "blobs.toc". If - * compression is used, each data file is individually compressed and the + * and there's a plain-text TOC file for each BLOBS TOC entry named + * "blobs_<dumpID>.toc" (or just "blobs.toc" in archive versions before 16). + * + * If compression is used, each data file is individually compressed and the * ".gz" suffix is added to the filenames. The TOC files are never * compressed by pg_dump, however they are accepted with the .gz suffix too, * in case the user has manually compressed them with 'gzip'. @@ -51,7 +53,7 @@ typedef struct char *directory; CompressFileHandle *dataFH; /* currently open data file */ - CompressFileHandle *LOsTocFH; /* file handle for blobs.toc */ + CompressFileHandle *LOsTocFH; /* file handle for blobs_NNN.toc */ ParallelState *pstate; /* for parallel backup / restore */ } lclContext; @@ -81,7 +83,7 @@ static void _StartLOs(ArchiveHandle *AH, TocEntry *te); static void _StartLO(ArchiveHandle *AH, TocEntry *te, Oid oid); static void _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid); static void _EndLOs(ArchiveHandle *AH, TocEntry *te); -static void _LoadLOs(ArchiveHandle *AH); +static void _LoadLOs(ArchiveHandle *AH, TocEntry *te); static void _PrepParallelRestore(ArchiveHandle *AH); static void _Clone(ArchiveHandle *AH); @@ -232,7 +234,10 @@ _ArchiveEntry(ArchiveHandle *AH, TocEntry *te) tctx = (lclTocEntry *) pg_malloc0(sizeof(lclTocEntry)); if (strcmp(te->desc, "BLOBS") == 0) - tctx->filename = pg_strdup("blobs.toc"); + { + snprintf(fn, MAXPGPATH, "blobs_%d.toc", te->dumpId); + tctx->filename = pg_strdup(fn); + } else if (te->dataDumper) { snprintf(fn, MAXPGPATH, "%d.dat", te->dumpId); @@ -415,7 +420,7 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te) return; if (strcmp(te->desc, "BLOBS") == 0) - _LoadLOs(AH); + _LoadLOs(AH, te); else { char fname[MAXPGPATH]; @@ -426,17 +431,23 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te) } static void -_LoadLOs(ArchiveHandle *AH) +_LoadLOs(ArchiveHandle *AH, TocEntry *te) { Oid oid; lclContext *ctx = (lclContext *) AH->formatData; + lclTocEntry *tctx = (lclTocEntry *) te->formatData; CompressFileHandle *CFH; char tocfname[MAXPGPATH]; char line[MAXPGPATH]; StartRestoreLOs(AH); - setFilePath(AH, tocfname, "blobs.toc"); + /* + * Note: before archive v16, there was always only one BLOBS TOC entry, + * now there can be multiple. We don't need to worry what version we are + * reading though, because tctx->filename should be correct either way. + */ + setFilePath(AH, tocfname, tctx->filename); CFH = ctx->LOsTocFH = InitDiscoverCompressFileHandle(tocfname, PG_BINARY_R); @@ -632,7 +643,7 @@ _ReopenArchive(ArchiveHandle *AH) */ /* - * Called by the archiver when starting to save all BLOB DATA (not schema). + * Called by the archiver when starting to save BLOB DATA (not schema). * It is called just prior to the dumper's DataDumper routine. * * We open the large object TOC file here, so that we can append a line to @@ -642,10 +653,11 @@ static void _StartLOs(ArchiveHandle *AH, TocEntry *te) { lclContext *ctx = (lclContext *) AH->formatData; + lclTocEntry *tctx = (lclTocEntry *) te->formatData; pg_compress_specification compression_spec = {0}; char fname[MAXPGPATH]; - setFilePath(AH, fname, "blobs.toc"); + setFilePath(AH, fname, tctx->filename); /* The LO TOC file is never compressed */ compression_spec.algorithm = PG_COMPRESSION_NONE; @@ -690,7 +702,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) pg_fatal("could not close LO data file: %m"); ctx->dataFH = NULL; - /* register the LO in blobs.toc */ + /* register the LO in blobs_NNN.toc */ len = snprintf(buf, sizeof(buf), "%u blob_%u.dat\n", oid, oid); if (!CFH->write_func(buf, len, CFH)) { @@ -703,7 +715,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) } /* - * Called by the archiver when finishing saving all BLOB DATA. + * Called by the archiver when finishing saving BLOB DATA. * * We close the LOs TOC file. */ @@ -795,7 +807,7 @@ _PrepParallelRestore(ArchiveHandle *AH) } /* - * If this is the BLOBS entry, what we stat'd was blobs.toc, which + * If this is a BLOBS entry, what we stat'd was blobs_NNN.toc, which * most likely is a lot smaller than the actual blob data. We don't * have a cheap way to estimate how much smaller, but fortunately it * doesn't matter too much as long as we get the LOs processed diff --git a/src/bin/pg_dump/pg_backup_null.c b/src/bin/pg_dump/pg_backup_null.c index 776f057770..a3257f4fc8 100644 --- a/src/bin/pg_dump/pg_backup_null.c +++ b/src/bin/pg_dump/pg_backup_null.c @@ -113,7 +113,7 @@ _EndData(ArchiveHandle *AH, TocEntry *te) } /* - * Called by the archiver when starting to save all BLOB DATA (not schema). + * Called by the archiver when starting to save BLOB DATA (not schema). * This routine should save whatever format-specific information is needed * to read the LOs back into memory. * @@ -170,7 +170,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) } /* - * Called by the archiver when finishing saving all BLOB DATA. + * Called by the archiver when finishing saving BLOB DATA. * * Optional. */ diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c index 4cb9707e63..41ee52b1d6 100644 --- a/src/bin/pg_dump/pg_backup_tar.c +++ b/src/bin/pg_dump/pg_backup_tar.c @@ -94,7 +94,7 @@ typedef struct char *filename; } lclTocEntry; -static void _LoadLOs(ArchiveHandle *AH); +static void _LoadLOs(ArchiveHandle *AH, TocEntry *te); static TAR_MEMBER *tarOpen(ArchiveHandle *AH, const char *filename, char mode); static void tarClose(ArchiveHandle *AH, TAR_MEMBER *th); @@ -634,13 +634,13 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te) } if (strcmp(te->desc, "BLOBS") == 0) - _LoadLOs(AH); + _LoadLOs(AH, te); else _PrintFileData(AH, tctx->filename); } static void -_LoadLOs(ArchiveHandle *AH) +_LoadLOs(ArchiveHandle *AH, TocEntry *te) { Oid oid; lclContext *ctx = (lclContext *) AH->formatData; @@ -651,7 +651,26 @@ _LoadLOs(ArchiveHandle *AH) StartRestoreLOs(AH); - th = tarOpen(AH, NULL, 'r'); /* Open next file */ + /* + * The blobs_NNN.toc or blobs.toc file is fairly useless to us because it + * will appear only after the associated blob_NNN.dat files. For archive + * versions >= 16 we can look at the BLOBS entry's te->tag to discover the + * OID of the first blob we want to restore, and then search forward to + * find the appropriate blob_<oid>.dat file. For older versions we rely + * on the knowledge that there was only one BLOBS entry and just search + * for the first blob_<oid>.dat file. Once we find the first blob file to + * restore, restore all blobs until we reach the blobs[_NNN].toc file. + */ + if (AH->version >= K_VERS_1_16) + { + /* We rely on atooid to not complain about nnnn..nnnn tags */ + oid = atooid(te->tag); + snprintf(buf, sizeof(buf), "blob_%u.dat", oid); + th = tarOpen(AH, buf, 'r'); /* Advance to first desired file */ + } + else + th = tarOpen(AH, NULL, 'r'); /* Open next file */ + while (th != NULL) { ctx->FH = th; @@ -681,9 +700,9 @@ _LoadLOs(ArchiveHandle *AH) /* * Once we have found the first LO, stop at the first non-LO entry - * (which will be 'blobs.toc'). This coding would eat all the - * rest of the archive if there are no LOs ... but this function - * shouldn't be called at all in that case. + * (which will be 'blobs[_NNN].toc'). This coding would eat all + * the rest of the archive if there are no LOs ... but this + * function shouldn't be called at all in that case. */ if (foundLO) break; @@ -847,7 +866,7 @@ _scriptOut(ArchiveHandle *AH, const void *buf, size_t len) */ /* - * Called by the archiver when starting to save all BLOB DATA (not schema). + * Called by the archiver when starting to save BLOB DATA (not schema). * This routine should save whatever format-specific information is needed * to read the LOs back into memory. * @@ -862,7 +881,7 @@ _StartLOs(ArchiveHandle *AH, TocEntry *te) lclContext *ctx = (lclContext *) AH->formatData; char fname[K_STD_BUF_SIZE]; - sprintf(fname, "blobs.toc"); + sprintf(fname, "blobs_%d.toc", te->dumpId); ctx->loToc = tarOpen(AH, fname, 'w'); } @@ -908,7 +927,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) } /* - * Called by the archiver when finishing saving all BLOB DATA. + * Called by the archiver when finishing saving BLOB DATA. * * Optional. * diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 8c0b5486b9..ecb1156f5e 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -3560,11 +3560,10 @@ getLOs(Archive *fout) { DumpOptions *dopt = fout->dopt; PQExpBuffer loQry = createPQExpBuffer(); - LoInfo *loinfo; - DumpableObject *lodata; PGresult *res; int ntups; int i; + int n; int i_oid; int i_lomowner; int i_lomacl; @@ -3572,11 +3571,15 @@ getLOs(Archive *fout) pg_log_info("reading large objects"); - /* Fetch LO OIDs, and owner/ACL data */ + /* + * Fetch LO OIDs and owner/ACL data. Order the data so that all the blobs + * with the same owner/ACL appear together. + */ appendPQExpBufferStr(loQry, "SELECT oid, lomowner, lomacl, " "acldefault('L', lomowner) AS acldefault " - "FROM pg_largeobject_metadata"); + "FROM pg_largeobject_metadata " + "ORDER BY lomowner, lomacl::pg_catalog.text, oid"); res = ExecuteSqlQuery(fout, loQry->data, PGRES_TUPLES_OK); @@ -3588,30 +3591,72 @@ getLOs(Archive *fout) ntups = PQntuples(res); /* - * Each large object has its own "BLOB" archive entry. + * Group the blobs into suitably-sized groups that have the same owner and + * ACL setting, and build a metadata and a data DumpableObject for each + * group. (If we supported initprivs for blobs, we'd have to insist that + * groups also share initprivs settings, since the DumpableObject only has + * room for one.) i is the index of the first tuple in the current group, + * and n is the number of tuples we include in the group. */ - loinfo = (LoInfo *) pg_malloc(ntups * sizeof(LoInfo)); + for (i = 0; i < ntups; i += n) + { + Oid thisoid = atooid(PQgetvalue(res, i, i_oid)); + char *thisowner = PQgetvalue(res, i, i_lomowner); + char *thisacl = PQgetvalue(res, i, i_lomacl); + LoInfo *loinfo; + DumpableObject *lodata; + char namebuf[64]; + + /* Scan to find first tuple not to be included in group */ + n = 1; + while (n < 1000 && i + n < ntups) + { + if (strcmp(thisowner, PQgetvalue(res, i + n, i_lomowner)) != 0 || + strcmp(thisacl, PQgetvalue(res, i + n, i_lomacl)) != 0) + break; + n++; + } - for (i = 0; i < ntups; i++) - { - loinfo[i].dobj.objType = DO_LARGE_OBJECT; - loinfo[i].dobj.catId.tableoid = LargeObjectRelationId; - loinfo[i].dobj.catId.oid = atooid(PQgetvalue(res, i, i_oid)); - AssignDumpId(&loinfo[i].dobj); + /* Build the metadata DumpableObject */ + loinfo = (LoInfo *) pg_malloc(offsetof(LoInfo, looids) + n * sizeof(Oid)); - loinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_oid)); - loinfo[i].dacl.acl = pg_strdup(PQgetvalue(res, i, i_lomacl)); - loinfo[i].dacl.acldefault = pg_strdup(PQgetvalue(res, i, i_acldefault)); - loinfo[i].dacl.privtype = 0; - loinfo[i].dacl.initprivs = NULL; - loinfo[i].rolname = getRoleName(PQgetvalue(res, i, i_lomowner)); + loinfo->dobj.objType = DO_LARGE_OBJECT; + loinfo->dobj.catId.tableoid = LargeObjectRelationId; + loinfo->dobj.catId.oid = thisoid; + AssignDumpId(&loinfo->dobj); + + if (n > 1) + snprintf(namebuf, sizeof(namebuf), "%u..%u", thisoid, + atooid(PQgetvalue(res, i + n - 1, i_oid))); + else + snprintf(namebuf, sizeof(namebuf), "%u", thisoid); + loinfo->dobj.name = pg_strdup(namebuf); + loinfo->dacl.acl = pg_strdup(thisacl); + loinfo->dacl.acldefault = pg_strdup(PQgetvalue(res, i, i_acldefault)); + loinfo->dacl.privtype = 0; + loinfo->dacl.initprivs = NULL; + loinfo->rolname = getRoleName(thisowner); + loinfo->numlos = n; + loinfo->looids[0] = thisoid; + /* Collect OIDs of the remaining blobs in this group */ + for (int k = 1; k < n; k++) + { + CatalogId extraID; + + loinfo->looids[k] = atooid(PQgetvalue(res, i + k, i_oid)); + + /* Make sure we can look up loinfo by any of the blobs' OIDs */ + extraID.tableoid = LargeObjectRelationId; + extraID.oid = loinfo->looids[k]; + recordAdditionalCatalogID(extraID, &loinfo->dobj); + } /* LOs have data */ - loinfo[i].dobj.components |= DUMP_COMPONENT_DATA; + loinfo->dobj.components |= DUMP_COMPONENT_DATA; - /* Mark whether LO has an ACL */ + /* Mark whether LO group has a non-empty ACL */ if (!PQgetisnull(res, i, i_lomacl)) - loinfo[i].dobj.components |= DUMP_COMPONENT_ACL; + loinfo->dobj.components |= DUMP_COMPONENT_ACL; /* * In binary-upgrade mode for LOs, we do *not* dump out the LO data, @@ -3621,21 +3666,22 @@ getLOs(Archive *fout) * pg_largeobject_metadata, after the dump is restored. */ if (dopt->binary_upgrade) - loinfo[i].dobj.dump &= ~DUMP_COMPONENT_DATA; - } + loinfo->dobj.dump &= ~DUMP_COMPONENT_DATA; - /* - * If we have any large objects, a "BLOBS" archive entry is needed. This - * is just a placeholder for sorting; it carries no data now. - */ - if (ntups > 0) - { + /* + * Create a "BLOBS" data item for the group, too. This is just a + * placeholder for sorting; it carries no data now. + */ lodata = (DumpableObject *) pg_malloc(sizeof(DumpableObject)); lodata->objType = DO_LARGE_OBJECT_DATA; lodata->catId = nilCatalogId; AssignDumpId(lodata); - lodata->name = pg_strdup("BLOBS"); + lodata->name = pg_strdup(namebuf); lodata->components |= DUMP_COMPONENT_DATA; + /* Set up explicit dependency from data to metadata */ + lodata->dependencies = (DumpId *) pg_malloc(sizeof(DumpId)); + lodata->dependencies[0] = loinfo->dobj.dumpId; + lodata->nDeps = lodata->allocDeps = 1; } PQclear(res); @@ -3645,123 +3691,109 @@ getLOs(Archive *fout) /* * dumpLO * - * dump the definition (metadata) of the given large object + * dump the definition (metadata) of the given large object group */ static void dumpLO(Archive *fout, const LoInfo *loinfo) { PQExpBuffer cquery = createPQExpBuffer(); - PQExpBuffer dquery = createPQExpBuffer(); - - appendPQExpBuffer(cquery, - "SELECT pg_catalog.lo_create('%s');\n", - loinfo->dobj.name); - appendPQExpBuffer(dquery, - "SELECT pg_catalog.lo_unlink('%s');\n", - loinfo->dobj.name); + /* + * The "definition" is just a newline-separated list of OIDs. We need to + * put something into the dropStmt too, but it can just be a comment. + */ + for (int i = 0; i < loinfo->numlos; i++) + appendPQExpBuffer(cquery, "%u\n", loinfo->looids[i]); if (loinfo->dobj.dump & DUMP_COMPONENT_DEFINITION) ArchiveEntry(fout, loinfo->dobj.catId, loinfo->dobj.dumpId, ARCHIVE_OPTS(.tag = loinfo->dobj.name, .owner = loinfo->rolname, - .description = "BLOB", + .description = "BLOB METADATA", .section = SECTION_PRE_DATA, .createStmt = cquery->data, - .dropStmt = dquery->data)); - - /* Dump comment if any */ - if (loinfo->dobj.dump & DUMP_COMPONENT_COMMENT) - dumpComment(fout, "LARGE OBJECT", loinfo->dobj.name, - NULL, loinfo->rolname, - loinfo->dobj.catId, 0, loinfo->dobj.dumpId); - - /* Dump security label if any */ - if (loinfo->dobj.dump & DUMP_COMPONENT_SECLABEL) - dumpSecLabel(fout, "LARGE OBJECT", loinfo->dobj.name, - NULL, loinfo->rolname, - loinfo->dobj.catId, 0, loinfo->dobj.dumpId); - - /* Dump ACL if any */ - if (loinfo->dobj.dump & DUMP_COMPONENT_ACL) - dumpACL(fout, loinfo->dobj.dumpId, InvalidDumpId, "LARGE OBJECT", - loinfo->dobj.name, NULL, - NULL, loinfo->rolname, &loinfo->dacl); + .dropStmt = "-- dummy")); + + /* + * Dump per-blob comments, seclabels, and ACLs if any. We assume these + * are rare enough that it's okay to generate retail TOC entries for them. + */ + if (loinfo->dobj.dump & (DUMP_COMPONENT_COMMENT | + DUMP_COMPONENT_SECLABEL | + DUMP_COMPONENT_ACL)) + { + for (int i = 0; i < loinfo->numlos; i++) + { + CatalogId catId; + char namebuf[32]; + + /* Build identifying info for this blob */ + catId.tableoid = loinfo->dobj.catId.tableoid; + catId.oid = loinfo->looids[i]; + snprintf(namebuf, sizeof(namebuf), "%u", loinfo->looids[i]); + + if (loinfo->dobj.dump & DUMP_COMPONENT_COMMENT) + dumpComment(fout, "LARGE OBJECT", namebuf, + NULL, loinfo->rolname, + catId, 0, loinfo->dobj.dumpId); + + if (loinfo->dobj.dump & DUMP_COMPONENT_SECLABEL) + dumpSecLabel(fout, "LARGE OBJECT", namebuf, + NULL, loinfo->rolname, + catId, 0, loinfo->dobj.dumpId); + + if (loinfo->dobj.dump & DUMP_COMPONENT_ACL) + dumpACL(fout, loinfo->dobj.dumpId, InvalidDumpId, + "LARGE OBJECT", namebuf, NULL, + NULL, loinfo->rolname, &loinfo->dacl); + } + } destroyPQExpBuffer(cquery); - destroyPQExpBuffer(dquery); } /* * dumpLOs: - * dump the data contents of all large objects + * dump the data contents of the large objects in the given group */ static int dumpLOs(Archive *fout, const void *arg) { - const char *loQry; - const char *loFetchQry; + const LoInfo *loinfo = (const LoInfo *) arg; PGconn *conn = GetConnection(fout); - PGresult *res; char buf[LOBBUFSIZE]; - int ntups; - int i; - int cnt; - - pg_log_info("saving large objects"); - /* - * Currently, we re-fetch all LO OIDs using a cursor. Consider scanning - * the already-in-memory dumpable objects instead... - */ - loQry = - "DECLARE looid CURSOR FOR " - "SELECT oid FROM pg_largeobject_metadata ORDER BY 1"; + pg_log_info("saving large objects \"%s\"", loinfo->dobj.name); - ExecuteSqlStatement(fout, loQry); + for (int i = 0; i < loinfo->numlos; i++) + { + Oid loOid = loinfo->looids[i]; + int loFd; + int cnt; - /* Command to fetch from cursor */ - loFetchQry = "FETCH 1000 IN looid"; + /* Open the LO */ + loFd = lo_open(conn, loOid, INV_READ); + if (loFd == -1) + pg_fatal("could not open large object %u: %s", + loOid, PQerrorMessage(conn)); - do - { - /* Do a fetch */ - res = ExecuteSqlQuery(fout, loFetchQry, PGRES_TUPLES_OK); + StartLO(fout, loOid); - /* Process the tuples, if any */ - ntups = PQntuples(res); - for (i = 0; i < ntups; i++) + /* Now read it in chunks, sending data to archive */ + do { - Oid loOid; - int loFd; - - loOid = atooid(PQgetvalue(res, i, 0)); - /* Open the LO */ - loFd = lo_open(conn, loOid, INV_READ); - if (loFd == -1) - pg_fatal("could not open large object %u: %s", + cnt = lo_read(conn, loFd, buf, LOBBUFSIZE); + if (cnt < 0) + pg_fatal("error reading large object %u: %s", loOid, PQerrorMessage(conn)); - StartLO(fout, loOid); - - /* Now read it in chunks, sending data to archive */ - do - { - cnt = lo_read(conn, loFd, buf, LOBBUFSIZE); - if (cnt < 0) - pg_fatal("error reading large object %u: %s", - loOid, PQerrorMessage(conn)); - - WriteData(fout, buf, cnt); - } while (cnt > 0); - - lo_close(conn, loFd); + WriteData(fout, buf, cnt); + } while (cnt > 0); - EndLO(fout, loOid); - } + lo_close(conn, loFd); - PQclear(res); - } while (ntups > 0); + EndLO(fout, loOid); + } return 1; } @@ -10413,28 +10445,34 @@ dumpDumpableObject(Archive *fout, DumpableObject *dobj) case DO_LARGE_OBJECT_DATA: if (dobj->dump & DUMP_COMPONENT_DATA) { + LoInfo *loinfo; TocEntry *te; + loinfo = (LoInfo *) findObjectByDumpId(dobj->dependencies[0]); + if (loinfo == NULL) + pg_fatal("missing metadata for large objects \"%s\"", + dobj->name); + te = ArchiveEntry(fout, dobj->catId, dobj->dumpId, ARCHIVE_OPTS(.tag = dobj->name, + .owner = loinfo->rolname, .description = "BLOBS", .section = SECTION_DATA, - .dumpFn = dumpLOs)); + .deps = dobj->dependencies, + .nDeps = dobj->nDeps, + .dumpFn = dumpLOs, + .dumpArg = loinfo)); /* * Set the TocEntry's dataLength in case we are doing a * parallel dump and want to order dump jobs by table size. * (We need some size estimate for every TocEntry with a * DataDumper function.) We don't currently have any cheap - * way to estimate the size of LOs, but it doesn't matter; - * let's just set the size to a large value so parallel dumps - * will launch this job first. If there's lots of LOs, we - * win, and if there aren't, we don't lose much. (If you want - * to improve on this, really what you should be thinking - * about is allowing LO dumping to be parallelized, not just - * getting a smarter estimate for the single TOC entry.) + * way to estimate the size of LOs, but fortunately it doesn't + * matter too much as long as we get large batches of LOs + * processed reasonably early. Assume 8K per blob. */ - te->dataLength = INT_MAX; + te->dataLength = loinfo->numlos * (pgoff_t) 8192; } break; case DO_POLICY: diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h index 2fe3cbed9a..9105210693 100644 --- a/src/bin/pg_dump/pg_dump.h +++ b/src/bin/pg_dump/pg_dump.h @@ -589,11 +589,21 @@ typedef struct _defaultACLInfo char defaclobjtype; } DefaultACLInfo; +/* + * LoInfo represents a group of large objects (blobs) that share the same + * owner and ACL setting. dobj.components has the DUMP_COMPONENT_COMMENT bit + * set if any blob in the group has a comment; similarly for sec labels. + * If there are many blobs with the same owner/ACL, we can divide them into + * multiple LoInfo groups, which will each spawn a BLOB METADATA and a BLOBS + * (data) TOC entry. This allows more parallelism during restore. + */ typedef struct _loInfo { DumpableObject dobj; DumpableAcl dacl; const char *rolname; + int numlos; + Oid looids[FLEXIBLE_ARRAY_MEMBER]; } LoInfo; /* @@ -680,6 +690,7 @@ typedef struct _SubscriptionInfo extern TableInfo *getSchemaData(Archive *fout, int *numTablesPtr); extern void AssignDumpId(DumpableObject *dobj); +extern void recordAdditionalCatalogID(CatalogId catId, DumpableObject *dobj); extern DumpId createDumpId(void); extern DumpId getMaxDumpId(void); extern DumpableObject *findObjectByDumpId(DumpId dumpId); diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index eb3ec534b4..76548561c8 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -109,11 +109,11 @@ my %pgdump_runs = ( '--format=directory', '--compress=gzip:1', "--file=$tempdir/compression_gzip_dir", 'postgres', ], - # Give coverage for manually compressed blob.toc files during + # Give coverage for manually compressed blobs.toc files during # restore. compress_cmd => { program => $ENV{'GZIP_PROGRAM'}, - args => [ '-f', "$tempdir/compression_gzip_dir/blobs.toc", ], + args => [ '-f', "$tempdir/compression_gzip_dir/blobs_*.toc", ], }, # Verify that only data files were compressed glob_patterns => [ @@ -172,16 +172,6 @@ my %pgdump_runs = ( '--format=directory', '--compress=lz4:1', "--file=$tempdir/compression_lz4_dir", 'postgres', ], - # Give coverage for manually compressed blob.toc files during - # restore. - compress_cmd => { - program => $ENV{'LZ4'}, - args => [ - '-z', '-f', '--rm', - "$tempdir/compression_lz4_dir/blobs.toc", - "$tempdir/compression_lz4_dir/blobs.toc.lz4", - ], - }, # Verify that data files were compressed glob_patterns => [ "$tempdir/compression_lz4_dir/toc.dat", @@ -242,14 +232,13 @@ my %pgdump_runs = ( '--format=directory', '--compress=zstd:1', "--file=$tempdir/compression_zstd_dir", 'postgres', ], - # Give coverage for manually compressed blob.toc files during + # Give coverage for manually compressed blobs.toc files during # restore. compress_cmd => { program => $ENV{'ZSTD'}, args => [ '-z', '-f', - '--rm', "$tempdir/compression_zstd_dir/blobs.toc", - "-o", "$tempdir/compression_zstd_dir/blobs.toc.zst", + '--rm', "$tempdir/compression_zstd_dir/blobs_*.toc", ], }, # Verify that data files were compressed @@ -413,7 +402,7 @@ my %pgdump_runs = ( }, glob_patterns => [ "$tempdir/defaults_dir_format/toc.dat", - "$tempdir/defaults_dir_format/blobs.toc", + "$tempdir/defaults_dir_format/blobs_*.toc", $supports_gzip ? "$tempdir/defaults_dir_format/*.dat.gz" : "$tempdir/defaults_dir_format/*.dat", ], @@ -4821,8 +4810,13 @@ foreach my $run (sort keys %pgdump_runs) # not defined. next if (!defined($compress_program) || $compress_program eq ''); - my @full_compress_cmd = - ($compress_cmd->{program}, @{ $compress_cmd->{args} }); + # Arguments may require globbing. + my @full_compress_cmd = ($compress_program); + foreach my $arg (@{ $compress_cmd->{args} }) + { + push @full_compress_cmd, glob($arg); + } + command_ok(\@full_compress_cmd, "$run: compression commands"); } -- 2.39.3 From 17ace22d028b24a89561e76f94f9defd92da9e8d Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed, 20 Dec 2023 16:56:54 -0500 Subject: [PATCH v9 3/4] Move BLOBS METADATA TOC entries into SECTION_DATA. Commit c0d5be5d6 put the new BLOB metadata TOC entries into SECTION_PRE_DATA, which perhaps is defensible in some ways, but it's a rather odd choice considering that we go out of our way to treat blobs as data. Moreover, because parallel restore handles the PRE_DATA section serially, this means we're only getting part of the parallelism speedup we could hope for. Moving these entries into SECTION_DATA means that we can parallelize the lo_create calls not only the data loading when there are many blobs. The dependencies established by the previous patch ensure that we won't try to load data for a blob we've not yet created. --- src/bin/pg_dump/pg_dump.c | 4 ++-- src/bin/pg_dump/t/002_pg_dump.pl | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index ecb1156f5e..4b34638cb1 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -3710,7 +3710,7 @@ dumpLO(Archive *fout, const LoInfo *loinfo) ARCHIVE_OPTS(.tag = loinfo->dobj.name, .owner = loinfo->rolname, .description = "BLOB METADATA", - .section = SECTION_PRE_DATA, + .section = SECTION_DATA, .createStmt = cquery->data, .dropStmt = "-- dummy")); @@ -18534,12 +18534,12 @@ addBoundaryDependencies(DumpableObject **dobjs, int numObjs, case DO_FDW: case DO_FOREIGN_SERVER: case DO_TRANSFORM: - case DO_LARGE_OBJECT: /* Pre-data objects: must come before the pre-data boundary */ addObjectDependency(preDataBound, dobj->dumpId); break; case DO_TABLE_DATA: case DO_SEQUENCE_SET: + case DO_LARGE_OBJECT: case DO_LARGE_OBJECT_DATA: /* Data objects: must come between the boundaries */ addObjectDependency(dobj, preDataBound->dumpId); diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index 76548561c8..f0ea6e3dd8 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -912,7 +912,7 @@ my %tests = ( column_inserts => 1, data_only => 1, inserts => 1, - section_pre_data => 1, + section_data => 1, test_schema_plus_large_objects => 1, }, unlike => { @@ -1289,7 +1289,7 @@ my %tests = ( column_inserts => 1, data_only => 1, inserts => 1, - section_pre_data => 1, + section_data => 1, test_schema_plus_large_objects => 1, }, unlike => { @@ -1497,7 +1497,7 @@ my %tests = ( column_inserts => 1, data_only => 1, inserts => 1, - section_pre_data => 1, + section_data => 1, test_schema_plus_large_objects => 1, }, unlike => { @@ -4241,7 +4241,7 @@ my %tests = ( column_inserts => 1, data_only => 1, inserts => 1, - section_pre_data => 1, + section_data => 1, test_schema_plus_large_objects => 1, binary_upgrade => 1, }, -- 2.39.3 From 3ab3558a236e6ad17fe48087aac3cabb4b02aa3e Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed, 20 Dec 2023 17:42:39 -0500 Subject: [PATCH v9 4/4] Invent --transaction-size option for pg_restore. This patch allows pg_restore to wrap its commands into transaction blocks, somewhat like --single-transaction, except that we commit and start a new block after every N objects. Using this mode with a size limit of 1000 or so objects greatly reduces the number of transactions consumed by the restore, while preventing any one transaction from taking enough locks to overrun the receiving server's shared lock table. (A value of 1000 works well with the default lock table size of around 6400 locks. Higher --transaction-size values can be used if one has increased the receiving server's lock table size.) In this patch I have just hard-wired pg_upgrade to use --transaction-size 1000. Perhaps there would be value in adding another pg_upgrade option to allow user control of that, but I'm unsure that it's worth the trouble; I think few users would use it, and any who did would see not that much benefit. However, we might need to adjust the logic to make the size be 1000 divided by the number of parallel restore jobs allowed. --- doc/src/sgml/ref/pg_restore.sgml | 24 +++++ src/bin/pg_dump/pg_backup.h | 4 +- src/bin/pg_dump/pg_backup_archiver.c | 139 +++++++++++++++++++++++++-- src/bin/pg_dump/pg_backup_archiver.h | 3 + src/bin/pg_dump/pg_backup_db.c | 18 ++++ src/bin/pg_dump/pg_restore.c | 15 ++- src/bin/pg_upgrade/pg_upgrade.c | 2 + 7 files changed, 197 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml index 1a23874da6..2e3ba80258 100644 --- a/doc/src/sgml/ref/pg_restore.sgml +++ b/doc/src/sgml/ref/pg_restore.sgml @@ -786,6 +786,30 @@ PostgreSQL documentation </listitem> </varlistentry> + <varlistentry> + <term><option>--transaction-size=<replaceable class="parameter">N</replaceable></option></term> + <listitem> + <para> + Execute the restore as a series of transactions, each processing + up to <replaceable class="parameter">N</replaceable> database + objects. This option implies <option>--exit-on-error</option>. + </para> + <para> + <option>--transaction-size</option> offers an intermediate choice + between the default behavior (one transaction per SQL command) + and <option>-1</option>/<option>--single-transaction</option> + (one transaction for all restored objects). + While <option>--single-transaction</option> has the least + overhead, it may be impractical for large databases because the + transaction will take a lock on each restored object, possibly + exhausting the server's lock table space. + Using <option>--transaction-size</option> with a size of a few + thousand objects offers nearly the same performance benefits while + capping the amount of lock table space needed. + </para> + </listitem> + </varlistentry> + <varlistentry> <term><option>--use-set-session-authorization</option></term> <listitem> diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h index 9ef2f2017e..fbf5f1c515 100644 --- a/src/bin/pg_dump/pg_backup.h +++ b/src/bin/pg_dump/pg_backup.h @@ -149,7 +149,9 @@ typedef struct _restoreOptions * compression */ int suppressDumpWarnings; /* Suppress output of WARNING entries * to stderr */ - bool single_txn; + + bool single_txn; /* restore all TOCs in one transaction */ + int txn_size; /* restore this many TOCs per txn, if > 0 */ bool *idWanted; /* array showing which dump IDs to emit */ int enable_row_security; diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 73b9972da4..ec74846998 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -502,7 +502,28 @@ RestoreArchive(Archive *AHX) /* Otherwise, drop anything that's selected and has a dropStmt */ if (((te->reqs & (REQ_SCHEMA | REQ_DATA)) != 0) && te->dropStmt) { + bool not_allowed_in_txn = false; + pg_log_info("dropping %s %s", te->desc, te->tag); + + /* + * In --transaction-size mode, we have to temporarily exit our + * transaction block to drop objects that can't be dropped + * within a transaction. + */ + if (ropt->txn_size > 0) + { + if (strcmp(te->desc, "DATABASE") == 0 || + strcmp(te->desc, "DATABASE PROPERTIES") == 0) + { + not_allowed_in_txn = true; + if (AH->connection) + CommitTransaction(AHX); + else + ahprintf(AH, "COMMIT;\n"); + } + } + /* Select owner and schema as necessary */ _becomeOwner(AH, te); _selectOutputSchema(AH, te->namespace); @@ -628,6 +649,33 @@ RestoreArchive(Archive *AHX) } } } + + /* + * In --transaction-size mode, re-establish the transaction + * block if needed; otherwise, commit after every N drops. + */ + if (ropt->txn_size > 0) + { + if (not_allowed_in_txn) + { + if (AH->connection) + StartTransaction(AHX); + else + ahprintf(AH, "BEGIN;\n"); + AH->txnCount = 0; + } + else if (++AH->txnCount >= ropt->txn_size) + { + if (AH->connection) + { + CommitTransaction(AHX); + StartTransaction(AHX); + } + else + ahprintf(AH, "COMMIT;\nBEGIN;\n"); + AH->txnCount = 0; + } + } } } @@ -724,7 +772,11 @@ RestoreArchive(Archive *AHX) } } - if (ropt->single_txn) + /* + * Close out any persistent transaction we may have. While these two + * cases are started in different places, we can end both cases here. + */ + if (ropt->single_txn || ropt->txn_size > 0) { if (AH->connection) CommitTransaction(AHX); @@ -785,6 +837,25 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel) */ if ((reqs & REQ_SCHEMA) != 0) { + bool object_is_db = false; + + /* + * In --transaction-size mode, must exit our transaction block to + * create a database or set its properties. + */ + if (strcmp(te->desc, "DATABASE") == 0 || + strcmp(te->desc, "DATABASE PROPERTIES") == 0) + { + object_is_db = true; + if (ropt->txn_size > 0) + { + if (AH->connection) + CommitTransaction(&AH->public); + else + ahprintf(AH, "COMMIT;\n\n"); + } + } + /* Show namespace in log message if available */ if (te->namespace) pg_log_info("creating %s \"%s.%s\"", @@ -835,10 +906,10 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel) /* * If we created a DB, connect to it. Also, if we changed DB * properties, reconnect to ensure that relevant GUC settings are - * applied to our session. + * applied to our session. (That also restarts the transaction block + * in --transaction-size mode.) */ - if (strcmp(te->desc, "DATABASE") == 0 || - strcmp(te->desc, "DATABASE PROPERTIES") == 0) + if (object_is_db) { pg_log_info("connecting to new database \"%s\"", te->tag); _reconnectToDB(AH, te->tag); @@ -964,6 +1035,25 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel) } } + /* + * If we emitted anything for this TOC entry, that counts as one action + * against the transaction-size limit. Commit if it's time to. + */ + if ((reqs & (REQ_SCHEMA | REQ_DATA)) != 0 && ropt->txn_size > 0) + { + if (++AH->txnCount >= ropt->txn_size) + { + if (AH->connection) + { + CommitTransaction(&AH->public); + StartTransaction(&AH->public); + } + else + ahprintf(AH, "COMMIT;\nBEGIN;\n\n"); + AH->txnCount = 0; + } + } + if (AH->public.n_errors > 0 && status == WORKER_OK) status = WORKER_IGNORED_ERRORS; @@ -1310,7 +1400,12 @@ StartRestoreLOs(ArchiveHandle *AH) { RestoreOptions *ropt = AH->public.ropt; - if (!ropt->single_txn) + /* + * LOs must be restored within a transaction block, since we need the LO + * handle to stay open while we write it. Establish a transaction unless + * there's one being used globally. + */ + if (!(ropt->single_txn || ropt->txn_size > 0)) { if (AH->connection) StartTransaction(&AH->public); @@ -1329,7 +1424,7 @@ EndRestoreLOs(ArchiveHandle *AH) { RestoreOptions *ropt = AH->public.ropt; - if (!ropt->single_txn) + if (!(ropt->single_txn || ropt->txn_size > 0)) { if (AH->connection) CommitTransaction(&AH->public); @@ -3170,6 +3265,19 @@ _doSetFixedOutputState(ArchiveHandle *AH) else ahprintf(AH, "SET row_security = off;\n"); + /* + * In --transaction-size mode, we should always be in a transaction when + * we begin to restore objects. + */ + if (ropt && ropt->txn_size > 0) + { + if (AH->connection) + StartTransaction(&AH->public); + else + ahprintf(AH, "\nBEGIN;\n"); + AH->txnCount = 0; + } + ahprintf(AH, "\n"); } @@ -4033,6 +4141,14 @@ restore_toc_entries_prefork(ArchiveHandle *AH, TocEntry *pending_list) } } + /* + * In --transaction-size mode, we must commit the open transaction before + * dropping the database connection. This also ensures that child workers + * can see the objects we've created so far. + */ + if (AH->public.ropt->txn_size > 0) + CommitTransaction(&AH->public); + /* * Now close parent connection in prep for parallel steps. We do this * mainly to ensure that we don't exceed the specified number of parallel @@ -4772,6 +4888,10 @@ CloneArchive(ArchiveHandle *AH) clone = (ArchiveHandle *) pg_malloc(sizeof(ArchiveHandle)); memcpy(clone, AH, sizeof(ArchiveHandle)); + /* Likewise flat-copy the RestoreOptions, so we can alter them locally */ + clone->public.ropt = (RestoreOptions *) pg_malloc(sizeof(RestoreOptions)); + memcpy(clone->public.ropt, AH->public.ropt, sizeof(RestoreOptions)); + /* Handle format-independent fields */ memset(&(clone->sqlparse), 0, sizeof(clone->sqlparse)); @@ -4793,6 +4913,13 @@ CloneArchive(ArchiveHandle *AH) /* clones should not share lo_buf */ clone->lo_buf = NULL; + /* + * Clone connections disregard --transaction-size; they must commit after + * each command so that the results are immediately visible to other + * workers. + */ + clone->public.ropt->txn_size = 0; + /* * Connect our new clone object to the database, using the same connection * parameters used for the original connection. diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h index e4dd395582..1b9f142dea 100644 --- a/src/bin/pg_dump/pg_backup_archiver.h +++ b/src/bin/pg_dump/pg_backup_archiver.h @@ -324,6 +324,9 @@ struct _archiveHandle char *currTablespace; /* current tablespace, or NULL */ char *currTableAm; /* current table access method, or NULL */ + /* in --transaction-size mode, this counts objects emitted in cur xact */ + int txnCount; + void *lo_buf; size_t lo_buf_used; size_t lo_buf_size; diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c index c14d813b21..6b3bf174f2 100644 --- a/src/bin/pg_dump/pg_backup_db.c +++ b/src/bin/pg_dump/pg_backup_db.c @@ -554,6 +554,7 @@ IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, { /* Make a writable copy of the command string */ char *buf = pg_strdup(te->defn); + RestoreOptions *ropt = AH->public.ropt; char *st; char *en; @@ -562,6 +563,23 @@ IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, { *en++ = '\0'; ahprintf(AH, "%s%s%s;\n", cmdBegin, st, cmdEnd); + + /* In --transaction-size mode, count each command as an action */ + if (ropt && ropt->txn_size > 0) + { + if (++AH->txnCount >= ropt->txn_size) + { + if (AH->connection) + { + CommitTransaction(&AH->public); + StartTransaction(&AH->public); + } + else + ahprintf(AH, "COMMIT;\nBEGIN;\n\n"); + AH->txnCount = 0; + } + } + st = en; } ahprintf(AH, "\n"); diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c index c3beacdec1..5ea78cf7cc 100644 --- a/src/bin/pg_dump/pg_restore.c +++ b/src/bin/pg_dump/pg_restore.c @@ -120,6 +120,7 @@ main(int argc, char **argv) {"role", required_argument, NULL, 2}, {"section", required_argument, NULL, 3}, {"strict-names", no_argument, &strict_names, 1}, + {"transaction-size", required_argument, NULL, 5}, {"use-set-session-authorization", no_argument, &use_setsessauth, 1}, {"no-comments", no_argument, &no_comments, 1}, {"no-publications", no_argument, &no_publications, 1}, @@ -289,10 +290,18 @@ main(int argc, char **argv) set_dump_section(optarg, &(opts->dumpSections)); break; - case 4: + case 4: /* filter */ read_restore_filters(optarg, opts); break; + case 5: /* transaction-size */ + if (!option_parse_int(optarg, "--transaction-size", + 1, INT_MAX, + &opts->txn_size)) + exit(1); + opts->exit_on_error = true; + break; + default: /* getopt_long already emitted a complaint */ pg_log_error_hint("Try \"%s --help\" for more information.", progname); @@ -337,6 +346,9 @@ main(int argc, char **argv) if (opts->dataOnly && opts->dropSchema) pg_fatal("options -c/--clean and -a/--data-only cannot be used together"); + if (opts->single_txn && opts->txn_size > 0) + pg_fatal("options -1/--single-transaction and --transaction-size cannot be used together"); + /* * -C is not compatible with -1, because we can't create a database inside * a transaction block. @@ -484,6 +496,7 @@ usage(const char *progname) printf(_(" --section=SECTION restore named section (pre-data, data, or post-data)\n")); printf(_(" --strict-names require table and/or schema include patterns to\n" " match at least one entity each\n")); + printf(_(" --transaction-size=N commit after every N objects\n")); printf(_(" --use-set-session-authorization\n" " use SET SESSION AUTHORIZATION commands instead of\n" " ALTER OWNER commands to set ownership\n")); diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c index 3960af4036..5cfd2282e1 100644 --- a/src/bin/pg_upgrade/pg_upgrade.c +++ b/src/bin/pg_upgrade/pg_upgrade.c @@ -548,6 +548,7 @@ create_new_objects(void) true, true, "\"%s/pg_restore\" %s %s --exit-on-error --verbose " + "--transaction-size=1000 " "--dbname postgres \"%s/%s\"", new_cluster.bindir, cluster_conn_opts(&new_cluster), @@ -586,6 +587,7 @@ create_new_objects(void) parallel_exec_prog(log_file_name, NULL, "\"%s/pg_restore\" %s %s --exit-on-error --verbose " + "--transaction-size=1000 " "--dbname template1 \"%s/%s\"", new_cluster.bindir, cluster_conn_opts(&new_cluster), -- 2.39.3
On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote: > I have spent some more effort in this area and developed a patch > series that I think addresses all of the performance issues that > we've discussed in this thread, both for pg_upgrade and more > general use of pg_dump/pg_restore. Concretely, it absorbs > the pg_restore --transaction-size switch that I proposed before > to cut the number of transactions needed during restore, and > rearranges the representation of BLOB-related TOC entries to > reduce the client-side memory requirements, and fixes some > ancient mistakes that prevent both selective restore of BLOBs > and parallel restore of BLOBs. > > As a demonstration, I made a database containing 100K empty blobs, > and measured the time needed to dump/restore that using -Fd > and -j 10. HEAD doesn't get any useful parallelism on blobs, > but with this patch series we do: > > dump restore > HEAD: 14sec 15sec > after 0002: 7sec 10sec > after 0003: 7sec 3sec Wow, thanks for putting together these patches. I intend to help review, but I'm not sure I'll find much time to do so before the new year. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > Wow, thanks for putting together these patches. I intend to help review, Thanks! > but I'm not sure I'll find much time to do so before the new year. There's no urgency, surely. If we can get these in during the January CF, I'll be happy. regards, tom lane
On Thu, 21 Dec 2023 at 10:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I have spent some more effort in this area and developed a patch
series that I think addresses all of the performance issues that
we've discussed in this thread, both for pg_upgrade and more
general use of pg_dump/pg_restore.
Applying all 4 patches, I also see good performance improvement.
With more Large Objects, although pg_dump improved significantly,
pg_restore is now comfortably an order of magnitude faster.
pg_dump times (seconds):
NumLOs dump-patch004 dump-HEAD improvement (%)
1 0.09 0.09 ~
10 0.10 0.12 ~
100 0.12 0.12 ~
1,000 0.41 0.44 ~
10,000 3 5 76%
100,000 35 47 36%
1,000,000 111 251 126%
pg_restore times (seconds):
NumLOs restore-patch0004 restore-HEAD improvement (%)
1 0.02 0.02 ~
10 0.03 0.03 ~
100 0.13 0.12 ~
1,000 0.98 0.97 ~
10,000 2 9 ~5x
100,000 6 93 13x
1,000,000 53 973 17x
Test details:
- pg_dump -Fd -j32 / pg_restore -j32
- 32vCPU / Ubuntu 20.04 / 260GB Memory / r6id.8xlarge
- Client & Server on same machine
- Empty LOs / Empty ACLs
- HEAD = 7d7ef075d2b3f3bac4db323c2a47fb15a4a9a817
- See attached graphs
IMHO the knob (for configuring batch size) is a non-blocker. The
default (1k) here is already way better than what we have today.Look forward to feedback on the tests, or I'll continue testing
whether ACLs / non-empty LOs etc. adversely affect these numbers.
-
Robins Tharakan
Amazon Web Services
Robins Tharakan
Amazon Web Services
Attachment
Robins Tharakan <tharakan@gmail.com> writes: > Applying all 4 patches, I also see good performance improvement. > With more Large Objects, although pg_dump improved significantly, > pg_restore is now comfortably an order of magnitude faster. Yeah. The key thing here is that pg_dump can only parallelize the data transfer, while (with 0004) pg_restore can parallelize large object creation and owner-setting as well as data transfer. I don't see any simple way to improve that on the dump side, but I'm not sure we need to. Zillions of empty objects is not really the use case to worry about. I suspect that a more realistic case with moderate amounts of data in the blobs would make pg_dump look better. regards, tom lane
On Thu, 28 Dec 2023 at 01:48, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robins Tharakan <tharakan@gmail.com> writes:
> Applying all 4 patches, I also see good performance improvement.
> With more Large Objects, although pg_dump improved significantly,
> pg_restore is now comfortably an order of magnitude faster.
Yeah. The key thing here is that pg_dump can only parallelize
the data transfer, while (with 0004) pg_restore can parallelize
large object creation and owner-setting as well as data transfer.
I don't see any simple way to improve that on the dump side,
but I'm not sure we need to. Zillions of empty objects is not
really the use case to worry about. I suspect that a more realistic
case with moderate amounts of data in the blobs would make pg_dump
look better.
Thanks for elaborating, and yes pg_dump times do reflect that
expectation.The first test involved a fixed number (32k) of
Large Objects (LOs) with varying sizes - I chose that number
intentionally since this was being tested on a 32vCPU instance
and the patch employs 1k batches.
We again see that pg_restore is an order of magnitude faster.
LO Size (bytes) restore-HEAD restore-patched improvement (Nx)
1 24.182 1.4 17x
10 24.741 1.5 17x
100 24.574 1.6 15x
1,000 25.314 1.7 15x
10,000 25.644 1.7 15x
100,000 50.046 4.3 12x
1,000,000 281.549 30.0 9x
pg_dump also sees improvements. Really small sized LOs
see a decent ~20% improvement which grows considerably as LOs
get bigger (beyond ~10-100kb).
LO Size (bytes) dump-HEAD dump-patched improvement (%)
1 12.9 10.7 18%
10 12.9 10.4 19%
100 12.8 10.3 20%
1,000 13.0 10.3 21%
10,000 14.2 10.3 27%
100,000 32.8 11.5 65%
1,000,000 211.8 23.6 89%
To test pg_restore scaling, 1 Million LOs (100kb each)
were created and pg_restore times tested for increasing
concurrency (on a 192vCPU instance). We see major speedup
upto -j64 and the best time was at -j96, after which
performance decreases slowly - see attached image.
Concurrency pg_restore-patched
384 75.87
352 75.63
320 72.11
288 70.05
256 70.98
224 66.98
192 63.04
160 61.37
128 58.82
96 58.55
64 60.46
32 77.29
16 115.51
8 203.48
4 366.33
Test details:
- Command used to generate SQL - create 1k LOs of 1kb each
- echo "SELECT lo_from_bytea(0, '\x` printf 'ff%.0s' {1..1000}`') FROM generate_series(1,1000);" > /tmp/tempdel
- Verify the LO size: select pg_column_size(lo_get(oid));
- Only GUC changed: max_connections=1000 (for the last test)
-
Robins Tharakan
Amazon Web Services
We again see that pg_restore is an order of magnitude faster.
LO Size (bytes) restore-HEAD restore-patched improvement (Nx)
1 24.182 1.4 17x
10 24.741 1.5 17x
100 24.574 1.6 15x
1,000 25.314 1.7 15x
10,000 25.644 1.7 15x
100,000 50.046 4.3 12x
1,000,000 281.549 30.0 9x
pg_dump also sees improvements. Really small sized LOs
see a decent ~20% improvement which grows considerably as LOs
get bigger (beyond ~10-100kb).
LO Size (bytes) dump-HEAD dump-patched improvement (%)
1 12.9 10.7 18%
10 12.9 10.4 19%
100 12.8 10.3 20%
1,000 13.0 10.3 21%
10,000 14.2 10.3 27%
100,000 32.8 11.5 65%
1,000,000 211.8 23.6 89%
To test pg_restore scaling, 1 Million LOs (100kb each)
were created and pg_restore times tested for increasing
concurrency (on a 192vCPU instance). We see major speedup
upto -j64 and the best time was at -j96, after which
performance decreases slowly - see attached image.
Concurrency pg_restore-patched
384 75.87
352 75.63
320 72.11
288 70.05
256 70.98
224 66.98
192 63.04
160 61.37
128 58.82
96 58.55
64 60.46
32 77.29
16 115.51
8 203.48
4 366.33
Test details:
- Command used to generate SQL - create 1k LOs of 1kb each
- echo "SELECT lo_from_bytea(0, '\x` printf 'ff%.0s' {1..1000}`') FROM generate_series(1,1000);" > /tmp/tempdel
- Verify the LO size: select pg_column_size(lo_get(oid));
- Only GUC changed: max_connections=1000 (for the last test)
-
Robins Tharakan
Amazon Web Services
Attachment
> On 11/12/2023, 01:43, "Tom Lane" <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> wrote: > I had initially supposed that in a parallel restore we could > have child workers also commit after every N TOC items, but was > soon disabused of that idea. After a worker processes a TOC > item, any dependent items (such as index builds) might get > dispatched to some other worker, which had better be able to > see the results of the first worker's step. So at least in > this implementation, we disable the multi-command-per-COMMIT > behavior during the parallel part of the restore. Maybe that > could be improved in future, but it seems like it'd add a > lot more complexity, and it wouldn't make life any better for > pg_upgrade (which doesn't use parallel pg_restore, and seems > unlikely to want to in future). I was not able to find email thread which details why we are not using parallel pg_restore for pg_upgrade. IMHO most of the customer will have single large database, and not using parallel restore will cause slow pg_upgrade. I am attaching a patch which enables parallel pg_restore for DATA and POST-DATA part of dump. It will push down --jobs value to pg_restore and will restore database sequentially. Benchmarks {5 million LOs 1 large DB} Patched {v9} time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub--jobs=20 pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 17.51s user 65.80s system 35% cpu 3:56.64total time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub-r pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 17.51s user 65.85s system 34% cpu 3:58.39total HEAD time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub-r --jobs=20 pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 53.95s user 82.44s system 41% cpu 5:25.23total time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub-r pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 54.94s user 81.26s system 41% cpu 5:24.86total Fix with --jobs propagation to pg_restore {on top of v9} time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub-r --jobs=20 pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 29.12s user 69.85s system 275% cpu 35.930total Although parallel restore does have small regression in ideal case of pg_upgrade --jobs Multiple DBs {4 DBs each having 2 million LOs} Fix with --jobs scheduling time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub-r --jobs=4 pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 31.80s user 109.52s system 120% cpu 1:57.35total Patched {v9} time pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir ~/upgrade/data/pub --new-datadir ~/data/sub-r --jobs=4 pg_upgrade --old-bindir ~/15/bin --new-bindir ~/install/bin --old-datadir 30.88s user 110.05s system 135% cpu 1:43.97total Regards Sachin
Attachment
"Kumar, Sachin" <ssetiya@amazon.com> writes: >> On 11/12/2023, 01:43, "Tom Lane" <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> wrote: >> ... Maybe that >> could be improved in future, but it seems like it'd add a >> lot more complexity, and it wouldn't make life any better for >> pg_upgrade (which doesn't use parallel pg_restore, and seems >> unlikely to want to in future). > I was not able to find email thread which details why we are not using > parallel pg_restore for pg_upgrade. Well, it's pretty obvious isn't it? The parallelism is being applied at the per-database level instead. > IMHO most of the customer will have single large > database, and not using parallel restore will cause slow pg_upgrade. You've offered no justification for that opinion ... > I am attaching a patch which enables parallel pg_restore for DATA and POST-DATA part > of dump. It will push down --jobs value to pg_restore and will restore > database sequentially. I don't think I trust this patch one bit. It makes way too many assumptions about how the --section options work, or even that they will work at all in a binary-upgrade situation. I've spent enough time with that code to know that --section is pretty close to being a fiction. One point in particular is that this would change the order of ACL restore relative to other steps, which almost certainly will cause problems for somebody. regards, tom lane
I wrote: > "Kumar, Sachin" <ssetiya@amazon.com> writes: >> I was not able to find email thread which details why we are not using >> parallel pg_restore for pg_upgrade. > Well, it's pretty obvious isn't it? The parallelism is being applied > at the per-database level instead. On further reflection, there is a very good reason why it's done like that. Because pg_upgrade is doing schema-only dump and restore, there's next to no opportunity for parallelism within either pg_dump or pg_restore. There's no data-loading steps, and there's no index-building either, so the time-consuming stuff that could be parallelized just isn't happening in pg_upgrade's usage. Now it's true that my 0003 patch moves the needle a little bit: since it makes BLOB creation (as opposed to loading) parallelizable, there'd be some hope for parallel pg_restore doing something useful in a database with very many blobs. But it makes no sense to remove the existing cross-database parallelism in pursuit of that; you'd make many more people unhappy than happy. Conceivably something could be salvaged of your idea by having pg_upgrade handle databases with many blobs differently from those without, applying parallelism within pg_restore for the first kind and then using cross-database parallelism for the rest. But that seems like a lot of complexity compared to the possible win. In any case I'd stay far away from using --section in pg_upgrade. Too many moving parts there. regards, tom lane
On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote: > * I did not invent a switch to control the batching of blobs; it's > just hard-wired at 1000 blobs per group here. Probably we need some > user knob for that, but I'm unsure if we want to expose a count or > just a boolean for one vs more than one blob per batch. The point of > forcing one blob per batch would be to allow exact control during > selective restore, and I'm not sure if there's any value in random > other settings. On the other hand, selective restore of blobs has > been completely broken for the last dozen years and I can't recall any > user complaints about that; so maybe nobody cares and we could just > leave this as an internal choice. I think the argument for making this configurable is that folks might have fewer larger blobs, in which case it might make sense to lower the batch size, or they might have many smaller blobs, in which case they might want to increase it. But I'm a bit skeptical that will make any sort of tremendous difference in practice, and I'm not sure how a user would decide on the right value besides trial-and-error. In any case, at the moment I think it'd be okay to keep this an internal setting and wait for feedback from the field. > * As the patch stands, we still build a separate TOC entry for each > comment or seclabel or ACL attached to a blob. If you have a lot of > blobs with non-default properties then the TOC bloat problem comes > back again. We could do something about that, but it would take a bit > of tedious refactoring, and the most obvious way to handle it probably > re-introduces too-many-locks problems. Is this a scenario that's > worth spending a lot of time on? I'll ask around about this. > Subject: [PATCH v9 1/4] Some small preliminaries for pg_dump changes. This one looked good to me. > Subject: [PATCH v9 2/4] In dumps, group large objects into matching metadata > and data entries. I spent most of my review time reading this one. Overall, it looks pretty good to me, and I think you've called out most of the interesting design choices. > + char *cmdEnd = psprintf(" OWNER TO %s", fmtId(te->owner)); > + > + IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", cmdEnd); This is just a nitpick, but is there any reason not to have IssueCommandPerBlob() accept a format string and the corresponding arguments? > + while (n < 1000 && i + n < ntups) Another nitpick: it might be worth moving these hard-wired constants to macros. Without a control switch, that'd at least make it easy for anyone determined to change the value for their installation. > Subject: [PATCH v9 3/4] Move BLOBS METADATA TOC entries into SECTION_DATA. This one looks reasonable, too. > In this patch I have just hard-wired pg_upgrade to use > --transaction-size 1000. Perhaps there would be value in adding > another pg_upgrade option to allow user control of that, but I'm > unsure that it's worth the trouble; I think few users would use it, > and any who did would see not that much benefit. However, we > might need to adjust the logic to make the size be 1000 divided > by the number of parallel restore jobs allowed. I wonder if it'd be worth making this configurable for pg_upgrade as an escape hatch in case the default setting is problematic. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Jan 05, 2024 at 03:02:34PM -0500, Tom Lane wrote: > On further reflection, there is a very good reason why it's done like > that. Because pg_upgrade is doing schema-only dump and restore, > there's next to no opportunity for parallelism within either pg_dump > or pg_restore. There's no data-loading steps, and there's no > index-building either, so the time-consuming stuff that could be > parallelized just isn't happening in pg_upgrade's usage. > > Now it's true that my 0003 patch moves the needle a little bit: > since it makes BLOB creation (as opposed to loading) parallelizable, > there'd be some hope for parallel pg_restore doing something useful in > a database with very many blobs. But it makes no sense to remove the > existing cross-database parallelism in pursuit of that; you'd make > many more people unhappy than happy. I assume the concern is that we'd end up multiplying the effective number of workers if we parallelized both in-database and cross-database? Would it be sufficient to make those separately configurable with a note about the multiplicative effects of setting both? I think it'd be unfortunate if pg_upgrade completely missed out on this improvement. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, Jan 12, 2024 at 04:42:27PM -0600, Nathan Bossart wrote: > On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote: >> + char *cmdEnd = psprintf(" OWNER TO %s", fmtId(te->owner)); >> + >> + IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", cmdEnd); > > This is just a nitpick, but is there any reason not to have > IssueCommandPerBlob() accept a format string and the corresponding > arguments? Eh, I guess you'd have to find some other way of specifying where the OID is supposed to go, which would probably be weird. Please disregard this one. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > On Wed, Dec 20, 2023 at 06:47:44PM -0500, Tom Lane wrote: >> + char *cmdEnd = psprintf(" OWNER TO %s", fmtId(te->owner)); >> + >> + IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", cmdEnd); > This is just a nitpick, but is there any reason not to have > IssueCommandPerBlob() accept a format string and the corresponding > arguments? The problem is how to combine the individual per-blob OID with the command string given by the caller. If we want the caller to also be able to inject data values, I don't really see how to combine the OID with the va_args list from the caller. If we stick with the design that the caller provides separate "front" and "back" strings, but ask to be able to inject data values into those, then we need two va_args lists which C doesn't support, or else an arbitrary decision about which one gets the va_args. (Admittedly, with only one caller that needs a non-constant string, we could make such a decision; but by the same token we'd gain little.) It'd be notationally simpler if we could have the caller supply one string that includes %u where the OID is supposed to go; but then we have problems if an owner name includes %. So on the whole I'm not seeing anything much better than what I did. Maybe I missed an idea though. > Another nitpick: it might be worth moving these hard-wired constants to > macros. Without a control switch, that'd at least make it easy for anyone > determined to change the value for their installation. OK. regards, tom lane
On Tue, 2 Jan 2024 at 23:03, Kumar, Sachin <ssetiya@amazon.com> wrote: > > > On 11/12/2023, 01:43, "Tom Lane" <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> wrote: > > > I had initially supposed that in a parallel restore we could > > have child workers also commit after every N TOC items, but was > > soon disabused of that idea. After a worker processes a TOC > > item, any dependent items (such as index builds) might get > > dispatched to some other worker, which had better be able to > > see the results of the first worker's step. So at least in > > this implementation, we disable the multi-command-per-COMMIT > > behavior during the parallel part of the restore. Maybe that > > could be improved in future, but it seems like it'd add a > > lot more complexity, and it wouldn't make life any better for > > pg_upgrade (which doesn't use parallel pg_restore, and seems > > unlikely to want to in future). > > I was not able to find email thread which details why we are not using > parallel pg_restore for pg_upgrade. IMHO most of the customer will have single large > database, and not using parallel restore will cause slow pg_upgrade. > > I am attaching a patch which enables parallel pg_restore for DATA and POST-DATA part > of dump. It will push down --jobs value to pg_restore and will restore database sequentially. CFBot shows that the patch does not apply anymore as in [1]: === Applying patches on top of PostgreSQL commit ID 46a0cd4cefb4d9b462d8cc4df5e7ecdd190bea92 === === applying patch ./v9-005-parallel_pg_restore.patch patching file src/bin/pg_upgrade/pg_upgrade.c Hunk #3 FAILED at 650. 1 out of 3 hunks FAILED -- saving rejects to file src/bin/pg_upgrade/pg_upgrade.c.rej Please post an updated version for the same. [1] - http://cfbot.cputube.org/patch_46_4713.log Regards, Vignesh
vignesh C <vignesh21@gmail.com> writes: > CFBot shows that the patch does not apply anymore as in [1]: > === Applying patches on top of PostgreSQL commit ID > 46a0cd4cefb4d9b462d8cc4df5e7ecdd190bea92 === > === applying patch ./v9-005-parallel_pg_restore.patch > patching file src/bin/pg_upgrade/pg_upgrade.c > Hunk #3 FAILED at 650. > 1 out of 3 hunks FAILED -- saving rejects to file > src/bin/pg_upgrade/pg_upgrade.c.rej That's because v9-005 was posted by itself. But I don't think we should use it anyway. Here's 0001-0004 again, updated to current HEAD (only line numbers changed) and with Nathan's suggestion to define some macros for the magic constants. regards, tom lane From c48b11547c6eb95ab217dddc047da5378042452c Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri, 26 Jan 2024 11:10:00 -0500 Subject: [PATCH v10 1/4] Some small preliminaries for pg_dump changes. Centralize management of the lo_buf used to hold data while restoring blobs. The code previously had each format handler create lo_buf, which seems rather pointless given that the format handlers all make it the same way. Moreover, the format handlers never use lo_buf directly, making this setup a failure from a separation-of-concerns standpoint. Let's move the responsibility into pg_backup_archiver.c, which is the only module concerned with lo_buf. The main reason to do this now is that it allows a centralized fix for the soon-to-be-false assumption that we never restore blobs in parallel. Also, get rid of dead code in DropLOIfExists: it's been a long time since we had any need to be able to restore to a pre-9.0 server. --- src/bin/pg_dump/pg_backup_archiver.c | 9 +++++++++ src/bin/pg_dump/pg_backup_custom.c | 7 ------- src/bin/pg_dump/pg_backup_db.c | 27 +++++---------------------- src/bin/pg_dump/pg_backup_directory.c | 6 ------ src/bin/pg_dump/pg_backup_null.c | 4 ---- src/bin/pg_dump/pg_backup_tar.c | 4 ---- 6 files changed, 14 insertions(+), 43 deletions(-) diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 256d1e35a4..26c2c684c8 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -1343,6 +1343,12 @@ StartRestoreLO(ArchiveHandle *AH, Oid oid, bool drop) AH->loCount++; /* Initialize the LO Buffer */ + if (AH->lo_buf == NULL) + { + /* First time through (in this process) so allocate the buffer */ + AH->lo_buf_size = LOBBUFSIZE; + AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); + } AH->lo_buf_used = 0; pg_log_info("restoring large object with OID %u", oid); @@ -4748,6 +4754,9 @@ CloneArchive(ArchiveHandle *AH) /* clone has its own error count, too */ clone->public.n_errors = 0; + /* clones should not share lo_buf */ + clone->lo_buf = NULL; + /* * Connect our new clone object to the database, using the same connection * parameters used for the original connection. diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c index b576b29924..7c6ac89dd4 100644 --- a/src/bin/pg_dump/pg_backup_custom.c +++ b/src/bin/pg_dump/pg_backup_custom.c @@ -140,10 +140,6 @@ InitArchiveFmt_Custom(ArchiveHandle *AH) ctx = (lclContext *) pg_malloc0(sizeof(lclContext)); AH->formatData = (void *) ctx; - /* Initialize LO buffering */ - AH->lo_buf_size = LOBBUFSIZE; - AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); - /* * Now open the file */ @@ -902,9 +898,6 @@ _Clone(ArchiveHandle *AH) * share knowledge about where the data blocks are across threads. * _PrintTocData has to be careful about the order of operations on that * state, though. - * - * Note: we do not make a local lo_buf because we expect at most one BLOBS - * entry per archive, so no parallelism is possible. */ } diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c index f766b65059..b297ca049d 100644 --- a/src/bin/pg_dump/pg_backup_db.c +++ b/src/bin/pg_dump/pg_backup_db.c @@ -544,26 +544,9 @@ CommitTransaction(Archive *AHX) void DropLOIfExists(ArchiveHandle *AH, Oid oid) { - /* - * If we are not restoring to a direct database connection, we have to - * guess about how to detect whether the LO exists. Assume new-style. - */ - if (AH->connection == NULL || - PQserverVersion(AH->connection) >= 90000) - { - ahprintf(AH, - "SELECT pg_catalog.lo_unlink(oid) " - "FROM pg_catalog.pg_largeobject_metadata " - "WHERE oid = '%u';\n", - oid); - } - else - { - /* Restoring to pre-9.0 server, so do it the old way */ - ahprintf(AH, - "SELECT CASE WHEN EXISTS(" - "SELECT 1 FROM pg_catalog.pg_largeobject WHERE loid = '%u'" - ") THEN pg_catalog.lo_unlink('%u') END;\n", - oid, oid); - } + ahprintf(AH, + "SELECT pg_catalog.lo_unlink(oid) " + "FROM pg_catalog.pg_largeobject_metadata " + "WHERE oid = '%u';\n", + oid); } diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c index dba57443e8..de3cfea02e 100644 --- a/src/bin/pg_dump/pg_backup_directory.c +++ b/src/bin/pg_dump/pg_backup_directory.c @@ -143,10 +143,6 @@ InitArchiveFmt_Directory(ArchiveHandle *AH) ctx->dataFH = NULL; ctx->LOsTocFH = NULL; - /* Initialize LO buffering */ - AH->lo_buf_size = LOBBUFSIZE; - AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); - /* * Now open the TOC file */ @@ -823,8 +819,6 @@ _Clone(ArchiveHandle *AH) ctx = (lclContext *) AH->formatData; /* - * Note: we do not make a local lo_buf because we expect at most one BLOBS - * entry per archive, so no parallelism is possible. Likewise, * TOC-entry-local state isn't an issue because any one TOC entry is * touched by just one worker child. */ diff --git a/src/bin/pg_dump/pg_backup_null.c b/src/bin/pg_dump/pg_backup_null.c index 08f096251b..776f057770 100644 --- a/src/bin/pg_dump/pg_backup_null.c +++ b/src/bin/pg_dump/pg_backup_null.c @@ -63,10 +63,6 @@ InitArchiveFmt_Null(ArchiveHandle *AH) AH->ClonePtr = NULL; AH->DeClonePtr = NULL; - /* Initialize LO buffering */ - AH->lo_buf_size = LOBBUFSIZE; - AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); - /* * Now prevent reading... */ diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c index aad88ad559..4cb9707e63 100644 --- a/src/bin/pg_dump/pg_backup_tar.c +++ b/src/bin/pg_dump/pg_backup_tar.c @@ -156,10 +156,6 @@ InitArchiveFmt_Tar(ArchiveHandle *AH) ctx->filePos = 0; ctx->isSpecialScript = 0; - /* Initialize LO buffering */ - AH->lo_buf_size = LOBBUFSIZE; - AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); - /* * Now open the tar file, and load the TOC if we're in read mode. */ -- 2.39.3 From ea600dd34ebddef9e3c7d267b0e73340bb77f48a Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri, 26 Jan 2024 11:25:26 -0500 Subject: [PATCH v10 2/4] In dumps, group large objects into matching metadata and data entries. Commit c0d5be5d6 caused pg_dump to create a separate BLOB metadata TOC entry for each large object (blob), but it did not touch the ancient decision to put all the blobs' data into a single BLOBS TOC entry. This is bad for a few reasons: for databases with millions of blobs, the TOC becomes unreasonably large, causing performance issues; selective restore of just some blobs is quite impossible; and we cannot parallelize either dump or restore of the blob data, since our architecture for that relies on farming out whole TOC entries to worker processes. To improve matters, let's group multiple blobs into each blob metadata TOC entry, and then make corresponding per-group blob data TOC entries. Selective restore using pg_restore's -l/-L switches is then possible, though only at the group level. (Perhaps we should provide a switch to allow forcing one-blob-per-group for users who need precise selective restore and don't have huge numbers of blobs. This patch doesn't yet do that, instead just hard-wiring the maximum number of blobs per entry at 1000.) The blobs in a group must all have the same owner, since the TOC entry format only allows one owner to be named. In this implementation we also require them to all share the same ACL (grants); the archive format wouldn't require that, but pg_dump's representation of DumpableObjects does. It seems unlikely that either restriction will be problematic for databases with huge numbers of blobs. The metadata TOC entries now have a "desc" string of "BLOB METADATA", and their "defn" string is just a newline-separated list of blob OIDs. The restore code has to generate creation commands, ALTER OWNER commands, and drop commands (for --clean mode) from that. We would need special-case code for ALTER OWNER and drop in any case, so the alternative of keeping the "defn" as directly executable SQL code for creation wouldn't buy much, and it seems like it'd bloat the archive to little purpose. The data TOC entries ("BLOBS") can be exactly the same as before, except that now there can be more than one, so we'd better give them identifying tag strings. We have to bump the archive file format version number, since existing versions of pg_restore wouldn't know they need to do something special for BLOB METADATA, plus they aren't going to work correctly with multiple BLOBS entries. Also, the directory and tar-file format handlers need some work for multiple BLOBS entries: they used to hard-wire the file name as "blobs.toc", which is replaced here with "blobs_<dumpid>.toc". The 002_pg_dump.pl test script also knows about that and requires minor updates. (I had to drop the test for manually-compressed blobs.toc files with LZ4, because lz4's obtuse command line design requires explicit specification of the output file name which seems impractical here. I don't think we're losing any useful test coverage thereby; that test stanza seems completely duplicative with the gzip and zstd cases anyway.) As this stands, we still generate a separate TOC entry for any comment, security label, or ACL attached to a blob. I feel comfortable in believing that comments and security labels on blobs are rare; but we might have to do something about aggregating blob ACLs into grouped TOC entries to avoid blowing up the TOC size, if there are use cases with large numbers of non-default blob ACLs. That can be done later though, as it would not create any compatibility issues. --- src/bin/pg_dump/common.c | 26 +++ src/bin/pg_dump/pg_backup_archiver.c | 76 +++++-- src/bin/pg_dump/pg_backup_archiver.h | 6 +- src/bin/pg_dump/pg_backup_custom.c | 4 +- src/bin/pg_dump/pg_backup_db.c | 27 +++ src/bin/pg_dump/pg_backup_directory.c | 38 ++-- src/bin/pg_dump/pg_backup_null.c | 4 +- src/bin/pg_dump/pg_backup_tar.c | 39 +++- src/bin/pg_dump/pg_dump.c | 287 +++++++++++++++----------- src/bin/pg_dump/pg_dump.h | 11 + src/bin/pg_dump/t/002_pg_dump.pl | 30 ++- 11 files changed, 361 insertions(+), 187 deletions(-) diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c index 0ed18b72d6..c7dd0b11fd 100644 --- a/src/bin/pg_dump/common.c +++ b/src/bin/pg_dump/common.c @@ -47,6 +47,8 @@ static DumpId lastDumpId = 0; /* Note: 0 is InvalidDumpId */ * expects that it can move them around when resizing the table. So we * cannot make the DumpableObjects be elements of the hash table directly; * instead, the hash table elements contain pointers to DumpableObjects. + * This does have the advantage of letting us map multiple CatalogIds + * to one DumpableObject, which is useful for blobs. * * It turns out to be convenient to also use this data structure to map * CatalogIds to owning extensions, if any. Since extension membership @@ -700,6 +702,30 @@ AssignDumpId(DumpableObject *dobj) } } +/* + * recordAdditionalCatalogID + * Record an additional catalog ID for the given DumpableObject + */ +void +recordAdditionalCatalogID(CatalogId catId, DumpableObject *dobj) +{ + CatalogIdMapEntry *entry; + bool found; + + /* CatalogId hash table must exist, if we have a DumpableObject */ + Assert(catalogIdHash != NULL); + + /* Add reference to CatalogId hash */ + entry = catalogid_insert(catalogIdHash, catId, &found); + if (!found) + { + entry->dobj = NULL; + entry->ext = NULL; + } + Assert(entry->dobj == NULL); + entry->dobj = dobj; +} + /* * Assign a DumpId that's not tied to a DumpableObject. * diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 26c2c684c8..73b9972da4 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -512,7 +512,20 @@ RestoreArchive(Archive *AHX) * don't necessarily emit it verbatim; at this point we add an * appropriate IF EXISTS clause, if the user requested it. */ - if (*te->dropStmt != '\0') + if (strcmp(te->desc, "BLOB METADATA") == 0) + { + /* We must generate the per-blob commands */ + if (ropt->if_exists) + IssueCommandPerBlob(AH, te, + "SELECT pg_catalog.lo_unlink(oid) " + "FROM pg_catalog.pg_largeobject_metadata " + "WHERE oid = '", "'"); + else + IssueCommandPerBlob(AH, te, + "SELECT pg_catalog.lo_unlink('", + "')"); + } + else if (*te->dropStmt != '\0') { if (!ropt->if_exists || strncmp(te->dropStmt, "--", 2) == 0) @@ -528,12 +541,12 @@ RestoreArchive(Archive *AHX) { /* * Inject an appropriate spelling of "if exists". For - * large objects, we have a separate routine that + * old-style large objects, we have a routine that * knows how to do it, without depending on * te->dropStmt; use that. For other objects we need * to parse the command. */ - if (strncmp(te->desc, "BLOB", 4) == 0) + if (strcmp(te->desc, "BLOB") == 0) { DropLOIfExists(AH, te->catalogId.oid); } @@ -1290,7 +1303,7 @@ EndLO(Archive *AHX, Oid oid) **********/ /* - * Called by a format handler before any LOs are restored + * Called by a format handler before a group of LOs is restored */ void StartRestoreLOs(ArchiveHandle *AH) @@ -1309,7 +1322,7 @@ StartRestoreLOs(ArchiveHandle *AH) } /* - * Called by a format handler after all LOs are restored + * Called by a format handler after a group of LOs is restored */ void EndRestoreLOs(ArchiveHandle *AH) @@ -2994,13 +3007,14 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH) { /* * Special Case: If 'SEQUENCE SET' or anything to do with LOs, then it - * is considered a data entry. We don't need to check for the BLOBS - * entry or old-style BLOB COMMENTS, because they will have hadDumper - * = true ... but we do need to check new-style BLOB ACLs, comments, + * is considered a data entry. We don't need to check for BLOBS or + * old-style BLOB COMMENTS entries, because they will have hadDumper = + * true ... but we do need to check new-style BLOB ACLs, comments, * etc. */ if (strcmp(te->desc, "SEQUENCE SET") == 0 || strcmp(te->desc, "BLOB") == 0 || + strcmp(te->desc, "BLOB METADATA") == 0 || (strcmp(te->desc, "ACL") == 0 && strncmp(te->tag, "LARGE OBJECT ", 13) == 0) || (strcmp(te->desc, "COMMENT") == 0 && @@ -3041,6 +3055,7 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH) if (!(ropt->sequence_data && strcmp(te->desc, "SEQUENCE SET") == 0) && !(ropt->binary_upgrade && (strcmp(te->desc, "BLOB") == 0 || + strcmp(te->desc, "BLOB METADATA") == 0 || (strcmp(te->desc, "ACL") == 0 && strncmp(te->tag, "LARGE OBJECT ", 13) == 0) || (strcmp(te->desc, "COMMENT") == 0 && @@ -3612,18 +3627,26 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData) } /* - * Actually print the definition. + * Actually print the definition. Normally we can just print the defn + * string if any, but we have two special cases: * - * Really crude hack for suppressing AUTHORIZATION clause that old pg_dump + * 1. A crude hack for suppressing AUTHORIZATION clause that old pg_dump * versions put into CREATE SCHEMA. Don't mutate the variant for schema * "public" that is a comment. We have to do this when --no-owner mode is * selected. This is ugly, but I see no other good way ... + * + * 2. BLOB METADATA entries need special processing since their defn + * strings are just lists of OIDs, not complete SQL commands. */ if (ropt->noOwner && strcmp(te->desc, "SCHEMA") == 0 && strncmp(te->defn, "--", 2) != 0) { ahprintf(AH, "CREATE SCHEMA %s;\n\n\n", fmtId(te->tag)); } + else if (strcmp(te->desc, "BLOB METADATA") == 0) + { + IssueCommandPerBlob(AH, te, "SELECT pg_catalog.lo_create('", "')"); + } else { if (te->defn && strlen(te->defn) > 0) @@ -3644,18 +3667,31 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData) te->owner && strlen(te->owner) > 0 && te->dropStmt && strlen(te->dropStmt) > 0) { - PQExpBufferData temp; + if (strcmp(te->desc, "BLOB METADATA") == 0) + { + /* BLOB METADATA needs special code to handle multiple LOs */ + char *cmdEnd = psprintf(" OWNER TO %s", fmtId(te->owner)); + + IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", cmdEnd); + pg_free(cmdEnd); + } + else + { + /* For all other cases, we can use _getObjectDescription */ + PQExpBufferData temp; - initPQExpBuffer(&temp); - _getObjectDescription(&temp, te); + initPQExpBuffer(&temp); + _getObjectDescription(&temp, te); - /* - * If _getObjectDescription() didn't fill the buffer, then there is no - * owner. - */ - if (temp.data[0]) - ahprintf(AH, "ALTER %s OWNER TO %s;\n\n", temp.data, fmtId(te->owner)); - termPQExpBuffer(&temp); + /* + * If _getObjectDescription() didn't fill the buffer, then there + * is no owner. + */ + if (temp.data[0]) + ahprintf(AH, "ALTER %s OWNER TO %s;\n\n", + temp.data, fmtId(te->owner)); + termPQExpBuffer(&temp); + } } /* diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h index 917283fd34..e4dd395582 100644 --- a/src/bin/pg_dump/pg_backup_archiver.h +++ b/src/bin/pg_dump/pg_backup_archiver.h @@ -68,10 +68,12 @@ #define K_VERS_1_15 MAKE_ARCHIVE_VERSION(1, 15, 0) /* add * compression_algorithm * in header */ +#define K_VERS_1_16 MAKE_ARCHIVE_VERSION(1, 16, 0) /* BLOB METADATA entries + * and multiple BLOBS */ /* Current archive version number (the format we can output) */ #define K_VERS_MAJOR 1 -#define K_VERS_MINOR 15 +#define K_VERS_MINOR 16 #define K_VERS_REV 0 #define K_VERS_SELF MAKE_ARCHIVE_VERSION(K_VERS_MAJOR, K_VERS_MINOR, K_VERS_REV) @@ -448,6 +450,8 @@ extern void InitArchiveFmt_Tar(ArchiveHandle *AH); extern bool isValidTarHeader(char *header); extern void ReconnectToServer(ArchiveHandle *AH, const char *dbname); +extern void IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, + const char *cmdBegin, const char *cmdEnd); extern void DropLOIfExists(ArchiveHandle *AH, Oid oid); void ahwrite(const void *ptr, size_t size, size_t nmemb, ArchiveHandle *AH); diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c index 7c6ac89dd4..55107b2005 100644 --- a/src/bin/pg_dump/pg_backup_custom.c +++ b/src/bin/pg_dump/pg_backup_custom.c @@ -338,7 +338,7 @@ _EndData(ArchiveHandle *AH, TocEntry *te) } /* - * Called by the archiver when starting to save all BLOB DATA (not schema). + * Called by the archiver when starting to save BLOB DATA (not schema). * This routine should save whatever format-specific information is needed * to read the LOs back into memory. * @@ -398,7 +398,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) } /* - * Called by the archiver when finishing saving all BLOB DATA. + * Called by the archiver when finishing saving BLOB DATA. * * Optional. */ diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c index b297ca049d..c14d813b21 100644 --- a/src/bin/pg_dump/pg_backup_db.c +++ b/src/bin/pg_dump/pg_backup_db.c @@ -541,6 +541,33 @@ CommitTransaction(Archive *AHX) ExecuteSqlCommand(AH, "COMMIT", "could not commit database transaction"); } +/* + * Issue per-blob commands for the large object(s) listed in the TocEntry + * + * The TocEntry's defn string is assumed to consist of large object OIDs, + * one per line. Wrap these in the given SQL command fragments and issue + * the commands. (cmdEnd need not include a semicolon.) + */ +void +IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, + const char *cmdBegin, const char *cmdEnd) +{ + /* Make a writable copy of the command string */ + char *buf = pg_strdup(te->defn); + char *st; + char *en; + + st = buf; + while ((en = strchr(st, '\n')) != NULL) + { + *en++ = '\0'; + ahprintf(AH, "%s%s%s;\n", cmdBegin, st, cmdEnd); + st = en; + } + ahprintf(AH, "\n"); + pg_free(buf); +} + void DropLOIfExists(ArchiveHandle *AH, Oid oid) { diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c index de3cfea02e..7be8d5487d 100644 --- a/src/bin/pg_dump/pg_backup_directory.c +++ b/src/bin/pg_dump/pg_backup_directory.c @@ -5,8 +5,10 @@ * A directory format dump is a directory, which contains a "toc.dat" file * for the TOC, and a separate file for each data entry, named "<oid>.dat". * Large objects are stored in separate files named "blob_<oid>.dat", - * and there's a plain-text TOC file for them called "blobs.toc". If - * compression is used, each data file is individually compressed and the + * and there's a plain-text TOC file for each BLOBS TOC entry named + * "blobs_<dumpID>.toc" (or just "blobs.toc" in archive versions before 16). + * + * If compression is used, each data file is individually compressed and the * ".gz" suffix is added to the filenames. The TOC files are never * compressed by pg_dump, however they are accepted with the .gz suffix too, * in case the user has manually compressed them with 'gzip'. @@ -51,7 +53,7 @@ typedef struct char *directory; CompressFileHandle *dataFH; /* currently open data file */ - CompressFileHandle *LOsTocFH; /* file handle for blobs.toc */ + CompressFileHandle *LOsTocFH; /* file handle for blobs_NNN.toc */ ParallelState *pstate; /* for parallel backup / restore */ } lclContext; @@ -81,7 +83,7 @@ static void _StartLOs(ArchiveHandle *AH, TocEntry *te); static void _StartLO(ArchiveHandle *AH, TocEntry *te, Oid oid); static void _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid); static void _EndLOs(ArchiveHandle *AH, TocEntry *te); -static void _LoadLOs(ArchiveHandle *AH); +static void _LoadLOs(ArchiveHandle *AH, TocEntry *te); static void _PrepParallelRestore(ArchiveHandle *AH); static void _Clone(ArchiveHandle *AH); @@ -232,7 +234,10 @@ _ArchiveEntry(ArchiveHandle *AH, TocEntry *te) tctx = (lclTocEntry *) pg_malloc0(sizeof(lclTocEntry)); if (strcmp(te->desc, "BLOBS") == 0) - tctx->filename = pg_strdup("blobs.toc"); + { + snprintf(fn, MAXPGPATH, "blobs_%d.toc", te->dumpId); + tctx->filename = pg_strdup(fn); + } else if (te->dataDumper) { snprintf(fn, MAXPGPATH, "%d.dat", te->dumpId); @@ -415,7 +420,7 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te) return; if (strcmp(te->desc, "BLOBS") == 0) - _LoadLOs(AH); + _LoadLOs(AH, te); else { char fname[MAXPGPATH]; @@ -426,17 +431,23 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te) } static void -_LoadLOs(ArchiveHandle *AH) +_LoadLOs(ArchiveHandle *AH, TocEntry *te) { Oid oid; lclContext *ctx = (lclContext *) AH->formatData; + lclTocEntry *tctx = (lclTocEntry *) te->formatData; CompressFileHandle *CFH; char tocfname[MAXPGPATH]; char line[MAXPGPATH]; StartRestoreLOs(AH); - setFilePath(AH, tocfname, "blobs.toc"); + /* + * Note: before archive v16, there was always only one BLOBS TOC entry, + * now there can be multiple. We don't need to worry what version we are + * reading though, because tctx->filename should be correct either way. + */ + setFilePath(AH, tocfname, tctx->filename); CFH = ctx->LOsTocFH = InitDiscoverCompressFileHandle(tocfname, PG_BINARY_R); @@ -632,7 +643,7 @@ _ReopenArchive(ArchiveHandle *AH) */ /* - * Called by the archiver when starting to save all BLOB DATA (not schema). + * Called by the archiver when starting to save BLOB DATA (not schema). * It is called just prior to the dumper's DataDumper routine. * * We open the large object TOC file here, so that we can append a line to @@ -642,10 +653,11 @@ static void _StartLOs(ArchiveHandle *AH, TocEntry *te) { lclContext *ctx = (lclContext *) AH->formatData; + lclTocEntry *tctx = (lclTocEntry *) te->formatData; pg_compress_specification compression_spec = {0}; char fname[MAXPGPATH]; - setFilePath(AH, fname, "blobs.toc"); + setFilePath(AH, fname, tctx->filename); /* The LO TOC file is never compressed */ compression_spec.algorithm = PG_COMPRESSION_NONE; @@ -690,7 +702,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) pg_fatal("could not close LO data file: %m"); ctx->dataFH = NULL; - /* register the LO in blobs.toc */ + /* register the LO in blobs_NNN.toc */ len = snprintf(buf, sizeof(buf), "%u blob_%u.dat\n", oid, oid); if (!CFH->write_func(buf, len, CFH)) { @@ -703,7 +715,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) } /* - * Called by the archiver when finishing saving all BLOB DATA. + * Called by the archiver when finishing saving BLOB DATA. * * We close the LOs TOC file. */ @@ -795,7 +807,7 @@ _PrepParallelRestore(ArchiveHandle *AH) } /* - * If this is the BLOBS entry, what we stat'd was blobs.toc, which + * If this is a BLOBS entry, what we stat'd was blobs_NNN.toc, which * most likely is a lot smaller than the actual blob data. We don't * have a cheap way to estimate how much smaller, but fortunately it * doesn't matter too much as long as we get the LOs processed diff --git a/src/bin/pg_dump/pg_backup_null.c b/src/bin/pg_dump/pg_backup_null.c index 776f057770..a3257f4fc8 100644 --- a/src/bin/pg_dump/pg_backup_null.c +++ b/src/bin/pg_dump/pg_backup_null.c @@ -113,7 +113,7 @@ _EndData(ArchiveHandle *AH, TocEntry *te) } /* - * Called by the archiver when starting to save all BLOB DATA (not schema). + * Called by the archiver when starting to save BLOB DATA (not schema). * This routine should save whatever format-specific information is needed * to read the LOs back into memory. * @@ -170,7 +170,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) } /* - * Called by the archiver when finishing saving all BLOB DATA. + * Called by the archiver when finishing saving BLOB DATA. * * Optional. */ diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c index 4cb9707e63..41ee52b1d6 100644 --- a/src/bin/pg_dump/pg_backup_tar.c +++ b/src/bin/pg_dump/pg_backup_tar.c @@ -94,7 +94,7 @@ typedef struct char *filename; } lclTocEntry; -static void _LoadLOs(ArchiveHandle *AH); +static void _LoadLOs(ArchiveHandle *AH, TocEntry *te); static TAR_MEMBER *tarOpen(ArchiveHandle *AH, const char *filename, char mode); static void tarClose(ArchiveHandle *AH, TAR_MEMBER *th); @@ -634,13 +634,13 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te) } if (strcmp(te->desc, "BLOBS") == 0) - _LoadLOs(AH); + _LoadLOs(AH, te); else _PrintFileData(AH, tctx->filename); } static void -_LoadLOs(ArchiveHandle *AH) +_LoadLOs(ArchiveHandle *AH, TocEntry *te) { Oid oid; lclContext *ctx = (lclContext *) AH->formatData; @@ -651,7 +651,26 @@ _LoadLOs(ArchiveHandle *AH) StartRestoreLOs(AH); - th = tarOpen(AH, NULL, 'r'); /* Open next file */ + /* + * The blobs_NNN.toc or blobs.toc file is fairly useless to us because it + * will appear only after the associated blob_NNN.dat files. For archive + * versions >= 16 we can look at the BLOBS entry's te->tag to discover the + * OID of the first blob we want to restore, and then search forward to + * find the appropriate blob_<oid>.dat file. For older versions we rely + * on the knowledge that there was only one BLOBS entry and just search + * for the first blob_<oid>.dat file. Once we find the first blob file to + * restore, restore all blobs until we reach the blobs[_NNN].toc file. + */ + if (AH->version >= K_VERS_1_16) + { + /* We rely on atooid to not complain about nnnn..nnnn tags */ + oid = atooid(te->tag); + snprintf(buf, sizeof(buf), "blob_%u.dat", oid); + th = tarOpen(AH, buf, 'r'); /* Advance to first desired file */ + } + else + th = tarOpen(AH, NULL, 'r'); /* Open next file */ + while (th != NULL) { ctx->FH = th; @@ -681,9 +700,9 @@ _LoadLOs(ArchiveHandle *AH) /* * Once we have found the first LO, stop at the first non-LO entry - * (which will be 'blobs.toc'). This coding would eat all the - * rest of the archive if there are no LOs ... but this function - * shouldn't be called at all in that case. + * (which will be 'blobs[_NNN].toc'). This coding would eat all + * the rest of the archive if there are no LOs ... but this + * function shouldn't be called at all in that case. */ if (foundLO) break; @@ -847,7 +866,7 @@ _scriptOut(ArchiveHandle *AH, const void *buf, size_t len) */ /* - * Called by the archiver when starting to save all BLOB DATA (not schema). + * Called by the archiver when starting to save BLOB DATA (not schema). * This routine should save whatever format-specific information is needed * to read the LOs back into memory. * @@ -862,7 +881,7 @@ _StartLOs(ArchiveHandle *AH, TocEntry *te) lclContext *ctx = (lclContext *) AH->formatData; char fname[K_STD_BUF_SIZE]; - sprintf(fname, "blobs.toc"); + sprintf(fname, "blobs_%d.toc", te->dumpId); ctx->loToc = tarOpen(AH, fname, 'w'); } @@ -908,7 +927,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) } /* - * Called by the archiver when finishing saving all BLOB DATA. + * Called by the archiver when finishing saving BLOB DATA. * * Optional. * diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index a19443becd..0245b22ef0 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -160,6 +160,13 @@ static int nseclabels = 0; */ #define DUMP_DEFAULT_ROWS_PER_INSERT 1 +/* + * Maximum number of large objects to group into a single ArchiveEntry. + * At some point we might want to make this user-controllable, but for now + * a hard-wired setting will suffice. + */ +#define MAX_BLOBS_PER_ARCHIVE_ENTRY 1000 + /* * Macro for producing quoted, schema-qualified name of a dumpable object. */ @@ -3581,11 +3588,10 @@ getLOs(Archive *fout) { DumpOptions *dopt = fout->dopt; PQExpBuffer loQry = createPQExpBuffer(); - LoInfo *loinfo; - DumpableObject *lodata; PGresult *res; int ntups; int i; + int n; int i_oid; int i_lomowner; int i_lomacl; @@ -3593,11 +3599,15 @@ getLOs(Archive *fout) pg_log_info("reading large objects"); - /* Fetch LO OIDs, and owner/ACL data */ + /* + * Fetch LO OIDs and owner/ACL data. Order the data so that all the blobs + * with the same owner/ACL appear together. + */ appendPQExpBufferStr(loQry, "SELECT oid, lomowner, lomacl, " "acldefault('L', lomowner) AS acldefault " - "FROM pg_largeobject_metadata"); + "FROM pg_largeobject_metadata " + "ORDER BY lomowner, lomacl::pg_catalog.text, oid"); res = ExecuteSqlQuery(fout, loQry->data, PGRES_TUPLES_OK); @@ -3609,30 +3619,72 @@ getLOs(Archive *fout) ntups = PQntuples(res); /* - * Each large object has its own "BLOB" archive entry. + * Group the blobs into suitably-sized groups that have the same owner and + * ACL setting, and build a metadata and a data DumpableObject for each + * group. (If we supported initprivs for blobs, we'd have to insist that + * groups also share initprivs settings, since the DumpableObject only has + * room for one.) i is the index of the first tuple in the current group, + * and n is the number of tuples we include in the group. */ - loinfo = (LoInfo *) pg_malloc(ntups * sizeof(LoInfo)); + for (i = 0; i < ntups; i += n) + { + Oid thisoid = atooid(PQgetvalue(res, i, i_oid)); + char *thisowner = PQgetvalue(res, i, i_lomowner); + char *thisacl = PQgetvalue(res, i, i_lomacl); + LoInfo *loinfo; + DumpableObject *lodata; + char namebuf[64]; + + /* Scan to find first tuple not to be included in group */ + n = 1; + while (n < MAX_BLOBS_PER_ARCHIVE_ENTRY && i + n < ntups) + { + if (strcmp(thisowner, PQgetvalue(res, i + n, i_lomowner)) != 0 || + strcmp(thisacl, PQgetvalue(res, i + n, i_lomacl)) != 0) + break; + n++; + } - for (i = 0; i < ntups; i++) - { - loinfo[i].dobj.objType = DO_LARGE_OBJECT; - loinfo[i].dobj.catId.tableoid = LargeObjectRelationId; - loinfo[i].dobj.catId.oid = atooid(PQgetvalue(res, i, i_oid)); - AssignDumpId(&loinfo[i].dobj); + /* Build the metadata DumpableObject */ + loinfo = (LoInfo *) pg_malloc(offsetof(LoInfo, looids) + n * sizeof(Oid)); + + loinfo->dobj.objType = DO_LARGE_OBJECT; + loinfo->dobj.catId.tableoid = LargeObjectRelationId; + loinfo->dobj.catId.oid = thisoid; + AssignDumpId(&loinfo->dobj); + + if (n > 1) + snprintf(namebuf, sizeof(namebuf), "%u..%u", thisoid, + atooid(PQgetvalue(res, i + n - 1, i_oid))); + else + snprintf(namebuf, sizeof(namebuf), "%u", thisoid); + loinfo->dobj.name = pg_strdup(namebuf); + loinfo->dacl.acl = pg_strdup(thisacl); + loinfo->dacl.acldefault = pg_strdup(PQgetvalue(res, i, i_acldefault)); + loinfo->dacl.privtype = 0; + loinfo->dacl.initprivs = NULL; + loinfo->rolname = getRoleName(thisowner); + loinfo->numlos = n; + loinfo->looids[0] = thisoid; + /* Collect OIDs of the remaining blobs in this group */ + for (int k = 1; k < n; k++) + { + CatalogId extraID; + + loinfo->looids[k] = atooid(PQgetvalue(res, i + k, i_oid)); - loinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_oid)); - loinfo[i].dacl.acl = pg_strdup(PQgetvalue(res, i, i_lomacl)); - loinfo[i].dacl.acldefault = pg_strdup(PQgetvalue(res, i, i_acldefault)); - loinfo[i].dacl.privtype = 0; - loinfo[i].dacl.initprivs = NULL; - loinfo[i].rolname = getRoleName(PQgetvalue(res, i, i_lomowner)); + /* Make sure we can look up loinfo by any of the blobs' OIDs */ + extraID.tableoid = LargeObjectRelationId; + extraID.oid = loinfo->looids[k]; + recordAdditionalCatalogID(extraID, &loinfo->dobj); + } /* LOs have data */ - loinfo[i].dobj.components |= DUMP_COMPONENT_DATA; + loinfo->dobj.components |= DUMP_COMPONENT_DATA; - /* Mark whether LO has an ACL */ + /* Mark whether LO group has a non-empty ACL */ if (!PQgetisnull(res, i, i_lomacl)) - loinfo[i].dobj.components |= DUMP_COMPONENT_ACL; + loinfo->dobj.components |= DUMP_COMPONENT_ACL; /* * In binary-upgrade mode for LOs, we do *not* dump out the LO data, @@ -3642,21 +3694,22 @@ getLOs(Archive *fout) * pg_largeobject_metadata, after the dump is restored. */ if (dopt->binary_upgrade) - loinfo[i].dobj.dump &= ~DUMP_COMPONENT_DATA; - } + loinfo->dobj.dump &= ~DUMP_COMPONENT_DATA; - /* - * If we have any large objects, a "BLOBS" archive entry is needed. This - * is just a placeholder for sorting; it carries no data now. - */ - if (ntups > 0) - { + /* + * Create a "BLOBS" data item for the group, too. This is just a + * placeholder for sorting; it carries no data now. + */ lodata = (DumpableObject *) pg_malloc(sizeof(DumpableObject)); lodata->objType = DO_LARGE_OBJECT_DATA; lodata->catId = nilCatalogId; AssignDumpId(lodata); - lodata->name = pg_strdup("BLOBS"); + lodata->name = pg_strdup(namebuf); lodata->components |= DUMP_COMPONENT_DATA; + /* Set up explicit dependency from data to metadata */ + lodata->dependencies = (DumpId *) pg_malloc(sizeof(DumpId)); + lodata->dependencies[0] = loinfo->dobj.dumpId; + lodata->nDeps = lodata->allocDeps = 1; } PQclear(res); @@ -3666,123 +3719,109 @@ getLOs(Archive *fout) /* * dumpLO * - * dump the definition (metadata) of the given large object + * dump the definition (metadata) of the given large object group */ static void dumpLO(Archive *fout, const LoInfo *loinfo) { PQExpBuffer cquery = createPQExpBuffer(); - PQExpBuffer dquery = createPQExpBuffer(); - appendPQExpBuffer(cquery, - "SELECT pg_catalog.lo_create('%s');\n", - loinfo->dobj.name); - - appendPQExpBuffer(dquery, - "SELECT pg_catalog.lo_unlink('%s');\n", - loinfo->dobj.name); + /* + * The "definition" is just a newline-separated list of OIDs. We need to + * put something into the dropStmt too, but it can just be a comment. + */ + for (int i = 0; i < loinfo->numlos; i++) + appendPQExpBuffer(cquery, "%u\n", loinfo->looids[i]); if (loinfo->dobj.dump & DUMP_COMPONENT_DEFINITION) ArchiveEntry(fout, loinfo->dobj.catId, loinfo->dobj.dumpId, ARCHIVE_OPTS(.tag = loinfo->dobj.name, .owner = loinfo->rolname, - .description = "BLOB", + .description = "BLOB METADATA", .section = SECTION_PRE_DATA, .createStmt = cquery->data, - .dropStmt = dquery->data)); - - /* Dump comment if any */ - if (loinfo->dobj.dump & DUMP_COMPONENT_COMMENT) - dumpComment(fout, "LARGE OBJECT", loinfo->dobj.name, - NULL, loinfo->rolname, - loinfo->dobj.catId, 0, loinfo->dobj.dumpId); - - /* Dump security label if any */ - if (loinfo->dobj.dump & DUMP_COMPONENT_SECLABEL) - dumpSecLabel(fout, "LARGE OBJECT", loinfo->dobj.name, - NULL, loinfo->rolname, - loinfo->dobj.catId, 0, loinfo->dobj.dumpId); - - /* Dump ACL if any */ - if (loinfo->dobj.dump & DUMP_COMPONENT_ACL) - dumpACL(fout, loinfo->dobj.dumpId, InvalidDumpId, "LARGE OBJECT", - loinfo->dobj.name, NULL, - NULL, loinfo->rolname, &loinfo->dacl); + .dropStmt = "-- dummy")); + + /* + * Dump per-blob comments, seclabels, and ACLs if any. We assume these + * are rare enough that it's okay to generate retail TOC entries for them. + */ + if (loinfo->dobj.dump & (DUMP_COMPONENT_COMMENT | + DUMP_COMPONENT_SECLABEL | + DUMP_COMPONENT_ACL)) + { + for (int i = 0; i < loinfo->numlos; i++) + { + CatalogId catId; + char namebuf[32]; + + /* Build identifying info for this blob */ + catId.tableoid = loinfo->dobj.catId.tableoid; + catId.oid = loinfo->looids[i]; + snprintf(namebuf, sizeof(namebuf), "%u", loinfo->looids[i]); + + if (loinfo->dobj.dump & DUMP_COMPONENT_COMMENT) + dumpComment(fout, "LARGE OBJECT", namebuf, + NULL, loinfo->rolname, + catId, 0, loinfo->dobj.dumpId); + + if (loinfo->dobj.dump & DUMP_COMPONENT_SECLABEL) + dumpSecLabel(fout, "LARGE OBJECT", namebuf, + NULL, loinfo->rolname, + catId, 0, loinfo->dobj.dumpId); + + if (loinfo->dobj.dump & DUMP_COMPONENT_ACL) + dumpACL(fout, loinfo->dobj.dumpId, InvalidDumpId, + "LARGE OBJECT", namebuf, NULL, + NULL, loinfo->rolname, &loinfo->dacl); + } + } destroyPQExpBuffer(cquery); - destroyPQExpBuffer(dquery); } /* * dumpLOs: - * dump the data contents of all large objects + * dump the data contents of the large objects in the given group */ static int dumpLOs(Archive *fout, const void *arg) { - const char *loQry; - const char *loFetchQry; + const LoInfo *loinfo = (const LoInfo *) arg; PGconn *conn = GetConnection(fout); - PGresult *res; char buf[LOBBUFSIZE]; - int ntups; - int i; - int cnt; - pg_log_info("saving large objects"); - - /* - * Currently, we re-fetch all LO OIDs using a cursor. Consider scanning - * the already-in-memory dumpable objects instead... - */ - loQry = - "DECLARE looid CURSOR FOR " - "SELECT oid FROM pg_largeobject_metadata ORDER BY 1"; + pg_log_info("saving large objects \"%s\"", loinfo->dobj.name); - ExecuteSqlStatement(fout, loQry); + for (int i = 0; i < loinfo->numlos; i++) + { + Oid loOid = loinfo->looids[i]; + int loFd; + int cnt; - /* Command to fetch from cursor */ - loFetchQry = "FETCH 1000 IN looid"; + /* Open the LO */ + loFd = lo_open(conn, loOid, INV_READ); + if (loFd == -1) + pg_fatal("could not open large object %u: %s", + loOid, PQerrorMessage(conn)); - do - { - /* Do a fetch */ - res = ExecuteSqlQuery(fout, loFetchQry, PGRES_TUPLES_OK); + StartLO(fout, loOid); - /* Process the tuples, if any */ - ntups = PQntuples(res); - for (i = 0; i < ntups; i++) + /* Now read it in chunks, sending data to archive */ + do { - Oid loOid; - int loFd; - - loOid = atooid(PQgetvalue(res, i, 0)); - /* Open the LO */ - loFd = lo_open(conn, loOid, INV_READ); - if (loFd == -1) - pg_fatal("could not open large object %u: %s", + cnt = lo_read(conn, loFd, buf, LOBBUFSIZE); + if (cnt < 0) + pg_fatal("error reading large object %u: %s", loOid, PQerrorMessage(conn)); - StartLO(fout, loOid); - - /* Now read it in chunks, sending data to archive */ - do - { - cnt = lo_read(conn, loFd, buf, LOBBUFSIZE); - if (cnt < 0) - pg_fatal("error reading large object %u: %s", - loOid, PQerrorMessage(conn)); + WriteData(fout, buf, cnt); + } while (cnt > 0); - WriteData(fout, buf, cnt); - } while (cnt > 0); + lo_close(conn, loFd); - lo_close(conn, loFd); - - EndLO(fout, loOid); - } - - PQclear(res); - } while (ntups > 0); + EndLO(fout, loOid); + } return 1; } @@ -10595,28 +10634,34 @@ dumpDumpableObject(Archive *fout, DumpableObject *dobj) case DO_LARGE_OBJECT_DATA: if (dobj->dump & DUMP_COMPONENT_DATA) { + LoInfo *loinfo; TocEntry *te; + loinfo = (LoInfo *) findObjectByDumpId(dobj->dependencies[0]); + if (loinfo == NULL) + pg_fatal("missing metadata for large objects \"%s\"", + dobj->name); + te = ArchiveEntry(fout, dobj->catId, dobj->dumpId, ARCHIVE_OPTS(.tag = dobj->name, + .owner = loinfo->rolname, .description = "BLOBS", .section = SECTION_DATA, - .dumpFn = dumpLOs)); + .deps = dobj->dependencies, + .nDeps = dobj->nDeps, + .dumpFn = dumpLOs, + .dumpArg = loinfo)); /* * Set the TocEntry's dataLength in case we are doing a * parallel dump and want to order dump jobs by table size. * (We need some size estimate for every TocEntry with a * DataDumper function.) We don't currently have any cheap - * way to estimate the size of LOs, but it doesn't matter; - * let's just set the size to a large value so parallel dumps - * will launch this job first. If there's lots of LOs, we - * win, and if there aren't, we don't lose much. (If you want - * to improve on this, really what you should be thinking - * about is allowing LO dumping to be parallelized, not just - * getting a smarter estimate for the single TOC entry.) + * way to estimate the size of LOs, but fortunately it doesn't + * matter too much as long as we get large batches of LOs + * processed reasonably early. Assume 8K per blob. */ - te->dataLength = INT_MAX; + te->dataLength = loinfo->numlos * (pgoff_t) 8192; } break; case DO_POLICY: diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h index 93d97a4090..c83e0298ca 100644 --- a/src/bin/pg_dump/pg_dump.h +++ b/src/bin/pg_dump/pg_dump.h @@ -583,11 +583,21 @@ typedef struct _defaultACLInfo char defaclobjtype; } DefaultACLInfo; +/* + * LoInfo represents a group of large objects (blobs) that share the same + * owner and ACL setting. dobj.components has the DUMP_COMPONENT_COMMENT bit + * set if any blob in the group has a comment; similarly for sec labels. + * If there are many blobs with the same owner/ACL, we can divide them into + * multiple LoInfo groups, which will each spawn a BLOB METADATA and a BLOBS + * (data) TOC entry. This allows more parallelism during restore. + */ typedef struct _loInfo { DumpableObject dobj; DumpableAcl dacl; const char *rolname; + int numlos; + Oid looids[FLEXIBLE_ARRAY_MEMBER]; } LoInfo; /* @@ -695,6 +705,7 @@ typedef struct _SubRelInfo extern TableInfo *getSchemaData(Archive *fout, int *numTablesPtr); extern void AssignDumpId(DumpableObject *dobj); +extern void recordAdditionalCatalogID(CatalogId catId, DumpableObject *dobj); extern DumpId createDumpId(void); extern DumpId getMaxDumpId(void); extern DumpableObject *findObjectByDumpId(DumpId dumpId); diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index 00b5092713..df11724e93 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -109,11 +109,11 @@ my %pgdump_runs = ( '--format=directory', '--compress=gzip:1', "--file=$tempdir/compression_gzip_dir", 'postgres', ], - # Give coverage for manually compressed blob.toc files during + # Give coverage for manually compressed blobs.toc files during # restore. compress_cmd => { program => $ENV{'GZIP_PROGRAM'}, - args => [ '-f', "$tempdir/compression_gzip_dir/blobs.toc", ], + args => [ '-f', "$tempdir/compression_gzip_dir/blobs_*.toc", ], }, # Verify that only data files were compressed glob_patterns => [ @@ -172,16 +172,6 @@ my %pgdump_runs = ( '--format=directory', '--compress=lz4:1', "--file=$tempdir/compression_lz4_dir", 'postgres', ], - # Give coverage for manually compressed blob.toc files during - # restore. - compress_cmd => { - program => $ENV{'LZ4'}, - args => [ - '-z', '-f', '--rm', - "$tempdir/compression_lz4_dir/blobs.toc", - "$tempdir/compression_lz4_dir/blobs.toc.lz4", - ], - }, # Verify that data files were compressed glob_patterns => [ "$tempdir/compression_lz4_dir/toc.dat", @@ -242,14 +232,13 @@ my %pgdump_runs = ( '--format=directory', '--compress=zstd:1', "--file=$tempdir/compression_zstd_dir", 'postgres', ], - # Give coverage for manually compressed blob.toc files during + # Give coverage for manually compressed blobs.toc files during # restore. compress_cmd => { program => $ENV{'ZSTD'}, args => [ '-z', '-f', - '--rm', "$tempdir/compression_zstd_dir/blobs.toc", - "-o", "$tempdir/compression_zstd_dir/blobs.toc.zst", + '--rm', "$tempdir/compression_zstd_dir/blobs_*.toc", ], }, # Verify that data files were compressed @@ -413,7 +402,7 @@ my %pgdump_runs = ( }, glob_patterns => [ "$tempdir/defaults_dir_format/toc.dat", - "$tempdir/defaults_dir_format/blobs.toc", + "$tempdir/defaults_dir_format/blobs_*.toc", $supports_gzip ? "$tempdir/defaults_dir_format/*.dat.gz" : "$tempdir/defaults_dir_format/*.dat", ], @@ -4858,8 +4847,13 @@ foreach my $run (sort keys %pgdump_runs) # not defined. next if (!defined($compress_program) || $compress_program eq ''); - my @full_compress_cmd = - ($compress_cmd->{program}, @{ $compress_cmd->{args} }); + # Arguments may require globbing. + my @full_compress_cmd = ($compress_program); + foreach my $arg (@{ $compress_cmd->{args} }) + { + push @full_compress_cmd, glob($arg); + } + command_ok(\@full_compress_cmd, "$run: compression commands"); } -- 2.39.3 From cc316ccc43c6e7c3b2d1631a5cdc48d57fa26344 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri, 26 Jan 2024 11:27:51 -0500 Subject: [PATCH v10 3/4] Move BLOBS METADATA TOC entries into SECTION_DATA. Commit c0d5be5d6 put the new BLOB metadata TOC entries into SECTION_PRE_DATA, which perhaps is defensible in some ways, but it's a rather odd choice considering that we go out of our way to treat blobs as data. Moreover, because parallel restore handles the PRE_DATA section serially, this means we're only getting part of the parallelism speedup we could hope for. Moving these entries into SECTION_DATA means that we can parallelize the lo_create calls not only the data loading when there are many blobs. The dependencies established by the previous patch ensure that we won't try to load data for a blob we've not yet created. --- src/bin/pg_dump/pg_dump.c | 4 ++-- src/bin/pg_dump/t/002_pg_dump.pl | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 0245b22ef0..77d745d55e 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -3738,7 +3738,7 @@ dumpLO(Archive *fout, const LoInfo *loinfo) ARCHIVE_OPTS(.tag = loinfo->dobj.name, .owner = loinfo->rolname, .description = "BLOB METADATA", - .section = SECTION_PRE_DATA, + .section = SECTION_DATA, .createStmt = cquery->data, .dropStmt = "-- dummy")); @@ -18612,12 +18612,12 @@ addBoundaryDependencies(DumpableObject **dobjs, int numObjs, case DO_FDW: case DO_FOREIGN_SERVER: case DO_TRANSFORM: - case DO_LARGE_OBJECT: /* Pre-data objects: must come before the pre-data boundary */ addObjectDependency(preDataBound, dobj->dumpId); break; case DO_TABLE_DATA: case DO_SEQUENCE_SET: + case DO_LARGE_OBJECT: case DO_LARGE_OBJECT_DATA: /* Data objects: must come between the boundaries */ addObjectDependency(dobj, preDataBound->dumpId); diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index df11724e93..25b099395a 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -912,7 +912,7 @@ my %tests = ( column_inserts => 1, data_only => 1, inserts => 1, - section_pre_data => 1, + section_data => 1, test_schema_plus_large_objects => 1, }, unlike => { @@ -1325,7 +1325,7 @@ my %tests = ( column_inserts => 1, data_only => 1, inserts => 1, - section_pre_data => 1, + section_data => 1, test_schema_plus_large_objects => 1, }, unlike => { @@ -1533,7 +1533,7 @@ my %tests = ( column_inserts => 1, data_only => 1, inserts => 1, - section_pre_data => 1, + section_data => 1, test_schema_plus_large_objects => 1, }, unlike => { @@ -4278,7 +4278,7 @@ my %tests = ( column_inserts => 1, data_only => 1, inserts => 1, - section_pre_data => 1, + section_data => 1, test_schema_plus_large_objects => 1, binary_upgrade => 1, }, -- 2.39.3 From df4c150100416e1c6030221081302d7b274309da Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri, 26 Jan 2024 11:37:37 -0500 Subject: [PATCH v10 4/4] Invent --transaction-size option for pg_restore. This patch allows pg_restore to wrap its commands into transaction blocks, somewhat like --single-transaction, except that we commit and start a new block after every N objects. Using this mode with a size limit of 1000 or so objects greatly reduces the number of transactions consumed by the restore, while preventing any one transaction from taking enough locks to overrun the receiving server's shared lock table. (A value of 1000 works well with the default lock table size of around 6400 locks. Higher --transaction-size values can be used if one has increased the receiving server's lock table size.) In this patch I have just hard-wired pg_upgrade to use --transaction-size 1000. Perhaps there would be value in adding another pg_upgrade option to allow user control of that, but I'm unsure that it's worth the trouble; I think few users would use it, and any who did would see not that much benefit. However, we might need to adjust the logic to make the size be 1000 divided by the number of parallel restore jobs allowed. --- doc/src/sgml/ref/pg_restore.sgml | 24 +++++ src/bin/pg_dump/pg_backup.h | 4 +- src/bin/pg_dump/pg_backup_archiver.c | 139 +++++++++++++++++++++++++-- src/bin/pg_dump/pg_backup_archiver.h | 3 + src/bin/pg_dump/pg_backup_db.c | 18 ++++ src/bin/pg_dump/pg_restore.c | 15 ++- src/bin/pg_upgrade/pg_upgrade.c | 11 +++ 7 files changed, 206 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml index 1a23874da6..2e3ba80258 100644 --- a/doc/src/sgml/ref/pg_restore.sgml +++ b/doc/src/sgml/ref/pg_restore.sgml @@ -786,6 +786,30 @@ PostgreSQL documentation </listitem> </varlistentry> + <varlistentry> + <term><option>--transaction-size=<replaceable class="parameter">N</replaceable></option></term> + <listitem> + <para> + Execute the restore as a series of transactions, each processing + up to <replaceable class="parameter">N</replaceable> database + objects. This option implies <option>--exit-on-error</option>. + </para> + <para> + <option>--transaction-size</option> offers an intermediate choice + between the default behavior (one transaction per SQL command) + and <option>-1</option>/<option>--single-transaction</option> + (one transaction for all restored objects). + While <option>--single-transaction</option> has the least + overhead, it may be impractical for large databases because the + transaction will take a lock on each restored object, possibly + exhausting the server's lock table space. + Using <option>--transaction-size</option> with a size of a few + thousand objects offers nearly the same performance benefits while + capping the amount of lock table space needed. + </para> + </listitem> + </varlistentry> + <varlistentry> <term><option>--use-set-session-authorization</option></term> <listitem> diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h index 9ef2f2017e..fbf5f1c515 100644 --- a/src/bin/pg_dump/pg_backup.h +++ b/src/bin/pg_dump/pg_backup.h @@ -149,7 +149,9 @@ typedef struct _restoreOptions * compression */ int suppressDumpWarnings; /* Suppress output of WARNING entries * to stderr */ - bool single_txn; + + bool single_txn; /* restore all TOCs in one transaction */ + int txn_size; /* restore this many TOCs per txn, if > 0 */ bool *idWanted; /* array showing which dump IDs to emit */ int enable_row_security; diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 73b9972da4..ec74846998 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -502,7 +502,28 @@ RestoreArchive(Archive *AHX) /* Otherwise, drop anything that's selected and has a dropStmt */ if (((te->reqs & (REQ_SCHEMA | REQ_DATA)) != 0) && te->dropStmt) { + bool not_allowed_in_txn = false; + pg_log_info("dropping %s %s", te->desc, te->tag); + + /* + * In --transaction-size mode, we have to temporarily exit our + * transaction block to drop objects that can't be dropped + * within a transaction. + */ + if (ropt->txn_size > 0) + { + if (strcmp(te->desc, "DATABASE") == 0 || + strcmp(te->desc, "DATABASE PROPERTIES") == 0) + { + not_allowed_in_txn = true; + if (AH->connection) + CommitTransaction(AHX); + else + ahprintf(AH, "COMMIT;\n"); + } + } + /* Select owner and schema as necessary */ _becomeOwner(AH, te); _selectOutputSchema(AH, te->namespace); @@ -628,6 +649,33 @@ RestoreArchive(Archive *AHX) } } } + + /* + * In --transaction-size mode, re-establish the transaction + * block if needed; otherwise, commit after every N drops. + */ + if (ropt->txn_size > 0) + { + if (not_allowed_in_txn) + { + if (AH->connection) + StartTransaction(AHX); + else + ahprintf(AH, "BEGIN;\n"); + AH->txnCount = 0; + } + else if (++AH->txnCount >= ropt->txn_size) + { + if (AH->connection) + { + CommitTransaction(AHX); + StartTransaction(AHX); + } + else + ahprintf(AH, "COMMIT;\nBEGIN;\n"); + AH->txnCount = 0; + } + } } } @@ -724,7 +772,11 @@ RestoreArchive(Archive *AHX) } } - if (ropt->single_txn) + /* + * Close out any persistent transaction we may have. While these two + * cases are started in different places, we can end both cases here. + */ + if (ropt->single_txn || ropt->txn_size > 0) { if (AH->connection) CommitTransaction(AHX); @@ -785,6 +837,25 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel) */ if ((reqs & REQ_SCHEMA) != 0) { + bool object_is_db = false; + + /* + * In --transaction-size mode, must exit our transaction block to + * create a database or set its properties. + */ + if (strcmp(te->desc, "DATABASE") == 0 || + strcmp(te->desc, "DATABASE PROPERTIES") == 0) + { + object_is_db = true; + if (ropt->txn_size > 0) + { + if (AH->connection) + CommitTransaction(&AH->public); + else + ahprintf(AH, "COMMIT;\n\n"); + } + } + /* Show namespace in log message if available */ if (te->namespace) pg_log_info("creating %s \"%s.%s\"", @@ -835,10 +906,10 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel) /* * If we created a DB, connect to it. Also, if we changed DB * properties, reconnect to ensure that relevant GUC settings are - * applied to our session. + * applied to our session. (That also restarts the transaction block + * in --transaction-size mode.) */ - if (strcmp(te->desc, "DATABASE") == 0 || - strcmp(te->desc, "DATABASE PROPERTIES") == 0) + if (object_is_db) { pg_log_info("connecting to new database \"%s\"", te->tag); _reconnectToDB(AH, te->tag); @@ -964,6 +1035,25 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel) } } + /* + * If we emitted anything for this TOC entry, that counts as one action + * against the transaction-size limit. Commit if it's time to. + */ + if ((reqs & (REQ_SCHEMA | REQ_DATA)) != 0 && ropt->txn_size > 0) + { + if (++AH->txnCount >= ropt->txn_size) + { + if (AH->connection) + { + CommitTransaction(&AH->public); + StartTransaction(&AH->public); + } + else + ahprintf(AH, "COMMIT;\nBEGIN;\n\n"); + AH->txnCount = 0; + } + } + if (AH->public.n_errors > 0 && status == WORKER_OK) status = WORKER_IGNORED_ERRORS; @@ -1310,7 +1400,12 @@ StartRestoreLOs(ArchiveHandle *AH) { RestoreOptions *ropt = AH->public.ropt; - if (!ropt->single_txn) + /* + * LOs must be restored within a transaction block, since we need the LO + * handle to stay open while we write it. Establish a transaction unless + * there's one being used globally. + */ + if (!(ropt->single_txn || ropt->txn_size > 0)) { if (AH->connection) StartTransaction(&AH->public); @@ -1329,7 +1424,7 @@ EndRestoreLOs(ArchiveHandle *AH) { RestoreOptions *ropt = AH->public.ropt; - if (!ropt->single_txn) + if (!(ropt->single_txn || ropt->txn_size > 0)) { if (AH->connection) CommitTransaction(&AH->public); @@ -3170,6 +3265,19 @@ _doSetFixedOutputState(ArchiveHandle *AH) else ahprintf(AH, "SET row_security = off;\n"); + /* + * In --transaction-size mode, we should always be in a transaction when + * we begin to restore objects. + */ + if (ropt && ropt->txn_size > 0) + { + if (AH->connection) + StartTransaction(&AH->public); + else + ahprintf(AH, "\nBEGIN;\n"); + AH->txnCount = 0; + } + ahprintf(AH, "\n"); } @@ -4033,6 +4141,14 @@ restore_toc_entries_prefork(ArchiveHandle *AH, TocEntry *pending_list) } } + /* + * In --transaction-size mode, we must commit the open transaction before + * dropping the database connection. This also ensures that child workers + * can see the objects we've created so far. + */ + if (AH->public.ropt->txn_size > 0) + CommitTransaction(&AH->public); + /* * Now close parent connection in prep for parallel steps. We do this * mainly to ensure that we don't exceed the specified number of parallel @@ -4772,6 +4888,10 @@ CloneArchive(ArchiveHandle *AH) clone = (ArchiveHandle *) pg_malloc(sizeof(ArchiveHandle)); memcpy(clone, AH, sizeof(ArchiveHandle)); + /* Likewise flat-copy the RestoreOptions, so we can alter them locally */ + clone->public.ropt = (RestoreOptions *) pg_malloc(sizeof(RestoreOptions)); + memcpy(clone->public.ropt, AH->public.ropt, sizeof(RestoreOptions)); + /* Handle format-independent fields */ memset(&(clone->sqlparse), 0, sizeof(clone->sqlparse)); @@ -4793,6 +4913,13 @@ CloneArchive(ArchiveHandle *AH) /* clones should not share lo_buf */ clone->lo_buf = NULL; + /* + * Clone connections disregard --transaction-size; they must commit after + * each command so that the results are immediately visible to other + * workers. + */ + clone->public.ropt->txn_size = 0; + /* * Connect our new clone object to the database, using the same connection * parameters used for the original connection. diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h index e4dd395582..1b9f142dea 100644 --- a/src/bin/pg_dump/pg_backup_archiver.h +++ b/src/bin/pg_dump/pg_backup_archiver.h @@ -324,6 +324,9 @@ struct _archiveHandle char *currTablespace; /* current tablespace, or NULL */ char *currTableAm; /* current table access method, or NULL */ + /* in --transaction-size mode, this counts objects emitted in cur xact */ + int txnCount; + void *lo_buf; size_t lo_buf_used; size_t lo_buf_size; diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c index c14d813b21..6b3bf174f2 100644 --- a/src/bin/pg_dump/pg_backup_db.c +++ b/src/bin/pg_dump/pg_backup_db.c @@ -554,6 +554,7 @@ IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, { /* Make a writable copy of the command string */ char *buf = pg_strdup(te->defn); + RestoreOptions *ropt = AH->public.ropt; char *st; char *en; @@ -562,6 +563,23 @@ IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, { *en++ = '\0'; ahprintf(AH, "%s%s%s;\n", cmdBegin, st, cmdEnd); + + /* In --transaction-size mode, count each command as an action */ + if (ropt && ropt->txn_size > 0) + { + if (++AH->txnCount >= ropt->txn_size) + { + if (AH->connection) + { + CommitTransaction(&AH->public); + StartTransaction(&AH->public); + } + else + ahprintf(AH, "COMMIT;\nBEGIN;\n\n"); + AH->txnCount = 0; + } + } + st = en; } ahprintf(AH, "\n"); diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c index c3beacdec1..5ea78cf7cc 100644 --- a/src/bin/pg_dump/pg_restore.c +++ b/src/bin/pg_dump/pg_restore.c @@ -120,6 +120,7 @@ main(int argc, char **argv) {"role", required_argument, NULL, 2}, {"section", required_argument, NULL, 3}, {"strict-names", no_argument, &strict_names, 1}, + {"transaction-size", required_argument, NULL, 5}, {"use-set-session-authorization", no_argument, &use_setsessauth, 1}, {"no-comments", no_argument, &no_comments, 1}, {"no-publications", no_argument, &no_publications, 1}, @@ -289,10 +290,18 @@ main(int argc, char **argv) set_dump_section(optarg, &(opts->dumpSections)); break; - case 4: + case 4: /* filter */ read_restore_filters(optarg, opts); break; + case 5: /* transaction-size */ + if (!option_parse_int(optarg, "--transaction-size", + 1, INT_MAX, + &opts->txn_size)) + exit(1); + opts->exit_on_error = true; + break; + default: /* getopt_long already emitted a complaint */ pg_log_error_hint("Try \"%s --help\" for more information.", progname); @@ -337,6 +346,9 @@ main(int argc, char **argv) if (opts->dataOnly && opts->dropSchema) pg_fatal("options -c/--clean and -a/--data-only cannot be used together"); + if (opts->single_txn && opts->txn_size > 0) + pg_fatal("options -1/--single-transaction and --transaction-size cannot be used together"); + /* * -C is not compatible with -1, because we can't create a database inside * a transaction block. @@ -484,6 +496,7 @@ usage(const char *progname) printf(_(" --section=SECTION restore named section (pre-data, data, or post-data)\n")); printf(_(" --strict-names require table and/or schema include patterns to\n" " match at least one entity each\n")); + printf(_(" --transaction-size=N commit after every N objects\n")); printf(_(" --use-set-session-authorization\n" " use SET SESSION AUTHORIZATION commands instead of\n" " ALTER OWNER commands to set ownership\n")); diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c index 10c94a6c1f..6a698e0c87 100644 --- a/src/bin/pg_upgrade/pg_upgrade.c +++ b/src/bin/pg_upgrade/pg_upgrade.c @@ -51,6 +51,13 @@ #include "fe_utils/string_utils.h" #include "pg_upgrade.h" +/* + * Maximum number of pg_restore actions (TOC entries) to process within one + * transaction. At some point we might want to make this user-controllable, + * but for now a hard-wired setting will suffice. + */ +#define RESTORE_TRANSACTION_SIZE 1000 + static void set_locale_and_encoding(void); static void prepare_new_cluster(void); static void prepare_new_globals(void); @@ -548,10 +555,12 @@ create_new_objects(void) true, true, "\"%s/pg_restore\" %s %s --exit-on-error --verbose " + "--transaction-size=%d " "--dbname postgres \"%s/%s\"", new_cluster.bindir, cluster_conn_opts(&new_cluster), create_opts, + RESTORE_TRANSACTION_SIZE, log_opts.dumpdir, sql_file_name); @@ -586,10 +595,12 @@ create_new_objects(void) parallel_exec_prog(log_file_name, NULL, "\"%s/pg_restore\" %s %s --exit-on-error --verbose " + "--transaction-size=%d " "--dbname template1 \"%s/%s\"", new_cluster.bindir, cluster_conn_opts(&new_cluster), create_opts, + RESTORE_TRANSACTION_SIZE, log_opts.dumpdir, sql_file_name); } -- 2.39.3
This patch seems to have stalled out again. In hopes of getting it over the finish line, I've done a bit more work to address the two loose ends I felt were probably essential to deal with: * Duplicative blob ACLs are now merged into a single TOC entry (per metadata group) with the GRANT/REVOKE commands stored only once. This is to address the possibly-common case where a database has a ton of blobs that have identical-but-not-default ACLs. I have not done anything about improving efficiency for blob comments or security labels. I think it's reasonable to assume that blobs with comments are pets not cattle, and there won't be many of them. I suppose it could be argued that seclabels might be used like ACLs with a lot of duplication, but I doubt that there's anyone out there at all putting seclabels on blobs in practice. So I don't care to expend effort on that. * Parallel pg_upgrade cuts the --transaction-size given to concurrent pg_restore jobs by the -j factor. This is to ensure we keep the shared locks table within bounds even in parallel mode. Now we could go further than that and provide some direct user control over these hard-wired settings, but I think that could be left for later, getting some field experience before we design an API. In short, I think this patchset is more or less commitable. 0001-0004 are rebased up to HEAD, but differ only in line numbers from the v10 patchset. 0005 handles ACL merging, and 0006 does the other thing. regards, tom lane From 1df5cc3ffdee85db8f9815dc0839769192b57158 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri, 26 Jan 2024 11:10:00 -0500 Subject: [PATCH v11 1/6] Some small preliminaries for pg_dump changes. Centralize management of the lo_buf used to hold data while restoring blobs. The code previously had each format handler create lo_buf, which seems rather pointless given that the format handlers all make it the same way. Moreover, the format handlers never use lo_buf directly, making this setup a failure from a separation-of-concerns standpoint. Let's move the responsibility into pg_backup_archiver.c, which is the only module concerned with lo_buf. The main reason to do this now is that it allows a centralized fix for the soon-to-be-false assumption that we never restore blobs in parallel. Also, get rid of dead code in DropLOIfExists: it's been a long time since we had any need to be able to restore to a pre-9.0 server. --- src/bin/pg_dump/pg_backup_archiver.c | 9 +++++++++ src/bin/pg_dump/pg_backup_custom.c | 7 ------- src/bin/pg_dump/pg_backup_db.c | 27 +++++---------------------- src/bin/pg_dump/pg_backup_directory.c | 6 ------ src/bin/pg_dump/pg_backup_null.c | 4 ---- src/bin/pg_dump/pg_backup_tar.c | 4 ---- 6 files changed, 14 insertions(+), 43 deletions(-) diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index d97ebaff5b..f5935b08bb 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -1343,6 +1343,12 @@ StartRestoreLO(ArchiveHandle *AH, Oid oid, bool drop) AH->loCount++; /* Initialize the LO Buffer */ + if (AH->lo_buf == NULL) + { + /* First time through (in this process) so allocate the buffer */ + AH->lo_buf_size = LOBBUFSIZE; + AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); + } AH->lo_buf_used = 0; pg_log_info("restoring large object with OID %u", oid); @@ -4749,6 +4755,9 @@ CloneArchive(ArchiveHandle *AH) /* clone has its own error count, too */ clone->public.n_errors = 0; + /* clones should not share lo_buf */ + clone->lo_buf = NULL; + /* * Connect our new clone object to the database, using the same connection * parameters used for the original connection. diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c index b576b29924..7c6ac89dd4 100644 --- a/src/bin/pg_dump/pg_backup_custom.c +++ b/src/bin/pg_dump/pg_backup_custom.c @@ -140,10 +140,6 @@ InitArchiveFmt_Custom(ArchiveHandle *AH) ctx = (lclContext *) pg_malloc0(sizeof(lclContext)); AH->formatData = (void *) ctx; - /* Initialize LO buffering */ - AH->lo_buf_size = LOBBUFSIZE; - AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); - /* * Now open the file */ @@ -902,9 +898,6 @@ _Clone(ArchiveHandle *AH) * share knowledge about where the data blocks are across threads. * _PrintTocData has to be careful about the order of operations on that * state, though. - * - * Note: we do not make a local lo_buf because we expect at most one BLOBS - * entry per archive, so no parallelism is possible. */ } diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c index f766b65059..b297ca049d 100644 --- a/src/bin/pg_dump/pg_backup_db.c +++ b/src/bin/pg_dump/pg_backup_db.c @@ -544,26 +544,9 @@ CommitTransaction(Archive *AHX) void DropLOIfExists(ArchiveHandle *AH, Oid oid) { - /* - * If we are not restoring to a direct database connection, we have to - * guess about how to detect whether the LO exists. Assume new-style. - */ - if (AH->connection == NULL || - PQserverVersion(AH->connection) >= 90000) - { - ahprintf(AH, - "SELECT pg_catalog.lo_unlink(oid) " - "FROM pg_catalog.pg_largeobject_metadata " - "WHERE oid = '%u';\n", - oid); - } - else - { - /* Restoring to pre-9.0 server, so do it the old way */ - ahprintf(AH, - "SELECT CASE WHEN EXISTS(" - "SELECT 1 FROM pg_catalog.pg_largeobject WHERE loid = '%u'" - ") THEN pg_catalog.lo_unlink('%u') END;\n", - oid, oid); - } + ahprintf(AH, + "SELECT pg_catalog.lo_unlink(oid) " + "FROM pg_catalog.pg_largeobject_metadata " + "WHERE oid = '%u';\n", + oid); } diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c index dba57443e8..de3cfea02e 100644 --- a/src/bin/pg_dump/pg_backup_directory.c +++ b/src/bin/pg_dump/pg_backup_directory.c @@ -143,10 +143,6 @@ InitArchiveFmt_Directory(ArchiveHandle *AH) ctx->dataFH = NULL; ctx->LOsTocFH = NULL; - /* Initialize LO buffering */ - AH->lo_buf_size = LOBBUFSIZE; - AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); - /* * Now open the TOC file */ @@ -823,8 +819,6 @@ _Clone(ArchiveHandle *AH) ctx = (lclContext *) AH->formatData; /* - * Note: we do not make a local lo_buf because we expect at most one BLOBS - * entry per archive, so no parallelism is possible. Likewise, * TOC-entry-local state isn't an issue because any one TOC entry is * touched by just one worker child. */ diff --git a/src/bin/pg_dump/pg_backup_null.c b/src/bin/pg_dump/pg_backup_null.c index 08f096251b..776f057770 100644 --- a/src/bin/pg_dump/pg_backup_null.c +++ b/src/bin/pg_dump/pg_backup_null.c @@ -63,10 +63,6 @@ InitArchiveFmt_Null(ArchiveHandle *AH) AH->ClonePtr = NULL; AH->DeClonePtr = NULL; - /* Initialize LO buffering */ - AH->lo_buf_size = LOBBUFSIZE; - AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); - /* * Now prevent reading... */ diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c index aad88ad559..4cb9707e63 100644 --- a/src/bin/pg_dump/pg_backup_tar.c +++ b/src/bin/pg_dump/pg_backup_tar.c @@ -156,10 +156,6 @@ InitArchiveFmt_Tar(ArchiveHandle *AH) ctx->filePos = 0; ctx->isSpecialScript = 0; - /* Initialize LO buffering */ - AH->lo_buf_size = LOBBUFSIZE; - AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE); - /* * Now open the tar file, and load the TOC if we're in read mode. */ -- 2.39.3 From 049b5964e4c3b5add51b02b5a74732df7af72beb Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri, 26 Jan 2024 11:25:26 -0500 Subject: [PATCH v11 2/6] In dumps, group large objects into matching metadata and data entries. Commit c0d5be5d6 caused pg_dump to create a separate BLOB metadata TOC entry for each large object (blob), but it did not touch the ancient decision to put all the blobs' data into a single BLOBS TOC entry. This is bad for a few reasons: for databases with millions of blobs, the TOC becomes unreasonably large, causing performance issues; selective restore of just some blobs is quite impossible; and we cannot parallelize either dump or restore of the blob data, since our architecture for that relies on farming out whole TOC entries to worker processes. To improve matters, let's group multiple blobs into each blob metadata TOC entry, and then make corresponding per-group blob data TOC entries. Selective restore using pg_restore's -l/-L switches is then possible, though only at the group level. (Perhaps we should provide a switch to allow forcing one-blob-per-group for users who need precise selective restore and don't have huge numbers of blobs. This patch doesn't yet do that, instead just hard-wiring the maximum number of blobs per entry at 1000.) The blobs in a group must all have the same owner, since the TOC entry format only allows one owner to be named. In this implementation we also require them to all share the same ACL (grants); the archive format wouldn't require that, but pg_dump's representation of DumpableObjects does. It seems unlikely that either restriction will be problematic for databases with huge numbers of blobs. The metadata TOC entries now have a "desc" string of "BLOB METADATA", and their "defn" string is just a newline-separated list of blob OIDs. The restore code has to generate creation commands, ALTER OWNER commands, and drop commands (for --clean mode) from that. We would need special-case code for ALTER OWNER and drop in any case, so the alternative of keeping the "defn" as directly executable SQL code for creation wouldn't buy much, and it seems like it'd bloat the archive to little purpose. The data TOC entries ("BLOBS") can be exactly the same as before, except that now there can be more than one, so we'd better give them identifying tag strings. We have to bump the archive file format version number, since existing versions of pg_restore wouldn't know they need to do something special for BLOB METADATA, plus they aren't going to work correctly with multiple BLOBS entries. Also, the directory and tar-file format handlers need some work for multiple BLOBS entries: they used to hard-wire the file name as "blobs.toc", which is replaced here with "blobs_<dumpid>.toc". The 002_pg_dump.pl test script also knows about that and requires minor updates. (I had to drop the test for manually-compressed blobs.toc files with LZ4, because lz4's obtuse command line design requires explicit specification of the output file name which seems impractical here. I don't think we're losing any useful test coverage thereby; that test stanza seems completely duplicative with the gzip and zstd cases anyway.) As this stands, we still generate a separate TOC entry for any comment, security label, or ACL attached to a blob. I feel comfortable in believing that comments and security labels on blobs are rare; but we might have to do something about aggregating blob ACLs into grouped TOC entries to avoid blowing up the TOC size, if there are use cases with large numbers of non-default blob ACLs. That can be done later though, as it would not create any compatibility issues. --- src/bin/pg_dump/common.c | 26 +++ src/bin/pg_dump/pg_backup_archiver.c | 76 +++++-- src/bin/pg_dump/pg_backup_archiver.h | 6 +- src/bin/pg_dump/pg_backup_custom.c | 4 +- src/bin/pg_dump/pg_backup_db.c | 27 +++ src/bin/pg_dump/pg_backup_directory.c | 38 ++-- src/bin/pg_dump/pg_backup_null.c | 4 +- src/bin/pg_dump/pg_backup_tar.c | 39 +++- src/bin/pg_dump/pg_dump.c | 287 +++++++++++++++----------- src/bin/pg_dump/pg_dump.h | 11 + src/bin/pg_dump/t/002_pg_dump.pl | 30 ++- 11 files changed, 361 insertions(+), 187 deletions(-) diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c index 0ed18b72d6..c7dd0b11fd 100644 --- a/src/bin/pg_dump/common.c +++ b/src/bin/pg_dump/common.c @@ -47,6 +47,8 @@ static DumpId lastDumpId = 0; /* Note: 0 is InvalidDumpId */ * expects that it can move them around when resizing the table. So we * cannot make the DumpableObjects be elements of the hash table directly; * instead, the hash table elements contain pointers to DumpableObjects. + * This does have the advantage of letting us map multiple CatalogIds + * to one DumpableObject, which is useful for blobs. * * It turns out to be convenient to also use this data structure to map * CatalogIds to owning extensions, if any. Since extension membership @@ -700,6 +702,30 @@ AssignDumpId(DumpableObject *dobj) } } +/* + * recordAdditionalCatalogID + * Record an additional catalog ID for the given DumpableObject + */ +void +recordAdditionalCatalogID(CatalogId catId, DumpableObject *dobj) +{ + CatalogIdMapEntry *entry; + bool found; + + /* CatalogId hash table must exist, if we have a DumpableObject */ + Assert(catalogIdHash != NULL); + + /* Add reference to CatalogId hash */ + entry = catalogid_insert(catalogIdHash, catId, &found); + if (!found) + { + entry->dobj = NULL; + entry->ext = NULL; + } + Assert(entry->dobj == NULL); + entry->dobj = dobj; +} + /* * Assign a DumpId that's not tied to a DumpableObject. * diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index f5935b08bb..aabcd8dde3 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -512,7 +512,20 @@ RestoreArchive(Archive *AHX) * don't necessarily emit it verbatim; at this point we add an * appropriate IF EXISTS clause, if the user requested it. */ - if (*te->dropStmt != '\0') + if (strcmp(te->desc, "BLOB METADATA") == 0) + { + /* We must generate the per-blob commands */ + if (ropt->if_exists) + IssueCommandPerBlob(AH, te, + "SELECT pg_catalog.lo_unlink(oid) " + "FROM pg_catalog.pg_largeobject_metadata " + "WHERE oid = '", "'"); + else + IssueCommandPerBlob(AH, te, + "SELECT pg_catalog.lo_unlink('", + "')"); + } + else if (*te->dropStmt != '\0') { if (!ropt->if_exists || strncmp(te->dropStmt, "--", 2) == 0) @@ -528,12 +541,12 @@ RestoreArchive(Archive *AHX) { /* * Inject an appropriate spelling of "if exists". For - * large objects, we have a separate routine that + * old-style large objects, we have a routine that * knows how to do it, without depending on * te->dropStmt; use that. For other objects we need * to parse the command. */ - if (strncmp(te->desc, "BLOB", 4) == 0) + if (strcmp(te->desc, "BLOB") == 0) { DropLOIfExists(AH, te->catalogId.oid); } @@ -1290,7 +1303,7 @@ EndLO(Archive *AHX, Oid oid) **********/ /* - * Called by a format handler before any LOs are restored + * Called by a format handler before a group of LOs is restored */ void StartRestoreLOs(ArchiveHandle *AH) @@ -1309,7 +1322,7 @@ StartRestoreLOs(ArchiveHandle *AH) } /* - * Called by a format handler after all LOs are restored + * Called by a format handler after a group of LOs is restored */ void EndRestoreLOs(ArchiveHandle *AH) @@ -2994,13 +3007,14 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH) { /* * Special Case: If 'SEQUENCE SET' or anything to do with LOs, then it - * is considered a data entry. We don't need to check for the BLOBS - * entry or old-style BLOB COMMENTS, because they will have hadDumper - * = true ... but we do need to check new-style BLOB ACLs, comments, + * is considered a data entry. We don't need to check for BLOBS or + * old-style BLOB COMMENTS entries, because they will have hadDumper = + * true ... but we do need to check new-style BLOB ACLs, comments, * etc. */ if (strcmp(te->desc, "SEQUENCE SET") == 0 || strcmp(te->desc, "BLOB") == 0 || + strcmp(te->desc, "BLOB METADATA") == 0 || (strcmp(te->desc, "ACL") == 0 && strncmp(te->tag, "LARGE OBJECT ", 13) == 0) || (strcmp(te->desc, "COMMENT") == 0 && @@ -3041,6 +3055,7 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH) if (!(ropt->sequence_data && strcmp(te->desc, "SEQUENCE SET") == 0) && !(ropt->binary_upgrade && (strcmp(te->desc, "BLOB") == 0 || + strcmp(te->desc, "BLOB METADATA") == 0 || (strcmp(te->desc, "ACL") == 0 && strncmp(te->tag, "LARGE OBJECT ", 13) == 0) || (strcmp(te->desc, "COMMENT") == 0 && @@ -3613,18 +3628,26 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData) } /* - * Actually print the definition. + * Actually print the definition. Normally we can just print the defn + * string if any, but we have two special cases: * - * Really crude hack for suppressing AUTHORIZATION clause that old pg_dump + * 1. A crude hack for suppressing AUTHORIZATION clause that old pg_dump * versions put into CREATE SCHEMA. Don't mutate the variant for schema * "public" that is a comment. We have to do this when --no-owner mode is * selected. This is ugly, but I see no other good way ... + * + * 2. BLOB METADATA entries need special processing since their defn + * strings are just lists of OIDs, not complete SQL commands. */ if (ropt->noOwner && strcmp(te->desc, "SCHEMA") == 0 && strncmp(te->defn, "--", 2) != 0) { ahprintf(AH, "CREATE SCHEMA %s;\n\n\n", fmtId(te->tag)); } + else if (strcmp(te->desc, "BLOB METADATA") == 0) + { + IssueCommandPerBlob(AH, te, "SELECT pg_catalog.lo_create('", "')"); + } else { if (te->defn && strlen(te->defn) > 0) @@ -3645,18 +3668,31 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData) te->owner && strlen(te->owner) > 0 && te->dropStmt && strlen(te->dropStmt) > 0) { - PQExpBufferData temp; + if (strcmp(te->desc, "BLOB METADATA") == 0) + { + /* BLOB METADATA needs special code to handle multiple LOs */ + char *cmdEnd = psprintf(" OWNER TO %s", fmtId(te->owner)); + + IssueCommandPerBlob(AH, te, "ALTER LARGE OBJECT ", cmdEnd); + pg_free(cmdEnd); + } + else + { + /* For all other cases, we can use _getObjectDescription */ + PQExpBufferData temp; - initPQExpBuffer(&temp); - _getObjectDescription(&temp, te); + initPQExpBuffer(&temp); + _getObjectDescription(&temp, te); - /* - * If _getObjectDescription() didn't fill the buffer, then there is no - * owner. - */ - if (temp.data[0]) - ahprintf(AH, "ALTER %s OWNER TO %s;\n\n", temp.data, fmtId(te->owner)); - termPQExpBuffer(&temp); + /* + * If _getObjectDescription() didn't fill the buffer, then there + * is no owner. + */ + if (temp.data[0]) + ahprintf(AH, "ALTER %s OWNER TO %s;\n\n", + temp.data, fmtId(te->owner)); + termPQExpBuffer(&temp); + } } /* diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h index 917283fd34..e4dd395582 100644 --- a/src/bin/pg_dump/pg_backup_archiver.h +++ b/src/bin/pg_dump/pg_backup_archiver.h @@ -68,10 +68,12 @@ #define K_VERS_1_15 MAKE_ARCHIVE_VERSION(1, 15, 0) /* add * compression_algorithm * in header */ +#define K_VERS_1_16 MAKE_ARCHIVE_VERSION(1, 16, 0) /* BLOB METADATA entries + * and multiple BLOBS */ /* Current archive version number (the format we can output) */ #define K_VERS_MAJOR 1 -#define K_VERS_MINOR 15 +#define K_VERS_MINOR 16 #define K_VERS_REV 0 #define K_VERS_SELF MAKE_ARCHIVE_VERSION(K_VERS_MAJOR, K_VERS_MINOR, K_VERS_REV) @@ -448,6 +450,8 @@ extern void InitArchiveFmt_Tar(ArchiveHandle *AH); extern bool isValidTarHeader(char *header); extern void ReconnectToServer(ArchiveHandle *AH, const char *dbname); +extern void IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, + const char *cmdBegin, const char *cmdEnd); extern void DropLOIfExists(ArchiveHandle *AH, Oid oid); void ahwrite(const void *ptr, size_t size, size_t nmemb, ArchiveHandle *AH); diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c index 7c6ac89dd4..55107b2005 100644 --- a/src/bin/pg_dump/pg_backup_custom.c +++ b/src/bin/pg_dump/pg_backup_custom.c @@ -338,7 +338,7 @@ _EndData(ArchiveHandle *AH, TocEntry *te) } /* - * Called by the archiver when starting to save all BLOB DATA (not schema). + * Called by the archiver when starting to save BLOB DATA (not schema). * This routine should save whatever format-specific information is needed * to read the LOs back into memory. * @@ -398,7 +398,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) } /* - * Called by the archiver when finishing saving all BLOB DATA. + * Called by the archiver when finishing saving BLOB DATA. * * Optional. */ diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c index b297ca049d..c14d813b21 100644 --- a/src/bin/pg_dump/pg_backup_db.c +++ b/src/bin/pg_dump/pg_backup_db.c @@ -541,6 +541,33 @@ CommitTransaction(Archive *AHX) ExecuteSqlCommand(AH, "COMMIT", "could not commit database transaction"); } +/* + * Issue per-blob commands for the large object(s) listed in the TocEntry + * + * The TocEntry's defn string is assumed to consist of large object OIDs, + * one per line. Wrap these in the given SQL command fragments and issue + * the commands. (cmdEnd need not include a semicolon.) + */ +void +IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, + const char *cmdBegin, const char *cmdEnd) +{ + /* Make a writable copy of the command string */ + char *buf = pg_strdup(te->defn); + char *st; + char *en; + + st = buf; + while ((en = strchr(st, '\n')) != NULL) + { + *en++ = '\0'; + ahprintf(AH, "%s%s%s;\n", cmdBegin, st, cmdEnd); + st = en; + } + ahprintf(AH, "\n"); + pg_free(buf); +} + void DropLOIfExists(ArchiveHandle *AH, Oid oid) { diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c index de3cfea02e..7be8d5487d 100644 --- a/src/bin/pg_dump/pg_backup_directory.c +++ b/src/bin/pg_dump/pg_backup_directory.c @@ -5,8 +5,10 @@ * A directory format dump is a directory, which contains a "toc.dat" file * for the TOC, and a separate file for each data entry, named "<oid>.dat". * Large objects are stored in separate files named "blob_<oid>.dat", - * and there's a plain-text TOC file for them called "blobs.toc". If - * compression is used, each data file is individually compressed and the + * and there's a plain-text TOC file for each BLOBS TOC entry named + * "blobs_<dumpID>.toc" (or just "blobs.toc" in archive versions before 16). + * + * If compression is used, each data file is individually compressed and the * ".gz" suffix is added to the filenames. The TOC files are never * compressed by pg_dump, however they are accepted with the .gz suffix too, * in case the user has manually compressed them with 'gzip'. @@ -51,7 +53,7 @@ typedef struct char *directory; CompressFileHandle *dataFH; /* currently open data file */ - CompressFileHandle *LOsTocFH; /* file handle for blobs.toc */ + CompressFileHandle *LOsTocFH; /* file handle for blobs_NNN.toc */ ParallelState *pstate; /* for parallel backup / restore */ } lclContext; @@ -81,7 +83,7 @@ static void _StartLOs(ArchiveHandle *AH, TocEntry *te); static void _StartLO(ArchiveHandle *AH, TocEntry *te, Oid oid); static void _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid); static void _EndLOs(ArchiveHandle *AH, TocEntry *te); -static void _LoadLOs(ArchiveHandle *AH); +static void _LoadLOs(ArchiveHandle *AH, TocEntry *te); static void _PrepParallelRestore(ArchiveHandle *AH); static void _Clone(ArchiveHandle *AH); @@ -232,7 +234,10 @@ _ArchiveEntry(ArchiveHandle *AH, TocEntry *te) tctx = (lclTocEntry *) pg_malloc0(sizeof(lclTocEntry)); if (strcmp(te->desc, "BLOBS") == 0) - tctx->filename = pg_strdup("blobs.toc"); + { + snprintf(fn, MAXPGPATH, "blobs_%d.toc", te->dumpId); + tctx->filename = pg_strdup(fn); + } else if (te->dataDumper) { snprintf(fn, MAXPGPATH, "%d.dat", te->dumpId); @@ -415,7 +420,7 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te) return; if (strcmp(te->desc, "BLOBS") == 0) - _LoadLOs(AH); + _LoadLOs(AH, te); else { char fname[MAXPGPATH]; @@ -426,17 +431,23 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te) } static void -_LoadLOs(ArchiveHandle *AH) +_LoadLOs(ArchiveHandle *AH, TocEntry *te) { Oid oid; lclContext *ctx = (lclContext *) AH->formatData; + lclTocEntry *tctx = (lclTocEntry *) te->formatData; CompressFileHandle *CFH; char tocfname[MAXPGPATH]; char line[MAXPGPATH]; StartRestoreLOs(AH); - setFilePath(AH, tocfname, "blobs.toc"); + /* + * Note: before archive v16, there was always only one BLOBS TOC entry, + * now there can be multiple. We don't need to worry what version we are + * reading though, because tctx->filename should be correct either way. + */ + setFilePath(AH, tocfname, tctx->filename); CFH = ctx->LOsTocFH = InitDiscoverCompressFileHandle(tocfname, PG_BINARY_R); @@ -632,7 +643,7 @@ _ReopenArchive(ArchiveHandle *AH) */ /* - * Called by the archiver when starting to save all BLOB DATA (not schema). + * Called by the archiver when starting to save BLOB DATA (not schema). * It is called just prior to the dumper's DataDumper routine. * * We open the large object TOC file here, so that we can append a line to @@ -642,10 +653,11 @@ static void _StartLOs(ArchiveHandle *AH, TocEntry *te) { lclContext *ctx = (lclContext *) AH->formatData; + lclTocEntry *tctx = (lclTocEntry *) te->formatData; pg_compress_specification compression_spec = {0}; char fname[MAXPGPATH]; - setFilePath(AH, fname, "blobs.toc"); + setFilePath(AH, fname, tctx->filename); /* The LO TOC file is never compressed */ compression_spec.algorithm = PG_COMPRESSION_NONE; @@ -690,7 +702,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) pg_fatal("could not close LO data file: %m"); ctx->dataFH = NULL; - /* register the LO in blobs.toc */ + /* register the LO in blobs_NNN.toc */ len = snprintf(buf, sizeof(buf), "%u blob_%u.dat\n", oid, oid); if (!CFH->write_func(buf, len, CFH)) { @@ -703,7 +715,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) } /* - * Called by the archiver when finishing saving all BLOB DATA. + * Called by the archiver when finishing saving BLOB DATA. * * We close the LOs TOC file. */ @@ -795,7 +807,7 @@ _PrepParallelRestore(ArchiveHandle *AH) } /* - * If this is the BLOBS entry, what we stat'd was blobs.toc, which + * If this is a BLOBS entry, what we stat'd was blobs_NNN.toc, which * most likely is a lot smaller than the actual blob data. We don't * have a cheap way to estimate how much smaller, but fortunately it * doesn't matter too much as long as we get the LOs processed diff --git a/src/bin/pg_dump/pg_backup_null.c b/src/bin/pg_dump/pg_backup_null.c index 776f057770..a3257f4fc8 100644 --- a/src/bin/pg_dump/pg_backup_null.c +++ b/src/bin/pg_dump/pg_backup_null.c @@ -113,7 +113,7 @@ _EndData(ArchiveHandle *AH, TocEntry *te) } /* - * Called by the archiver when starting to save all BLOB DATA (not schema). + * Called by the archiver when starting to save BLOB DATA (not schema). * This routine should save whatever format-specific information is needed * to read the LOs back into memory. * @@ -170,7 +170,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) } /* - * Called by the archiver when finishing saving all BLOB DATA. + * Called by the archiver when finishing saving BLOB DATA. * * Optional. */ diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c index 4cb9707e63..41ee52b1d6 100644 --- a/src/bin/pg_dump/pg_backup_tar.c +++ b/src/bin/pg_dump/pg_backup_tar.c @@ -94,7 +94,7 @@ typedef struct char *filename; } lclTocEntry; -static void _LoadLOs(ArchiveHandle *AH); +static void _LoadLOs(ArchiveHandle *AH, TocEntry *te); static TAR_MEMBER *tarOpen(ArchiveHandle *AH, const char *filename, char mode); static void tarClose(ArchiveHandle *AH, TAR_MEMBER *th); @@ -634,13 +634,13 @@ _PrintTocData(ArchiveHandle *AH, TocEntry *te) } if (strcmp(te->desc, "BLOBS") == 0) - _LoadLOs(AH); + _LoadLOs(AH, te); else _PrintFileData(AH, tctx->filename); } static void -_LoadLOs(ArchiveHandle *AH) +_LoadLOs(ArchiveHandle *AH, TocEntry *te) { Oid oid; lclContext *ctx = (lclContext *) AH->formatData; @@ -651,7 +651,26 @@ _LoadLOs(ArchiveHandle *AH) StartRestoreLOs(AH); - th = tarOpen(AH, NULL, 'r'); /* Open next file */ + /* + * The blobs_NNN.toc or blobs.toc file is fairly useless to us because it + * will appear only after the associated blob_NNN.dat files. For archive + * versions >= 16 we can look at the BLOBS entry's te->tag to discover the + * OID of the first blob we want to restore, and then search forward to + * find the appropriate blob_<oid>.dat file. For older versions we rely + * on the knowledge that there was only one BLOBS entry and just search + * for the first blob_<oid>.dat file. Once we find the first blob file to + * restore, restore all blobs until we reach the blobs[_NNN].toc file. + */ + if (AH->version >= K_VERS_1_16) + { + /* We rely on atooid to not complain about nnnn..nnnn tags */ + oid = atooid(te->tag); + snprintf(buf, sizeof(buf), "blob_%u.dat", oid); + th = tarOpen(AH, buf, 'r'); /* Advance to first desired file */ + } + else + th = tarOpen(AH, NULL, 'r'); /* Open next file */ + while (th != NULL) { ctx->FH = th; @@ -681,9 +700,9 @@ _LoadLOs(ArchiveHandle *AH) /* * Once we have found the first LO, stop at the first non-LO entry - * (which will be 'blobs.toc'). This coding would eat all the - * rest of the archive if there are no LOs ... but this function - * shouldn't be called at all in that case. + * (which will be 'blobs[_NNN].toc'). This coding would eat all + * the rest of the archive if there are no LOs ... but this + * function shouldn't be called at all in that case. */ if (foundLO) break; @@ -847,7 +866,7 @@ _scriptOut(ArchiveHandle *AH, const void *buf, size_t len) */ /* - * Called by the archiver when starting to save all BLOB DATA (not schema). + * Called by the archiver when starting to save BLOB DATA (not schema). * This routine should save whatever format-specific information is needed * to read the LOs back into memory. * @@ -862,7 +881,7 @@ _StartLOs(ArchiveHandle *AH, TocEntry *te) lclContext *ctx = (lclContext *) AH->formatData; char fname[K_STD_BUF_SIZE]; - sprintf(fname, "blobs.toc"); + sprintf(fname, "blobs_%d.toc", te->dumpId); ctx->loToc = tarOpen(AH, fname, 'w'); } @@ -908,7 +927,7 @@ _EndLO(ArchiveHandle *AH, TocEntry *te, Oid oid) } /* - * Called by the archiver when finishing saving all BLOB DATA. + * Called by the archiver when finishing saving BLOB DATA. * * Optional. * diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 171e591696..369816daef 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -160,6 +160,13 @@ static int nseclabels = 0; */ #define DUMP_DEFAULT_ROWS_PER_INSERT 1 +/* + * Maximum number of large objects to group into a single ArchiveEntry. + * At some point we might want to make this user-controllable, but for now + * a hard-wired setting will suffice. + */ +#define MAX_BLOBS_PER_ARCHIVE_ENTRY 1000 + /* * Macro for producing quoted, schema-qualified name of a dumpable object. */ @@ -3592,11 +3599,10 @@ getLOs(Archive *fout) { DumpOptions *dopt = fout->dopt; PQExpBuffer loQry = createPQExpBuffer(); - LoInfo *loinfo; - DumpableObject *lodata; PGresult *res; int ntups; int i; + int n; int i_oid; int i_lomowner; int i_lomacl; @@ -3604,11 +3610,15 @@ getLOs(Archive *fout) pg_log_info("reading large objects"); - /* Fetch LO OIDs, and owner/ACL data */ + /* + * Fetch LO OIDs and owner/ACL data. Order the data so that all the blobs + * with the same owner/ACL appear together. + */ appendPQExpBufferStr(loQry, "SELECT oid, lomowner, lomacl, " "acldefault('L', lomowner) AS acldefault " - "FROM pg_largeobject_metadata"); + "FROM pg_largeobject_metadata " + "ORDER BY lomowner, lomacl::pg_catalog.text, oid"); res = ExecuteSqlQuery(fout, loQry->data, PGRES_TUPLES_OK); @@ -3620,30 +3630,72 @@ getLOs(Archive *fout) ntups = PQntuples(res); /* - * Each large object has its own "BLOB" archive entry. + * Group the blobs into suitably-sized groups that have the same owner and + * ACL setting, and build a metadata and a data DumpableObject for each + * group. (If we supported initprivs for blobs, we'd have to insist that + * groups also share initprivs settings, since the DumpableObject only has + * room for one.) i is the index of the first tuple in the current group, + * and n is the number of tuples we include in the group. */ - loinfo = (LoInfo *) pg_malloc(ntups * sizeof(LoInfo)); + for (i = 0; i < ntups; i += n) + { + Oid thisoid = atooid(PQgetvalue(res, i, i_oid)); + char *thisowner = PQgetvalue(res, i, i_lomowner); + char *thisacl = PQgetvalue(res, i, i_lomacl); + LoInfo *loinfo; + DumpableObject *lodata; + char namebuf[64]; + + /* Scan to find first tuple not to be included in group */ + n = 1; + while (n < MAX_BLOBS_PER_ARCHIVE_ENTRY && i + n < ntups) + { + if (strcmp(thisowner, PQgetvalue(res, i + n, i_lomowner)) != 0 || + strcmp(thisacl, PQgetvalue(res, i + n, i_lomacl)) != 0) + break; + n++; + } - for (i = 0; i < ntups; i++) - { - loinfo[i].dobj.objType = DO_LARGE_OBJECT; - loinfo[i].dobj.catId.tableoid = LargeObjectRelationId; - loinfo[i].dobj.catId.oid = atooid(PQgetvalue(res, i, i_oid)); - AssignDumpId(&loinfo[i].dobj); + /* Build the metadata DumpableObject */ + loinfo = (LoInfo *) pg_malloc(offsetof(LoInfo, looids) + n * sizeof(Oid)); + + loinfo->dobj.objType = DO_LARGE_OBJECT; + loinfo->dobj.catId.tableoid = LargeObjectRelationId; + loinfo->dobj.catId.oid = thisoid; + AssignDumpId(&loinfo->dobj); + + if (n > 1) + snprintf(namebuf, sizeof(namebuf), "%u..%u", thisoid, + atooid(PQgetvalue(res, i + n - 1, i_oid))); + else + snprintf(namebuf, sizeof(namebuf), "%u", thisoid); + loinfo->dobj.name = pg_strdup(namebuf); + loinfo->dacl.acl = pg_strdup(thisacl); + loinfo->dacl.acldefault = pg_strdup(PQgetvalue(res, i, i_acldefault)); + loinfo->dacl.privtype = 0; + loinfo->dacl.initprivs = NULL; + loinfo->rolname = getRoleName(thisowner); + loinfo->numlos = n; + loinfo->looids[0] = thisoid; + /* Collect OIDs of the remaining blobs in this group */ + for (int k = 1; k < n; k++) + { + CatalogId extraID; + + loinfo->looids[k] = atooid(PQgetvalue(res, i + k, i_oid)); - loinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i, i_oid)); - loinfo[i].dacl.acl = pg_strdup(PQgetvalue(res, i, i_lomacl)); - loinfo[i].dacl.acldefault = pg_strdup(PQgetvalue(res, i, i_acldefault)); - loinfo[i].dacl.privtype = 0; - loinfo[i].dacl.initprivs = NULL; - loinfo[i].rolname = getRoleName(PQgetvalue(res, i, i_lomowner)); + /* Make sure we can look up loinfo by any of the blobs' OIDs */ + extraID.tableoid = LargeObjectRelationId; + extraID.oid = loinfo->looids[k]; + recordAdditionalCatalogID(extraID, &loinfo->dobj); + } /* LOs have data */ - loinfo[i].dobj.components |= DUMP_COMPONENT_DATA; + loinfo->dobj.components |= DUMP_COMPONENT_DATA; - /* Mark whether LO has an ACL */ + /* Mark whether LO group has a non-empty ACL */ if (!PQgetisnull(res, i, i_lomacl)) - loinfo[i].dobj.components |= DUMP_COMPONENT_ACL; + loinfo->dobj.components |= DUMP_COMPONENT_ACL; /* * In binary-upgrade mode for LOs, we do *not* dump out the LO data, @@ -3653,21 +3705,22 @@ getLOs(Archive *fout) * pg_largeobject_metadata, after the dump is restored. */ if (dopt->binary_upgrade) - loinfo[i].dobj.dump &= ~DUMP_COMPONENT_DATA; - } + loinfo->dobj.dump &= ~DUMP_COMPONENT_DATA; - /* - * If we have any large objects, a "BLOBS" archive entry is needed. This - * is just a placeholder for sorting; it carries no data now. - */ - if (ntups > 0) - { + /* + * Create a "BLOBS" data item for the group, too. This is just a + * placeholder for sorting; it carries no data now. + */ lodata = (DumpableObject *) pg_malloc(sizeof(DumpableObject)); lodata->objType = DO_LARGE_OBJECT_DATA; lodata->catId = nilCatalogId; AssignDumpId(lodata); - lodata->name = pg_strdup("BLOBS"); + lodata->name = pg_strdup(namebuf); lodata->components |= DUMP_COMPONENT_DATA; + /* Set up explicit dependency from data to metadata */ + lodata->dependencies = (DumpId *) pg_malloc(sizeof(DumpId)); + lodata->dependencies[0] = loinfo->dobj.dumpId; + lodata->nDeps = lodata->allocDeps = 1; } PQclear(res); @@ -3677,123 +3730,109 @@ getLOs(Archive *fout) /* * dumpLO * - * dump the definition (metadata) of the given large object + * dump the definition (metadata) of the given large object group */ static void dumpLO(Archive *fout, const LoInfo *loinfo) { PQExpBuffer cquery = createPQExpBuffer(); - PQExpBuffer dquery = createPQExpBuffer(); - appendPQExpBuffer(cquery, - "SELECT pg_catalog.lo_create('%s');\n", - loinfo->dobj.name); - - appendPQExpBuffer(dquery, - "SELECT pg_catalog.lo_unlink('%s');\n", - loinfo->dobj.name); + /* + * The "definition" is just a newline-separated list of OIDs. We need to + * put something into the dropStmt too, but it can just be a comment. + */ + for (int i = 0; i < loinfo->numlos; i++) + appendPQExpBuffer(cquery, "%u\n", loinfo->looids[i]); if (loinfo->dobj.dump & DUMP_COMPONENT_DEFINITION) ArchiveEntry(fout, loinfo->dobj.catId, loinfo->dobj.dumpId, ARCHIVE_OPTS(.tag = loinfo->dobj.name, .owner = loinfo->rolname, - .description = "BLOB", + .description = "BLOB METADATA", .section = SECTION_PRE_DATA, .createStmt = cquery->data, - .dropStmt = dquery->data)); - - /* Dump comment if any */ - if (loinfo->dobj.dump & DUMP_COMPONENT_COMMENT) - dumpComment(fout, "LARGE OBJECT", loinfo->dobj.name, - NULL, loinfo->rolname, - loinfo->dobj.catId, 0, loinfo->dobj.dumpId); - - /* Dump security label if any */ - if (loinfo->dobj.dump & DUMP_COMPONENT_SECLABEL) - dumpSecLabel(fout, "LARGE OBJECT", loinfo->dobj.name, - NULL, loinfo->rolname, - loinfo->dobj.catId, 0, loinfo->dobj.dumpId); - - /* Dump ACL if any */ - if (loinfo->dobj.dump & DUMP_COMPONENT_ACL) - dumpACL(fout, loinfo->dobj.dumpId, InvalidDumpId, "LARGE OBJECT", - loinfo->dobj.name, NULL, - NULL, loinfo->rolname, &loinfo->dacl); + .dropStmt = "-- dummy")); + + /* + * Dump per-blob comments, seclabels, and ACLs if any. We assume these + * are rare enough that it's okay to generate retail TOC entries for them. + */ + if (loinfo->dobj.dump & (DUMP_COMPONENT_COMMENT | + DUMP_COMPONENT_SECLABEL | + DUMP_COMPONENT_ACL)) + { + for (int i = 0; i < loinfo->numlos; i++) + { + CatalogId catId; + char namebuf[32]; + + /* Build identifying info for this blob */ + catId.tableoid = loinfo->dobj.catId.tableoid; + catId.oid = loinfo->looids[i]; + snprintf(namebuf, sizeof(namebuf), "%u", loinfo->looids[i]); + + if (loinfo->dobj.dump & DUMP_COMPONENT_COMMENT) + dumpComment(fout, "LARGE OBJECT", namebuf, + NULL, loinfo->rolname, + catId, 0, loinfo->dobj.dumpId); + + if (loinfo->dobj.dump & DUMP_COMPONENT_SECLABEL) + dumpSecLabel(fout, "LARGE OBJECT", namebuf, + NULL, loinfo->rolname, + catId, 0, loinfo->dobj.dumpId); + + if (loinfo->dobj.dump & DUMP_COMPONENT_ACL) + dumpACL(fout, loinfo->dobj.dumpId, InvalidDumpId, + "LARGE OBJECT", namebuf, NULL, + NULL, loinfo->rolname, &loinfo->dacl); + } + } destroyPQExpBuffer(cquery); - destroyPQExpBuffer(dquery); } /* * dumpLOs: - * dump the data contents of all large objects + * dump the data contents of the large objects in the given group */ static int dumpLOs(Archive *fout, const void *arg) { - const char *loQry; - const char *loFetchQry; + const LoInfo *loinfo = (const LoInfo *) arg; PGconn *conn = GetConnection(fout); - PGresult *res; char buf[LOBBUFSIZE]; - int ntups; - int i; - int cnt; - pg_log_info("saving large objects"); - - /* - * Currently, we re-fetch all LO OIDs using a cursor. Consider scanning - * the already-in-memory dumpable objects instead... - */ - loQry = - "DECLARE looid CURSOR FOR " - "SELECT oid FROM pg_largeobject_metadata ORDER BY 1"; + pg_log_info("saving large objects \"%s\"", loinfo->dobj.name); - ExecuteSqlStatement(fout, loQry); + for (int i = 0; i < loinfo->numlos; i++) + { + Oid loOid = loinfo->looids[i]; + int loFd; + int cnt; - /* Command to fetch from cursor */ - loFetchQry = "FETCH 1000 IN looid"; + /* Open the LO */ + loFd = lo_open(conn, loOid, INV_READ); + if (loFd == -1) + pg_fatal("could not open large object %u: %s", + loOid, PQerrorMessage(conn)); - do - { - /* Do a fetch */ - res = ExecuteSqlQuery(fout, loFetchQry, PGRES_TUPLES_OK); + StartLO(fout, loOid); - /* Process the tuples, if any */ - ntups = PQntuples(res); - for (i = 0; i < ntups; i++) + /* Now read it in chunks, sending data to archive */ + do { - Oid loOid; - int loFd; - - loOid = atooid(PQgetvalue(res, i, 0)); - /* Open the LO */ - loFd = lo_open(conn, loOid, INV_READ); - if (loFd == -1) - pg_fatal("could not open large object %u: %s", + cnt = lo_read(conn, loFd, buf, LOBBUFSIZE); + if (cnt < 0) + pg_fatal("error reading large object %u: %s", loOid, PQerrorMessage(conn)); - StartLO(fout, loOid); - - /* Now read it in chunks, sending data to archive */ - do - { - cnt = lo_read(conn, loFd, buf, LOBBUFSIZE); - if (cnt < 0) - pg_fatal("error reading large object %u: %s", - loOid, PQerrorMessage(conn)); + WriteData(fout, buf, cnt); + } while (cnt > 0); - WriteData(fout, buf, cnt); - } while (cnt > 0); + lo_close(conn, loFd); - lo_close(conn, loFd); - - EndLO(fout, loOid); - } - - PQclear(res); - } while (ntups > 0); + EndLO(fout, loOid); + } return 1; } @@ -10623,28 +10662,34 @@ dumpDumpableObject(Archive *fout, DumpableObject *dobj) case DO_LARGE_OBJECT_DATA: if (dobj->dump & DUMP_COMPONENT_DATA) { + LoInfo *loinfo; TocEntry *te; + loinfo = (LoInfo *) findObjectByDumpId(dobj->dependencies[0]); + if (loinfo == NULL) + pg_fatal("missing metadata for large objects \"%s\"", + dobj->name); + te = ArchiveEntry(fout, dobj->catId, dobj->dumpId, ARCHIVE_OPTS(.tag = dobj->name, + .owner = loinfo->rolname, .description = "BLOBS", .section = SECTION_DATA, - .dumpFn = dumpLOs)); + .deps = dobj->dependencies, + .nDeps = dobj->nDeps, + .dumpFn = dumpLOs, + .dumpArg = loinfo)); /* * Set the TocEntry's dataLength in case we are doing a * parallel dump and want to order dump jobs by table size. * (We need some size estimate for every TocEntry with a * DataDumper function.) We don't currently have any cheap - * way to estimate the size of LOs, but it doesn't matter; - * let's just set the size to a large value so parallel dumps - * will launch this job first. If there's lots of LOs, we - * win, and if there aren't, we don't lose much. (If you want - * to improve on this, really what you should be thinking - * about is allowing LO dumping to be parallelized, not just - * getting a smarter estimate for the single TOC entry.) + * way to estimate the size of LOs, but fortunately it doesn't + * matter too much as long as we get large batches of LOs + * processed reasonably early. Assume 8K per blob. */ - te->dataLength = INT_MAX; + te->dataLength = loinfo->numlos * (pgoff_t) 8192; } break; case DO_POLICY: diff --git a/src/bin/pg_dump/pg_dump.h b/src/bin/pg_dump/pg_dump.h index 9bc93520b4..2a7c5873a0 100644 --- a/src/bin/pg_dump/pg_dump.h +++ b/src/bin/pg_dump/pg_dump.h @@ -582,11 +582,21 @@ typedef struct _defaultACLInfo char defaclobjtype; } DefaultACLInfo; +/* + * LoInfo represents a group of large objects (blobs) that share the same + * owner and ACL setting. dobj.components has the DUMP_COMPONENT_COMMENT bit + * set if any blob in the group has a comment; similarly for sec labels. + * If there are many blobs with the same owner/ACL, we can divide them into + * multiple LoInfo groups, which will each spawn a BLOB METADATA and a BLOBS + * (data) TOC entry. This allows more parallelism during restore. + */ typedef struct _loInfo { DumpableObject dobj; DumpableAcl dacl; const char *rolname; + int numlos; + Oid looids[FLEXIBLE_ARRAY_MEMBER]; } LoInfo; /* @@ -695,6 +705,7 @@ typedef struct _SubRelInfo extern TableInfo *getSchemaData(Archive *fout, int *numTablesPtr); extern void AssignDumpId(DumpableObject *dobj); +extern void recordAdditionalCatalogID(CatalogId catId, DumpableObject *dobj); extern DumpId createDumpId(void); extern DumpId getMaxDumpId(void); extern DumpableObject *findObjectByDumpId(DumpId dumpId); diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index c8b489d94e..1da2fc6575 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -109,11 +109,11 @@ my %pgdump_runs = ( '--format=directory', '--compress=gzip:1', "--file=$tempdir/compression_gzip_dir", 'postgres', ], - # Give coverage for manually compressed blob.toc files during + # Give coverage for manually compressed blobs.toc files during # restore. compress_cmd => { program => $ENV{'GZIP_PROGRAM'}, - args => [ '-f', "$tempdir/compression_gzip_dir/blobs.toc", ], + args => [ '-f', "$tempdir/compression_gzip_dir/blobs_*.toc", ], }, # Verify that only data files were compressed glob_patterns => [ @@ -172,16 +172,6 @@ my %pgdump_runs = ( '--format=directory', '--compress=lz4:1', "--file=$tempdir/compression_lz4_dir", 'postgres', ], - # Give coverage for manually compressed blob.toc files during - # restore. - compress_cmd => { - program => $ENV{'LZ4'}, - args => [ - '-z', '-f', '--rm', - "$tempdir/compression_lz4_dir/blobs.toc", - "$tempdir/compression_lz4_dir/blobs.toc.lz4", - ], - }, # Verify that data files were compressed glob_patterns => [ "$tempdir/compression_lz4_dir/toc.dat", @@ -242,14 +232,13 @@ my %pgdump_runs = ( '--format=directory', '--compress=zstd:1', "--file=$tempdir/compression_zstd_dir", 'postgres', ], - # Give coverage for manually compressed blob.toc files during + # Give coverage for manually compressed blobs.toc files during # restore. compress_cmd => { program => $ENV{'ZSTD'}, args => [ '-z', '-f', - '--rm', "$tempdir/compression_zstd_dir/blobs.toc", - "-o", "$tempdir/compression_zstd_dir/blobs.toc.zst", + '--rm', "$tempdir/compression_zstd_dir/blobs_*.toc", ], }, # Verify that data files were compressed @@ -413,7 +402,7 @@ my %pgdump_runs = ( }, glob_patterns => [ "$tempdir/defaults_dir_format/toc.dat", - "$tempdir/defaults_dir_format/blobs.toc", + "$tempdir/defaults_dir_format/blobs_*.toc", $supports_gzip ? "$tempdir/defaults_dir_format/*.dat.gz" : "$tempdir/defaults_dir_format/*.dat", ], @@ -4858,8 +4847,13 @@ foreach my $run (sort keys %pgdump_runs) # not defined. next if (!defined($compress_program) || $compress_program eq ''); - my @full_compress_cmd = - ($compress_cmd->{program}, @{ $compress_cmd->{args} }); + # Arguments may require globbing. + my @full_compress_cmd = ($compress_program); + foreach my $arg (@{ $compress_cmd->{args} }) + { + push @full_compress_cmd, glob($arg); + } + command_ok(\@full_compress_cmd, "$run: compression commands"); } -- 2.39.3 From a29eaab42326a0e7296dfa4b7779c0bc998ef0f4 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri, 26 Jan 2024 11:27:51 -0500 Subject: [PATCH v11 3/6] Move BLOBS METADATA TOC entries into SECTION_DATA. Commit c0d5be5d6 put the new BLOB metadata TOC entries into SECTION_PRE_DATA, which perhaps is defensible in some ways, but it's a rather odd choice considering that we go out of our way to treat blobs as data. Moreover, because parallel restore handles the PRE_DATA section serially, this means we're only getting part of the parallelism speedup we could hope for. Moving these entries into SECTION_DATA means that we can parallelize the lo_create calls not only the data loading when there are many blobs. The dependencies established by the previous patch ensure that we won't try to load data for a blob we've not yet created. --- src/bin/pg_dump/pg_dump.c | 4 ++-- src/bin/pg_dump/t/002_pg_dump.pl | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 369816daef..13431db769 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -3749,7 +3749,7 @@ dumpLO(Archive *fout, const LoInfo *loinfo) ARCHIVE_OPTS(.tag = loinfo->dobj.name, .owner = loinfo->rolname, .description = "BLOB METADATA", - .section = SECTION_PRE_DATA, + .section = SECTION_DATA, .createStmt = cquery->data, .dropStmt = "-- dummy")); @@ -18654,12 +18654,12 @@ addBoundaryDependencies(DumpableObject **dobjs, int numObjs, case DO_FDW: case DO_FOREIGN_SERVER: case DO_TRANSFORM: - case DO_LARGE_OBJECT: /* Pre-data objects: must come before the pre-data boundary */ addObjectDependency(preDataBound, dobj->dumpId); break; case DO_TABLE_DATA: case DO_SEQUENCE_SET: + case DO_LARGE_OBJECT: case DO_LARGE_OBJECT_DATA: /* Data objects: must come between the boundaries */ addObjectDependency(dobj, preDataBound->dumpId); diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index 1da2fc6575..a273bc07a2 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -912,7 +912,7 @@ my %tests = ( column_inserts => 1, data_only => 1, inserts => 1, - section_pre_data => 1, + section_data => 1, test_schema_plus_large_objects => 1, }, unlike => { @@ -1325,7 +1325,7 @@ my %tests = ( column_inserts => 1, data_only => 1, inserts => 1, - section_pre_data => 1, + section_data => 1, test_schema_plus_large_objects => 1, }, unlike => { @@ -1533,7 +1533,7 @@ my %tests = ( column_inserts => 1, data_only => 1, inserts => 1, - section_pre_data => 1, + section_data => 1, test_schema_plus_large_objects => 1, }, unlike => { @@ -4278,7 +4278,7 @@ my %tests = ( column_inserts => 1, data_only => 1, inserts => 1, - section_pre_data => 1, + section_data => 1, test_schema_plus_large_objects => 1, binary_upgrade => 1, }, -- 2.39.3 From 837be1244a9e867c4d9387c2734203c62fcac243 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri, 26 Jan 2024 11:37:37 -0500 Subject: [PATCH v11 4/6] Invent --transaction-size option for pg_restore. This patch allows pg_restore to wrap its commands into transaction blocks, somewhat like --single-transaction, except that we commit and start a new block after every N objects. Using this mode with a size limit of 1000 or so objects greatly reduces the number of transactions consumed by the restore, while preventing any one transaction from taking enough locks to overrun the receiving server's shared lock table. (A value of 1000 works well with the default lock table size of around 6400 locks. Higher --transaction-size values can be used if one has increased the receiving server's lock table size.) In this patch I have just hard-wired pg_upgrade to use --transaction-size 1000. Perhaps there would be value in adding another pg_upgrade option to allow user control of that, but I'm unsure that it's worth the trouble; I think few users would use it, and any who did would see not that much benefit. However, we might need to adjust the logic to make the size be 1000 divided by the number of parallel restore jobs allowed. --- doc/src/sgml/ref/pg_restore.sgml | 24 +++++ src/bin/pg_dump/pg_backup.h | 4 +- src/bin/pg_dump/pg_backup_archiver.c | 139 +++++++++++++++++++++++++-- src/bin/pg_dump/pg_backup_archiver.h | 3 + src/bin/pg_dump/pg_backup_db.c | 18 ++++ src/bin/pg_dump/pg_restore.c | 15 ++- src/bin/pg_upgrade/pg_upgrade.c | 11 +++ 7 files changed, 206 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml index 1a23874da6..2e3ba80258 100644 --- a/doc/src/sgml/ref/pg_restore.sgml +++ b/doc/src/sgml/ref/pg_restore.sgml @@ -786,6 +786,30 @@ PostgreSQL documentation </listitem> </varlistentry> + <varlistentry> + <term><option>--transaction-size=<replaceable class="parameter">N</replaceable></option></term> + <listitem> + <para> + Execute the restore as a series of transactions, each processing + up to <replaceable class="parameter">N</replaceable> database + objects. This option implies <option>--exit-on-error</option>. + </para> + <para> + <option>--transaction-size</option> offers an intermediate choice + between the default behavior (one transaction per SQL command) + and <option>-1</option>/<option>--single-transaction</option> + (one transaction for all restored objects). + While <option>--single-transaction</option> has the least + overhead, it may be impractical for large databases because the + transaction will take a lock on each restored object, possibly + exhausting the server's lock table space. + Using <option>--transaction-size</option> with a size of a few + thousand objects offers nearly the same performance benefits while + capping the amount of lock table space needed. + </para> + </listitem> + </varlistentry> + <varlistentry> <term><option>--use-set-session-authorization</option></term> <listitem> diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h index 9ef2f2017e..fbf5f1c515 100644 --- a/src/bin/pg_dump/pg_backup.h +++ b/src/bin/pg_dump/pg_backup.h @@ -149,7 +149,9 @@ typedef struct _restoreOptions * compression */ int suppressDumpWarnings; /* Suppress output of WARNING entries * to stderr */ - bool single_txn; + + bool single_txn; /* restore all TOCs in one transaction */ + int txn_size; /* restore this many TOCs per txn, if > 0 */ bool *idWanted; /* array showing which dump IDs to emit */ int enable_row_security; diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index aabcd8dde3..29a358d16d 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -502,7 +502,28 @@ RestoreArchive(Archive *AHX) /* Otherwise, drop anything that's selected and has a dropStmt */ if (((te->reqs & (REQ_SCHEMA | REQ_DATA)) != 0) && te->dropStmt) { + bool not_allowed_in_txn = false; + pg_log_info("dropping %s %s", te->desc, te->tag); + + /* + * In --transaction-size mode, we have to temporarily exit our + * transaction block to drop objects that can't be dropped + * within a transaction. + */ + if (ropt->txn_size > 0) + { + if (strcmp(te->desc, "DATABASE") == 0 || + strcmp(te->desc, "DATABASE PROPERTIES") == 0) + { + not_allowed_in_txn = true; + if (AH->connection) + CommitTransaction(AHX); + else + ahprintf(AH, "COMMIT;\n"); + } + } + /* Select owner and schema as necessary */ _becomeOwner(AH, te); _selectOutputSchema(AH, te->namespace); @@ -628,6 +649,33 @@ RestoreArchive(Archive *AHX) } } } + + /* + * In --transaction-size mode, re-establish the transaction + * block if needed; otherwise, commit after every N drops. + */ + if (ropt->txn_size > 0) + { + if (not_allowed_in_txn) + { + if (AH->connection) + StartTransaction(AHX); + else + ahprintf(AH, "BEGIN;\n"); + AH->txnCount = 0; + } + else if (++AH->txnCount >= ropt->txn_size) + { + if (AH->connection) + { + CommitTransaction(AHX); + StartTransaction(AHX); + } + else + ahprintf(AH, "COMMIT;\nBEGIN;\n"); + AH->txnCount = 0; + } + } } } @@ -724,7 +772,11 @@ RestoreArchive(Archive *AHX) } } - if (ropt->single_txn) + /* + * Close out any persistent transaction we may have. While these two + * cases are started in different places, we can end both cases here. + */ + if (ropt->single_txn || ropt->txn_size > 0) { if (AH->connection) CommitTransaction(AHX); @@ -785,6 +837,25 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel) */ if ((reqs & REQ_SCHEMA) != 0) { + bool object_is_db = false; + + /* + * In --transaction-size mode, must exit our transaction block to + * create a database or set its properties. + */ + if (strcmp(te->desc, "DATABASE") == 0 || + strcmp(te->desc, "DATABASE PROPERTIES") == 0) + { + object_is_db = true; + if (ropt->txn_size > 0) + { + if (AH->connection) + CommitTransaction(&AH->public); + else + ahprintf(AH, "COMMIT;\n\n"); + } + } + /* Show namespace in log message if available */ if (te->namespace) pg_log_info("creating %s \"%s.%s\"", @@ -835,10 +906,10 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel) /* * If we created a DB, connect to it. Also, if we changed DB * properties, reconnect to ensure that relevant GUC settings are - * applied to our session. + * applied to our session. (That also restarts the transaction block + * in --transaction-size mode.) */ - if (strcmp(te->desc, "DATABASE") == 0 || - strcmp(te->desc, "DATABASE PROPERTIES") == 0) + if (object_is_db) { pg_log_info("connecting to new database \"%s\"", te->tag); _reconnectToDB(AH, te->tag); @@ -964,6 +1035,25 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te, bool is_parallel) } } + /* + * If we emitted anything for this TOC entry, that counts as one action + * against the transaction-size limit. Commit if it's time to. + */ + if ((reqs & (REQ_SCHEMA | REQ_DATA)) != 0 && ropt->txn_size > 0) + { + if (++AH->txnCount >= ropt->txn_size) + { + if (AH->connection) + { + CommitTransaction(&AH->public); + StartTransaction(&AH->public); + } + else + ahprintf(AH, "COMMIT;\nBEGIN;\n\n"); + AH->txnCount = 0; + } + } + if (AH->public.n_errors > 0 && status == WORKER_OK) status = WORKER_IGNORED_ERRORS; @@ -1310,7 +1400,12 @@ StartRestoreLOs(ArchiveHandle *AH) { RestoreOptions *ropt = AH->public.ropt; - if (!ropt->single_txn) + /* + * LOs must be restored within a transaction block, since we need the LO + * handle to stay open while we write it. Establish a transaction unless + * there's one being used globally. + */ + if (!(ropt->single_txn || ropt->txn_size > 0)) { if (AH->connection) StartTransaction(&AH->public); @@ -1329,7 +1424,7 @@ EndRestoreLOs(ArchiveHandle *AH) { RestoreOptions *ropt = AH->public.ropt; - if (!ropt->single_txn) + if (!(ropt->single_txn || ropt->txn_size > 0)) { if (AH->connection) CommitTransaction(&AH->public); @@ -3171,6 +3266,19 @@ _doSetFixedOutputState(ArchiveHandle *AH) else ahprintf(AH, "SET row_security = off;\n"); + /* + * In --transaction-size mode, we should always be in a transaction when + * we begin to restore objects. + */ + if (ropt && ropt->txn_size > 0) + { + if (AH->connection) + StartTransaction(&AH->public); + else + ahprintf(AH, "\nBEGIN;\n"); + AH->txnCount = 0; + } + ahprintf(AH, "\n"); } @@ -4034,6 +4142,14 @@ restore_toc_entries_prefork(ArchiveHandle *AH, TocEntry *pending_list) } } + /* + * In --transaction-size mode, we must commit the open transaction before + * dropping the database connection. This also ensures that child workers + * can see the objects we've created so far. + */ + if (AH->public.ropt->txn_size > 0) + CommitTransaction(&AH->public); + /* * Now close parent connection in prep for parallel steps. We do this * mainly to ensure that we don't exceed the specified number of parallel @@ -4773,6 +4889,10 @@ CloneArchive(ArchiveHandle *AH) clone = (ArchiveHandle *) pg_malloc(sizeof(ArchiveHandle)); memcpy(clone, AH, sizeof(ArchiveHandle)); + /* Likewise flat-copy the RestoreOptions, so we can alter them locally */ + clone->public.ropt = (RestoreOptions *) pg_malloc(sizeof(RestoreOptions)); + memcpy(clone->public.ropt, AH->public.ropt, sizeof(RestoreOptions)); + /* Handle format-independent fields */ memset(&(clone->sqlparse), 0, sizeof(clone->sqlparse)); @@ -4794,6 +4914,13 @@ CloneArchive(ArchiveHandle *AH) /* clones should not share lo_buf */ clone->lo_buf = NULL; + /* + * Clone connections disregard --transaction-size; they must commit after + * each command so that the results are immediately visible to other + * workers. + */ + clone->public.ropt->txn_size = 0; + /* * Connect our new clone object to the database, using the same connection * parameters used for the original connection. diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h index e4dd395582..1b9f142dea 100644 --- a/src/bin/pg_dump/pg_backup_archiver.h +++ b/src/bin/pg_dump/pg_backup_archiver.h @@ -324,6 +324,9 @@ struct _archiveHandle char *currTablespace; /* current tablespace, or NULL */ char *currTableAm; /* current table access method, or NULL */ + /* in --transaction-size mode, this counts objects emitted in cur xact */ + int txnCount; + void *lo_buf; size_t lo_buf_used; size_t lo_buf_size; diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c index c14d813b21..6b3bf174f2 100644 --- a/src/bin/pg_dump/pg_backup_db.c +++ b/src/bin/pg_dump/pg_backup_db.c @@ -554,6 +554,7 @@ IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, { /* Make a writable copy of the command string */ char *buf = pg_strdup(te->defn); + RestoreOptions *ropt = AH->public.ropt; char *st; char *en; @@ -562,6 +563,23 @@ IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, { *en++ = '\0'; ahprintf(AH, "%s%s%s;\n", cmdBegin, st, cmdEnd); + + /* In --transaction-size mode, count each command as an action */ + if (ropt && ropt->txn_size > 0) + { + if (++AH->txnCount >= ropt->txn_size) + { + if (AH->connection) + { + CommitTransaction(&AH->public); + StartTransaction(&AH->public); + } + else + ahprintf(AH, "COMMIT;\nBEGIN;\n\n"); + AH->txnCount = 0; + } + } + st = en; } ahprintf(AH, "\n"); diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c index c3beacdec1..5ea78cf7cc 100644 --- a/src/bin/pg_dump/pg_restore.c +++ b/src/bin/pg_dump/pg_restore.c @@ -120,6 +120,7 @@ main(int argc, char **argv) {"role", required_argument, NULL, 2}, {"section", required_argument, NULL, 3}, {"strict-names", no_argument, &strict_names, 1}, + {"transaction-size", required_argument, NULL, 5}, {"use-set-session-authorization", no_argument, &use_setsessauth, 1}, {"no-comments", no_argument, &no_comments, 1}, {"no-publications", no_argument, &no_publications, 1}, @@ -289,10 +290,18 @@ main(int argc, char **argv) set_dump_section(optarg, &(opts->dumpSections)); break; - case 4: + case 4: /* filter */ read_restore_filters(optarg, opts); break; + case 5: /* transaction-size */ + if (!option_parse_int(optarg, "--transaction-size", + 1, INT_MAX, + &opts->txn_size)) + exit(1); + opts->exit_on_error = true; + break; + default: /* getopt_long already emitted a complaint */ pg_log_error_hint("Try \"%s --help\" for more information.", progname); @@ -337,6 +346,9 @@ main(int argc, char **argv) if (opts->dataOnly && opts->dropSchema) pg_fatal("options -c/--clean and -a/--data-only cannot be used together"); + if (opts->single_txn && opts->txn_size > 0) + pg_fatal("options -1/--single-transaction and --transaction-size cannot be used together"); + /* * -C is not compatible with -1, because we can't create a database inside * a transaction block. @@ -484,6 +496,7 @@ usage(const char *progname) printf(_(" --section=SECTION restore named section (pre-data, data, or post-data)\n")); printf(_(" --strict-names require table and/or schema include patterns to\n" " match at least one entity each\n")); + printf(_(" --transaction-size=N commit after every N objects\n")); printf(_(" --use-set-session-authorization\n" " use SET SESSION AUTHORIZATION commands instead of\n" " ALTER OWNER commands to set ownership\n")); diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c index f6143b6bc4..8c6fb96478 100644 --- a/src/bin/pg_upgrade/pg_upgrade.c +++ b/src/bin/pg_upgrade/pg_upgrade.c @@ -51,6 +51,13 @@ #include "fe_utils/string_utils.h" #include "pg_upgrade.h" +/* + * Maximum number of pg_restore actions (TOC entries) to process within one + * transaction. At some point we might want to make this user-controllable, + * but for now a hard-wired setting will suffice. + */ +#define RESTORE_TRANSACTION_SIZE 1000 + static void set_locale_and_encoding(void); static void prepare_new_cluster(void); static void prepare_new_globals(void); @@ -562,10 +569,12 @@ create_new_objects(void) true, true, "\"%s/pg_restore\" %s %s --exit-on-error --verbose " + "--transaction-size=%d " "--dbname postgres \"%s/%s\"", new_cluster.bindir, cluster_conn_opts(&new_cluster), create_opts, + RESTORE_TRANSACTION_SIZE, log_opts.dumpdir, sql_file_name); @@ -600,10 +609,12 @@ create_new_objects(void) parallel_exec_prog(log_file_name, NULL, "\"%s/pg_restore\" %s %s --exit-on-error --verbose " + "--transaction-size=%d " "--dbname template1 \"%s/%s\"", new_cluster.bindir, cluster_conn_opts(&new_cluster), create_opts, + RESTORE_TRANSACTION_SIZE, log_opts.dumpdir, sql_file_name); } -- 2.39.3 From bc0e962f09ec59795c49046924445193190702a4 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri, 15 Mar 2024 18:45:02 -0400 Subject: [PATCH v11 5/6] Improve storage efficiency for BLOB ACLs. Since we are already requiring the blobs of a metadata group to share the same ACL, it's not terribly hard to store only one copy of that ACL, and then make pg_restore regenerate the appropriate commands for each blob. This saves space in the dump file not only by removing duplicative SQL command strings, but by not needing a separate TOC entry for each blob's ACL. In turn, that reduces client-side memory requirements for handling many blobs. TOC entries that need special processing are labeled as "ACL"/"LARGE OBJECTS nnn..nnn". If we have a blob with a unique ACL, continue to label it as "ACL"/"LARGE OBJECT nnn". We don't actually have to make such a distinction, but it saves a few cycles during restore for the easy case, and it seems like a good idea to not change the TOC contents unnecessarily. We already bumped the archive file format version number earlier in this patch series, so it's not necessary to do so again, even though this adds another thing that pg_restore needs to know. --- src/bin/pg_dump/pg_backup_archiver.c | 23 +++++-- src/bin/pg_dump/pg_backup_archiver.h | 1 + src/bin/pg_dump/pg_backup_db.c | 83 ++++++++++++++++++++++++ src/bin/pg_dump/pg_dump.c | 97 ++++++++++++++++++---------- 4 files changed, 164 insertions(+), 40 deletions(-) diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 29a358d16d..c7a6c918a6 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -3111,11 +3111,11 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH) strcmp(te->desc, "BLOB") == 0 || strcmp(te->desc, "BLOB METADATA") == 0 || (strcmp(te->desc, "ACL") == 0 && - strncmp(te->tag, "LARGE OBJECT ", 13) == 0) || + strncmp(te->tag, "LARGE OBJECT", 12) == 0) || (strcmp(te->desc, "COMMENT") == 0 && - strncmp(te->tag, "LARGE OBJECT ", 13) == 0) || + strncmp(te->tag, "LARGE OBJECT", 12) == 0) || (strcmp(te->desc, "SECURITY LABEL") == 0 && - strncmp(te->tag, "LARGE OBJECT ", 13) == 0)) + strncmp(te->tag, "LARGE OBJECT", 12) == 0)) res = res & REQ_DATA; else res = res & ~REQ_DATA; @@ -3152,11 +3152,11 @@ _tocEntryRequired(TocEntry *te, teSection curSection, ArchiveHandle *AH) (strcmp(te->desc, "BLOB") == 0 || strcmp(te->desc, "BLOB METADATA") == 0 || (strcmp(te->desc, "ACL") == 0 && - strncmp(te->tag, "LARGE OBJECT ", 13) == 0) || + strncmp(te->tag, "LARGE OBJECT", 12) == 0) || (strcmp(te->desc, "COMMENT") == 0 && - strncmp(te->tag, "LARGE OBJECT ", 13) == 0) || + strncmp(te->tag, "LARGE OBJECT", 12) == 0) || (strcmp(te->desc, "SECURITY LABEL") == 0 && - strncmp(te->tag, "LARGE OBJECT ", 13) == 0)))) + strncmp(te->tag, "LARGE OBJECT", 12) == 0)))) res = res & REQ_SCHEMA; } @@ -3737,7 +3737,7 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData) /* * Actually print the definition. Normally we can just print the defn - * string if any, but we have two special cases: + * string if any, but we have three special cases: * * 1. A crude hack for suppressing AUTHORIZATION clause that old pg_dump * versions put into CREATE SCHEMA. Don't mutate the variant for schema @@ -3746,6 +3746,10 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData) * * 2. BLOB METADATA entries need special processing since their defn * strings are just lists of OIDs, not complete SQL commands. + * + * 3. ACL LARGE OBJECTS entries need special processing because they + * contain only one copy of the ACL GRANT/REVOKE commands, which we must + * apply to each large object listed in the associated BLOB METADATA. */ if (ropt->noOwner && strcmp(te->desc, "SCHEMA") == 0 && strncmp(te->defn, "--", 2) != 0) @@ -3756,6 +3760,11 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData) { IssueCommandPerBlob(AH, te, "SELECT pg_catalog.lo_create('", "')"); } + else if (strcmp(te->desc, "ACL") == 0 && + strncmp(te->tag, "LARGE OBJECTS", 13) == 0) + { + IssueACLPerBlob(AH, te); + } else { if (te->defn && strlen(te->defn) > 0) diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h index 1b9f142dea..d6104a7196 100644 --- a/src/bin/pg_dump/pg_backup_archiver.h +++ b/src/bin/pg_dump/pg_backup_archiver.h @@ -455,6 +455,7 @@ extern bool isValidTarHeader(char *header); extern void ReconnectToServer(ArchiveHandle *AH, const char *dbname); extern void IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, const char *cmdBegin, const char *cmdEnd); +extern void IssueACLPerBlob(ArchiveHandle *AH, TocEntry *te); extern void DropLOIfExists(ArchiveHandle *AH, Oid oid); void ahwrite(const void *ptr, size_t size, size_t nmemb, ArchiveHandle *AH); diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c index 6b3bf174f2..a02841c405 100644 --- a/src/bin/pg_dump/pg_backup_db.c +++ b/src/bin/pg_dump/pg_backup_db.c @@ -586,6 +586,89 @@ IssueCommandPerBlob(ArchiveHandle *AH, TocEntry *te, pg_free(buf); } +/* + * Process a "LARGE OBJECTS" ACL TocEntry. + * + * To save space in the dump file, the TocEntry contains only one copy + * of the required GRANT/REVOKE commands, written to apply to the first + * blob in the group (although we do not depend on that detail here). + * We must expand the text to generate commands for all the blobs listed + * in the associated BLOB METADATA entry. + */ +void +IssueACLPerBlob(ArchiveHandle *AH, TocEntry *te) +{ + TocEntry *blobte = getTocEntryByDumpId(AH, te->dependencies[0]); + char *buf; + char *st; + char *st2; + char *en; + bool inquotes; + + if (!blobte) + pg_fatal("could not find entry for ID %d", te->dependencies[0]); + Assert(strcmp(blobte->desc, "BLOB METADATA") == 0); + + /* Make a writable copy of the ACL commands string */ + buf = pg_strdup(te->defn); + + /* + * We have to parse out the commands sufficiently to locate the blob OIDs + * and find the command-ending semicolons. The commands should not + * contain anything hard to parse except for double-quoted role names, + * which are easy to ignore. Once we've split apart the first and second + * halves of a command, apply IssueCommandPerBlob. (This means the + * updates on the blobs are interleaved if there's multiple commands, but + * that should cause no trouble.) + */ + inquotes = false; + st = en = buf; + st2 = NULL; + while (*en) + { + /* Ignore double-quoted material */ + if (*en == '"') + inquotes = !inquotes; + if (inquotes) + { + en++; + continue; + } + /* If we found "LARGE OBJECT", that's the end of the first half */ + if (strncmp(en, "LARGE OBJECT ", 13) == 0) + { + /* Terminate the first-half string */ + en += 13; + Assert(isdigit((unsigned char) *en)); + *en++ = '\0'; + /* Skip the rest of the blob OID */ + while (isdigit((unsigned char) *en)) + en++; + /* Second half starts here */ + Assert(st2 == NULL); + st2 = en; + } + /* If we found semicolon, that's the end of the second half */ + else if (*en == ';') + { + /* Terminate the second-half string */ + *en++ = '\0'; + Assert(st2 != NULL); + /* Issue this command for each blob */ + IssueCommandPerBlob(AH, blobte, st, st2); + /* For neatness, skip whitespace before the next command */ + while (isspace((unsigned char) *en)) + en++; + /* Reset for new command */ + st = en; + st2 = NULL; + } + else + en++; + } + pg_free(buf); +} + void DropLOIfExists(ArchiveHandle *AH, Oid oid) { diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 13431db769..d875fe756d 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -271,7 +271,7 @@ static void dumpDefaultACL(Archive *fout, const DefaultACLInfo *daclinfo); static DumpId dumpACL(Archive *fout, DumpId objDumpId, DumpId altDumpId, const char *type, const char *name, const char *subname, - const char *nspname, const char *owner, + const char *nspname, const char *tag, const char *owner, const DumpableAcl *dacl); static void getDependencies(Archive *fout); @@ -3278,7 +3278,7 @@ dumpDatabase(Archive *fout) dumpACL(fout, dbDumpId, InvalidDumpId, "DATABASE", qdatname, NULL, NULL, - dba, &dbdacl); + NULL, dba, &dbdacl); /* * Now construct a DATABASE PROPERTIES archive entry to restore any @@ -3754,12 +3754,11 @@ dumpLO(Archive *fout, const LoInfo *loinfo) .dropStmt = "-- dummy")); /* - * Dump per-blob comments, seclabels, and ACLs if any. We assume these - * are rare enough that it's okay to generate retail TOC entries for them. + * Dump per-blob comments and seclabels if any. We assume these are rare + * enough that it's okay to generate retail TOC entries for them. */ if (loinfo->dobj.dump & (DUMP_COMPONENT_COMMENT | - DUMP_COMPONENT_SECLABEL | - DUMP_COMPONENT_ACL)) + DUMP_COMPONENT_SECLABEL)) { for (int i = 0; i < loinfo->numlos; i++) { @@ -3780,11 +3779,39 @@ dumpLO(Archive *fout, const LoInfo *loinfo) dumpSecLabel(fout, "LARGE OBJECT", namebuf, NULL, loinfo->rolname, catId, 0, loinfo->dobj.dumpId); + } + } + + /* + * Dump the ACLs if any (remember that all blobs in the group will have + * the same ACL). If there's just one blob, dump a simple ACL entry; if + * there's more, make a "LARGE OBJECTS" entry that really contains only + * the ACL for the first blob. _printTocEntry() will be cued by the tag + * string to emit a mutated version for each blob. + */ + if (loinfo->dobj.dump & DUMP_COMPONENT_ACL) + { + char namebuf[32]; + + /* Build identifying info for the first blob */ + snprintf(namebuf, sizeof(namebuf), "%u", loinfo->looids[0]); + + if (loinfo->numlos > 1) + { + char tagbuf[64]; + + snprintf(tagbuf, sizeof(tagbuf), "LARGE OBJECTS %u..%u", + loinfo->looids[0], loinfo->looids[loinfo->numlos - 1]); - if (loinfo->dobj.dump & DUMP_COMPONENT_ACL) - dumpACL(fout, loinfo->dobj.dumpId, InvalidDumpId, - "LARGE OBJECT", namebuf, NULL, - NULL, loinfo->rolname, &loinfo->dacl); + dumpACL(fout, loinfo->dobj.dumpId, InvalidDumpId, + "LARGE OBJECT", namebuf, NULL, NULL, + tagbuf, loinfo->rolname, &loinfo->dacl); + } + else + { + dumpACL(fout, loinfo->dobj.dumpId, InvalidDumpId, + "LARGE OBJECT", namebuf, NULL, NULL, + NULL, loinfo->rolname, &loinfo->dacl); } } @@ -10787,7 +10814,7 @@ dumpNamespace(Archive *fout, const NamespaceInfo *nspinfo) if (nspinfo->dobj.dump & DUMP_COMPONENT_ACL) dumpACL(fout, nspinfo->dobj.dumpId, InvalidDumpId, "SCHEMA", qnspname, NULL, NULL, - nspinfo->rolname, &nspinfo->dacl); + NULL, nspinfo->rolname, &nspinfo->dacl); free(qnspname); @@ -11084,7 +11111,7 @@ dumpEnumType(Archive *fout, const TypeInfo *tyinfo) dumpACL(fout, tyinfo->dobj.dumpId, InvalidDumpId, "TYPE", qtypname, NULL, tyinfo->dobj.namespace->dobj.name, - tyinfo->rolname, &tyinfo->dacl); + NULL, tyinfo->rolname, &tyinfo->dacl); PQclear(res); destroyPQExpBuffer(q); @@ -11237,7 +11264,7 @@ dumpRangeType(Archive *fout, const TypeInfo *tyinfo) dumpACL(fout, tyinfo->dobj.dumpId, InvalidDumpId, "TYPE", qtypname, NULL, tyinfo->dobj.namespace->dobj.name, - tyinfo->rolname, &tyinfo->dacl); + NULL, tyinfo->rolname, &tyinfo->dacl); PQclear(res); destroyPQExpBuffer(q); @@ -11308,7 +11335,7 @@ dumpUndefinedType(Archive *fout, const TypeInfo *tyinfo) dumpACL(fout, tyinfo->dobj.dumpId, InvalidDumpId, "TYPE", qtypname, NULL, tyinfo->dobj.namespace->dobj.name, - tyinfo->rolname, &tyinfo->dacl); + NULL, tyinfo->rolname, &tyinfo->dacl); destroyPQExpBuffer(q); destroyPQExpBuffer(delq); @@ -11555,7 +11582,7 @@ dumpBaseType(Archive *fout, const TypeInfo *tyinfo) dumpACL(fout, tyinfo->dobj.dumpId, InvalidDumpId, "TYPE", qtypname, NULL, tyinfo->dobj.namespace->dobj.name, - tyinfo->rolname, &tyinfo->dacl); + NULL, tyinfo->rolname, &tyinfo->dacl); PQclear(res); destroyPQExpBuffer(q); @@ -11710,7 +11737,7 @@ dumpDomain(Archive *fout, const TypeInfo *tyinfo) dumpACL(fout, tyinfo->dobj.dumpId, InvalidDumpId, "TYPE", qtypname, NULL, tyinfo->dobj.namespace->dobj.name, - tyinfo->rolname, &tyinfo->dacl); + NULL, tyinfo->rolname, &tyinfo->dacl); /* Dump any per-constraint comments */ for (i = 0; i < tyinfo->nDomChecks; i++) @@ -11924,7 +11951,7 @@ dumpCompositeType(Archive *fout, const TypeInfo *tyinfo) dumpACL(fout, tyinfo->dobj.dumpId, InvalidDumpId, "TYPE", qtypname, NULL, tyinfo->dobj.namespace->dobj.name, - tyinfo->rolname, &tyinfo->dacl); + NULL, tyinfo->rolname, &tyinfo->dacl); /* Dump any per-column comments */ if (tyinfo->dobj.dump & DUMP_COMPONENT_COMMENT) @@ -12200,7 +12227,7 @@ dumpProcLang(Archive *fout, const ProcLangInfo *plang) if (plang->lanpltrusted && plang->dobj.dump & DUMP_COMPONENT_ACL) dumpACL(fout, plang->dobj.dumpId, InvalidDumpId, "LANGUAGE", qlanname, NULL, NULL, - plang->lanowner, &plang->dacl); + NULL, plang->lanowner, &plang->dacl); free(qlanname); @@ -12664,7 +12691,7 @@ dumpFunc(Archive *fout, const FuncInfo *finfo) dumpACL(fout, finfo->dobj.dumpId, InvalidDumpId, keyword, funcsig, NULL, finfo->dobj.namespace->dobj.name, - finfo->rolname, &finfo->dacl); + NULL, finfo->rolname, &finfo->dacl); PQclear(res); @@ -14524,7 +14551,7 @@ dumpAgg(Archive *fout, const AggInfo *agginfo) dumpACL(fout, agginfo->aggfn.dobj.dumpId, InvalidDumpId, "FUNCTION", aggsig, NULL, agginfo->aggfn.dobj.namespace->dobj.name, - agginfo->aggfn.rolname, &agginfo->aggfn.dacl); + NULL, agginfo->aggfn.rolname, &agginfo->aggfn.dacl); free(aggsig); free(aggfullsig); @@ -14921,7 +14948,7 @@ dumpForeignDataWrapper(Archive *fout, const FdwInfo *fdwinfo) /* Handle the ACL */ if (fdwinfo->dobj.dump & DUMP_COMPONENT_ACL) dumpACL(fout, fdwinfo->dobj.dumpId, InvalidDumpId, - "FOREIGN DATA WRAPPER", qfdwname, NULL, + "FOREIGN DATA WRAPPER", qfdwname, NULL, NULL, NULL, fdwinfo->rolname, &fdwinfo->dacl); free(qfdwname); @@ -15008,7 +15035,7 @@ dumpForeignServer(Archive *fout, const ForeignServerInfo *srvinfo) /* Handle the ACL */ if (srvinfo->dobj.dump & DUMP_COMPONENT_ACL) dumpACL(fout, srvinfo->dobj.dumpId, InvalidDumpId, - "FOREIGN SERVER", qsrvname, NULL, + "FOREIGN SERVER", qsrvname, NULL, NULL, NULL, srvinfo->rolname, &srvinfo->dacl); /* Dump user mappings */ @@ -15208,6 +15235,8 @@ dumpDefaultACL(Archive *fout, const DefaultACLInfo *daclinfo) * 'subname' is the formatted name of the sub-object, if any. Must be quoted. * (Currently we assume that subname is only provided for table columns.) * 'nspname' is the namespace the object is in (NULL if none). + * 'tag' is the tag to use for the ACL TOC entry; typically, this is NULL + * to use the default for the object type. * 'owner' is the owner, NULL if there is no owner (for languages). * 'dacl' is the DumpableAcl struct for the object. * @@ -15218,7 +15247,7 @@ dumpDefaultACL(Archive *fout, const DefaultACLInfo *daclinfo) static DumpId dumpACL(Archive *fout, DumpId objDumpId, DumpId altDumpId, const char *type, const char *name, const char *subname, - const char *nspname, const char *owner, + const char *nspname, const char *tag, const char *owner, const DumpableAcl *dacl) { DumpId aclDumpId = InvalidDumpId; @@ -15290,14 +15319,16 @@ dumpACL(Archive *fout, DumpId objDumpId, DumpId altDumpId, if (sql->len > 0) { - PQExpBuffer tag = createPQExpBuffer(); + PQExpBuffer tagbuf = createPQExpBuffer(); DumpId aclDeps[2]; int nDeps = 0; - if (subname) - appendPQExpBuffer(tag, "COLUMN %s.%s", name, subname); + if (tag) + appendPQExpBufferStr(tagbuf, tag); + else if (subname) + appendPQExpBuffer(tagbuf, "COLUMN %s.%s", name, subname); else - appendPQExpBuffer(tag, "%s %s", type, name); + appendPQExpBuffer(tagbuf, "%s %s", type, name); aclDeps[nDeps++] = objDumpId; if (altDumpId != InvalidDumpId) @@ -15306,7 +15337,7 @@ dumpACL(Archive *fout, DumpId objDumpId, DumpId altDumpId, aclDumpId = createDumpId(); ArchiveEntry(fout, nilCatalogId, aclDumpId, - ARCHIVE_OPTS(.tag = tag->data, + ARCHIVE_OPTS(.tag = tagbuf->data, .namespace = nspname, .owner = owner, .description = "ACL", @@ -15315,7 +15346,7 @@ dumpACL(Archive *fout, DumpId objDumpId, DumpId altDumpId, .deps = aclDeps, .nDeps = nDeps)); - destroyPQExpBuffer(tag); + destroyPQExpBuffer(tagbuf); } destroyPQExpBuffer(sql); @@ -15697,8 +15728,8 @@ dumpTable(Archive *fout, const TableInfo *tbinfo) tableAclDumpId = dumpACL(fout, tbinfo->dobj.dumpId, InvalidDumpId, objtype, namecopy, NULL, - tbinfo->dobj.namespace->dobj.name, tbinfo->rolname, - &tbinfo->dacl); + tbinfo->dobj.namespace->dobj.name, + NULL, tbinfo->rolname, &tbinfo->dacl); } /* @@ -15791,8 +15822,8 @@ dumpTable(Archive *fout, const TableInfo *tbinfo) */ dumpACL(fout, tbinfo->dobj.dumpId, tableAclDumpId, "TABLE", namecopy, attnamecopy, - tbinfo->dobj.namespace->dobj.name, tbinfo->rolname, - &coldacl); + tbinfo->dobj.namespace->dobj.name, + NULL, tbinfo->rolname, &coldacl); free(attnamecopy); } PQclear(res); -- 2.39.3 From 04d1271254731f9f4e878f932816832599af3d7d Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Fri, 15 Mar 2024 19:02:40 -0400 Subject: [PATCH v11 6/6] Be more conservative about --transaction-size in parallel pg_upgrade. Patch 0004 just blindly set the --transaction-size for pg_restore invoked by parallel pg_upgrade to 1000. If the user asks for 5 or 10 parallel restore jobs, our headroom to avoid running out of lock table entries disappears. Be safe by dividing the transaction size by the number of parallel jobs. We could go further and give users direct control of this setting, but it's quite unclear that the extra API complexity would be useful. --- src/bin/pg_upgrade/pg_upgrade.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c index 8c6fb96478..af370768b6 100644 --- a/src/bin/pg_upgrade/pg_upgrade.c +++ b/src/bin/pg_upgrade/pg_upgrade.c @@ -587,6 +587,7 @@ create_new_objects(void) log_file_name[MAXPGPATH]; DbInfo *old_db = &old_cluster.dbarr.dbs[dbnum]; const char *create_opts; + int txn_size; /* Skip template1 in this pass */ if (strcmp(old_db->db_name, "template1") == 0) @@ -606,6 +607,19 @@ create_new_objects(void) else create_opts = "--create"; + /* + * In parallel mode, reduce the --transaction-size of each restore job + * so that the total number of locks that could be held across all the + * jobs stays in bounds. + */ + txn_size = RESTORE_TRANSACTION_SIZE; + if (user_opts.jobs > 1) + { + txn_size /= user_opts.jobs; + /* Keep some sanity if -j is huge */ + txn_size = Max(txn_size, 10); + } + parallel_exec_prog(log_file_name, NULL, "\"%s/pg_restore\" %s %s --exit-on-error --verbose " @@ -614,7 +628,7 @@ create_new_objects(void) new_cluster.bindir, cluster_conn_opts(&new_cluster), create_opts, - RESTORE_TRANSACTION_SIZE, + txn_size, log_opts.dumpdir, sql_file_name); } -- 2.39.3
On Fri, 2024-03-15 at 19:18 -0400, Tom Lane wrote: > This patch seems to have stalled out again. In hopes of getting it > over the finish line, I've done a bit more work to address the two > loose ends I felt were probably essential to deal with: Applies and builds fine. I didn't scrutinize the code, but I gave it a spin on a database with 15 million (small) large objects. I tried pg_upgrade --link with and without the patch on a debug build with the default configuration. Without the patch: Runtime: 74.5 minutes Memory usage: ~7GB Disk usage: an extra 5GB dump file + log file during the dump With the patch: Runtime: 70 minutes Memory usage: ~1GB Disk usage: an extra 0.5GB during the dump Memory usage stayed stable once it reached its peak, so no noticeable memory leaks. The reduced memory usage is great. I was surprised by the difference in disk usage: the lion's share is the dump file, and that got substantially smaller. But also the log file shrank considerably, because not every individual large object gets logged. I had a look at "perf top", and the profile looked pretty similar in both cases. The patch is a clear improvement. Yours, Laurenz Albe
Laurenz Albe <laurenz.albe@cybertec.at> writes: > On Fri, 2024-03-15 at 19:18 -0400, Tom Lane wrote: >> This patch seems to have stalled out again. In hopes of getting it >> over the finish line, I've done a bit more work to address the two >> loose ends I felt were probably essential to deal with: > Applies and builds fine. > I didn't scrutinize the code, but I gave it a spin on a database with > 15 million (small) large objects. I tried pg_upgrade --link with and > without the patch on a debug build with the default configuration. Thanks for looking at it! > Without the patch: > Runtime: 74.5 minutes > With the patch: > Runtime: 70 minutes Hm, I'd have hoped for a bit more runtime improvement. But perhaps not --- most of the win we saw upthread was from parallelism, and I don't think you'd get any parallelism in a pg_upgrade with all the data in one database. (Perhaps there is more to do there later, but I'm still not clear on how this should interact with the existing cross-DB parallelism; so I'm content to leave that question for another patch.) regards, tom lane
On Sat, 2024-03-16 at 18:46 -0400, Tom Lane wrote: > > Without the patch: > > Runtime: 74.5 minutes > > > With the patch: > > Runtime: 70 minutes > > Hm, I'd have hoped for a bit more runtime improvement. I did a second run with the patch, and that finished in 66 minutes, so there is some jitter there. I think the reduced memory footprint and the reduced transaction ID consumption alone make this patch worthwhile. Yours, Laurenz Albe
Hi, On Sat, Mar 16, 2024 at 06:46:15PM -0400, Tom Lane wrote: > Laurenz Albe <laurenz.albe@cybertec.at> writes: > > On Fri, 2024-03-15 at 19:18 -0400, Tom Lane wrote: > >> This patch seems to have stalled out again. In hopes of getting it > >> over the finish line, I've done a bit more work to address the two > >> loose ends I felt were probably essential to deal with: > > > Applies and builds fine. > > I didn't scrutinize the code, but I gave it a spin on a database with > > 15 million (small) large objects. I tried pg_upgrade --link with and > > without the patch on a debug build with the default configuration. > > Thanks for looking at it! > > > Without the patch: > > Runtime: 74.5 minutes > > > With the patch: > > Runtime: 70 minutes > > Hm, I'd have hoped for a bit more runtime improvement. I also think that this is quite a large runtime for pg_upgrade, but the more important savings should be the memory usage. > But perhaps not --- most of the win we saw upthread was from > parallelism, and I don't think you'd get any parallelism in a > pg_upgrade with all the data in one database. (Perhaps there is more > to do there later, but I'm still not clear on how this should interact > with the existing cross-DB parallelism; so I'm content to leave that > question for another patch.) What is the status of this? In the commitfest, this patch is marked as "Needs Review" with Nathan as reviewer - Nathan, were you going to take another look at this or was your mail from January 12th a full review? My feeling is that this patch is "Ready for Committer" and it is Tom's call to commit it during the next days or not. I am +1 that this is an important feature/bug fix to have. Because we have customers stuck on older versions due to their pathological large objects usage, I did some benchmarks (jsut doing pg_dump, not pg_upgarde) a while ago which were also very promising; however, I lost the exact numbers/results. I am happy to do further tests if that is required for this patch to go forward. Also, is there a chance this is going to be back-patched? I guess it would be enough if the ugprade target is v17 so it is less of a concern, but it would be nice if people with millions of large objects are not stuck until they are ready to ugprade to v17. Michael
On Wed, 2024-03-27 at 10:20 +0100, Michael Banck wrote: > Also, is there a chance this is going to be back-patched? I guess it > would be enough if the ugprade target is v17 so it is less of a concern, > but it would be nice if people with millions of large objects are not > stuck until they are ready to ugprade to v17. It is a quite invasive patch, and it adds new features (pg_restore in bigger transaction patches), so I think this is not for backpatching, desirable as it may seem from the usability angle. Yours, Laurenz Albe
Hi, On Wed, Mar 27, 2024 at 10:53:51AM +0100, Laurenz Albe wrote: > On Wed, 2024-03-27 at 10:20 +0100, Michael Banck wrote: > > Also, is there a chance this is going to be back-patched? I guess it > > would be enough if the ugprade target is v17 so it is less of a concern, > > but it would be nice if people with millions of large objects are not > > stuck until they are ready to ugprade to v17. > > It is a quite invasive patch, and it adds new features (pg_restore in > bigger transaction patches), so I think this is not for backpatching, > desirable as it may seem from the usability angle. Right, I forgot about those changes, makes sense. Michael
Michael Banck <mbanck@gmx.net> writes: > What is the status of this? In the commitfest, this patch is marked as > "Needs Review" with Nathan as reviewer - Nathan, were you going to take > another look at this or was your mail from January 12th a full review? In my mind the ball is in Nathan's court. I feel it's about committable, but he might not agree. > Also, is there a chance this is going to be back-patched? No chance of that I'm afraid. The patch bumps the archive version number, because it creates TOC entries that older pg_restore would not know what to do with. We can't put that kind of compatibility break into stable branches. regards, tom lane
On Wed, Mar 27, 2024 at 10:54:05AM -0400, Tom Lane wrote: > Michael Banck <mbanck@gmx.net> writes: >> What is the status of this? In the commitfest, this patch is marked as >> "Needs Review" with Nathan as reviewer - Nathan, were you going to take >> another look at this or was your mail from January 12th a full review? > > In my mind the ball is in Nathan's court. I feel it's about > committable, but he might not agree. I'll prioritize another round of review on this one. FWIW I don't remember having any major concerns on a previous version of the patch set I looked at. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Mar 27, 2024 at 10:08:26AM -0500, Nathan Bossart wrote: > On Wed, Mar 27, 2024 at 10:54:05AM -0400, Tom Lane wrote: >> Michael Banck <mbanck@gmx.net> writes: >>> What is the status of this? In the commitfest, this patch is marked as >>> "Needs Review" with Nathan as reviewer - Nathan, were you going to take >>> another look at this or was your mail from January 12th a full review? >> >> In my mind the ball is in Nathan's court. I feel it's about >> committable, but he might not agree. > > I'll prioritize another round of review on this one. FWIW I don't remember > having any major concerns on a previous version of the patch set I looked > at. Sorry for taking so long to get back to this one. Overall, I think the code is in decent shape. Nothing stands out after a couple of passes. The small amount of runtime improvement cited upthread is indeed a bit disappointing, but IIUC this at least sets the stage for additional parallelism in the future, and the memory/disk usage improvements are nothing to sneeze at, either. The one design point that worries me a little is the non-configurability of --transaction-size in pg_upgrade. I think it's fine to default it to 1,000 or something, but given how often I've had to fiddle with max_locks_per_transaction, I'm wondering if we might regret hard-coding it. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > Sorry for taking so long to get back to this one. Overall, I think the > code is in decent shape. Thanks for looking at it! > The one design point that worries me a little is the non-configurability of > --transaction-size in pg_upgrade. I think it's fine to default it to 1,000 > or something, but given how often I've had to fiddle with > max_locks_per_transaction, I'm wondering if we might regret hard-coding it. Well, we could add a command-line switch to pg_upgrade, but I'm unconvinced that it'd be worth the trouble. I think a very large fraction of users invoke pg_upgrade by means of packager-supplied scripts that are unlikely to provide a way to pass through such a switch. I'm inclined to say let's leave it as-is until we get some actual field requests for a switch. regards, tom lane
On Mon, Apr 01, 2024 at 03:28:26PM -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> The one design point that worries me a little is the non-configurability of >> --transaction-size in pg_upgrade. I think it's fine to default it to 1,000 >> or something, but given how often I've had to fiddle with >> max_locks_per_transaction, I'm wondering if we might regret hard-coding it. > > Well, we could add a command-line switch to pg_upgrade, but I'm > unconvinced that it'd be worth the trouble. I think a very large > fraction of users invoke pg_upgrade by means of packager-supplied > scripts that are unlikely to provide a way to pass through such > a switch. I'm inclined to say let's leave it as-is until we get > some actual field requests for a switch. Okay. I'll let you know if I see anything. IIRC usually the pg_dump side of pg_upgrade is more prone to lock exhaustion, so you may very well be right that this is unnecessary. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Mon, Apr 01, 2024 at 03:28:26PM -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: > > The one design point that worries me a little is the non-configurability of > > --transaction-size in pg_upgrade. I think it's fine to default it to 1,000 > > or something, but given how often I've had to fiddle with > > max_locks_per_transaction, I'm wondering if we might regret hard-coding it. > > Well, we could add a command-line switch to pg_upgrade, but I'm > unconvinced that it'd be worth the trouble. I think a very large > fraction of users invoke pg_upgrade by means of packager-supplied > scripts that are unlikely to provide a way to pass through such > a switch. I'm inclined to say let's leave it as-is until we get > some actual field requests for a switch. I've been importing our schemas and doing upgrade testing, and was surprised when a postgres backend was killed for OOM during pg_upgrade: Killed process 989302 (postgres) total-vm:5495648kB, anon-rss:5153292kB, ... Upgrading from v16 => v16 doesn't use nearly as much RAM. While tracking down the responsible commit, I reproduced the problem using a subset of tables; at 959b38d770, the backend process used ~650 MB RAM, and at its parent commit used at most ~120 MB. 959b38d770b Invent --transaction-size option for pg_restore. By changing RESTORE_TRANSACTION_SIZE to 100, backend RAM use goes to 180 MB during pg_upgrade, which is reasonable. With partitioning, we have a lot of tables, some of them wide (126 partitioned tables, 8942 childs, total 1019315 columns). I didn't track if certain parts of our schema contribute most to the high backend mem use, just that it's now 5x (while testing a subset) to 50x higher. We'd surely prefer that the transaction size be configurable. -- Justin
Hi, Justin! Thank you for sharing this. On Wed, Jul 24, 2024 at 5:18 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Mon, Apr 01, 2024 at 03:28:26PM -0400, Tom Lane wrote: > > Nathan Bossart <nathandbossart@gmail.com> writes: > > > The one design point that worries me a little is the non-configurability of > > > --transaction-size in pg_upgrade. I think it's fine to default it to 1,000 > > > or something, but given how often I've had to fiddle with > > > max_locks_per_transaction, I'm wondering if we might regret hard-coding it. > > > > Well, we could add a command-line switch to pg_upgrade, but I'm > > unconvinced that it'd be worth the trouble. I think a very large > > fraction of users invoke pg_upgrade by means of packager-supplied > > scripts that are unlikely to provide a way to pass through such > > a switch. I'm inclined to say let's leave it as-is until we get > > some actual field requests for a switch. > > I've been importing our schemas and doing upgrade testing, and was > surprised when a postgres backend was killed for OOM during pg_upgrade: > > Killed process 989302 (postgres) total-vm:5495648kB, anon-rss:5153292kB, ... > > Upgrading from v16 => v16 doesn't use nearly as much RAM. > > While tracking down the responsible commit, I reproduced the problem > using a subset of tables; at 959b38d770, the backend process used > ~650 MB RAM, and at its parent commit used at most ~120 MB. > > 959b38d770b Invent --transaction-size option for pg_restore. > > By changing RESTORE_TRANSACTION_SIZE to 100, backend RAM use goes to > 180 MB during pg_upgrade, which is reasonable. > > With partitioning, we have a lot of tables, some of them wide (126 > partitioned tables, 8942 childs, total 1019315 columns). I didn't track > if certain parts of our schema contribute most to the high backend mem > use, just that it's now 5x (while testing a subset) to 50x higher. Do you think there is a way to anonymize the schema and share it? > We'd surely prefer that the transaction size be configurable. I think we can add an option to pg_upgrade. But I wonder if there is something else we can do. It seems that restoring some objects is much more expensive than restoring others. It would be nice to identify such cases and check which memory contexts are growing and why. It would be helpful if you could share your data schema, so we could dig into it. I can imagine we need to count some DDL commands in aspect of maximum restore transaction size in a different way than others. Also, we probably need to change the default restore transaction size. ------ Regards, Alexander Korotkov Supabase
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Wed, Jul 24, 2024 at 5:18 PM Justin Pryzby <pryzby@telsasoft.com> wrote: >> We'd surely prefer that the transaction size be configurable. > I think we can add an option to pg_upgrade. But I wonder if there is > something else we can do. Yeah, I'm not enamored of adding a command-line option, if only because I think a lot of people invoke pg_upgrade through vendor-provided scripts that aren't going to cooperate with that. If we can find some way to make it adapt without help, that would be much better. regards, tom lane
On Wed, Jul 24, 2024 at 09:17:51AM -0500, Justin Pryzby wrote: > With partitioning, we have a lot of tables, some of them wide (126 > partitioned tables, 8942 childs, total 1019315 columns). On Fri, Jul 26, 2024 at 10:53:30PM +0300, Alexander Korotkov wrote: > It would be nice to identify such cases and check which memory contexts are > growing and why. I reproduced the problem with this schema: SELECT format('CREATE TABLE p(i int, %s) PARTITION BY RANGE(i)', array_to_string(a, ', ')) FROM (SELECT array_agg(format('i%sint', i))a FROM generate_series(1,999)i); SELECT format('CREATE TABLE t%s PARTITION OF p FOR VALUES FROM (%s)TO(%s)', i,i,i+1) FROM generate_series(1,999)i; This used over 4 GB of RAM. 3114201 pryzbyj 20 0 5924520 4.2g 32476 T 0.0 53.8 0:27.35 postgres: pryzbyj postgres [local] UPDATE The large context is: 2024-07-26 15:22:19.280 CDT [3114201] LOG: level: 1; CacheMemoryContext: 5211209088 total in 50067 blocks; 420688 free (14chunks); 5210788400 used Note that there seemed to be no issue when I created 999 tables without partitioning: SELECT format('CREATE TABLE t%s(LIKE p)', i,i,i+1) FROM generate_series(1,999)i; -- Justin
On Fri, Jul 26, 2024 at 11:36 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Wed, Jul 24, 2024 at 09:17:51AM -0500, Justin Pryzby wrote: > > With partitioning, we have a lot of tables, some of them wide (126 > > partitioned tables, 8942 childs, total 1019315 columns). > > On Fri, Jul 26, 2024 at 10:53:30PM +0300, Alexander Korotkov wrote: > > It would be nice to identify such cases and check which memory contexts are > > growing and why. > > I reproduced the problem with this schema: > > SELECT format('CREATE TABLE p(i int, %s) PARTITION BY RANGE(i)', array_to_string(a, ', ')) FROM (SELECT array_agg(format('i%sint', i))a FROM generate_series(1,999)i); > SELECT format('CREATE TABLE t%s PARTITION OF p FOR VALUES FROM (%s)TO(%s)', i,i,i+1) FROM generate_series(1,999)i; > > This used over 4 GB of RAM. > 3114201 pryzbyj 20 0 5924520 4.2g 32476 T 0.0 53.8 0:27.35 postgres: pryzbyj postgres [local] UPDATE > > The large context is: > 2024-07-26 15:22:19.280 CDT [3114201] LOG: level: 1; CacheMemoryContext: 5211209088 total in 50067 blocks; 420688 free(14 chunks); 5210788400 used > > Note that there seemed to be no issue when I created 999 tables without > partitioning: > > SELECT format('CREATE TABLE t%s(LIKE p)', i,i,i+1) FROM generate_series(1,999)i; Thank you! That was quick. I'm looking into this. ------ Regards, Alexander Korotkov Supabase
Justin Pryzby <pryzby@telsasoft.com> writes: > On Fri, Jul 26, 2024 at 10:53:30PM +0300, Alexander Korotkov wrote: >> It would be nice to identify such cases and check which memory contexts are >> growing and why. > I reproduced the problem with this schema: > SELECT format('CREATE TABLE p(i int, %s) PARTITION BY RANGE(i)', array_to_string(a, ', ')) FROM (SELECT array_agg(format('i%sint', i))a FROM generate_series(1,999)i); > SELECT format('CREATE TABLE t%s PARTITION OF p FOR VALUES FROM (%s)TO(%s)', i,i,i+1) FROM generate_series(1,999)i; > This used over 4 GB of RAM. Interesting. This doesn't bloat particularly much in a regular pg_restore, even with --transaction-size=1000; but it does in pg_upgrade, as you say. I found that the bloat was occurring during these long sequences of UPDATE commands issued by pg_upgrade: -- For binary upgrade, recreate inherited column. UPDATE pg_catalog.pg_attribute SET attislocal = false WHERE attname = 'i' AND attrelid = '\"public\".\"t139\"'::pg_catalog.regclass; -- For binary upgrade, recreate inherited column. UPDATE pg_catalog.pg_attribute SET attislocal = false WHERE attname = 'i1' AND attrelid = '\"public\".\"t139\"'::pg_catalog.regclass; -- For binary upgrade, recreate inherited column. UPDATE pg_catalog.pg_attribute SET attislocal = false WHERE attname = 'i2' AND attrelid = '\"public\".\"t139\"'::pg_catalog.regclass; I think the problem is basically that each one of these commands causes a relcache inval, for which we can't reclaim space right away, so that we end up consuming O(N^2) cache space for an N-column inherited table. It's fairly easy to fix things so that this example doesn't cause that to happen: we just need to issue these updates as one command not N commands per table. See attached. However, I fear this should just be considered a draft, because the other code for binary upgrade in the immediate vicinity is just as aggressively stupid and unoptimized as this bit, and can probably also be driven to O(N^2) behavior with enough CHECK constraints etc. We've gone out of our way to make ALTER TABLE capable of handling many updates to a table's DDL in one command, but whoever wrote this code appears not to have read that memo, or at least to have believed that performance of pg_upgrade isn't of concern. > Note that there seemed to be no issue when I created 999 tables without > partitioning: > SELECT format('CREATE TABLE t%s(LIKE p)', i,i,i+1) FROM generate_series(1,999)i; Yeah, because then we don't need to play games with attislocal. regards, tom lane diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index b8b1888bd3..19f98bdf43 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -16094,6 +16094,12 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo) tbinfo->relkind == RELKIND_FOREIGN_TABLE || tbinfo->relkind == RELKIND_PARTITIONED_TABLE)) { + bool firstinhcol = true; + + /* + * Drop any dropped columns. We don't really expect there to be a + * lot, else this code would be pretty inefficient. + */ for (j = 0; j < tbinfo->numatts; j++) { if (tbinfo->attisdropped[j]) @@ -16120,18 +16126,37 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo) appendPQExpBuffer(q, "DROP COLUMN %s;\n", fmtId(tbinfo->attnames[j])); } - else if (!tbinfo->attislocal[j]) + } + + /* + * Fix up inherited columns. There could be a lot of these, so we + * do the operation in a single SQL command; otherwise, we risk + * O(N^2) relcache bloat thanks to repeatedly invalidating the + * table's relcache entry. + */ + for (j = 0; j < tbinfo->numatts; j++) + { + if (!tbinfo->attisdropped[j] && + !tbinfo->attislocal[j]) { - appendPQExpBufferStr(q, "\n-- For binary upgrade, recreate inherited column.\n"); - appendPQExpBufferStr(q, "UPDATE pg_catalog.pg_attribute\n" - "SET attislocal = false\n" - "WHERE attname = "); + if (firstinhcol) + { + appendPQExpBufferStr(q, "\n-- For binary upgrade, recreate inherited columns.\n"); + appendPQExpBufferStr(q, "UPDATE pg_catalog.pg_attribute\n" + "SET attislocal = false\n" + "WHERE attrelid = "); + appendStringLiteralAH(q, qualrelname, fout); + appendPQExpBufferStr(q, "::pg_catalog.regclass\n" + " AND attname IN ("); + firstinhcol = false; + } + else + appendPQExpBufferStr(q, ", "); appendStringLiteralAH(q, tbinfo->attnames[j], fout); - appendPQExpBufferStr(q, "\n AND attrelid = "); - appendStringLiteralAH(q, qualrelname, fout); - appendPQExpBufferStr(q, "::pg_catalog.regclass;\n"); } } + if (!firstinhcol) + appendPQExpBufferStr(q, ");\n"); /* * Add inherited CHECK constraints, if any.
On Sat, Jul 27, 2024 at 1:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: > > On Fri, Jul 26, 2024 at 10:53:30PM +0300, Alexander Korotkov wrote: > >> It would be nice to identify such cases and check which memory contexts are > >> growing and why. > > > I reproduced the problem with this schema: > > > SELECT format('CREATE TABLE p(i int, %s) PARTITION BY RANGE(i)', array_to_string(a, ', ')) FROM (SELECT array_agg(format('i%sint', i))a FROM generate_series(1,999)i); > > SELECT format('CREATE TABLE t%s PARTITION OF p FOR VALUES FROM (%s)TO(%s)', i,i,i+1) FROM generate_series(1,999)i; > > > This used over 4 GB of RAM. > > Interesting. This doesn't bloat particularly much in a regular > pg_restore, even with --transaction-size=1000; but it does in > pg_upgrade, as you say. I found that the bloat was occurring > during these long sequences of UPDATE commands issued by pg_upgrade: > > -- For binary upgrade, recreate inherited column. > UPDATE pg_catalog.pg_attribute > SET attislocal = false > WHERE attname = 'i' > AND attrelid = '\"public\".\"t139\"'::pg_catalog.regclass; > > -- For binary upgrade, recreate inherited column. > UPDATE pg_catalog.pg_attribute > SET attislocal = false > WHERE attname = 'i1' > AND attrelid = '\"public\".\"t139\"'::pg_catalog.regclass; > > -- For binary upgrade, recreate inherited column. > UPDATE pg_catalog.pg_attribute > SET attislocal = false > WHERE attname = 'i2' > AND attrelid = '\"public\".\"t139\"'::pg_catalog.regclass; > > I think the problem is basically that each one of these commands > causes a relcache inval, for which we can't reclaim space right > away, so that we end up consuming O(N^2) cache space for an > N-column inherited table. I was about to report the same. > It's fairly easy to fix things so that this example doesn't cause > that to happen: we just need to issue these updates as one command > not N commands per table. See attached. However, I fear this should > just be considered a draft, because the other code for binary upgrade > in the immediate vicinity is just as aggressively stupid and > unoptimized as this bit, and can probably also be driven to O(N^2) > behavior with enough CHECK constraints etc. We've gone out of our way > to make ALTER TABLE capable of handling many updates to a table's DDL > in one command, but whoever wrote this code appears not to have read > that memo, or at least to have believed that performance of pg_upgrade > isn't of concern. I was thinking about counting actual number of queries, not TOC entries for transaction number as a more universal solution. But that would require usage of psql_scan() or writing simpler alternative for this particular purpose. That looks quite annoying. What do you think? ------ Regards, Alexander Korotkov Supabase
Alexander Korotkov <aekorotkov@gmail.com> writes: > On Sat, Jul 27, 2024 at 1:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It's fairly easy to fix things so that this example doesn't cause >> that to happen: we just need to issue these updates as one command >> not N commands per table. > I was thinking about counting actual number of queries, not TOC > entries for transaction number as a more universal solution. But that > would require usage of psql_scan() or writing simpler alternative for > this particular purpose. That looks quite annoying. What do you > think? The assumption underlying what we're doing now is that the number of SQL commands per TOC entry is limited. I'd prefer to fix the code so that that assumption is correct, at least in normal cases. I confess I'd not looked closely enough at the binary-upgrade support code to realize it wasn't correct already :-(. If we go that way, we can fix this while also making pg_upgrade faster rather than slower. I also expect that it'll be a lot simpler than putting a full SQL parser in pg_restore. regards, tom lane
On Sat, Jul 27, 2024 at 2:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alexander Korotkov <aekorotkov@gmail.com> writes: > > On Sat, Jul 27, 2024 at 1:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> It's fairly easy to fix things so that this example doesn't cause > >> that to happen: we just need to issue these updates as one command > >> not N commands per table. > > > I was thinking about counting actual number of queries, not TOC > > entries for transaction number as a more universal solution. But that > > would require usage of psql_scan() or writing simpler alternative for > > this particular purpose. That looks quite annoying. What do you > > think? > > The assumption underlying what we're doing now is that the number > of SQL commands per TOC entry is limited. I'd prefer to fix the > code so that that assumption is correct, at least in normal cases. > I confess I'd not looked closely enough at the binary-upgrade support > code to realize it wasn't correct already :-(. If we go that way, > we can fix this while also making pg_upgrade faster rather than > slower. I also expect that it'll be a lot simpler than putting > a full SQL parser in pg_restore. I'm good with that as soon as we're not going to meet many cases of high number SQL commands per TOC entry. J4F, I have an idea to count number of ';' sings and use it for transaction size counter, since it is as upper bound estimate of number of SQL commands :-) ------ Regards, Alexander Korotkov Supabase
Alexander Korotkov <aekorotkov@gmail.com> writes: > J4F, I have an idea to count number of ';' sings and use it for > transaction size counter, since it is as upper bound estimate of > number of SQL commands :-) Hmm ... that's not a completely silly idea. Let's keep it in the back pocket in case we can't easily reduce the number of SQL commands in some cases. It's late here, and I've got some other commitments tomorrow, but I'll try to produce a patch to merge more of the SQL commands in a day or two. regards, tom lane
I wrote: > Alexander Korotkov <aekorotkov@gmail.com> writes: >> J4F, I have an idea to count number of ';' sings and use it for >> transaction size counter, since it is as upper bound estimate of >> number of SQL commands :-) > Hmm ... that's not a completely silly idea. Let's keep it in > the back pocket in case we can't easily reduce the number of > SQL commands in some cases. After poking at this for awhile, we can fix Justin's example case by avoiding repeated UPDATEs on pg_attribute, so I think we should do that. It seems clearly a win, with no downside other than a small increment of complexity in pg_dump. However, that's probably not sufficient to mark this issue as closed. It seems likely that there are other patterns that would cause backend memory bloat. One case that I found is tables with a lot of inherited constraints (not partitions, but old-style inheritance). For example, load the output of this Perl script into a database: ----- for (my $i = 0; $i < 100; $i++) { print "CREATE TABLE test_inh_check$i (\n"; for (my $j = 0; $j < 1000; $j++) { print "a$j float check (a$j > 10.2),\n"; } print "b float);\n"; print "CREATE TABLE test_inh_check_child$i() INHERITS(test_inh_check$i);\n"; } ----- pg_dump is horrendously slow on this, thanks to O(N^2) behavior in ruleutils.c, and pg_upgrade is worse --- and leaks memory too in HEAD/v17. The slowness was there before, so I think the lack of field complaints indicates that this isn't a real-world use case. Still, it's bad if pg_upgrade fails when it would not have before, and there may be other similar issues. So I'm forced to the conclusion that we'd better make the transaction size adaptive as per Alexander's suggestion. In addition to the patches attached, I experimented with making dumpTableSchema fold all the ALTER TABLE commands for a single table into one command. That's do-able without too much effort, but I'm now convinced that we shouldn't. It would break the semicolon-counting hack for detecting that tables like these involve extra work. I'm also not very confident that the backend won't have trouble with ALTER TABLE commands containing hundreds of subcommands. That's something we ought to work on probably, but it's not a project that I want to condition v17 pg_upgrade's stability on. Anyway, proposed patches attached. 0001 is some trivial cleanup that I noticed while working on the failed single-ALTER-TABLE idea. 0002 merges the catalog-UPDATE commands that dumpTableSchema issues, and 0003 is Alexander's suggestion. regards, tom lane From 34ebed72e9029f690e5d3f0cb7464670e83caa5c Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Sun, 28 Jul 2024 13:02:27 -0400 Subject: [PATCH v1 1/3] Fix missing ONLY in one dumpTableSchema command. The various ALTER [FOREIGN] TABLE commands issued by dumpTableSchema all use ONLY, except for this one. I think it's not a live bug because we don't permit foreign tables to have children, but it's still inconsistent. I happened across this while refactoring dumpTableSchema to merge all its ALTERs into one; while I'm not certain we should actually do that, this seems like good cleanup. --- src/bin/pg_dump/pg_dump.c | 2 +- src/bin/pg_dump/t/002_pg_dump.pl | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index b8b1888bd3..7cd6a7fb97 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -16344,7 +16344,7 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo) if (tbinfo->relkind == RELKIND_FOREIGN_TABLE && tbinfo->attfdwoptions[j][0] != '\0') appendPQExpBuffer(q, - "ALTER FOREIGN TABLE %s ALTER COLUMN %s OPTIONS (\n" + "ALTER FOREIGN TABLE ONLY %s ALTER COLUMN %s OPTIONS (\n" " %s\n" ");\n", qualrelname, diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl index d3dd8784d6..5bcc2244d5 100644 --- a/src/bin/pg_dump/t/002_pg_dump.pl +++ b/src/bin/pg_dump/t/002_pg_dump.pl @@ -1154,7 +1154,7 @@ my %tests = ( 'ALTER FOREIGN TABLE foreign_table ALTER COLUMN c1 OPTIONS' => { regexp => qr/^ - \QALTER FOREIGN TABLE dump_test.foreign_table ALTER COLUMN c1 OPTIONS (\E\n + \QALTER FOREIGN TABLE ONLY dump_test.foreign_table ALTER COLUMN c1 OPTIONS (\E\n \s+\Qcolumn_name 'col1'\E\n \Q);\E\n /xm, -- 2.43.5 From c8f0d0252e292f276fe9631ae31e6aea11d294d2 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Sun, 28 Jul 2024 16:19:48 -0400 Subject: [PATCH v1 2/3] Reduce number of commands dumpTableSchema emits for binary upgrade. Avoid issuing a separate SQL UPDATE command for each column when directly manipulating pg_attribute contents in binary upgrade mode. With the separate updates, we triggered a relcache invalidation with each update. For a table with N columns, that causes O(N^2) relcache bloat in txn_size mode because the table's newly-created relcache entry can't be flushed till end of transaction. Reducing the number of commands is marginally faster as well as avoiding that problem. While at it, likewise avoid issuing a separate UPDATE on pg_constraint for each inherited constraint. This is less exciting, first because inherited (non-partitioned) constraints are relatively rare, and second because the backend has a good deal of trouble anyway with restoring tables containing many such constraints, due to MergeConstraintsIntoExisting being horribly inefficient. But it seems more consistent to do it this way here too, and it surely can't hurt. Per report from Justin Pryzby. Back-patch to v17 where txn_size mode was introduced. Discussion: https://postgr.es/m/ZqEND4ZcTDBmcv31@pryzbyj2023 --- src/bin/pg_dump/pg_dump.c | 132 ++++++++++++++++++++++++++++---------- 1 file changed, 97 insertions(+), 35 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 7cd6a7fb97..2b02148559 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -15670,6 +15670,7 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo) DumpOptions *dopt = fout->dopt; PQExpBuffer q = createPQExpBuffer(); PQExpBuffer delq = createPQExpBuffer(); + PQExpBuffer extra = createPQExpBuffer(); char *qrelname; char *qualrelname; int numParents; @@ -15736,7 +15737,7 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo) char *partkeydef = NULL; char *ftoptions = NULL; char *srvname = NULL; - char *foreign = ""; + const char *foreign = ""; /* * Set reltypename, and collect any relkind-specific data that we @@ -16094,51 +16095,98 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo) tbinfo->relkind == RELKIND_FOREIGN_TABLE || tbinfo->relkind == RELKIND_PARTITIONED_TABLE)) { + bool firstitem; + + /* + * Drop any dropped columns. Merge the pg_attribute manipulations + * into a single SQL command, so that we don't cause repeated + * relcache flushes on the target table. Otherwise we risk O(N^2) + * relcache bloat while dropping N columns. + */ + resetPQExpBuffer(extra); + firstitem = true; for (j = 0; j < tbinfo->numatts; j++) { if (tbinfo->attisdropped[j]) { - appendPQExpBufferStr(q, "\n-- For binary upgrade, recreate dropped column.\n"); - appendPQExpBuffer(q, "UPDATE pg_catalog.pg_attribute\n" - "SET attlen = %d, " - "attalign = '%c', attbyval = false\n" - "WHERE attname = ", + if (firstitem) + { + appendPQExpBufferStr(q, "\n-- For binary upgrade, recreate dropped columns.\n" + "UPDATE pg_catalog.pg_attribute\n" + "SET attlen = v.dlen, " + "attalign = v.dalign, " + "attbyval = false\n" + "FROM (VALUES "); + firstitem = false; + } + else + appendPQExpBufferStr(q, ",\n "); + appendPQExpBufferChar(q, '('); + appendStringLiteralAH(q, tbinfo->attnames[j], fout); + appendPQExpBuffer(q, ", %d, '%c')", tbinfo->attlen[j], tbinfo->attalign[j]); - appendStringLiteralAH(q, tbinfo->attnames[j], fout); - appendPQExpBufferStr(q, "\n AND attrelid = "); - appendStringLiteralAH(q, qualrelname, fout); - appendPQExpBufferStr(q, "::pg_catalog.regclass;\n"); - - if (tbinfo->relkind == RELKIND_RELATION || - tbinfo->relkind == RELKIND_PARTITIONED_TABLE) - appendPQExpBuffer(q, "ALTER TABLE ONLY %s ", - qualrelname); - else - appendPQExpBuffer(q, "ALTER FOREIGN TABLE ONLY %s ", - qualrelname); - appendPQExpBuffer(q, "DROP COLUMN %s;\n", + /* The ALTER ... DROP COLUMN commands must come after */ + appendPQExpBuffer(extra, "ALTER %sTABLE ONLY %s ", + foreign, qualrelname); + appendPQExpBuffer(extra, "DROP COLUMN %s;\n", fmtId(tbinfo->attnames[j])); } - else if (!tbinfo->attislocal[j]) + } + if (!firstitem) + { + appendPQExpBufferStr(q, ") v(dname, dlen, dalign)\n" + "WHERE attrelid = "); + appendStringLiteralAH(q, qualrelname, fout); + appendPQExpBufferStr(q, "::pg_catalog.regclass\n" + " AND attname = v.dname;\n"); + /* Now we can issue the actual DROP COLUMN commands */ + appendBinaryPQExpBuffer(q, extra->data, extra->len); + } + + /* + * Fix up inherited columns. As above, do the pg_attribute + * manipulations in a single SQL command. + */ + firstitem = true; + for (j = 0; j < tbinfo->numatts; j++) + { + if (!tbinfo->attisdropped[j] && + !tbinfo->attislocal[j]) { - appendPQExpBufferStr(q, "\n-- For binary upgrade, recreate inherited column.\n"); - appendPQExpBufferStr(q, "UPDATE pg_catalog.pg_attribute\n" - "SET attislocal = false\n" - "WHERE attname = "); + if (firstitem) + { + appendPQExpBufferStr(q, "\n-- For binary upgrade, recreate inherited columns.\n"); + appendPQExpBufferStr(q, "UPDATE pg_catalog.pg_attribute\n" + "SET attislocal = false\n" + "WHERE attrelid = "); + appendStringLiteralAH(q, qualrelname, fout); + appendPQExpBufferStr(q, "::pg_catalog.regclass\n" + " AND attname IN ("); + firstitem = false; + } + else + appendPQExpBufferStr(q, ", "); appendStringLiteralAH(q, tbinfo->attnames[j], fout); - appendPQExpBufferStr(q, "\n AND attrelid = "); - appendStringLiteralAH(q, qualrelname, fout); - appendPQExpBufferStr(q, "::pg_catalog.regclass;\n"); } } + if (!firstitem) + appendPQExpBufferStr(q, ");\n"); /* * Add inherited CHECK constraints, if any. * * For partitions, they were already dumped, and conislocal * doesn't need fixing. + * + * As above, issue only one direct manipulation of pg_constraint. + * Although it is tempting to merge the ALTER ADD CONSTRAINT + * commands into one as well, refrain for now due to concern about + * possible backend memory bloat if there are many such + * constraints. */ + resetPQExpBuffer(extra); + firstitem = true; for (k = 0; k < tbinfo->ncheck; k++) { ConstraintInfo *constr = &(tbinfo->checkexprs[k]); @@ -16146,18 +16194,31 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo) if (constr->separate || constr->conislocal || tbinfo->ispartition) continue; - appendPQExpBufferStr(q, "\n-- For binary upgrade, set up inherited constraint.\n"); + if (firstitem) + appendPQExpBufferStr(q, "\n-- For binary upgrade, set up inherited constraints.\n"); appendPQExpBuffer(q, "ALTER %sTABLE ONLY %s ADD CONSTRAINT %s %s;\n", foreign, qualrelname, fmtId(constr->dobj.name), constr->condef); - appendPQExpBufferStr(q, "UPDATE pg_catalog.pg_constraint\n" - "SET conislocal = false\n" - "WHERE contype = 'c' AND conname = "); - appendStringLiteralAH(q, constr->dobj.name, fout); - appendPQExpBufferStr(q, "\n AND conrelid = "); - appendStringLiteralAH(q, qualrelname, fout); - appendPQExpBufferStr(q, "::pg_catalog.regclass;\n"); + /* Update pg_constraint after all the ALTER TABLEs */ + if (firstitem) + { + appendPQExpBufferStr(extra, "UPDATE pg_catalog.pg_constraint\n" + "SET conislocal = false\n" + "WHERE contype = 'c' AND conrelid = "); + appendStringLiteralAH(extra, qualrelname, fout); + appendPQExpBufferStr(extra, "::pg_catalog.regclass\n"); + appendPQExpBufferStr(extra, " AND conname IN ("); + firstitem = false; + } + else + appendPQExpBufferStr(extra, ", "); + appendStringLiteralAH(extra, constr->dobj.name, fout); + } + if (!firstitem) + { + appendPQExpBufferStr(extra, ");\n"); + appendBinaryPQExpBuffer(q, extra->data, extra->len); } if (numParents > 0 && !tbinfo->ispartition) @@ -16445,6 +16506,7 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo) destroyPQExpBuffer(q); destroyPQExpBuffer(delq); + destroyPQExpBuffer(extra); free(qrelname); free(qualrelname); } -- 2.43.5 From 7cfea69d3f5df4c15681f4902a86b366f0ed4292 Mon Sep 17 00:00:00 2001 From: Tom Lane <tgl@sss.pgh.pa.us> Date: Sun, 28 Jul 2024 16:40:35 -0400 Subject: [PATCH v1 3/3] Count individual SQL commands in pg_restore's --transaction-size mode. The initial implementation counted one action per TOC entry (except for some special cases for multi-blob BLOBS entries). This assumes that TOC entries are all about equally complex, but it turns out that that assumption doesn't hold up very well in binary-upgrade mode. For example, even after the previous patch I was able to cause backend bloat with tables having many inherited constraints. To fix, count multi-command TOC entries as N actions, allowing the transaction size to be scaled down when we hit a complex TOC entry. Rather than add a SQL parser to pg_restore, approximate "multi command" by counting semicolons in the TOC entry's defn string. This will be fooled by semicolons appearing in string literals --- but the error is in the conservative direction, so it doesn't seem worth working harder. The biggest risk is with function/procedure TOC entries, but we can just explicitly skip those. (This is undoubtedly a hack, and maybe someday we'll be able to revert it after fixing the backend's bloat issues or rethinking what pg_dump emits in binary upgrade mode. But that surely isn't a project for v17.) Thanks to Alexander Korotkov for the let's-count-semicolons idea. Per report from Justin Pryzby. Back-patch to v17 where txn_size mode was introduced. Discussion: https://postgr.es/m/ZqEND4ZcTDBmcv31@pryzbyj2023 --- src/bin/pg_dump/pg_backup_archiver.c | 28 +++++++++++++++++++++++++--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c index 68e321212d..8c20c263c4 100644 --- a/src/bin/pg_dump/pg_backup_archiver.c +++ b/src/bin/pg_dump/pg_backup_archiver.c @@ -3827,10 +3827,32 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData) { IssueACLPerBlob(AH, te); } - else + else if (te->defn && strlen(te->defn) > 0) { - if (te->defn && strlen(te->defn) > 0) - ahprintf(AH, "%s\n\n", te->defn); + ahprintf(AH, "%s\n\n", te->defn); + + /* + * If the defn string contains multiple SQL commands, txn_size mode + * should count it as N actions not one. But rather than build a full + * SQL parser, approximate this by counting semicolons. One case + * where that tends to be badly fooled is function definitions, so + * ignore them. (restore_toc_entry will count one action anyway.) + */ + if (ropt->txn_size > 0 && + strcmp(te->desc, "FUNCTION") != 0 && + strcmp(te->desc, "PROCEDURE") != 0) + { + const char *p = te->defn; + int nsemis = 0; + + while ((p = strchr(p, ';')) != NULL) + { + nsemis++; + p++; + } + if (nsemis > 1) + AH->txnCount += nsemis - 1; + } } /* -- 2.43.5
On Mon, Jul 29, 2024 at 12:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > So I'm forced to the conclusion that we'd better make the transaction > size adaptive as per Alexander's suggestion. > > In addition to the patches attached, I experimented with making > dumpTableSchema fold all the ALTER TABLE commands for a single table > into one command. That's do-able without too much effort, but I'm now > convinced that we shouldn't. It would break the semicolon-counting > hack for detecting that tables like these involve extra work. > I'm also not very confident that the backend won't have trouble with > ALTER TABLE commands containing hundreds of subcommands. That's > something we ought to work on probably, but it's not a project that > I want to condition v17 pg_upgrade's stability on. > > Anyway, proposed patches attached. 0001 is some trivial cleanup > that I noticed while working on the failed single-ALTER-TABLE idea. > 0002 merges the catalog-UPDATE commands that dumpTableSchema issues, > and 0003 is Alexander's suggestion. Nice to see you picked up my idea. I took a look over the patchset. Looks good to me. ------ Regards, Alexander Korotkov Supabase