Thread: Horribly slow pg_upgrade performance with many Large Objects
Hi Hackers ## The issue I have now met a not insignificant number of cases where pg_upgrade performance is really bad when the database has a large number of Large Objects. The average time to `pg_dump --binary-upgrade --format=custom ...` a database and then `pg_restore ...` it back is 1 minute per 1 million LOs, This does not seem to bad until you realize that it is - 100 minutes, ot 1H40min for 100M LOs - 16H40 min for 1 Billion large objects . - 66H40m or 4 2days 18 hours and 40 min for those unfortunate enough who have to full 4 Billion LOs My direct tests have 10m5s for 10M LOs and 100m54s for 100M. Also a smaller server just runs out of shared memory when restoring 100M LOs. I know that our semi-official recommendation for using Large Objects is "Don't!" but still people use them and sometimes they have a lot of them. One of the reasons seems to be an older version of Hibernate which converted any column of type CLOB into a Large Object, so we see cases where average LO size is a few hundred to a few thousand bytes. I read through the old discussions around the time of release of v12 and the consensus seemed to be at that time that dumping the LOs without data was ok "because there will not be too many LOs" which unfortunately has turned out to be wishful thinking, as there are lots of Tiny Large Object out there and these really slow down pg_upgrade. ## So what to do about this? The obvious solution would be to handle the table `pg_largeobject_metadata` the same way as we currently handle `pg_largeobject `by not doing anything with it in `pg_dump --binary-upgrade` and just handle the contents it like we do for user tables in pg_upgrade itself. This should work fine for all source database versions starting from PgSQL v12. For older supported versions 9.2 to 11 where `oid` was system column we should just dump the three fields of `pg_largeobject_metadata` directly so that the dump command would be `COPY (SELECT oid, lomowner, lomacl) TO stdout` and then restore would happen more or less automatically. Or we could just modify `pg_restore` so that it restores LOs using COPY instead of doing the current dance of 3 separate SELECT's from lo_xxx() functions. pg_dump --binary-upgrade format=custome is relatively fast for LOs - 1m58s for 100M object, though still much slower than `pg_dump --data-only -t pg_largeobject_metadata` which is 20 sec, or 23 when also run through gzip ## preferred solution for --binary-upgrade - link/copy data files when source is v12+ - dump the triplet of pg_largeobject_metadata columns if source is v9.2 - v11 an acceptable faster solution would be to just always dump pg_largeobject_metadata (oid, lomowner, lomacl) as this would not need any changes to `pg_upgrade`, just to `pg_dump --binary-upgrade` it would not be as fast as linking/copying but would be several orders of magnitude faster than current behaviour. Does anyone have any objections to this ? Would anyone be willing to take up fixing this ? Or perhaps advise where in code the change should be done ? I started looking at this and at high level It looks like I need to create a task to copy out specific data columns and not the structure of pg_largeobject metadata but as pg_dump.c is 582,673 it would take some time to understand where exactly to do this and all the possible interactions with flags like --binary-upgrade and specific formats. --- Cheers, Hannu
Hannu Krosing <hannuk@google.com> writes: > I have now met a not insignificant number of cases where pg_upgrade > performance is really bad when the database has a large number of > Large Objects. What version are you testing? We did some work in that area in the v17 cycle (a45c78e32). regards, tom lane
On Mon, Apr 07, 2025 at 10:33:47PM +0200, Hannu Krosing wrote: > The obvious solution would be to handle the table > `pg_largeobject_metadata` the same way as we currently handle > `pg_largeobject `by not doing anything with it in `pg_dump > --binary-upgrade` and just handle the contents it like we do for user > tables in pg_upgrade itself. > > This should work fine for all source database versions starting from PgSQL v12. Unfortunately, the storage format for aclitem changed in v16, so this would need to be restricted to upgrades from v16 and newer. That being said, I regularly hear about slow upgrades with many LOs, so I think it'd be worthwhile to try to improve matters in v19. -- nathan
On Mon, Apr 07, 2025 at 05:25:32PM -0400, Tom Lane wrote: > What version are you testing? We did some work in that area in the > v17 cycle (a45c78e32). I am puzzled by the target version used here, as well. If there is more that can be improved, v19 would be the version to consider for future improvements at this stage. -- Michael
Attachment
I was testing on version 17 On Tue, Apr 8, 2025 at 6:52 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Apr 07, 2025 at 05:25:32PM -0400, Tom Lane wrote: > > What version are you testing? We did some work in that area in the > > v17 cycle (a45c78e32). > > I am puzzled by the target version used here, as well. I was testing on version 17 Here is how you can easily test too (as --binary-upgrade does not dump the actual data it is ok for the test to not put anything there) hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5433 lodb hannuk@db01-c1a:~/work/lo-testing$ psql -p 5433 lodb psql (17.4 (Ubuntu 17.4-1.pgdg22.04+2)) Type "help" for help. lodb=# insert into pg_largeobject_metadata(oid, lomowner) SELECT i, 16384 FROM generate_series(1, 100_000_000) g(i); INSERT 0 100000000 Time: 162414.216 ms (02:42.414) lodb=# \q hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t pg_largeobject_metadata -p 5433 lodb | gzip > pg_largeobject_metadata.data.gz real 0m22.094s user 0m20.741s sys 0m2.085s hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t pg_largeobject_metadata --format=custom -p 5433 lodb -f pg_largeobject_metadata.dump real 0m20.226s user 0m18.068s sys 0m0.824s > If there is > more that can be improved, v19 would be the version to consider for > future improvements at this stage. If the internal format has changed in 16 the correct way would be to go through the data-only dump of pg_largeobject_metadata in all cases. Even for the 100M case where you get the restore in 2 minutes instead of 100 minutes hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5434 lodb hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434 --exit-on-error --transaction-size=1000 --dbname lodb pg_largeobject_metadata.dump real 2m2.277s user 0m2.594s sys 0m0.549s And even in case of the user-visible format change in acl format it is most likely that changing the visible format using some regexp magic, or even a dedicated function, would still me much faster than creating all the LOs though creation commands. ------ The commands I used to do the pg_upgrade-like test were hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only --quote-all-identifiers --binary-upgrade --format=custom --file=lodb100m.dump -p 5433 lodb real 1m58.241s user 0m35.229s sys 0m17.854s hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434 --exit-on-error --transaction-size=1000 --dbname lodb lodb100m.dump real 100m54.878s user 3m23.885s sys 20m33.761s (I left out the --verbose part that pg_upgrade also sets as I did not want to get 100M lines of "large object created " messages ) also the postgres server at -p 5434 needs to be started with -b flag to accept the loading a dump from --binary-upgrade. In Debian/Ubuntu this can be directly passed to pg_ctlcluster as follows sudo pg_ctlcluster 17 target -o -b ---- Hannu > -- > Michael
Looked like a bit illogical order on re-reading it so I want to make clear that the pg_upgrade-like test showing 100min for 100 million LOs is at the end of last message and the proposed solution is at the beginning On Tue, Apr 8, 2025 at 9:15 AM Hannu Krosing <hannuk@google.com> wrote: > > I was testing on version 17 > > > On Tue, Apr 8, 2025 at 6:52 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Mon, Apr 07, 2025 at 05:25:32PM -0400, Tom Lane wrote: > > > What version are you testing? We did some work in that area in the > > > v17 cycle (a45c78e32). > > > > I am puzzled by the target version used here, as well. > > I was testing on version 17 > > Here is how you can easily test too (as --binary-upgrade does not dump > the actual data it is ok for the test to not put anything there) > > hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5433 lodb > hannuk@db01-c1a:~/work/lo-testing$ psql -p 5433 lodb > psql (17.4 (Ubuntu 17.4-1.pgdg22.04+2)) > Type "help" for help. > > lodb=# insert into pg_largeobject_metadata(oid, lomowner) SELECT i, > 16384 FROM generate_series(1, 100_000_000) g(i); > INSERT 0 100000000 > Time: 162414.216 ms (02:42.414) > lodb=# > \q > > hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t > pg_largeobject_metadata -p 5433 lodb | gzip > > pg_largeobject_metadata.data.gz > real 0m22.094s > user 0m20.741s > sys 0m2.085s > > hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t > pg_largeobject_metadata --format=custom -p 5433 lodb -f > pg_largeobject_metadata.dump > real 0m20.226s > user 0m18.068s > sys 0m0.824s > > > If there is > > more that can be improved, v19 would be the version to consider for > > future improvements at this stage. > > If the internal format has changed in 16 the correct way would be to > go through the data-only dump of pg_largeobject_metadata in all cases. > Even for the 100M case where you get the restore in 2 minutes instead > of 100 minutes > > hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5434 lodb > hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434 > --exit-on-error --transaction-size=1000 --dbname lodb > pg_largeobject_metadata.dump > > real 2m2.277s > user 0m2.594s > sys 0m0.549s > > And even in case of the user-visible format change in acl format it is > most likely that changing the visible format using some regexp magic, > or even a dedicated function, would still me much faster than creating > all the LOs though creation commands. > > ------ > The commands I used to do the pg_upgrade-like test were > > hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only > --quote-all-identifiers --binary-upgrade --format=custom > --file=lodb100m.dump -p 5433 lodb > real 1m58.241s > user 0m35.229s > sys 0m17.854s > > hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434 > --exit-on-error --transaction-size=1000 --dbname lodb lodb100m.dump > real 100m54.878s > user 3m23.885s > sys 20m33.761s > > (I left out the --verbose part that pg_upgrade also sets as I did not > want to get 100M lines of "large object created " messages ) > > also the postgres server at -p 5434 needs to be started with -b flag > to accept the loading a dump from --binary-upgrade. In Debian/Ubuntu > this can be directly passed to pg_ctlcluster as follows > > sudo pg_ctlcluster 17 target -o -b > > ---- > Hannu > > > > > > -- > > Michael
On Tue, Apr 8, 2025 at 12:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Mon, Apr 07, 2025 at 10:33:47PM +0200, Hannu Krosing wrote: > > The obvious solution would be to handle the table > > `pg_largeobject_metadata` the same way as we currently handle > > `pg_largeobject `by not doing anything with it in `pg_dump > > --binary-upgrade` and just handle the contents it like we do for user > > tables in pg_upgrade itself. > > > > This should work fine for all source database versions starting from PgSQL v12. > > Unfortunately, the storage format for aclitem changed in v16, so this would > need to be restricted to upgrades from v16 and newer. Have we also changed the external format of aclitem any time since v 9.2 or are the changes just to storage ? If external formats have been stable we can still get reasonable performance with dumping the data (2 min for 100M rows) Plus dumping data would work for all the supported source versions. The worst case would still be quite bad with 80+ min for the full set of 4 billion LOs but even that would be much better than the 3 days with current wayd. > That being said, I > regularly hear about slow upgrades with many LOs, so I think it'd be > worthwhile to try to improve matters in v19. Changing the LO export to dumping pg_largeobject_metadata content instead of creating the LOs should be a nice small change confined to pg_dump --binary-upgrade only so perhaps we could squeeze it in v18 still. -- Hannu
On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote: > On Tue, Apr 8, 2025 at 12:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> That being said, I >> regularly hear about slow upgrades with many LOs, so I think it'd be >> worthwhile to try to improve matters in v19. > > Changing the LO export to dumping pg_largeobject_metadata content > instead of creating the LOs should be a nice small change confined to > pg_dump --binary-upgrade only so perhaps we could squeeze it in v18 > still. Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19 material at this point. -- nathan
On Tue, Apr 8, 2025 at 5:46 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote: > > On Tue, Apr 8, 2025 at 12:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> That being said, I > >> regularly hear about slow upgrades with many LOs, so I think it'd be > >> worthwhile to try to improve matters in v19. > > > > Changing the LO export to dumping pg_largeobject_metadata content > > instead of creating the LOs should be a nice small change confined to > > pg_dump --binary-upgrade only so perhaps we could squeeze it in v18 > > still. > > Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19 > material at this point. Sure. But I actually think that this is something that should be back-ported to at least all supported versions at some pon. Possibly made dependent on some environment flag so only people that desperately need it will get it. Btw, who would be the right person(s) to ask questions about internals of pg_dump ? I have a few more things in the pipeline to add there and would like to make sure that I have the right approach. ------ Hannu
Nathan Bossart <nathandbossart@gmail.com> writes: > On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote: >> Changing the LO export to dumping pg_largeobject_metadata content >> instead of creating the LOs should be a nice small change confined to >> pg_dump --binary-upgrade only so perhaps we could squeeze it in v18 >> still. > Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19 > material at this point. Yeah, even if we had a patch in hand, it's too late for v18. However there are additional problems with this idea: 1. The idea requires role OIDs to match across the upgrade. I don't believe that pg_upgrade tries to preserve role OIDs --- and doing so would be problematic, because what if the new cluster's bootstrap superuser is named differently in the old and new clusters? It might be possible to work around that with some casting to/from regrole, but I don't think a simple COPY into pg_largeobject_metadata will play along with that. 2. If you just do the equivalent of an INSERT or COPY into pg_largeobject_metadata, you could create entries that look right, but they are actually not right because there should be pg_shdepend entries backing each ownership or permission reference (for non-pinned roles) and there won't be. I guess you could think of also manually inserting rows into pg_shdepend, but (a) ugh and (b) the claimed speedup is kind of vanishing into the distance at this point. regards, tom lane
This is what the opening comment in pg_upgrade says I think we do preserve role oids /* * To simplify the upgrade process, we force certain system values to be * identical between old and new clusters: * * We control all assignments of pg_class.oid (and relfilenode) so toast * oids are the same between old and new clusters. This is important * because toast oids are stored as toast pointers in user tables. * * While pg_class.oid and pg_class.relfilenode are initially the same in a * cluster, they can diverge due to CLUSTER, REINDEX, or VACUUM FULL. We * control assignments of pg_class.relfilenode because we want the filenames * to match between the old and new cluster. * * We control assignment of pg_tablespace.oid because we want the oid to match * between the old and new cluster. * * We control all assignments of pg_type.oid because these oids are stored * in user composite type values. * * We control all assignments of pg_enum.oid because these oids are stored * in user tables as enum values. * * We control all assignments of pg_authid.oid for historical reasons (the * oids used to be stored in pg_largeobject_metadata, which is now copied via * SQL commands), that might change at some point in the future. */ On Tue, Apr 8, 2025 at 6:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Nathan Bossart <nathandbossart@gmail.com> writes: > > On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote: > >> Changing the LO export to dumping pg_largeobject_metadata content > >> instead of creating the LOs should be a nice small change confined to > >> pg_dump --binary-upgrade only so perhaps we could squeeze it in v18 > >> still. > > > Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19 > > material at this point. > > Yeah, even if we had a patch in hand, it's too late for v18. However > there are additional problems with this idea: > > 1. The idea requires role OIDs to match across the upgrade. > I don't believe that pg_upgrade tries to preserve role OIDs --- and > doing so would be problematic, because what if the new cluster's > bootstrap superuser is named differently in the old and new clusters? > > It might be possible to work around that with some casting to/from > regrole, but I don't think a simple COPY into pg_largeobject_metadata > will play along with that. > > 2. If you just do the equivalent of an INSERT or COPY into > pg_largeobject_metadata, you could create entries that look right, > but they are actually not right because there should be pg_shdepend > entries backing each ownership or permission reference (for non-pinned > roles) and there won't be. > > I guess you could think of also manually inserting rows into > pg_shdepend, but (a) ugh and (b) the claimed speedup is kind > of vanishing into the distance at this point. > > regards, tom lane
Hannu Krosing <hannuk@google.com> writes: > I think we do preserve role oids Oh ... I'd been looking for mentions of "role" in pg_upgrade_support.c, but what I should have looked for was "pg_authid". So yeah, we do preserve role OIDs, and maybe that's enough to make this workable, at least with source versions that share the same rules for what goes into pg_largeobject_metadata and pg_shdepend. It's not something I'd risk back-patching though. regards, tom lane
On Tue, Apr 08, 2025 at 12:37:43PM -0400, Tom Lane wrote: > Hannu Krosing <hannuk@google.com> writes: >> I think we do preserve role oids > > Oh ... I'd been looking for mentions of "role" in > pg_upgrade_support.c, but what I should have looked for was > "pg_authid". So yeah, we do preserve role OIDs, and maybe that's > enough to make this workable, at least with source versions that > share the same rules for what goes into pg_largeobject_metadata and > pg_shdepend. It's not something I'd risk back-patching though. I do think it's worth considering going back to copying pg_largobject_metadata's files for upgrades from v16 and newer. That sounds restrictive at the moment, but it'll mean that all but one supported major version can copy the files during upgrade to v19. I'll admit I'm a tad worried about having to go back to copying via SQL commands in the future and re-regressing things (leading to unpredictable differences in upgrade downtime), but I'm not sure that's a great reason to withhold this optimization. Of course, I wouldn't be opposed to optimizing the SQL command strategy, too, but I suspect that won't compare to copying the files. -- nathan
On Tue, Apr 8, 2025 at 6:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Hannu Krosing <hannuk@google.com> writes: > > I think we do preserve role oids > > Oh ... I'd been looking for mentions of "role" in > pg_upgrade_support.c, but what I should have looked for was > "pg_authid". So yeah, we do preserve role OIDs, and maybe that's > enough to make this workable, at least with source versions that > share the same rules for what goes into pg_largeobject_metadata and > pg_shdepend. It's not something I'd risk back-patching though. The risk is why I suggest to have this in backports as something last resort which is gated on an environment variable I have been forced to do this half-manually a few times to make pg_upgrade feasible at all. Current really UGH workaround is to use pg_repack support for swapping relfilenodes of pg_largeobject_metadata with a user table before and after the upgrade and then manually patching up leftovers like pg_shdepend after pg_upgrade. These have fortunately been cases where there were no other dependencies like comments or security labels on LOs, but these to should be doable; > regards, tom lane
Nathan Bossart <nathandbossart@gmail.com> writes: > I do think it's worth considering going back to copying > pg_largobject_metadata's files for upgrades from v16 and newer. (If we do this) I don't see why we'd need to stop at v16. I'm envisioning that we'd use COPY, which will be dealing in the text representation of aclitems, and I don't think that's changed in a long time. The sort of thing that would break it is changes in the set of available/default privilege bits for large objects. That is, where the dump currently contains something like SELECT pg_catalog.lo_create('2121'); ALTER LARGE OBJECT 2121 OWNER TO postgres; GRANT ALL ON LARGE OBJECT 2121 TO joe; we'd have COPY pg_largeobject_metadata FROM STDIN; ... 2121 10 {postgres=rw/postgres,joe=rw/postgres} ... and some appropriate COPY data for pg_shdepend too. (The fact that this representation will contain both numeric and symbolic role OIDs is why I was concerned about OID stability.) The key thing that worries me here is if the source and target versions have different ideas of which roles are pinned, which would silently change what appears in pg_shdepend. But it'd only really break if a role mentioned in some LO's owner or ACL is pinned in the source and not in the target, which seems unlikely. (In the other direction, we'd just be adding a useless row in pg_shdepend.) regards, tom lane
On Tue, Apr 8, 2025 at 7:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Nathan Bossart <nathandbossart@gmail.com> writes: > > I do think it's worth considering going back to copying > > pg_largobject_metadata's files for upgrades from v16 and newer. > > (If we do this) I don't see why we'd need to stop at v16. I'm > envisioning that we'd use COPY, which will be dealing in the > text representation of aclitems, and I don't think that's changed > in a long time. The sort of thing that would break it is changes > in the set of available/default privilege bits for large objects. > > That is, where the dump currently contains something like > > SELECT pg_catalog.lo_create('2121'); > ALTER LARGE OBJECT 2121 OWNER TO postgres; > GRANT ALL ON LARGE OBJECT 2121 TO joe; Also note that in my --binary-upgrade tests the 100 min / 100M objects ratio was in case with no grants. I would expect this to grow to at least 120 to 150 minutes when grants are also involved. In copy case I would expect the presence of grants to not make much difference. > we'd have > > COPY pg_largeobject_metadata FROM STDIN; > ... > 2121 10 {postgres=rw/postgres,joe=rw/postgres} > ... -- Hannu
On Tue, Apr 08, 2025 at 01:07:09PM -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> I do think it's worth considering going back to copying >> pg_largobject_metadata's files for upgrades from v16 and newer. > > (If we do this) I don't see why we'd need to stop at v16. I'm > envisioning that we'd use COPY, which will be dealing in the > text representation of aclitems, and I don't think that's changed > in a long time. The sort of thing that would break it is changes > in the set of available/default privilege bits for large objects. I was thinking of actually reverting commit 12a53c7 for upgrades from v16, which AFAICT is the last release where any relevant storage formats changed (aclitem changed in v16). But if COPY gets us pretty close to that and is less likely to be disrupted by future changes, it could be a better long-term approach. > That is, where the dump currently contains something like > > SELECT pg_catalog.lo_create('2121'); > ALTER LARGE OBJECT 2121 OWNER TO postgres; > GRANT ALL ON LARGE OBJECT 2121 TO joe; > > we'd have > > COPY pg_largeobject_metadata FROM STDIN; > ... > 2121 10 {postgres=rw/postgres,joe=rw/postgres} > ... > > and some appropriate COPY data for pg_shdepend too. Unless I'm missing something, we don't seem to have had any dependency handling before commit 12a53c7. Was that broken before we moved to SQL commands? -- nathan
Hannu Krosing <hannuk@google.com> writes: > In copy case I would expect the presence of grants to not make much difference. aclitemin is slower than a lot of other datatype input functions, but it's still got to be faster than a GRANT. regards, tom lane
Hmm ... one annoying thing for this project is that AFAICS pg_upgrade does *not* preserve database OIDs, which is problematic for using COPY to load pg_shdepend rows. regards, tom lane
On Tue, Apr 08, 2025 at 01:36:58PM -0400, Tom Lane wrote: > Hmm ... one annoying thing for this project is that AFAICS pg_upgrade > does *not* preserve database OIDs, which is problematic for using > COPY to load pg_shdepend rows. I think it does; see commit aa01051. -- nathan
Nathan Bossart <nathandbossart@gmail.com> writes: > Unless I'm missing something, we don't seem to have had any dependency > handling before commit 12a53c7. Was that broken before we moved to SQL > commands? Sounds like it :-( regards, tom lane
Nathan Bossart <nathandbossart@gmail.com> writes: > On Tue, Apr 08, 2025 at 01:36:58PM -0400, Tom Lane wrote: >> Hmm ... one annoying thing for this project is that AFAICS pg_upgrade >> does *not* preserve database OIDs, which is problematic for using >> COPY to load pg_shdepend rows. > I think it does; see commit aa01051. Ah --- I thought I remembered something having been done about that, but I failed to find it because I was looking in pg_upgrade not pg_dump. Too bad aa01051 didn't update the comment at the top of pg_upgrade.c. regards, tom lane
On Tue, Apr 08, 2025 at 01:42:20PM -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> Unless I'm missing something, we don't seem to have had any dependency >> handling before commit 12a53c7. Was that broken before we moved to SQL >> commands? > > Sounds like it :-( Huh. Sure enough, it seems to be lost during an upgrade from 9.6 to 10. v9.6: postgres=# select lo_from_bytea(1234, '1234'); lo_from_bytea --------------- 1234 (1 row) postgres=# create role bob; CREATE ROLE postgres=# grant select on large object 1234 to bob; GRANT postgres=# drop role bob; ERROR: role "bob" cannot be dropped because some objects depend on it DETAIL: privileges for large object 1234 v10 (upgraded from v9.6): postgres=# select lo_get(1234); lo_get ------------ \x31323334 (1 row) postgres=# drop role bob; DROP ROLE If I then try to upgrade that database to v17, it fails like this: pg_restore: from TOC entry 2422; 0 0 ACL LARGE OBJECT 1234 nathan pg_restore: error: could not execute query: ERROR: role "16384" does not exist Command was: GRANT SELECT ON LARGE OBJECT 1234 TO "16384"; I've also verified that the dependency information is carried over in upgrades to later versions (AFAICT all the supported ones). -- nathan
On Tue, Apr 8, 2025 at 8:39 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > ... > > I've also verified that the dependency information is carried over in > upgrades to later versions (AFAICT all the supported ones). If I remember correctly the change to not copying pg_largeobject_metadata data file but instead moving LOs as part of schema was done in v12 when oid,, which had been a system column in v11, became a user column, so upgrade to v11 is likely also missing the dependencies
On 4/8/25 15:41, Hannu Krosing wrote: > On Tue, Apr 8, 2025 at 8:39 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> > ... >> >> I've also verified that the dependency information is carried over in >> upgrades to later versions (AFAICT all the supported ones). > > If I remember correctly the change to not copying > pg_largeobject_metadata data file but instead moving LOs as part of > schema was done in v12 when oid,, which had been a system column in > v11, became a user column, so upgrade to v11 is likely also missing > the dependencies > > I remember an incident where large amounts of LOs ran pg_upgrade into a transaction-ID wrap around because the restore part would create individual single statement transactions per LO to create, then change permissions and ownership and finally fill in the data. Could that be related here? Regards, Jan
On Tue, Apr 08, 2025 at 05:37:50PM -0400, Jan Wieck wrote: > I remember an incident where large amounts of LOs ran pg_upgrade into a > transaction-ID wrap around because the restore part would create individual > single statement transactions per LO to create, then change permissions and > ownership and finally fill in the data. Could that be related here? I believe commits 74cf7d4 and 959b38d have largely fixed that problem. -- nathan
On Tue, Apr 08, 2025 at 01:51:22PM -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> On Tue, Apr 08, 2025 at 01:36:58PM -0400, Tom Lane wrote: >>> Hmm ... one annoying thing for this project is that AFAICS pg_upgrade >>> does *not* preserve database OIDs, which is problematic for using >>> COPY to load pg_shdepend rows. > >> I think it does; see commit aa01051. > > Ah --- I thought I remembered something having been done about that, > but I failed to find it because I was looking in pg_upgrade not > pg_dump. Too bad aa01051 didn't update the comment at the top of > pg_upgrade.c. I'll apply the attached patch to fix the comment shortly. -- nathan
Attachment
On Tue, Apr 08, 2025 at 09:41:06PM +0200, Hannu Krosing wrote: > On Tue, Apr 8, 2025 at 8:39 PM Nathan Bossart <nathandbossart@gmail.com> wrote: >> I've also verified that the dependency information is carried over in >> upgrades to later versions (AFAICT all the supported ones). > > If I remember correctly the change to not copying > pg_largeobject_metadata data file but instead moving LOs as part of > schema was done in v12 when oid,, which had been a system column in > v11, became a user column, so upgrade to v11 is likely also missing > the dependencies Right. -- nathan
On Tue, Apr 08, 2025 at 12:22:00PM -0500, Nathan Bossart wrote: > On Tue, Apr 08, 2025 at 01:07:09PM -0400, Tom Lane wrote: >> Nathan Bossart <nathandbossart@gmail.com> writes: >>> I do think it's worth considering going back to copying >>> pg_largobject_metadata's files for upgrades from v16 and newer. >> >> (If we do this) I don't see why we'd need to stop at v16. I'm >> envisioning that we'd use COPY, which will be dealing in the >> text representation of aclitems, and I don't think that's changed >> in a long time. The sort of thing that would break it is changes >> in the set of available/default privilege bits for large objects. > > I was thinking of actually reverting commit 12a53c7 for upgrades from v16, > which AFAICT is the last release where any relevant storage formats changed > (aclitem changed in v16). But if COPY gets us pretty close to that and is > less likely to be disrupted by future changes, it could be a better > long-term approach. > >> That is, where the dump currently contains something like >> >> SELECT pg_catalog.lo_create('2121'); >> ALTER LARGE OBJECT 2121 OWNER TO postgres; >> GRANT ALL ON LARGE OBJECT 2121 TO joe; >> >> we'd have >> >> COPY pg_largeobject_metadata FROM STDIN; >> ... >> 2121 10 {postgres=rw/postgres,joe=rw/postgres} >> ... >> >> and some appropriate COPY data for pg_shdepend too. I did some more research here. For many large objects without ACLs to dump, I noticed that the vast majority of time is going to restoring the ALTER OWNER commands. For 1 million such large objects, restoring took ~73 seconds on my machine. If I instead invented an lo_create_with_owner() function and created 100 per SELECT command, the same restore takes ~7 seconds. Copying the relevant pg_shdepend rows out and back in takes ~2.5 seconds. I imagine using COPY for pg_largeobject_metadata would also take a couple of seconds in this case. For upgrading, I don't think there's any huge benefit to optimizing the restore commands versus using COPY. It might make future catalog changes for large object stuff easier, but I'd expect those to be rare. However, the optimized restore commands could be nice for non-pg_upgrade use-cases. -- nathan
And in case there *is* ACL present then each user mentioned in the ACL adds more overhead Also the separate GRANT calls cause bloat as the pg_largeoject_metadata row gets updated for each ALTER USER or GRANT The following is for 10 million LOs with 1 and 3 users being GRANTed SELECT on each object (with no grants the pg_restore run was 10 minutes) Nr of GRANTS | pg_dump time | pg_restore time --------------+--------------+---------------- 0 | 0m 10s | 10m 5s 1 | 0m 17s | 15m 3s 3 | 0m 21s | 27m 15s NB! - I left out the --verbose flag from pg_dump as used by pg_upgrade, as it will emit one line per LO dumped ## 1 GRANT / LO hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only --quote-all-identifiers --binary-upgrade --format=custom --file=lodb10m.dump -p 5433 lodb10m real 0m17.022s user 0m2.956s sys 0m1.453s hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434 --exit-on-error --transaction-size=1000 --dbname lodb10m lodb10m.dump real 15m3.136s user 0m28.991s sys 2m54.164s ## 3 GRANTs / LO make sample LO with 3 grants ALTER LARGE OBJECT 1 OWNER TO "hannuk"; GRANT SELECT ON LARGE OBJECT 1 TO "bob"; GRANT SELECT ON LARGE OBJECT 1 TO "joe"; GRANT SELECT ON LARGE OBJECT 1 TO "tom"; lodb10m=# select * from pg_shdepend where objid = 1; ┌───────┬─────────┬───────┬──────────┬────────────┬──────────┬─────────┐ │ dbid │ classid │ objid │ objsubid │ refclassid │ refobjid │ deptype │ ├───────┼─────────┼───────┼──────────┼────────────┼──────────┼─────────┤ │ 16406 │ 2613 │ 1 │ 0 │ 1260 │ 16384 │ o │ │ 16406 │ 2613 │ 1 │ 0 │ 1260 │ 16393 │ a │ │ 16406 │ 2613 │ 1 │ 0 │ 1260 │ 16394 │ a │ │ 16406 │ 2613 │ 1 │ 0 │ 1260 │ 16395 │ a │ └───────┴─────────┴───────┴──────────┴────────────┴──────────┴─────────┘ lodb10m=# select * from pg_largeobject_metadata ; ┌─────┬──────────┬───────────────────────────────────────────────────────────┐ │ oid │ lomowner │ lomacl │ ├─────┼──────────┼───────────────────────────────────────────────────────────┤ │ 1 │ 16384 │ {hannuk=rw/hannuk,bob=r/hannuk,joe=r/hannuk,tom=r/hannuk} │ └─────┴──────────┴───────────────────────────────────────────────────────────┘ Make the remaining 10M-1 LOs lodb10m=# insert into pg_largeobject_metadata(oid, lomowner, lomacl) SELECT i, 16384, '{hannuk=rw/hannuk,bob=r/hannuk,joe=r/hannuk,tom=r/hannuk}' FROM generate_series(2, 10_000_000) g(i); INSERT 0 9999999 Time: 18859.341 ms (00:18.859) And add their sharedeps lodb10m=# WITH refdeps (robj, rdeptype) AS ( VALUES (16384, 'o'), (16393, 'a'), (16394, 'a'), (16395, 'a') ) INSERT INTO pg_shdepend SELECT 16396, 2613, i, 0, 1260, robj, rdeptype FROM generate_series(2, 10_000_000) g(i) , refdeps ; INSERT 0 39999996 Time: 116697.342 ms (01:56.697) Time pg_upgrade's pg_dump and pg_reload hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only --quote-all-identifiers --binary-upgrade --format=custom --file=lodb10m-3grants.dump -p 5433 lodb10m real 0m21.519s user 0m2.951s sys 0m1.723s hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434 --exit-on-error --transaction-size=1000 --dbname lodb10m lodb10m-3grants.dump real 27m15.372s user 0m45.157s sys 4m57.513s On Fri, Apr 11, 2025 at 10:11 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Apr 08, 2025 at 12:22:00PM -0500, Nathan Bossart wrote: > > On Tue, Apr 08, 2025 at 01:07:09PM -0400, Tom Lane wrote: > >> Nathan Bossart <nathandbossart@gmail.com> writes: > >>> I do think it's worth considering going back to copying > >>> pg_largobject_metadata's files for upgrades from v16 and newer. > >> > >> (If we do this) I don't see why we'd need to stop at v16. I'm > >> envisioning that we'd use COPY, which will be dealing in the > >> text representation of aclitems, and I don't think that's changed > >> in a long time. The sort of thing that would break it is changes > >> in the set of available/default privilege bits for large objects. > > > > I was thinking of actually reverting commit 12a53c7 for upgrades from v16, > > which AFAICT is the last release where any relevant storage formats changed > > (aclitem changed in v16). But if COPY gets us pretty close to that and is > > less likely to be disrupted by future changes, it could be a better > > long-term approach. > > > >> That is, where the dump currently contains something like > >> > >> SELECT pg_catalog.lo_create('2121'); > >> ALTER LARGE OBJECT 2121 OWNER TO postgres; > >> GRANT ALL ON LARGE OBJECT 2121 TO joe; > >> > >> we'd have > >> > >> COPY pg_largeobject_metadata FROM STDIN; > >> ... > >> 2121 10 {postgres=rw/postgres,joe=rw/postgres} > >> ... > >> > >> and some appropriate COPY data for pg_shdepend too. > > I did some more research here. For many large objects without ACLs to > dump, I noticed that the vast majority of time is going to restoring the > ALTER OWNER commands. For 1 million such large objects, restoring took ~73 > seconds on my machine. If I instead invented an lo_create_with_owner() > function and created 100 per SELECT command, the same restore takes ~7 > seconds. Copying the relevant pg_shdepend rows out and back in takes ~2.5 > seconds. I imagine using COPY for pg_largeobject_metadata would also take > a couple of seconds in this case. > > For upgrading, I don't think there's any huge benefit to optimizing the > restore commands versus using COPY. It might make future catalog changes > for large object stuff easier, but I'd expect those to be rare. However, > the optimized restore commands could be nice for non-pg_upgrade use-cases. > > -- > nathan