Thread: Horribly slow pg_upgrade performance with many Large Objects

Horribly slow pg_upgrade performance with many Large Objects

From
Hannu Krosing
Date:
Hi Hackers

## The issue

I have now met a not insignificant number of cases where pg_upgrade
performance is really bad when the database has a large number of
Large Objects.

The average time to `pg_dump --binary-upgrade --format=custom ...` a
database and then `pg_restore ...` it back is 1 minute per 1 million
LOs,

This does not seem to bad until you realize that it is
- 100 minutes, ot 1H40min for 100M LOs
- 16H40 min for 1 Billion large objects .
- 66H40m or 4 2days 18 hours and 40 min for those unfortunate enough
who have to full 4 Billion LOs

My direct tests have 10m5s for 10M LOs and 100m54s for 100M.

Also a smaller server just runs out of shared memory when restoring 100M LOs.

I know that our  semi-official recommendation for using Large Objects
is "Don't!" but still people use them and sometimes they have a lot of
them. One of the reasons seems to be an older version of Hibernate
which converted any column of type CLOB into a Large Object, so we see
cases where average LO size is a few hundred to a few thousand bytes.

I read through the old discussions around the time of release of v12
and the consensus seemed to be at that time that dumping the LOs
without data was ok "because there will not be too many LOs" which
unfortunately has turned out to be wishful thinking, as there are lots
of  Tiny Large Object out there and these really slow down pg_upgrade.

## So what to do about this?

The obvious solution would be to handle the table
`pg_largeobject_metadata` the same way as we currently handle
`pg_largeobject `by not doing anything with it in `pg_dump
--binary-upgrade` and just handle the contents it like we do for user
tables in pg_upgrade itself.

This should work fine for all source database versions starting from PgSQL v12.

For older supported versions 9.2 to 11 where `oid` was system column
we should just dump the three fields of `pg_largeobject_metadata`
directly so that the dump command would be  `COPY (SELECT oid,
lomowner, lomacl) TO stdout` and then restore would happen more or
less automatically.

Or we could just modify `pg_restore` so that it restores LOs using
COPY instead of doing the current dance of 3 separate SELECT's from
lo_xxx() functions.

pg_dump --binary-upgrade format=custome is relatively fast for LOs -
1m58s for 100M object,

though still much slower than `pg_dump --data-only -t
pg_largeobject_metadata` which is 20 sec, or 23 when also run through
gzip

## preferred solution for --binary-upgrade

- link/copy data files when source is v12+
- dump the triplet of pg_largeobject_metadata columns if source is v9.2 - v11

an acceptable faster solution would be to just always dump
pg_largeobject_metadata (oid,  lomowner, lomacl) as this would not
need any changes to `pg_upgrade`, just to `pg_dump --binary-upgrade`
it would not be as fast as linking/copying but would be several orders
of magnitude faster than current behaviour.


Does anyone have any objections to this ?

Would anyone be willing to take up fixing this ?

Or perhaps advise where in code the change should be done ?

I started looking at this and at high level It looks like I need to
create a task to copy out specific data columns  and not the structure
of pg_largeobject metadata but as pg_dump.c is 582,673 it would take
some time to understand where exactly to do this and all the possible
interactions with flags like --binary-upgrade and specific formats.

---
Cheers,
Hannu



Hannu Krosing <hannuk@google.com> writes:
> I have now met a not insignificant number of cases where pg_upgrade
> performance is really bad when the database has a large number of
> Large Objects.

What version are you testing?  We did some work in that area in the
v17 cycle (a45c78e32).

            regards, tom lane



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Nathan Bossart
Date:
On Mon, Apr 07, 2025 at 10:33:47PM +0200, Hannu Krosing wrote:
> The obvious solution would be to handle the table
> `pg_largeobject_metadata` the same way as we currently handle
> `pg_largeobject `by not doing anything with it in `pg_dump
> --binary-upgrade` and just handle the contents it like we do for user
> tables in pg_upgrade itself.
> 
> This should work fine for all source database versions starting from PgSQL v12.

Unfortunately, the storage format for aclitem changed in v16, so this would
need to be restricted to upgrades from v16 and newer.  That being said, I
regularly hear about slow upgrades with many LOs, so I think it'd be
worthwhile to try to improve matters in v19.

-- 
nathan



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Michael Paquier
Date:
On Mon, Apr 07, 2025 at 05:25:32PM -0400, Tom Lane wrote:
> What version are you testing?  We did some work in that area in the
> v17 cycle (a45c78e32).

I am puzzled by the target version used here, as well.  If there is
more that can be improved, v19 would be the version to consider for
future improvements at this stage.
--
Michael

Attachment

Re: Horribly slow pg_upgrade performance with many Large Objects

From
Hannu Krosing
Date:
I was testing on version 17


On Tue, Apr 8, 2025 at 6:52 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Apr 07, 2025 at 05:25:32PM -0400, Tom Lane wrote:
> > What version are you testing?  We did some work in that area in the
> > v17 cycle (a45c78e32).
>
> I am puzzled by the target version used here, as well.

I was testing on version 17

Here is how you can easily test too (as --binary-upgrade does not dump
the actual data it is ok for the test to not put anything there)

hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5433 lodb
hannuk@db01-c1a:~/work/lo-testing$ psql -p 5433 lodb
psql (17.4 (Ubuntu 17.4-1.pgdg22.04+2))
Type "help" for help.

lodb=# insert into pg_largeobject_metadata(oid, lomowner) SELECT i,
16384 FROM generate_series(1, 100_000_000) g(i);
INSERT 0 100000000
Time: 162414.216 ms (02:42.414)
lodb=#
\q

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t
pg_largeobject_metadata -p 5433 lodb | gzip >
pg_largeobject_metadata.data.gz
real 0m22.094s
user 0m20.741s
sys 0m2.085s

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t
pg_largeobject_metadata --format=custom -p 5433 lodb -f
pg_largeobject_metadata.dump
real 0m20.226s
user 0m18.068s
sys 0m0.824s

>  If there is
> more that can be improved, v19 would be the version to consider for
> future improvements at this stage.

If the internal format has changed in 16 the correct way would be to
go through the data-only dump of pg_largeobject_metadata in all cases.
Even for the 100M case where you get the restore in 2 minutes instead
of 100 minutes

hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5434 lodb
hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb
pg_largeobject_metadata.dump

real 2m2.277s
user 0m2.594s
sys 0m0.549s

And even in case of the user-visible format change in acl format it is
most likely that changing the visible format using some regexp magic,
or even a dedicated function, would still me much faster than creating
all the LOs though creation commands.

------
The commands I used to do the pg_upgrade-like test were

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only
--quote-all-identifiers --binary-upgrade --format=custom
--file=lodb100m.dump -p 5433 lodb
real 1m58.241s
user 0m35.229s
sys 0m17.854s

hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb lodb100m.dump
real    100m54.878s
user    3m23.885s
sys     20m33.761s

(I left out the --verbose part that pg_upgrade also sets as I did not
want to get 100M lines of  "large object created " messages )

also the postgres server at -p 5434 needs to be started with -b flag
to accept the loading a dump from --binary-upgrade. In Debian/Ubuntu
this can be directly passed to pg_ctlcluster as follows

sudo pg_ctlcluster 17 target -o -b

----
Hannu




> --
> Michael



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Hannu Krosing
Date:
Looked like a bit illogical order on re-reading it so I want to make
clear that the pg_upgrade-like test showing 100min for 100 million LOs
is at the end of last message and the proposed solution is at the
beginning

On Tue, Apr 8, 2025 at 9:15 AM Hannu Krosing <hannuk@google.com> wrote:
>
> I was testing on version 17
>
>
> On Tue, Apr 8, 2025 at 6:52 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Mon, Apr 07, 2025 at 05:25:32PM -0400, Tom Lane wrote:
> > > What version are you testing?  We did some work in that area in the
> > > v17 cycle (a45c78e32).
> >
> > I am puzzled by the target version used here, as well.
>
> I was testing on version 17
>
> Here is how you can easily test too (as --binary-upgrade does not dump
> the actual data it is ok for the test to not put anything there)
>
> hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5433 lodb
> hannuk@db01-c1a:~/work/lo-testing$ psql -p 5433 lodb
> psql (17.4 (Ubuntu 17.4-1.pgdg22.04+2))
> Type "help" for help.
>
> lodb=# insert into pg_largeobject_metadata(oid, lomowner) SELECT i,
> 16384 FROM generate_series(1, 100_000_000) g(i);
> INSERT 0 100000000
> Time: 162414.216 ms (02:42.414)
> lodb=#
> \q
>
> hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t
> pg_largeobject_metadata -p 5433 lodb | gzip >
> pg_largeobject_metadata.data.gz
> real 0m22.094s
> user 0m20.741s
> sys 0m2.085s
>
> hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --data-only -t
> pg_largeobject_metadata --format=custom -p 5433 lodb -f
> pg_largeobject_metadata.dump
> real 0m20.226s
> user 0m18.068s
> sys 0m0.824s
>
> >  If there is
> > more that can be improved, v19 would be the version to consider for
> > future improvements at this stage.
>
> If the internal format has changed in 16 the correct way would be to
> go through the data-only dump of pg_largeobject_metadata in all cases.
> Even for the 100M case where you get the restore in 2 minutes instead
> of 100 minutes
>
> hannuk@db01-c1a:~/work/lo-testing$ createdb -p 5434 lodb
> hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
> --exit-on-error --transaction-size=1000 --dbname lodb
> pg_largeobject_metadata.dump
>
> real 2m2.277s
> user 0m2.594s
> sys 0m0.549s
>
> And even in case of the user-visible format change in acl format it is
> most likely that changing the visible format using some regexp magic,
> or even a dedicated function, would still me much faster than creating
> all the LOs though creation commands.
>
> ------
> The commands I used to do the pg_upgrade-like test were
>
> hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only
> --quote-all-identifiers --binary-upgrade --format=custom
> --file=lodb100m.dump -p 5433 lodb
> real 1m58.241s
> user 0m35.229s
> sys 0m17.854s
>
> hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
> --exit-on-error --transaction-size=1000 --dbname lodb lodb100m.dump
> real    100m54.878s
> user    3m23.885s
> sys     20m33.761s
>
> (I left out the --verbose part that pg_upgrade also sets as I did not
> want to get 100M lines of  "large object created " messages )
>
> also the postgres server at -p 5434 needs to be started with -b flag
> to accept the loading a dump from --binary-upgrade. In Debian/Ubuntu
> this can be directly passed to pg_ctlcluster as follows
>
> sudo pg_ctlcluster 17 target -o -b
>
> ----
> Hannu
>
>
>
>
> > --
> > Michael



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Hannu Krosing
Date:
On Tue, Apr 8, 2025 at 12:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Mon, Apr 07, 2025 at 10:33:47PM +0200, Hannu Krosing wrote:
> > The obvious solution would be to handle the table
> > `pg_largeobject_metadata` the same way as we currently handle
> > `pg_largeobject `by not doing anything with it in `pg_dump
> > --binary-upgrade` and just handle the contents it like we do for user
> > tables in pg_upgrade itself.
> >
> > This should work fine for all source database versions starting from PgSQL v12.
>
> Unfortunately, the storage format for aclitem changed in v16, so this would
> need to be restricted to upgrades from v16 and newer.

Have we also changed the external format of aclitem any time since v
9.2 or are the changes just to storage ?

If external formats have been stable we can still get reasonable
performance with dumping the data (2 min for 100M rows)
Plus dumping data would work for all the supported source versions.

The worst case would still be quite bad with 80+ min for the full set
of 4 billion LOs but even that would be much better than the 3 days
with current wayd.

> That being said, I
> regularly hear about slow upgrades with many LOs, so I think it'd be
> worthwhile to try to improve matters in v19.

Changing the LO export to dumping pg_largeobject_metadata content
instead of creating the LOs should be a nice small change confined to
pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
still.


--
Hannu



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Nathan Bossart
Date:
On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote:
> On Tue, Apr 8, 2025 at 12:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> That being said, I
>> regularly hear about slow upgrades with many LOs, so I think it'd be
>> worthwhile to try to improve matters in v19.
> 
> Changing the LO export to dumping pg_largeobject_metadata content
> instead of creating the LOs should be a nice small change confined to
> pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
> still.

Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19
material at this point.

-- 
nathan



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Hannu Krosing
Date:
On Tue, Apr 8, 2025 at 5:46 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote:
> > On Tue, Apr 8, 2025 at 12:17 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> That being said, I
> >> regularly hear about slow upgrades with many LOs, so I think it'd be
> >> worthwhile to try to improve matters in v19.
> >
> > Changing the LO export to dumping pg_largeobject_metadata content
> > instead of creating the LOs should be a nice small change confined to
> > pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
> > still.
>
> Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19
> material at this point.

Sure. But I actually think that this is something that should be
back-ported to at least all supported versions at some pon.
Possibly made dependent on some environment flag so only people that
desperately need it will get it.

Btw, who would be the right person(s) to ask questions about internals
of pg_dump ?
I have a few more things in the pipeline to add there and would like
to make sure that I have the right approach.

------
Hannu



Nathan Bossart <nathandbossart@gmail.com> writes:
> On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote:
>> Changing the LO export to dumping pg_largeobject_metadata content
>> instead of creating the LOs should be a nice small change confined to
>> pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
>> still.

> Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19
> material at this point.

Yeah, even if we had a patch in hand, it's too late for v18.  However
there are additional problems with this idea:

1. The idea requires role OIDs to match across the upgrade.
I don't believe that pg_upgrade tries to preserve role OIDs --- and
doing so would be problematic, because what if the new cluster's
bootstrap superuser is named differently in the old and new clusters?

It might be possible to work around that with some casting to/from
regrole, but I don't think a simple COPY into pg_largeobject_metadata
will play along with that.

2. If you just do the equivalent of an INSERT or COPY into
pg_largeobject_metadata, you could create entries that look right,
but they are actually not right because there should be pg_shdepend
entries backing each ownership or permission reference (for non-pinned
roles) and there won't be.

I guess you could think of also manually inserting rows into
pg_shdepend, but (a) ugh and (b) the claimed speedup is kind
of vanishing into the distance at this point.

            regards, tom lane



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Hannu Krosing
Date:
This is what the opening comment in pg_upgrade says

I think we do preserve role oids

/*
 * To simplify the upgrade process, we force certain system values to be
 * identical between old and new clusters:
 *
 * We control all assignments of pg_class.oid (and relfilenode) so toast
 * oids are the same between old and new clusters.  This is important
 * because toast oids are stored as toast pointers in user tables.
 *
 * While pg_class.oid and pg_class.relfilenode are initially the same in a
 * cluster, they can diverge due to CLUSTER, REINDEX, or VACUUM FULL. We
 * control assignments of pg_class.relfilenode because we want the filenames
 * to match between the old and new cluster.
 *
 * We control assignment of pg_tablespace.oid because we want the oid to match
 * between the old and new cluster.
 *
 * We control all assignments of pg_type.oid because these oids are stored
 * in user composite type values.
 *
 * We control all assignments of pg_enum.oid because these oids are stored
 * in user tables as enum values.
 *
 * We control all assignments of pg_authid.oid for historical reasons (the
 * oids used to be stored in pg_largeobject_metadata, which is now copied via
 * SQL commands), that might change at some point in the future.
 */



On Tue, Apr 8, 2025 at 6:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Nathan Bossart <nathandbossart@gmail.com> writes:
> > On Tue, Apr 08, 2025 at 09:35:24AM +0200, Hannu Krosing wrote:
> >> Changing the LO export to dumping pg_largeobject_metadata content
> >> instead of creating the LOs should be a nice small change confined to
> >> pg_dump --binary-upgrade only so perhaps we could squeeze it in v18
> >> still.
>
> > Feature freeze for v18 was ~4 hours ago, so unfortunately this is v19
> > material at this point.
>
> Yeah, even if we had a patch in hand, it's too late for v18.  However
> there are additional problems with this idea:
>
> 1. The idea requires role OIDs to match across the upgrade.
> I don't believe that pg_upgrade tries to preserve role OIDs --- and
> doing so would be problematic, because what if the new cluster's
> bootstrap superuser is named differently in the old and new clusters?
>
> It might be possible to work around that with some casting to/from
> regrole, but I don't think a simple COPY into pg_largeobject_metadata
> will play along with that.
>
> 2. If you just do the equivalent of an INSERT or COPY into
> pg_largeobject_metadata, you could create entries that look right,
> but they are actually not right because there should be pg_shdepend
> entries backing each ownership or permission reference (for non-pinned
> roles) and there won't be.
>
> I guess you could think of also manually inserting rows into
> pg_shdepend, but (a) ugh and (b) the claimed speedup is kind
> of vanishing into the distance at this point.
>
>                         regards, tom lane



Hannu Krosing <hannuk@google.com> writes:
> I think we do preserve role oids

Oh ... I'd been looking for mentions of "role" in
pg_upgrade_support.c, but what I should have looked for was
"pg_authid".  So yeah, we do preserve role OIDs, and maybe that's
enough to make this workable, at least with source versions that
share the same rules for what goes into pg_largeobject_metadata and
pg_shdepend.  It's not something I'd risk back-patching though.

            regards, tom lane



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Nathan Bossart
Date:
On Tue, Apr 08, 2025 at 12:37:43PM -0400, Tom Lane wrote:
> Hannu Krosing <hannuk@google.com> writes:
>> I think we do preserve role oids
> 
> Oh ... I'd been looking for mentions of "role" in
> pg_upgrade_support.c, but what I should have looked for was
> "pg_authid".  So yeah, we do preserve role OIDs, and maybe that's
> enough to make this workable, at least with source versions that
> share the same rules for what goes into pg_largeobject_metadata and
> pg_shdepend.  It's not something I'd risk back-patching though.

I do think it's worth considering going back to copying
pg_largobject_metadata's files for upgrades from v16 and newer.  That
sounds restrictive at the moment, but it'll mean that all but one supported
major version can copy the files during upgrade to v19.  I'll admit I'm a
tad worried about having to go back to copying via SQL commands in the
future and re-regressing things (leading to unpredictable differences in
upgrade downtime), but I'm not sure that's a great reason to withhold this
optimization.

Of course, I wouldn't be opposed to optimizing the SQL command strategy,
too, but I suspect that won't compare to copying the files.

-- 
nathan



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Hannu Krosing
Date:
On Tue, Apr 8, 2025 at 6:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Hannu Krosing <hannuk@google.com> writes:
> > I think we do preserve role oids
>
> Oh ... I'd been looking for mentions of "role" in
> pg_upgrade_support.c, but what I should have looked for was
> "pg_authid".  So yeah, we do preserve role OIDs, and maybe that's
> enough to make this workable, at least with source versions that
> share the same rules for what goes into pg_largeobject_metadata and
> pg_shdepend.  It's not something I'd risk back-patching though.

The risk is why I suggest to have this in backports as something last
resort which is gated on an environment variable

I have been forced to do this half-manually a few times to make
pg_upgrade feasible at all.

Current really UGH workaround is to use pg_repack support for swapping
relfilenodes of pg_largeobject_metadata with a user table before and
after the upgrade and then manually patching up leftovers like
pg_shdepend after pg_upgrade.

These have fortunately been cases where there were no other
dependencies like comments or security labels on LOs, but these to
should be doable;

>                         regards, tom lane



Nathan Bossart <nathandbossart@gmail.com> writes:
> I do think it's worth considering going back to copying
> pg_largobject_metadata's files for upgrades from v16 and newer.

(If we do this) I don't see why we'd need to stop at v16.  I'm
envisioning that we'd use COPY, which will be dealing in the
text representation of aclitems, and I don't think that's changed
in a long time.  The sort of thing that would break it is changes
in the set of available/default privilege bits for large objects.

That is, where the dump currently contains something like

SELECT pg_catalog.lo_create('2121');
ALTER LARGE OBJECT 2121 OWNER TO postgres;
GRANT ALL ON LARGE OBJECT 2121 TO joe;

we'd have

COPY pg_largeobject_metadata FROM STDIN;
...
2121    10    {postgres=rw/postgres,joe=rw/postgres}
...

and some appropriate COPY data for pg_shdepend too.

(The fact that this representation will contain both numeric and
symbolic role OIDs is why I was concerned about OID stability.)

The key thing that worries me here is if the source and target
versions have different ideas of which roles are pinned, which would
silently change what appears in pg_shdepend.  But it'd only really
break if a role mentioned in some LO's owner or ACL is pinned in the
source and not in the target, which seems unlikely.  (In the other
direction, we'd just be adding a useless row in pg_shdepend.)

            regards, tom lane



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Hannu Krosing
Date:
On Tue, Apr 8, 2025 at 7:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Nathan Bossart <nathandbossart@gmail.com> writes:
> > I do think it's worth considering going back to copying
> > pg_largobject_metadata's files for upgrades from v16 and newer.
>
> (If we do this) I don't see why we'd need to stop at v16.  I'm
> envisioning that we'd use COPY, which will be dealing in the
> text representation of aclitems, and I don't think that's changed
> in a long time.  The sort of thing that would break it is changes
> in the set of available/default privilege bits for large objects.
>
> That is, where the dump currently contains something like
>
> SELECT pg_catalog.lo_create('2121');
> ALTER LARGE OBJECT 2121 OWNER TO postgres;
> GRANT ALL ON LARGE OBJECT 2121 TO joe;

Also note that in my --binary-upgrade tests the 100 min / 100M objects
ratio was in case with no grants. I would expect this to grow to at
least 120 to 150 minutes when grants are also involved.

In copy case I would expect the presence of grants to not make much difference.

> we'd have
>
> COPY pg_largeobject_metadata FROM STDIN;
> ...
> 2121    10      {postgres=rw/postgres,joe=rw/postgres}
> ...

--
Hannu



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Nathan Bossart
Date:
On Tue, Apr 08, 2025 at 01:07:09PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> I do think it's worth considering going back to copying
>> pg_largobject_metadata's files for upgrades from v16 and newer.
> 
> (If we do this) I don't see why we'd need to stop at v16.  I'm
> envisioning that we'd use COPY, which will be dealing in the
> text representation of aclitems, and I don't think that's changed
> in a long time.  The sort of thing that would break it is changes
> in the set of available/default privilege bits for large objects.

I was thinking of actually reverting commit 12a53c7 for upgrades from v16,
which AFAICT is the last release where any relevant storage formats changed
(aclitem changed in v16).  But if COPY gets us pretty close to that and is
less likely to be disrupted by future changes, it could be a better
long-term approach.

> That is, where the dump currently contains something like
> 
> SELECT pg_catalog.lo_create('2121');
> ALTER LARGE OBJECT 2121 OWNER TO postgres;
> GRANT ALL ON LARGE OBJECT 2121 TO joe;
> 
> we'd have
> 
> COPY pg_largeobject_metadata FROM STDIN;
> ...
> 2121    10    {postgres=rw/postgres,joe=rw/postgres}
> ...
> 
> and some appropriate COPY data for pg_shdepend too.

Unless I'm missing something, we don't seem to have had any dependency
handling before commit 12a53c7.  Was that broken before we moved to SQL
commands?

-- 
nathan



Hannu Krosing <hannuk@google.com> writes:
> In copy case I would expect the presence of grants to not make much difference.

aclitemin is slower than a lot of other datatype input functions,
but it's still got to be faster than a GRANT.

            regards, tom lane



Hmm ... one annoying thing for this project is that AFAICS pg_upgrade
does *not* preserve database OIDs, which is problematic for using
COPY to load pg_shdepend rows.

            regards, tom lane



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Nathan Bossart
Date:
On Tue, Apr 08, 2025 at 01:36:58PM -0400, Tom Lane wrote:
> Hmm ... one annoying thing for this project is that AFAICS pg_upgrade
> does *not* preserve database OIDs, which is problematic for using
> COPY to load pg_shdepend rows.

I think it does; see commit aa01051.

-- 
nathan



Nathan Bossart <nathandbossart@gmail.com> writes:
> Unless I'm missing something, we don't seem to have had any dependency
> handling before commit 12a53c7.  Was that broken before we moved to SQL
> commands?

Sounds like it :-(

            regards, tom lane



Nathan Bossart <nathandbossart@gmail.com> writes:
> On Tue, Apr 08, 2025 at 01:36:58PM -0400, Tom Lane wrote:
>> Hmm ... one annoying thing for this project is that AFAICS pg_upgrade
>> does *not* preserve database OIDs, which is problematic for using
>> COPY to load pg_shdepend rows.

> I think it does; see commit aa01051.

Ah --- I thought I remembered something having been done about that,
but I failed to find it because I was looking in pg_upgrade not
pg_dump.  Too bad aa01051 didn't update the comment at the top of
pg_upgrade.c.

            regards, tom lane



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Nathan Bossart
Date:
On Tue, Apr 08, 2025 at 01:42:20PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> Unless I'm missing something, we don't seem to have had any dependency
>> handling before commit 12a53c7.  Was that broken before we moved to SQL
>> commands?
> 
> Sounds like it :-(

Huh.  Sure enough, it seems to be lost during an upgrade from 9.6 to 10.

v9.6:

    postgres=# select lo_from_bytea(1234, '1234');
     lo_from_bytea
    ---------------
              1234
    (1 row)

    postgres=# create role bob;
    CREATE ROLE
    postgres=# grant select on large object 1234 to bob;
    GRANT
    postgres=# drop role bob;
    ERROR:  role "bob" cannot be dropped because some objects depend on it
    DETAIL:  privileges for large object 1234

v10 (upgraded from v9.6):

    postgres=# select lo_get(1234);
       lo_get
    ------------
     \x31323334
    (1 row)

    postgres=# drop role bob;
    DROP ROLE

If I then try to upgrade that database to v17, it fails like this:

    pg_restore: from TOC entry 2422; 0 0 ACL LARGE OBJECT 1234 nathan
    pg_restore: error: could not execute query: ERROR:  role "16384" does not exist
    Command was: GRANT SELECT ON LARGE OBJECT 1234 TO "16384";

I've also verified that the dependency information is carried over in
upgrades to later versions (AFAICT all the supported ones).

-- 
nathan



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Hannu Krosing
Date:
On Tue, Apr 8, 2025 at 8:39 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
...
>
> I've also verified that the dependency information is carried over in
> upgrades to later versions (AFAICT all the supported ones).

If I remember correctly the change to not copying
pg_largeobject_metadata data file but instead moving LOs as part of
schema was done in v12 when oid,, which had been a system column in
v11, became a user column, so upgrade to v11 is likely also missing
the dependencies



On 4/8/25 15:41, Hannu Krosing wrote:
> On Tue, Apr 8, 2025 at 8:39 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>>
> ...
>>
>> I've also verified that the dependency information is carried over in
>> upgrades to later versions (AFAICT all the supported ones).
> 
> If I remember correctly the change to not copying
> pg_largeobject_metadata data file but instead moving LOs as part of
> schema was done in v12 when oid,, which had been a system column in
> v11, became a user column, so upgrade to v11 is likely also missing
> the dependencies
> 
> 

I remember an incident where large amounts of LOs ran pg_upgrade into a 
transaction-ID wrap around because the restore part would create 
individual single statement transactions per LO to create, then change 
permissions and ownership and finally fill in the data. Could that be 
related here?


Regards, Jan



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Nathan Bossart
Date:
On Tue, Apr 08, 2025 at 05:37:50PM -0400, Jan Wieck wrote:
> I remember an incident where large amounts of LOs ran pg_upgrade into a
> transaction-ID wrap around because the restore part would create individual
> single statement transactions per LO to create, then change permissions and
> ownership and finally fill in the data. Could that be related here?

I believe commits 74cf7d4 and 959b38d have largely fixed that problem.

-- 
nathan



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Nathan Bossart
Date:
On Tue, Apr 08, 2025 at 01:51:22PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> On Tue, Apr 08, 2025 at 01:36:58PM -0400, Tom Lane wrote:
>>> Hmm ... one annoying thing for this project is that AFAICS pg_upgrade
>>> does *not* preserve database OIDs, which is problematic for using
>>> COPY to load pg_shdepend rows.
> 
>> I think it does; see commit aa01051.
> 
> Ah --- I thought I remembered something having been done about that,
> but I failed to find it because I was looking in pg_upgrade not
> pg_dump.  Too bad aa01051 didn't update the comment at the top of
> pg_upgrade.c.

I'll apply the attached patch to fix the comment shortly.

-- 
nathan

Attachment

Re: Horribly slow pg_upgrade performance with many Large Objects

From
Nathan Bossart
Date:
On Tue, Apr 08, 2025 at 09:41:06PM +0200, Hannu Krosing wrote:
> On Tue, Apr 8, 2025 at 8:39 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I've also verified that the dependency information is carried over in
>> upgrades to later versions (AFAICT all the supported ones).
> 
> If I remember correctly the change to not copying
> pg_largeobject_metadata data file but instead moving LOs as part of
> schema was done in v12 when oid,, which had been a system column in
> v11, became a user column, so upgrade to v11 is likely also missing
> the dependencies

Right.

-- 
nathan



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Nathan Bossart
Date:
On Tue, Apr 08, 2025 at 12:22:00PM -0500, Nathan Bossart wrote:
> On Tue, Apr 08, 2025 at 01:07:09PM -0400, Tom Lane wrote:
>> Nathan Bossart <nathandbossart@gmail.com> writes:
>>> I do think it's worth considering going back to copying
>>> pg_largobject_metadata's files for upgrades from v16 and newer.
>> 
>> (If we do this) I don't see why we'd need to stop at v16.  I'm
>> envisioning that we'd use COPY, which will be dealing in the
>> text representation of aclitems, and I don't think that's changed
>> in a long time.  The sort of thing that would break it is changes
>> in the set of available/default privilege bits for large objects.
> 
> I was thinking of actually reverting commit 12a53c7 for upgrades from v16,
> which AFAICT is the last release where any relevant storage formats changed
> (aclitem changed in v16).  But if COPY gets us pretty close to that and is
> less likely to be disrupted by future changes, it could be a better
> long-term approach.
> 
>> That is, where the dump currently contains something like
>> 
>> SELECT pg_catalog.lo_create('2121');
>> ALTER LARGE OBJECT 2121 OWNER TO postgres;
>> GRANT ALL ON LARGE OBJECT 2121 TO joe;
>> 
>> we'd have
>> 
>> COPY pg_largeobject_metadata FROM STDIN;
>> ...
>> 2121    10    {postgres=rw/postgres,joe=rw/postgres}
>> ...
>> 
>> and some appropriate COPY data for pg_shdepend too.

I did some more research here.  For many large objects without ACLs to
dump, I noticed that the vast majority of time is going to restoring the
ALTER OWNER commands.  For 1 million such large objects, restoring took ~73
seconds on my machine.  If I instead invented an lo_create_with_owner()
function and created 100 per SELECT command, the same restore takes ~7
seconds.  Copying the relevant pg_shdepend rows out and back in takes ~2.5
seconds.  I imagine using COPY for pg_largeobject_metadata would also take
a couple of seconds in this case.

For upgrading, I don't think there's any huge benefit to optimizing the
restore commands versus using COPY.  It might make future catalog changes
for large object stuff easier, but I'd expect those to be rare.  However,
the optimized restore commands could be nice for non-pg_upgrade use-cases.

-- 
nathan



Re: Horribly slow pg_upgrade performance with many Large Objects

From
Hannu Krosing
Date:
And in case there *is* ACL present then each user mentioned in the ACL
adds more overhead

Also the separate GRANT calls cause bloat as the
pg_largeoject_metadata row gets updated for each ALTER USER or GRANT

The following is for 10 million LOs with 1 and 3 users being GRANTed
SELECT on each object (with no grants the pg_restore run was 10
minutes)

Nr of GRANTS  | pg_dump time | pg_restore time
--------------+--------------+----------------
       0      |    0m 10s    |   10m  5s
       1      |    0m 17s    |   15m  3s
       3      |    0m 21s    |   27m 15s

NB! - I left out the --verbose flag from pg_dump as used by
pg_upgrade, as it will emit one line per LO dumped

## 1 GRANT / LO

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only
--quote-all-identifiers --binary-upgrade --format=custom
--file=lodb10m.dump -p 5433 lodb10m
real    0m17.022s
user    0m2.956s
sys     0m1.453s

hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb10m lodb10m.dump
real    15m3.136s
user    0m28.991s
sys     2m54.164s

## 3 GRANTs / LO

make sample LO with 3 grants

ALTER LARGE OBJECT 1 OWNER TO "hannuk";
GRANT SELECT ON LARGE OBJECT 1 TO "bob";
GRANT SELECT ON LARGE OBJECT 1 TO "joe";
GRANT SELECT ON LARGE OBJECT 1 TO "tom";
lodb10m=# select *  from pg_shdepend where objid = 1;

┌───────┬─────────┬───────┬──────────┬────────────┬──────────┬─────────┐
│ dbid  │ classid │ objid │ objsubid │ refclassid │ refobjid │ deptype │
├───────┼─────────┼───────┼──────────┼────────────┼──────────┼─────────┤
│ 16406 │    2613 │     1 │        0 │       1260 │    16384 │ o       │
│ 16406 │    2613 │     1 │        0 │       1260 │    16393 │ a       │
│ 16406 │    2613 │     1 │        0 │       1260 │    16394 │ a       │
│ 16406 │    2613 │     1 │        0 │       1260 │    16395 │ a       │
└───────┴─────────┴───────┴──────────┴────────────┴──────────┴─────────┘

lodb10m=# select *  from pg_largeobject_metadata ;
┌─────┬──────────┬───────────────────────────────────────────────────────────┐
│ oid │ lomowner │                          lomacl                           │
├─────┼──────────┼───────────────────────────────────────────────────────────┤
│   1 │    16384 │ {hannuk=rw/hannuk,bob=r/hannuk,joe=r/hannuk,tom=r/hannuk} │
└─────┴──────────┴───────────────────────────────────────────────────────────┘

Make the remaining 10M-1 LOs

lodb10m=# insert into pg_largeobject_metadata(oid, lomowner, lomacl)
SELECT i, 16384,
'{hannuk=rw/hannuk,bob=r/hannuk,joe=r/hannuk,tom=r/hannuk}' FROM
generate_series(2, 10_000_000) g(i);
INSERT 0 9999999
Time: 18859.341 ms (00:18.859)

And add their sharedeps

lodb10m=# WITH refdeps (robj, rdeptype)
AS ( VALUES
  (16384, 'o'),
  (16393, 'a'),
  (16394, 'a'),
  (16395, 'a')
)
INSERT INTO pg_shdepend SELECT 16396, 2613, i, 0, 1260, robj, rdeptype
  FROM generate_series(2, 10_000_000) g(i)
     , refdeps
;
INSERT 0 39999996
Time: 116697.342 ms (01:56.697)

Time pg_upgrade's pg_dump and pg_reload

hannuk@db01-c1a:~/work/lo-testing$ time pg_dump --schema-only
--quote-all-identifiers --binary-upgrade --format=custom
--file=lodb10m-3grants.dump -p 5433 lodb10m
real    0m21.519s
user    0m2.951s
sys     0m1.723s

hannuk@db01-c1a:~/work/lo-testing$ time pg_restore -p 5434
--exit-on-error --transaction-size=1000 --dbname lodb10m
lodb10m-3grants.dump
real    27m15.372s
user    0m45.157s
sys     4m57.513s

On Fri, Apr 11, 2025 at 10:11 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Tue, Apr 08, 2025 at 12:22:00PM -0500, Nathan Bossart wrote:
> > On Tue, Apr 08, 2025 at 01:07:09PM -0400, Tom Lane wrote:
> >> Nathan Bossart <nathandbossart@gmail.com> writes:
> >>> I do think it's worth considering going back to copying
> >>> pg_largobject_metadata's files for upgrades from v16 and newer.
> >>
> >> (If we do this) I don't see why we'd need to stop at v16.  I'm
> >> envisioning that we'd use COPY, which will be dealing in the
> >> text representation of aclitems, and I don't think that's changed
> >> in a long time.  The sort of thing that would break it is changes
> >> in the set of available/default privilege bits for large objects.
> >
> > I was thinking of actually reverting commit 12a53c7 for upgrades from v16,
> > which AFAICT is the last release where any relevant storage formats changed
> > (aclitem changed in v16).  But if COPY gets us pretty close to that and is
> > less likely to be disrupted by future changes, it could be a better
> > long-term approach.
> >
> >> That is, where the dump currently contains something like
> >>
> >> SELECT pg_catalog.lo_create('2121');
> >> ALTER LARGE OBJECT 2121 OWNER TO postgres;
> >> GRANT ALL ON LARGE OBJECT 2121 TO joe;
> >>
> >> we'd have
> >>
> >> COPY pg_largeobject_metadata FROM STDIN;
> >> ...
> >> 2121 10      {postgres=rw/postgres,joe=rw/postgres}
> >> ...
> >>
> >> and some appropriate COPY data for pg_shdepend too.
>
> I did some more research here.  For many large objects without ACLs to
> dump, I noticed that the vast majority of time is going to restoring the
> ALTER OWNER commands.  For 1 million such large objects, restoring took ~73
> seconds on my machine.  If I instead invented an lo_create_with_owner()
> function and created 100 per SELECT command, the same restore takes ~7
> seconds.  Copying the relevant pg_shdepend rows out and back in takes ~2.5
> seconds.  I imagine using COPY for pg_largeobject_metadata would also take
> a couple of seconds in this case.
>
> For upgrading, I don't think there's any huge benefit to optimizing the
> restore commands versus using COPY.  It might make future catalog changes
> for large object stuff easier, but I'd expect those to be rare.  However,
> the optimized restore commands could be nice for non-pg_upgrade use-cases.
>
> --
> nathan