Horribly slow pg_upgrade performance with many Large Objects - Mailing list pgsql-hackers

From Hannu Krosing
Subject Horribly slow pg_upgrade performance with many Large Objects
Date
Msg-id CAMT0RQSS-6qLH+zYsOeUbAYhop3wmQTkNmQpo5--QRDUR+qYmQ@mail.gmail.com
Whole thread Raw
Responses Re: Horribly slow pg_upgrade performance with many Large Objects
Re: Horribly slow pg_upgrade performance with many Large Objects
List pgsql-hackers
Hi Hackers

## The issue

I have now met a not insignificant number of cases where pg_upgrade
performance is really bad when the database has a large number of
Large Objects.

The average time to `pg_dump --binary-upgrade --format=custom ...` a
database and then `pg_restore ...` it back is 1 minute per 1 million
LOs,

This does not seem to bad until you realize that it is
- 100 minutes, ot 1H40min for 100M LOs
- 16H40 min for 1 Billion large objects .
- 66H40m or 4 2days 18 hours and 40 min for those unfortunate enough
who have to full 4 Billion LOs

My direct tests have 10m5s for 10M LOs and 100m54s for 100M.

Also a smaller server just runs out of shared memory when restoring 100M LOs.

I know that our  semi-official recommendation for using Large Objects
is "Don't!" but still people use them and sometimes they have a lot of
them. One of the reasons seems to be an older version of Hibernate
which converted any column of type CLOB into a Large Object, so we see
cases where average LO size is a few hundred to a few thousand bytes.

I read through the old discussions around the time of release of v12
and the consensus seemed to be at that time that dumping the LOs
without data was ok "because there will not be too many LOs" which
unfortunately has turned out to be wishful thinking, as there are lots
of  Tiny Large Object out there and these really slow down pg_upgrade.

## So what to do about this?

The obvious solution would be to handle the table
`pg_largeobject_metadata` the same way as we currently handle
`pg_largeobject `by not doing anything with it in `pg_dump
--binary-upgrade` and just handle the contents it like we do for user
tables in pg_upgrade itself.

This should work fine for all source database versions starting from PgSQL v12.

For older supported versions 9.2 to 11 where `oid` was system column
we should just dump the three fields of `pg_largeobject_metadata`
directly so that the dump command would be  `COPY (SELECT oid,
lomowner, lomacl) TO stdout` and then restore would happen more or
less automatically.

Or we could just modify `pg_restore` so that it restores LOs using
COPY instead of doing the current dance of 3 separate SELECT's from
lo_xxx() functions.

pg_dump --binary-upgrade format=custome is relatively fast for LOs -
1m58s for 100M object,

though still much slower than `pg_dump --data-only -t
pg_largeobject_metadata` which is 20 sec, or 23 when also run through
gzip

## preferred solution for --binary-upgrade

- link/copy data files when source is v12+
- dump the triplet of pg_largeobject_metadata columns if source is v9.2 - v11

an acceptable faster solution would be to just always dump
pg_largeobject_metadata (oid,  lomowner, lomacl) as this would not
need any changes to `pg_upgrade`, just to `pg_dump --binary-upgrade`
it would not be as fast as linking/copying but would be several orders
of magnitude faster than current behaviour.


Does anyone have any objections to this ?

Would anyone be willing to take up fixing this ?

Or perhaps advise where in code the change should be done ?

I started looking at this and at high level It looks like I need to
create a task to copy out specific data columns  and not the structure
of pg_largeobject metadata but as pg_dump.c is 582,673 it would take
some time to understand where exactly to do this and all the possible
interactions with flags like --binary-upgrade and specific formats.

---
Cheers,
Hannu



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: BAS_BULKREAD vs read stream
Next
From: "Maksim.Melnikov"
Date:
Subject: Re: sync_standbys_defined read/write race on startup