Thread: pg_dump: optimize dumpFunc()

pg_dump: optimize dumpFunc()

From
Nathan Bossart
Date:
I've recently committed some optimizations for dumping sequences and
pg_class information (commits 68e9629, bd15b7d, and 2329cad), and I noticed
that we are also executing a query per function in pg_dump.  Commit be85727
optimized this by preparing the query ahead of time, but I found that we
can improve performance further by gathering all the relevant data in a
single query.  Here are the results I see for a database with 10k simple
functions with and without the attached patch:

    with patch:

    $ time pg_dump postgres >/dev/null
    pg_dump postgres > /dev/null  0.04s user 0.01s system 40% cpu 0.118 total
    $ time pg_dump postgres >/dev/null
    pg_dump postgres > /dev/null  0.04s user 0.01s system 41% cpu 0.107 total
    $ time pg_dump postgres >/dev/null
    pg_dump postgres > /dev/null  0.04s user 0.01s system 42% cpu 0.103 total
    $ time pg_dump postgres >/dev/null
    pg_dump postgres > /dev/null  0.04s user 0.01s system 44% cpu 0.105 total

    without patch:

    $ time pg_dump postgres >/dev/null
    pg_dump postgres > /dev/null  0.05s user 0.03s system 32% cpu 0.253 total
    $ time pg_dump postgres >/dev/null
    pg_dump postgres > /dev/null  0.05s user 0.03s system 32% cpu 0.252 total
    $ time pg_dump postgres >/dev/null
    pg_dump postgres > /dev/null  0.06s user 0.03s system 32% cpu 0.251 total
    $ time pg_dump postgres >/dev/null
    pg_dump postgres > /dev/null  0.06s user 0.03s system 33% cpu 0.254 total

This one looks a little different than the sequence/pg_class commits.  Much
of the function information isn't terribly conducive to parsing into
fixed-size variables in an array, so instead I've opted to just leave the
PGresult around for reference by dumpFunc().  This patch also creates an
ordered array of function OIDs to speed up locating the relevant index in
the PGresult for use in calls to PQgetvalue().

I may be running out of opportunities where this style of optimization
makes much difference.  I'll likely start focusing on the restore side
soon.

-- 
nathan

Attachment

Re: pg_dump: optimize dumpFunc()

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> I've recently committed some optimizations for dumping sequences and
> pg_class information (commits 68e9629, bd15b7d, and 2329cad), and I noticed
> that we are also executing a query per function in pg_dump.  Commit be85727
> optimized this by preparing the query ahead of time, but I found that we
> can improve performance further by gathering all the relevant data in a
> single query.  Here are the results I see for a database with 10k simple
> functions with and without the attached patch:

I'm a bit concerned about this on two grounds:

1. Is it a win for DBs with not so many functions?

2. On the other end of the scale, if you've got a *boatload* of
functions, what does it do to pg_dump's memory requirements?
I'm recalling my days at Salesforce, where they had quite a few
thousand pl/pgsql functions totalling very many megabytes of source
text.  (Don't recall precise numbers offhand, and they'd be obsolete
by now even if I did.)

I'm not sure that the results you're showing justify taking any
risk here.

            regards, tom lane



Re: pg_dump: optimize dumpFunc()

From
Nathan Bossart
Date:
On Fri, Aug 02, 2024 at 01:33:45AM -0400, Tom Lane wrote:
> I'm a bit concerned about this on two grounds:
> 
> 1. Is it a win for DBs with not so many functions?
> 
> 2. On the other end of the scale, if you've got a *boatload* of
> functions, what does it do to pg_dump's memory requirements?
> I'm recalling my days at Salesforce, where they had quite a few
> thousand pl/pgsql functions totalling very many megabytes of source
> text.  (Don't recall precise numbers offhand, and they'd be obsolete
> by now even if I did.)
> 
> I'm not sure that the results you're showing justify taking any
> risk here.

Hm.  I think this is sufficient reason to withdraw this patch on the spot.

-- 
nathan



Re: pg_dump: optimize dumpFunc()

From
Neil Conway
Date:
On Fri, Aug 2, 2024 at 4:00 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Fri, Aug 02, 2024 at 01:33:45AM -0400, Tom Lane wrote:
> 2. On the other end of the scale, if you've got a *boatload* of
> functions, what does it do to pg_dump's memory requirements?

Hm.  I think this is sufficient reason to withdraw this patch on the spot.

We could potentially rework the list of dumpable objects so that each list item represents one or more objects of the same type that we can fetch via a single query. We could then make whatever tradeoff we want in terms of fetch batch size vs client-side memory consumption.

Debatable whether it is worth the hit to code readability though, and might be a bit grotty to do the tsort we need to do for dependency handling...

Neil