Thread: BUG #17717: Regression in vacuumdb (15 is slower than 10/11 and possible memory issue)

BUG #17717: Regression in vacuumdb (15 is slower than 10/11 and possible memory issue)

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17717
Logged by:          Gunnar L
Email address:      postgresql@taljaren.se
PostgreSQL version: 15.0
Operating system:   Ubuntu Linux
Description:

We have observed a significant slowdown in vacuumdb performance between
different versions of postgresql. And possibly also a memory issue.

We run a specific data model, where each customer has its own schema with
its own set of tables. Each database server hosts 16 databases, each
containing around 250 customer schemas. Due to postgres creating a new file
for each database object, we end up with around 5 million files on each
database server. This may or may not be related to the issue we're seeing
(new algorithms with new time complexity?)

We upgraded from postgresql 9.5 to postgresql 13, and noticed a significant
slowdown in how vacuumdb performs. Before, we could run a vacuumdb -a -z
each night, taking around 2 hours to complete. After the upgrade, we see a
constant 100% CPU utilization during the vacuumdb process (almost no I/O
activity), and vacuumdb cannot complete within a reasonable time. We're able
to vacuum about 3-4 databases each night.

We are able to recreate this issue, using a simple bash script to generate a
similar setup.

From local testing, here are our findings:

Concerning speed:
* Version 9.5, 10, 11 are fast  (9.5 slower than 10 and 11)
* Version 12, 13, 14 are very, very slow
* Version 15 is faster (a lot faster than 12,13,14) but not nearly as fast
as 10 or 11.

Concerning memory usage:
* Version 15 is using a lot more shared memory OR it might not be releasing
it properly after vacuuming a db.

These are the timings for vacuuming the 16 dbs.

Version   Seconds   Completed
------------------------------
9.5       412       16/16
10        178       16/16
11        166       16/16
12       8319        1/16 or 2/16 (manually aborted)
13      18853        3/16 or 4/16 (manually aborted)
14      16857        3/16 or 4/16 (manually aborted)
15        617        1/16 (crashed!)
15       4158        6/16 (crashed! --shm-size=256mb)
15       9500       16/16 (--shm-size=4096mb)

The timing of the only successful run for postgres 15 is somewhat flaky,
since the machine was suspended for about 1-1.5 hours so 9500 is only an
estimate, but the first run (1 db completed in 10 minutes) gives that it is
faster than 12-14 but slower than 10 and 11 (3 minutes to complete
everything)


The following describes our setup
This is the script (called  setup.sh)  we’re using to populate the databases
(we give a port number as parameter)

##### start of setup.sh
export PGPASSWORD=mysecretpassword
PORT=$1

echo ""> tables_$PORT.sql
for schema in `seq -w 1 250`; do
    echo "create schema schema$schema;" >> tables_$PORT.sql
    for table in `seq -w 1 500`; do
        echo "create table schema$schema.table$table (id int);" >>
tables_$PORT.sql
    done
done

echo "Setting up db: 01"
createdb -h localhost -U postgres -p $PORT  db01
psql -q -h localhost -U postgres -p $PORT db01 -f tables_$PORT.sql

# This seems to be the fastest way to create the databases
for db in `seq -w 2 16`; do
    echo "Setting up db: $db"
    createdb -h localhost -U postgres -p $PORT --template db01 db$db
done
####### end of setup.sh



To execute a test for a particular postgres version (in this example PG
9.5), we run the following. It will setup PG 9.5 on port 15432.

docker run --rm --name pg95 -e POSTGRES_PASSWORD=mysecretpassword -p
15432:5432 -d postgres:9.5
./setup.sh 15432
date; time docker exec -it pg95 bash -c "vacuumdb -a -z -U postgres"; date

(The date commands are added to keep track of when tasks were started).





Here are complete set of commands and output and  comments 
(We use different ports for different versions of PG)

date; time docker exec -it pg95 bash -c "vacuumdb -a -z -U postgres"; date
(The date commands since it takes some time to run)









time docker exec -it pg95 bash -c "vacuumdb -a -z -U postgres"
vacuumdb: vacuuming database "db01"
…<snip>...
vacuumdb: vacuuming database "db16"
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template1"

real    6m52,070s
user    0m0,048s
sys    0m0,029s


time docker exec -it pg10 bash -c "vacuumdb -a -z -U postgres"
vacuumdb: vacuuming database "db01"
…<snip>...
vacuumdb: vacuuming database "db16"
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template1"

real    2m58,354s
user    0m0,043s
sys    0m0,013s





time docker exec -it pg11 bash -c "vacuumdb -a -z -U postgres"
vacuumdb: vacuuming database "db01"
…<snip>...
vacuumdb: vacuuming database "db16"
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template1"

real    2m46,181s
user    0m0,047s
sys    0m0,012s




date; time docker exec -it pg12 bash -c "vacuumdb -a -z -U postgres"; date
lör 10 dec 2022 18:57:43 CET
vacuumdb: vacuuming database "db01"
vacuumdb: vacuuming database "db02"
^CCancel request sent
vacuumdb: error: vacuuming of table "schema241.table177" in database "db02"
failed: ERROR:  canceling statement due to user request

real    138m39,600s
user    0m0,177s
sys    0m0,418s
lör 10 dec 2022 21:16:22 CET




date;time docker exec -it pg13 bash -c "vacuumdb -a -z -U postgres"
lör 10 dec 2022 07:22:32 CET
vacuumdb: vacuuming database "db01"
vacuumdb: vacuuming database "db02"
vacuumdb: vacuuming database "db03"
vacuumdb: vacuuming database "db04"
^CCancel request sent

real    314m13,172s
user    0m0,551s
sys    0m0,663s
lör 10 dec 2022 12:37:03 CET



date;time docker exec -it pg14 bash -c "vacuumdb -a -z -U postgres"; date
lör 10 dec 2022 14:15:37 CET
vacuumdb: vacuuming database "db01"
vacuumdb: vacuuming database "db02"
vacuumdb: vacuuming database "db03"
vacuumdb: vacuuming database "db04"
^CCancel request sent

real    280m57,172s
user    0m0,586s
sys    0m0,559s
lör 10 dec 2022 18:56:34 CET



date;time docker exec -it pg15 bash -c "vacuumdb -a -z -U postgres"; date
lör 10 dec 2022 12:50:25 CET
vacuumdb: vacuuming database "db01"
vacuumdb: vacuuming database "db02"
vacuumdb: error: processing of database "db02" failed: ERROR:  could not
resize shared memory segment "/PostgreSQL.2952321776" to 27894720 bytes: No
space left on device

real    10m17,913s
user    0m0,030s
sys    0m0,049s
lör 10 dec 2022 13:00:43 CET

# it was faster, but we need  to extend shared memory to make it work


docker run --rm --name pg15 --shm-size=256mb -e
POSTGRES_PASSWORD=mysecretpassword -p 55555:5432 -d postgres:15

date;time docker exec -it pg15 bash -c "vacuumdb -a -z -U postgres"; date
mån 12 dec 2022 08:56:17 CET
vacuumdb: vacuuming database "db01"
…<snip>...
vacuumdb: vacuuming database "db07"
vacuumdb: error: processing of database "db07" failed: ERROR:  could not
resize shared memory segment "/PostgreSQL.1003084622" to 27894720 bytes: No
space left on device

real    69m18,345s
user    0m0,217s
sys    0m0,086s
mån 12 dec 2022 10:05:36 CET



docker run --rm --name pg15 --shm-size=4096mb -e
POSTGRES_PASSWORD=mysecretpassword -p 55555:5432 -d postgres:15

date;time docker exec -it pg15 bash -c "vacuumdb -a -z -U postgres"; date
mån 12 dec 2022 11:16:11 CET
vacuumdb: vacuuming database "db01"
…<snip>...
vacuumdb: vacuuming database "db16"
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template1"

real    232m46,168s
user    0m0,227s
sys    0m0,467s
mån 12 dec 2022 15:08:57 CET



Here is the hardware that was used
AMD Ryzen 7 PRO 5850U with Radeon Graphics
8 Cores, 16 threads

$ free
               total        used        free      shared  buff/cache
available
Mem:        28562376     5549716      752624     1088488    22260036
21499752
Swap:         999420      325792      673628

Disk:    NVMe device, Samsung SSD 980 1TB


PG Bug reporting form <noreply@postgresql.org> writes:
> We run a specific data model, where each customer has its own schema with
> its own set of tables. Each database server hosts 16 databases, each
> containing around 250 customer schemas. Due to postgres creating a new file
> for each database object, we end up with around 5 million files on each
> database server. This may or may not be related to the issue we're seeing
> (new algorithms with new time complexity?)

> We upgraded from postgresql 9.5 to postgresql 13, and noticed a significant
> slowdown in how vacuumdb performs. Before, we could run a vacuumdb -a -z
> each night, taking around 2 hours to complete. After the upgrade, we see a
> constant 100% CPU utilization during the vacuumdb process (almost no I/O
> activity), and vacuumdb cannot complete within a reasonable time. We're able
> to vacuum about 3-4 databases each night.

I poked into this a little bit.  On HEAD, watching things with "perf"
identifies vac_update_datfrozenxid() as the main time sink.  It's not
hard to see why: that does a seqscan of pg_class, and it's invoked
at the end of each vacuum() call.  So if you try to vacuum each table
in the DB separately, you're going to end up spending O(N^2) time
in often-useless rescans of pg_class.  This isn't a huge problem in
ordinary-sized DBs, but with 125000 small tables in the DB it becomes
the dominant cost.

> Concerning speed:
> * Version 9.5, 10, 11 are fast  (9.5 slower than 10 and 11)
> * Version 12, 13, 14 are very, very slow
> * Version 15 is faster (a lot faster than 12,13,14) but not nearly as fast
> as 10 or 11.

The reason for the v12 performance change is that up through v11,
"vacuumdb -a -z" would just issue "VACUUM (ANALYZE);" in each DB.
So vac_update_datfrozenxid only ran once.  Beginning in v12 (commit
e0c2933a7), vacuumdb issues a separate VACUUM command for each
targeted table, which causes the problem.

I'm not sure why there's a performance delta from 14 to 15.
It doesn't look like vacuumdb itself had any material changes,
so we must have done something different on the backend side.
This may indicate that there's another O(N^2) behavior that
we got rid of in v15.  Anyway, that change isn't bad, so I did
not poke into it too much.

Conclusions:

* As a short-term fix, you could try using vacuumdb from v11
with the newer servers.  Or just do "psql -c 'vacuum analyze'"
and not bother with vacuumdb at all.  (On HEAD, with this
example database, 'vacuum analyze' takes about 7 seconds per DB
for me, versus ~10 minutes using vacuumdb.)

* To fix vacuumdb properly, it might be enough to get it to
batch VACUUMs, say by naming up to 1000 tables per command
instead of just one.  I'm not sure how that would interact
with its parallelization logic, though.  It's not really
solving the O(N^2) issue either, just pushing it further out.

* A better idea, though sadly not very back-patchable, could
be to expose a VACUUM option to control whether it runs
vac_update_datfrozenxid, so that vacuumdb can do that just
once at the end.  Considering that vac_update_datfrozenxid
requires an exclusive lock, the current behavior is poison for
parallel vacuuming quite aside from the O(N^2) issue.  This
might tie into some work Peter G. has been pursuing, too.

            regards, tom lane



On Thu, Dec 15, 2022 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> * A better idea, though sadly not very back-patchable, could
> be to expose a VACUUM option to control whether it runs
> vac_update_datfrozenxid, so that vacuumdb can do that just
> once at the end.  Considering that vac_update_datfrozenxid
> requires an exclusive lock, the current behavior is poison for
> parallel vacuuming quite aside from the O(N^2) issue.  This
> might tie into some work Peter G. has been pursuing, too.

That sounds like a good idea to me. But do we actually need a VACUUM
option for this? I wonder if we could get away with having the VACUUM
command never call vac_update_datfrozenxid(), except when run in
single-user mode. It would be nice to make pg_xact/clog truncation
autovacuum's responsibility.

Autovacuum already does things differently to the VACUUM command, and
for reasons that seem related to this complaint about vacuumdb.
Besides, autovacuum is already on the hook to call
vac_update_datfrozenxid() for the benefit of databases that haven't
actually been vacuumed, per the do_autovacuum() comments right above
its vac_update_datfrozenxid() call.

-- 
Peter Geoghegan



Peter Geoghegan <pg@bowt.ie> writes:
> On Thu, Dec 15, 2022 at 10:56 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> * A better idea, though sadly not very back-patchable, could
>> be to expose a VACUUM option to control whether it runs
>> vac_update_datfrozenxid, so that vacuumdb can do that just
>> once at the end.  Considering that vac_update_datfrozenxid
>> requires an exclusive lock, the current behavior is poison for
>> parallel vacuuming quite aside from the O(N^2) issue.  This
>> might tie into some work Peter G. has been pursuing, too.

> That sounds like a good idea to me. But do we actually need a VACUUM
> option for this? I wonder if we could get away with having the VACUUM
> command never call vac_update_datfrozenxid(), except when run in
> single-user mode. It would be nice to make pg_xact/clog truncation
> autovacuum's responsibility.

I could get behind manual VACUUM not invoking vac_update_datfrozenxid
by default, perhaps.  But if it can never call it, then that is a
fairly important bit of housekeeping that is unreachable except by
autovacuum.  No doubt the people who turn off autovacuum are benighted,
but they're still out there.

Could we get somewhere by saying that manual VACUUM calls
vac_update_datfrozenxid only if it's a full-DB vacuum (ie, no table
was specified)?  That would fix the problem at hand.  However, it'd
mean (since v12) that a vacuumdb run never calls vac_update_datfrozenxid
at all, which would result in horrible problems for any poor sods
who think that a cronjob running "vacuumdb -a" is an adequate substitute
for autovacuum.

Or maybe we could modify things so that "autovacuum = off" doesn't prevent
occasional cycles of vac_update_datfrozenxid-and-nothing-else?

In the end I feel like a manual way to call vac_update_datfrozenxid
would be the least magical way of running this.

            regards, tom lane



On Thu, Dec 15, 2022 at 1:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I could get behind manual VACUUM not invoking vac_update_datfrozenxid
> by default, perhaps.  But if it can never call it, then that is a
> fairly important bit of housekeeping that is unreachable except by
> autovacuum.  No doubt the people who turn off autovacuum are benighted,
> but they're still out there.

I wouldn't mind adding another option for this to VACUUM. We already
have a couple of VACUUM options that are only really needed as escape
hatches, or perhaps as testing tools used by individual Postgres
hackers. Another one doesn't seem too bad. The VACUUM command should
eventually become totally niche, so I'm not too concerned about going
overboard here.

> Could we get somewhere by saying that manual VACUUM calls
> vac_update_datfrozenxid only if it's a full-DB vacuum (ie, no table
> was specified)?  That would fix the problem at hand.

That definitely seems reasonable.

> Or maybe we could modify things so that "autovacuum = off" doesn't prevent
> occasional cycles of vac_update_datfrozenxid-and-nothing-else?

That's what I was thinking of. It seems like a more natural approach
to me, at least offhand.

I have to imagine that the vast majority of individual calls to
vac_update_datfrozenxid have just about zero chance of updating
datfrozenxid or datminmxid as things stand. There is bound to be some
number of completely static tables in every database (maybe just
system catalogs). Those static tables are bound to be the tables that
hold back datfrozenxid/datminmxid approximately all the time. To me
this suggests that vac_update_datfrozenxid should fully own the fact
that it's supposed to be called out of band, possibly only in
autovacuum.

Separately, I wonder if it would make sense to invent a new fast-path
for the VACUUM command that is designed to inexpensively determine
that it cannot possibly matter if vac_update_datfrozenxid is never
called, given the specifics (the details of the target rel and its
TOAST rel).

-- 
Peter Geoghegan



Peter Geoghegan <pg@bowt.ie> writes:
> On Thu, Dec 15, 2022 at 1:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Or maybe we could modify things so that "autovacuum = off" doesn't prevent
>> occasional cycles of vac_update_datfrozenxid-and-nothing-else?

> That's what I was thinking of. It seems like a more natural approach
> to me, at least offhand.

Seems worth looking into.  But I suppose the launch frequency would
have to be more often than the current behavior for autovacuum = off,
so it would complicate the logic in that area.

> I have to imagine that the vast majority of individual calls to
> vac_update_datfrozenxid have just about zero chance of updating
> datfrozenxid or datminmxid as things stand.

That is a really good point.  How about teaching VACUUM to track
the oldest original relfrozenxid and relminmxid among the table(s)
it processed, and skip vac_update_datfrozenxid unless at least one
of those matches the database's values?  For extra credit, also
skip if we didn't successfully advance the source rel's value.

This might lead to a fix that solves the OP's problem while not
changing anything fundamental, which would make it reasonable
to back-patch.

            regards, tom lane



On Fri, Dec 16, 2022 at 6:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I have to imagine that the vast majority of individual calls to
> > vac_update_datfrozenxid have just about zero chance of updating
> > datfrozenxid or datminmxid as things stand.
>
> That is a really good point.  How about teaching VACUUM to track
> the oldest original relfrozenxid and relminmxid among the table(s)
> it processed, and skip vac_update_datfrozenxid unless at least one
> of those matches the database's values?  For extra credit, also
> skip if we didn't successfully advance the source rel's value.

Hmm. I think that that would probably work.

It would certainly work on 15+, because there tends to be "natural
diversity" among the relfrozenxid values seen for each table, due to
the "track oldest extant XID" work; we no longer see many tables that
all have the same relfrozenxid, that advance in lockstep. But even
that factor probably doesn't matter, since we only need one "laggard
relfrozenxid" static table for the scheme to work and work well. That
is probably a safe bet on all versions, though I'd have to check to be
sure.

> This might lead to a fix that solves the OP's problem while not
> changing anything fundamental, which would make it reasonable
> to back-patch.

That's a big plus. This is a nasty regression. I wouldn't call it a
must-fix, but it's bad enough to be worth fixing if we can come up
with a reasonably non-invasive approach.

-- 
Peter Geoghegan



Peter Geoghegan <pg@bowt.ie> writes:
> On Fri, Dec 16, 2022 at 6:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That is a really good point.  How about teaching VACUUM to track
>> the oldest original relfrozenxid and relminmxid among the table(s)
>> it processed, and skip vac_update_datfrozenxid unless at least one
>> of those matches the database's values?  For extra credit, also
>> skip if we didn't successfully advance the source rel's value.

> Hmm. I think that that would probably work.

> It would certainly work on 15+, because there tends to be "natural
> diversity" among the relfrozenxid values seen for each table, due to
> the "track oldest extant XID" work; we no longer see many tables that
> all have the same relfrozenxid, that advance in lockstep. But even
> that factor probably doesn't matter, since we only need one "laggard
> relfrozenxid" static table for the scheme to work and work well. That
> is probably a safe bet on all versions, though I'd have to check to be
> sure.

Oh, I see your point: if a whole lot of tables have the same relfrozenxid
and it matches datfrozenxid, this won't help.  Still, we can hope that
that's an uncommon situation.  If we postulate somebody trying to use
scheduled "vacuumdb -z" in place of autovacuum, they shouldn't really have
that situation.  Successively vacuuming many tables should normally
result in the tables' relfrozenxids not being all the same, unless they
were unlucky enough to have a very long-running transaction holding back
the global xmin horizon the whole time.

            regards, tom lane



On Thu, Dec 15, 2022 at 01:56:30PM -0500, Tom Lane wrote:
> * To fix vacuumdb properly, it might be enough to get it to
> batch VACUUMs, say by naming up to 1000 tables per command
> instead of just one.  I'm not sure how that would interact
> with its parallelization logic, though.  It's not really
> solving the O(N^2) issue either, just pushing it further out.

I have been thinking about this part, and using a hardcoded rule for
the batches would be tricky.  The list of relations returned by the
scan of pg_class are ordered by relpages, so depending on the
distribution of the sizes (few tables with a large size and a lot of
table with small sizes, exponential distribution of table sizes), we
may finish with more downsides than upsides in some cases, even if we
use a linear rule based on the number of relations, or even if we
distribute the relations across the slots in a round robin fashion for
example.

In order to control all that, rather than a hardcoded rule, could it
be as simple as introducing an option like vacuumdb --batch=N
defaulting to 1 to let users control the number of relations grouped
in a single command with a round robin distribution for each slot?
--
Michael

Attachment

> In order to control all that, rather than a hardcoded rule, could it
> be as simple as introducing an option like vacuumdb --batch=N
> defaulting to 1 to let users control the number of relations grouped
> in a single command with a round robin distribution for each slot?

My first reaction to that is: Is it possible to explain to a DBA what N should be for a particular cluster?


On Thu, Dec 15, 2022 at 08:39:54PM -0800, Peter Geoghegan wrote:
> On Thu, Dec 15, 2022 at 1:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I could get behind manual VACUUM not invoking vac_update_datfrozenxid
>> by default, perhaps.  But if it can never call it, then that is a
>> fairly important bit of housekeeping that is unreachable except by
>> autovacuum.  No doubt the people who turn off autovacuum are benighted,
>> but they're still out there.
> 
> I wouldn't mind adding another option for this to VACUUM. We already
> have a couple of VACUUM options that are only really needed as escape
> hatches, or perhaps as testing tools used by individual Postgres
> hackers. Another one doesn't seem too bad. The VACUUM command should
> eventually become totally niche, so I'm not too concerned about going
> overboard here.

Perhaps there could also be an update-datfrozenxid function that vacuumdb
calls when finished with a database.  Even if vacuum becomes smarter about
calling vac_update_datfrozenxid, this might still be worth doing.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Sat, Dec 17, 2022 at 06:23:27PM -0800, Christophe Pettus wrote:
> My first reaction to that is: Is it possible to explain to a DBA
> what N should be for a particular cluster?

Assuming that we can come up with a rather straight-forward still
portable rule for the distribution of the relations across of the
slots like something I mentioned above (which is not the best thing
depending on the sizes and the number of tables), that would be quite
tricky IMO.
--
Michael

Attachment
On Sun, Dec 18, 2022 at 11:21:47AM +0900, Michael Paquier wrote:
> On Thu, Dec 15, 2022 at 01:56:30PM -0500, Tom Lane wrote:
> > * To fix vacuumdb properly, it might be enough to get it to
> > batch VACUUMs, say by naming up to 1000 tables per command
> > instead of just one.  I'm not sure how that would interact
> > with its parallelization logic, though.  It's not really
> > solving the O(N^2) issue either, just pushing it further out.
> 
> I have been thinking about this part, and using a hardcoded rule for
> the batches would be tricky.  The list of relations returned by the
> scan of pg_class are ordered by relpages, so depending on the
> distribution of the sizes (few tables with a large size and a lot of
> table with small sizes, exponential distribution of table sizes), we
> may finish with more downsides than upsides in some cases, even if we
> use a linear rule based on the number of relations, or even if we
> distribute the relations across the slots in a round robin fashion for
> example.

I've always found it weird that it uses "ORDER BY relpages".

I'd prefer if it could ORDER BY age(relfrozenxid) or
GREATEST(age(relfrozenxid), age(relminmxid)), at least if you specify
one of the --min-*age parms.  Or something less hardcoded and
unconfigurable.

-- 
Justin