Re: Statistics Import and Export - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Statistics Import and Export
Date
Msg-id urk5wcvgvnqzmvdvbe6b2gthhf3pphfqmbwculciy3lm4kvnay@avvgvhxo544p
Whole thread Raw
In response to Re: Statistics Import and Export  (Corey Huinker <corey.huinker@gmail.com>)
Responses Re: Statistics Import and Export
Re: Statistics Import and Export
Re: Statistics Import and Export
List pgsql-hackers
Hi,

On 2025-03-06 12:16:44 -0500, Corey Huinker wrote:
> >
> > To be honest, I am a bit surprised that we decided to enable this by
> > default. It's not obvious to me that statistics should be regarded as
> > part of the database in the same way that table definitions or table
> > data are. That said, I'm not overwhelmingly opposed to that choice.
> > However, even if it's the right choice in theory, we should maybe
> > rethink if it's going to be too slow or use too much memory.
> >
>
> I'm strongly in favor of the choice to make it default. This is reducing
> the impact of a post-upgrade customer footgun wherein heavy workloads are
> applied to a database post-upgrade but before analyze/vacuumdb have had a
> chance to do their magic [1].

To be clear, I think this is a very important improvement that most people
should use. I just don't think it's quite there yet.


> It seems to me that we're fretting over seconds when the feature is
> potentially saving the customer hours of reduced availability if not
> outright downtime.

FWIW, I care about the performance for two reasons:

1) It's a difference of seconds in the regression database, which has a few
   hundred tables, few columns, very little data and thus small stats. In a
   database with a lot of tables and columns with complicated datatypes the
   difference will be far larger.

   And in contrast to analyzing the database in parallel, the pg_dump/restore
   work to restore stats afaict happens single-threaded for each database.


2) The changes initially substantially increased the time a test cycle takes
   for me locally. I run the tests 10s to 100s time a day, that really adds
   up.

   002_pg_upgrade is the test that dominates the overall test time for me, so
   it getting slower by a good bit means the overall test time increased.

   1fd1bd87101^:
   total test time:             1m27.010s
   003_pg_upgrade alone:        1m6.309s

   1fd1bd87101:
   total test time:             1m45.945s
   003_pg_upgrade alone:        1m24.597s

   master at 0f21db36d66:
   total test time:             1m34.576s
   003_pg_upgrade alone:        1m12.550s

   It clearly got a lot better since 1fd1bd87101, but it's still ~9% slower
   than before...


I care about the memory usage effects because I've seen plenty systems where
pg_statistics is many gigabytes (after toast compression!), and I am really
worried that pg_dump having all the serialized strings in memory will cause a
lot of previously working pg_dump invocations and pg_upgrades to fail. That'd
also be a really bad experience.


The more I think about it, the less correct it seems to me to have the
statement to restore statistics tracked via ArchiveOpts->createStmt.  We use
that for DDL, but this really is data, not DDL.  Because we store it in
->createStmt it's stored in-memory for the runtime of pg_dump, which means the
peak memory usage will inherently be quite high.

I think the stats need to be handled much more like we handle the actual table
data, which are obviously *not* stored in memory for the whole run of pg_dump.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: Statistics Import and Export
Next
From: Tom Lane
Date:
Subject: Re: what's going on with lapwing?