Re: Statistics Import and Export - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Statistics Import and Export |
Date | |
Msg-id | urk5wcvgvnqzmvdvbe6b2gthhf3pphfqmbwculciy3lm4kvnay@avvgvhxo544p Whole thread Raw |
In response to | Re: Statistics Import and Export (Corey Huinker <corey.huinker@gmail.com>) |
Responses |
Re: Statistics Import and Export
Re: Statistics Import and Export Re: Statistics Import and Export |
List | pgsql-hackers |
Hi, On 2025-03-06 12:16:44 -0500, Corey Huinker wrote: > > > > To be honest, I am a bit surprised that we decided to enable this by > > default. It's not obvious to me that statistics should be regarded as > > part of the database in the same way that table definitions or table > > data are. That said, I'm not overwhelmingly opposed to that choice. > > However, even if it's the right choice in theory, we should maybe > > rethink if it's going to be too slow or use too much memory. > > > > I'm strongly in favor of the choice to make it default. This is reducing > the impact of a post-upgrade customer footgun wherein heavy workloads are > applied to a database post-upgrade but before analyze/vacuumdb have had a > chance to do their magic [1]. To be clear, I think this is a very important improvement that most people should use. I just don't think it's quite there yet. > It seems to me that we're fretting over seconds when the feature is > potentially saving the customer hours of reduced availability if not > outright downtime. FWIW, I care about the performance for two reasons: 1) It's a difference of seconds in the regression database, which has a few hundred tables, few columns, very little data and thus small stats. In a database with a lot of tables and columns with complicated datatypes the difference will be far larger. And in contrast to analyzing the database in parallel, the pg_dump/restore work to restore stats afaict happens single-threaded for each database. 2) The changes initially substantially increased the time a test cycle takes for me locally. I run the tests 10s to 100s time a day, that really adds up. 002_pg_upgrade is the test that dominates the overall test time for me, so it getting slower by a good bit means the overall test time increased. 1fd1bd87101^: total test time: 1m27.010s 003_pg_upgrade alone: 1m6.309s 1fd1bd87101: total test time: 1m45.945s 003_pg_upgrade alone: 1m24.597s master at 0f21db36d66: total test time: 1m34.576s 003_pg_upgrade alone: 1m12.550s It clearly got a lot better since 1fd1bd87101, but it's still ~9% slower than before... I care about the memory usage effects because I've seen plenty systems where pg_statistics is many gigabytes (after toast compression!), and I am really worried that pg_dump having all the serialized strings in memory will cause a lot of previously working pg_dump invocations and pg_upgrades to fail. That'd also be a really bad experience. The more I think about it, the less correct it seems to me to have the statement to restore statistics tracked via ArchiveOpts->createStmt. We use that for DDL, but this really is data, not DDL. Because we store it in ->createStmt it's stored in-memory for the runtime of pg_dump, which means the peak memory usage will inherently be quite high. I think the stats need to be handled much more like we handle the actual table data, which are obviously *not* stored in memory for the whole run of pg_dump. Greetings, Andres Freund
pgsql-hackers by date: