On Thu, Mar 06, 2025 at 01:04:55PM -0500, Andres Freund wrote:
> To be clear, I think this is a very important improvement that most people
> should use.
+1
> I just don't think it's quite there yet.
I agree that we should continue working on the performance/memory stuff.
> 1) It's a difference of seconds in the regression database, which has a few
> hundred tables, few columns, very little data and thus small stats. In a
> database with a lot of tables and columns with complicated datatypes the
> difference will be far larger.
>
> And in contrast to analyzing the database in parallel, the pg_dump/restore
> work to restore stats afaict happens single-threaded for each database.
Yeah, I did a lot of work in v18 to rein in pg_dump --binary-upgrade
runtime, and I'm a bit worried that this will undo much of that. Obviously
it's going to increase runtime by some amount, which is acceptable, but it
needs to be within reason. I'm optimistic this is within reach for v18 by
reducing the number of queries.
> I care about the memory usage effects because I've seen plenty systems where
> pg_statistics is many gigabytes (after toast compression!), and I am really
> worried that pg_dump having all the serialized strings in memory will cause a
> lot of previously working pg_dump invocations and pg_upgrades to fail. That'd
> also be a really bad experience.
I think it is entirely warranted to consider these cases. IME cases of "a
million tables" or "a million sequences" are far more common than you might
think.
--
nathan