Re: [Feature Request] INSERT FROZEN to Optimize Large Cold Data Imports and Migrations - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [Feature Request] INSERT FROZEN to Optimize Large Cold Data Imports and Migrations
Date
Msg-id k7wr7djnzmtks6ekwfrhmtsvz33ib2fjnxa6jzl4mk2xijvjtm@jakdjildxika
Whole thread Raw
In response to [Feature Request] INSERT FROZEN to Optimize Large Cold Data Imports and Migrations  (Sébastien <bokanist@gmail.com>)
Responses Re: [Feature Request] INSERT FROZEN to Optimize Large Cold Data Imports and Migrations
List pgsql-hackers
Hi,

On 2025-02-13 10:52:31 +0100, Sébastien wrote:
> Introduce an INSERT FROZEN feature to bypass vacuum processing for
> large-scale cold data imports, reducing the impact on system performance
> post-import. For large imports, migrations and major version upgrades.
> Business Use-case:
> 
> When importing massive datasets (e.g., 6-10TB) into PostgreSQL on a heavily
> loaded server, we observed that the system struggled significantly weeks
> later when the autovacuum process had to freeze all the imported data
> pages. This led to severe performance degradation, requiring manual
> intervention to prioritize vacuum jobs to complete them as quickly as
> possible.

What version of postgres was this?  What batch sizes do you need to support?
I.e. is all of this data inserted at once, or in steps?

As already discussed, it seems unlikely that we'll ever support INSERT FROZEN,
due to the potential of causing concurrent queries to give bogus answers. But
there's actually a lot we can do to improve this short of INSERT FROZEN.

The reason I asked for the version is that the behaviour would e.g. likely be
worse before autovacuum_vacuum_insert_scale_factor existed. We are working on
improvements around that in 18 too, ensuring that the gap between insert
triggered vacuums does not grow forever.

Several recent releases have also improved the situation around this in other
ways, e.g. by just making vacuuming faster and by avoiding doing redundant
work in more cases (by increasing relfrozenzid more aggressively).

We've also been talking about performing freezing during e.g. checkpoints, if
possible.

If you're inserting all the data in a single transaction however, it'll be
hard to improve most of this, because while that long long transaction runs,
we can't do anything that needs to know the transaction has finished. OTOH,
if it were a single transaction, you could already use COPY FREEZE.


A somewhat related issue is that bulk INSERTs, in contrast to COPY, currently
does not use the bulk-insert logic, leading the INSERT to cause a lot more WAL
to be emitted compared to inserting the same data via COPY.


> This issue is particularly critical during database *migrations* or *version
> upgrades*, where a full data reload is often necessary. Each time a major
> PostgreSQL upgrade occurs, users must reimport large datasets, leading to
> the same problem of vacuum storms post-import. An INSERT FROZEN feature
> would allow importing data that is known to be immutable, preventing
> unnecessary vacuum overhead and reducing system strain.

What are you using for such upgrades or migrations? I'd not expect INSERT to
be used, due to the overhead that has compared to COPY.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: BackgroundPsql swallowing errors on windows
Next
From: Andres Freund
Date:
Subject: Re: Parameter binding for COPY TO queries