Re: Add 64-bit XIDs into PostgreSQL 15 - Mailing list pgsql-hackers

From Chris Travers
Subject Re: Add 64-bit XIDs into PostgreSQL 15
Date
Msg-id CAEq-hvsYDXzk4fHR3W7jbaXpqOQaor028tMiJvz955Xcp=Z89Q@mail.gmail.com
Whole thread Raw
In response to Re: Add 64-bit XIDs into PostgreSQL 15  (Pavel Borisov <pashkin.elfe@gmail.com>)
Responses Re: Add 64-bit XIDs into PostgreSQL 15
List pgsql-hackers


On Mon, Nov 21, 2022 at 10:40 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
> I have a very serious concern about the current patch set. as someone who has faced transaction id wraparound in the past.
>
> I can start by saying I think it would be helpful (if the other issues are approached reasonably) to have 64-bit xids, but there is an important piece of context in reventing xid wraparounds that seems missing from this patch unless I missed something.
>
> XID wraparound is a symptom, not an underlying problem.  It usually occurs when autovacuum or other vacuum strategies have unexpected stalls and therefore fail to work as expected.  Shifting to 64-bit XIDs dramatically changes the sorts of problems that these stalls are likely to pose to operational teams.  -- you can find you are running out of storage rather than facing an imminent database shutdown.  Worse, this patch delays the problem until some (possibly far later!) time, when vacuum will take far longer to finish, and options for resolving the problem are diminished.  As a result I am concerned that merely changing xids from 32-bit to 64-bit will lead to a smaller number of far more serious outages.
>
> What would make a big difference from my perspective would be to combine this with an inverse system for warning that there is a problem, allowing the administrator to throw warnings about xids since last vacuum, with a configurable threshold.  We could have this at two billion by default as that would pose operational warnings not much later than we have now.
>
> Otherwise I can imagine cases where instead of 30 hours to vacuum a table, it takes 300 hours on a database that is short on space.  And I would not want to be facing such a situation.

Hi, Chris!
I had a similar stance when I started working on this patch. Of
course, it seemed horrible just to postpone the consequences of
inadequate monitoring, too long running transactions that prevent
aggressive autovacuum etc. So I can understand your point.

With time I've got to a little bit of another view of this feature i.e.

1. It's important to correctly set monitoring, the cut-off of long
transactions, etc. anyway. It's not the responsibility of vacuum
before wraparound to report inadequate monitoring etc. Furthermore, in
real life, this will be already too late if it prevents 32-bit
wraparound and invokes much downtime in an unexpected moment of time
if it occurs already. (The rough analogy for that is the machine
running at 120mph turns every control off and applies full brakes just
because the cooling liquid is low (of course there might be a warning
previously, but anyway))

So I disagree with you on a few critical points here.

Right now the way things work is:
1.  Database starts throwing warnings that xid wraparound is approaching
2.  Database-owning team initiates an emergency response, may take downtime or degradation of services as a result
3.  People get frustrated with PostgreSQL because this is a reliability problem.

What I am worried about is:
1.  Database is running out of space
2.  Database-owning team initiates an emergency response and takes more downtime to into a good spot
3.  People get frustrated with PostgreSQL because this is a reliability problem.

If that's the way we go, I don't think we've solved that much.  And as humans we also bias our judgments towards newsworthy events, so rarer, more severe problems are a larger perceived problem than the more routine, less severe problems.  So I think our image as a reliable database would suffer.

An ideal resolution from my perspective would be:
1.  Database starts throwing warnings that xid lag has reached severely abnormal levels
2.  Database owning team initiates an effort to correct this, and does not take downtime or degradation of services as a result
3.  People do not get frustrated because this is not a reliability problem anymore.

Now, 64-big xids are necessary to get us there but they are not sufficient.  One needs to fix the way we handle this sort of problem.  There is existing logic to warn if we are approaching xid wraparound.  This should be changed to check how many xids we have used rather than remaining and have a sensible default there (optionally configurable).

I agree it is not vacuum's responsibility.  It is the responsibility of the current warnings we have to avoid more serious problems arising from this change.  These should just be adjusted rather than dropped.


2. The checks and handlers for the event that is never expected in the
cluster lifetime (~200 years at constant rate of 1e6 TPS) can be just
dropped. Of course we still need to do automatic routine maintenance
like cutting SLRU buffers (but with a much bigger interval if we have
much disk space e.g.). But I considered that we either can not care
what will be with cluster after > 200 years (it will be migrated many
times before this, on many reasons not related to Postgres even for
the most conservative owners). So the radical proposal is to drop
64-bit wraparound at all. The most moderate one is just not taking
very much care that after 200 years we have more hassle than next
month if we haven't set up everything correctly. Next month's pain
will be more significant even if it teaches dba something.

Big thanks for your view on the general implementation of this feature, anyway.

Kind regards,
Pavel Borisov.
Supabase


pgsql-hackers by date:

Previous
From: Chris Travers
Date:
Subject: Re: Add 64-bit XIDs into PostgreSQL 15
Next
From: Nathan Bossart
Date:
Subject: Re: wake up logical workers after ALTER SUBSCRIPTION