Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae - Mailing list pgsql-bugs
From | Matthias van de Meent |
---|---|
Subject | Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae |
Date | |
Msg-id | CAEze2WjMTh4KS0=QEQB-Jq+tDLPR+0+zVBMfVwSPK5A=WZa95Q@mail.gmail.com Whole thread Raw |
In response to | Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
(Robert Haas <robertmhaas@gmail.com>)
|
List | pgsql-bugs |
On Thu, 21 Mar 2024 at 17:15, Robert Haas <robertmhaas@gmail.com> wrote: > > On Sun, Mar 3, 2024 at 7:07 PM Noah Misch <noah@leadboat.com> wrote: > > I figure Matthias's upthread theory is more likely than not to hold. If it > > does hold, commit 1ccc1e05ae created a new corruption route. Hence, I'm > > adding a v17 open item for commit 1ccc1e05ae. > > I need some help understanding what's going on here. I became aware of > this thread because I took a look at the open items list. > > This email seems to have branched off of the thread for bug #17257, > reported 2021-10-29. The antecedent of "Matthias's upthread theory" is > unclear to me. These emails seem like the most relevant ones: > > https://www.postgresql.org/message-id/CAEze2Wj7O5tnM_U151Baxr5ObTJafwH%3D71_JEmgJV%2B6eBgjL7g%40mail.gmail.com > https://www.postgresql.org/message-id/CAEze2WhxhEQEx%2Bc%2BCXoDpQs1H1HgkYUK4BW-hFw5_eQxuVWqRw%40mail.gmail.com > https://www.postgresql.org/message-id/20240106202413.e5%40rfd.leadboat.com > > But I'm having a hard time piecing it all together. The general > picture seems to be that pruning and vacuum disagree about whether a > particular tuple is prunable; before 1ccc1e05ae, that caused the retry > loop in heap_page_prune() to retry forever. Now, it causes > relfrozenxid to be set to too new a value, which is a data-corruption > scenario. If that's right, I'm slightly miffed to find this being > labeled as an open item, since that makes it seem like 1ccc1e05ae > didn't create any new problem but only caused existing defects in the > GlobalVisTest machinery to have different consequences. Perhaps it's > all for the best, though. It's kind of embarrassing that we haven't > fixed whatever the problem is here yet. > > But what exactly is the problem, and what's the fix? In the first of > the emails linked above, Matthias argues that the problem is that > GlobalVisState->maybe_needed can move backward. Peter Geoghegan seems > to agree with that here: > > https://www.postgresql.org/message-id/CAH2-Wzk_L7Z7LREHTtg5vY08eeWdnHO70m98eWx4U1uwvW%3D0sA%40mail.gmail.com > > And Peter seems to have been trying to make sense of Andres's remarks > here, which I think are saying the same thing: > > https://www.postgresql.org/message-id/20210616192202.6q63mu66h4uyn343%40alap3.anarazel.de > > So it seems like Matthias, Peter, and Andres all agree that > GlobalVisState->maybe_needed going backward is bad and causes this > problem. Unfortunately, I don't understand the mechanism. There are 2 mechanisms I know of which allow this value to go backwards: 1. Replication slots that connect may set their backend's xmin to an xmin < GlobalXmin. This is known and has been documented, and was considered OK when this was discussed on the list previously. 2. The commit abort path has a short window in which the backend's xmin is unset and does not mirror the xmin of registered snapshots. This is what I described in [0], and may be the worst (?) offender. > -- > Robert Haas > EDB: http://www.enterprisedb.com [0] https://www.postgresql.org/message-id/CAEze2Wj%2BV0kTx86xB_YbyaqTr5hnE_igdWAwuhSyjXBYscf5-Q%40mail.gmail.com
pgsql-bugs by date: