Thread: Re: PITR promote bug: Checkpointer writes to older timeline

Re: PITR promote bug: Checkpointer writes to older timeline

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> I have been working on that over the last couple of days, and applied
> a fix down to 10.  One thing that I did not like in the test was the
> use of compare() to check if the contents of the WAL segment before
> and after the timeline jump remained the same as this would have been
> unstable with any concurrent activity.  Instead, I have added a phase
> at the end of the test with an extra checkpoint and recovery triggered
> once, which is enough to reproduce the PANIC reported at the top of
> the thread.

Buildfarm member hornet just reported a failure in this test:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2021-06-27%2013%3A40%3A57

the critical bit being

2021-06-27 17:35:46.504 UTC [11862234:1] [unknown] LOG:  connection received: host=[local]
2021-06-27 17:35:46.505 UTC [18350260:12] LOG:  recovering prepared transaction 734 from shared memory
TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin)", File: "procarray.c", Line: 2492,
PID:11862234) 
2021-06-27 17:35:46.511 UTC [14876838:4] LOG:  database system is ready to accept connections

It's not clear whether this is a problem with the test case or an
actual server bug, but I'm leaning to the latter theory.  My gut
feel is it's some problem in the "snapshot scalability" work.  It
doesn't look the same as the known open issue, but maybe related?

            regards, tom lane



Re: PITR promote bug: Checkpointer writes to older timeline

From
Tom Lane
Date:
I wrote:
> Buildfarm member hornet just reported a failure in this test:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2021-06-27%2013%3A40%3A57
> It's not clear whether this is a problem with the test case or an
> actual server bug, but I'm leaning to the latter theory.  My gut
> feel is it's some problem in the "snapshot scalability" work.  It
> doesn't look the same as the known open issue, but maybe related?

Hmm, the plot thickens.  I scraped the buildfarm logs for similar-looking
assertion failures back to last August, when the snapshot scalability
patches went in.  The first such failure is not until 2021-03-24
(see attachment), and they all look to be triggered by
023_pitr_prepared_xact.pl.  It sure looks like recovering a prepared
transaction creates a transient state in which a new backend will
compute a broken snapshot.

            regards, tom lane

    sysname    | branch |      snapshot       |     stage     |
                                                     l
                                             

---------------+--------+---------------------+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 calliphoridae | HEAD   | 2021-03-24 06:50:09 | recoveryCheck | TRAP:
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin,RecentXmin)", File:
"/mnt/resource/andres/bf/calliphoridae/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c",Line: 2463, PID:
3890215)
 francolin     | HEAD   | 2021-03-29 16:21:58 | recoveryCheck | TRAP:
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin,RecentXmin)", File:
"/mnt/resource/andres/bf/francolin/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c",Line: 2463, PID:
1861665)
 moonjelly     | HEAD   | 2021-04-01 15:25:38 | recoveryCheck | TRAP:
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin,RecentXmin)", File: "procarray.c", Line: 2463, PID:
2345153)
 francolin     | HEAD   | 2021-04-07 12:30:08 | recoveryCheck | TRAP:
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin,RecentXmin)", File:
"/mnt/resource/andres/bf/francolin/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c",Line: 2468, PID:
3257637)
 fairywren     | HEAD   | 2021-04-20 03:04:04 | recoveryCheck | TRAP:
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin,RecentXmin)", File:
"C:/tools/msys64/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c",Line: 2094, PID:
94824)
 mantid        | HEAD   | 2021-04-25 10:07:06 | recoveryCheck | TRAP:
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin,RecentXmin)", File: "procarray.c", Line: 2094, PID:
2820886)
 thorntail     | HEAD   | 2021-04-29 07:18:09 | recoveryCheck | TRAP:
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin,RecentXmin)", File:
"/home/nm/farm/sparc64_deb10_gcc_64_ubsan/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/procarray.c",Line: 2094,
PID:3099560) 
 mantid        | HEAD   | 2021-05-03 13:07:06 | recoveryCheck | TRAP:
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin,RecentXmin)", File: "procarray.c", Line: 2094, PID:
1163004)
 mantid        | HEAD   | 2021-05-10 01:07:07 | recoveryCheck | TRAP:
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin,RecentXmin)", File: "procarray.c", Line: 2468, PID:
2812704)
 hornet        | HEAD   | 2021-06-27 13:40:57 | recoveryCheck | TRAP:
FailedAssertion("TransactionIdPrecedesOrEquals(TransactionXmin,RecentXmin)", File: "procarray.c", Line: 2492, PID:
11862234)
(10 rows)


Re: PITR promote bug: Checkpointer writes to older timeline

From
Tom Lane
Date:
I wrote:
> It sure looks like recovering a prepared
> transaction creates a transient state in which a new backend will
> compute a broken snapshot.

Oh, after further digging this is the same issue discussed here:

https://www.postgresql.org/message-id/flat/20210422203603.fdnh3fu2mmfp2iov%40alap3.anarazel.de

            regards, tom lane