Re: pgsql: Improve runtime and output of tests for replication slots checkp - Mailing list pgsql-committers
From | Alexander Korotkov |
---|---|
Subject | Re: pgsql: Improve runtime and output of tests for replication slots checkp |
Date | |
Msg-id | CAPpHfdtmLNpswWzk2qQ7-nMBLod2GrRWwMvfb0hc7tQfDD--ZQ@mail.gmail.com Whole thread Raw |
In response to | Re: pgsql: Improve runtime and output of tests for replication slots checkp (Alexander Korotkov <aekorotkov@gmail.com>) |
List | pgsql-committers |
On Sat, Jun 21, 2025 at 1:40 AM Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Sat, Jun 21, 2025 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I wrote:
> > > But in the buildfarm failures I don't see any 'checkpoint complete'
> > > before the shutdown.
> >
> > Ooops, I lied: we have at least one case where the checkpoint does
> > finish but then it hangs up anyway:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=melonworm&dt=2025-06-20%2019%3A59%3A31&stg=recovery-check
> >
> > 2025-06-20 20:05:32.548 UTC [3171474][client backend][61/2:0] LOG: statement: select injection_points_wakeup('checkpoint-before-old-wal-removal')
> > 2025-06-20 20:05:32.550 UTC [3162113][checkpointer][:0] LOG: checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.001 s, total=1.507 s; sync files=1, longest=0.001 s, average=0.001 s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
> > 2025-06-20 20:05:32.551 UTC [3171474][client backend][:0] LOG: disconnection: session time: 0:00:00.073 user=bf database=postgres host=[local]
> > 2025-06-20 20:05:32.552 UTC [3170563][client backend][:0] LOG: disconnection: session time: 0:00:01.538 user=bf database=postgres host=[local]
> > 2025-06-20 20:05:32.564 UTC [3163446][client backend][7/146:0] LOG: statement: SELECT 1
> > 2025-06-20 20:05:32.564 UTC [3161873][postmaster][:0] LOG: received immediate shutdown request
> > 2025-06-20 20:05:32.604 UTC [3161873][postmaster][:0] LOG: database system is shut down
> >
> > I'm still confused about whether the 046 script intends to
> > sometimes test cases where the checkpoint doesn't have time to
> > complete. But it seems that whatever is going on here
> > is a bit subtle and platform-dependent.
>
> I think this indicates unfinished intention to wait for checkpoint
> completion. But I think both cases (checkpoint finished and
> unfinished) should work correctly. So, I believe there is a backend
> problem. I'm trying to reproduce this locally. Sorry for the
> confusion.
I manage to reproduce the bug locally. I've got the following backtrace.
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x000000019e6fd3c8 libsystem_kernel.dylib`__semwait_signal + 8
frame #1: 0x000000019e5de568 libsystem_c.dylib`nanosleep + 220
frame #2: 0x0000000103009628 postgres`pg_usleep(microsec=1000) at pgsleep.c:50:10
frame #3: 0x00000001027bd248 postgres`read_local_xlog_page_guts(state=0x000000011383ed58, targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360, cur_page="\U00000018\xd1\U00000005", wait_for_wal=true) at xlogutils.c:950:4
frame #4: 0x00000001027bd130 postgres`read_local_xlog_page(state=0x000000011383ed58, targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360, cur_page="\U00000018\xd1\U00000005") at xlogutils.c:848:9
frame #5: 0x00000001027ae184 postgres`ReadPageInternal(state=0x000000011383ed58, pageptr=687882240, reqLen=277) at xlogreader.c:1061:12
frame #6: 0x00000001027acd14 postgres`XLogDecodeNextRecord(state=0x000000011383ed58, nonblocking=false) at xlogreader.c:727:14
frame #7: 0x00000001027ac6bc postgres`XLogReadAhead(state=0x000000011383ed58, nonblocking=false) at xlogreader.c:974:11
frame #8: 0x00000001027ac5c0 postgres`XLogReadRecord(state=0x000000011383ed58, errormsg=0x000000016d798dd0) at xlogreader.c:405:3
frame #9: 0x0000000102c1d1d0 postgres`pg_logical_slot_get_changes_guts(fcinfo=0x0000000113830700, confirm=true, binary=false) at logicalfuncs.c:257:13
frame #10: 0x0000000102c1ca00 postgres`pg_logical_slot_get_changes(fcinfo=0x0000000113830700) at logicalfuncs.c:333:9
frame #11: 0x00000001029e1adc postgres`ExecMakeTableFunctionResult(setexpr=0x0000000113825440, econtext=0x00000001138252e0, argContext=0x0000000113830600, expectedDesc=0x000000011382ecb0, randomAccess=false) at execSRF.c:234:13
While looking at frame 5 I found the following code to be executed.
targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
targetRecOff = RecPtr % XLOG_BLCKSZ;
/*
* Read the page containing the record into state->readBuf. Request enough
* byte to cover the whole record header, or at least the part of it that
* fits on the same page.
*/
readOff = ReadPageInternal(state, targetPagePtr,
Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
And I see the following variable values.
(lldb) p/x targetPagePtr
(XLogRecPtr) 0x0000000029004000
(lldb) p/x RecPtr
(XLogRecPtr) 0x0000000029002138
I hardly understand how is this possible given it was compiled with "-O0". I'm planning to continue investigating this tomorrow.
------
Regards,
Alexander Korotkov
Supabase
> On Sat, Jun 21, 2025 at 1:25 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I wrote:
> > > But in the buildfarm failures I don't see any 'checkpoint complete'
> > > before the shutdown.
> >
> > Ooops, I lied: we have at least one case where the checkpoint does
> > finish but then it hangs up anyway:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=melonworm&dt=2025-06-20%2019%3A59%3A31&stg=recovery-check
> >
> > 2025-06-20 20:05:32.548 UTC [3171474][client backend][61/2:0] LOG: statement: select injection_points_wakeup('checkpoint-before-old-wal-removal')
> > 2025-06-20 20:05:32.550 UTC [3162113][checkpointer][:0] LOG: checkpoint complete: wrote 1 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.045 s, sync=0.001 s, total=1.507 s; sync files=1, longest=0.001 s, average=0.001 s; distance=327688 kB, estimate=327688 kB; lsn=0/290020C0, redo lsn=0/29002068
> > 2025-06-20 20:05:32.551 UTC [3171474][client backend][:0] LOG: disconnection: session time: 0:00:00.073 user=bf database=postgres host=[local]
> > 2025-06-20 20:05:32.552 UTC [3170563][client backend][:0] LOG: disconnection: session time: 0:00:01.538 user=bf database=postgres host=[local]
> > 2025-06-20 20:05:32.564 UTC [3163446][client backend][7/146:0] LOG: statement: SELECT 1
> > 2025-06-20 20:05:32.564 UTC [3161873][postmaster][:0] LOG: received immediate shutdown request
> > 2025-06-20 20:05:32.604 UTC [3161873][postmaster][:0] LOG: database system is shut down
> >
> > I'm still confused about whether the 046 script intends to
> > sometimes test cases where the checkpoint doesn't have time to
> > complete. But it seems that whatever is going on here
> > is a bit subtle and platform-dependent.
>
> I think this indicates unfinished intention to wait for checkpoint
> completion. But I think both cases (checkpoint finished and
> unfinished) should work correctly. So, I believe there is a backend
> problem. I'm trying to reproduce this locally. Sorry for the
> confusion.
I manage to reproduce the bug locally. I've got the following backtrace.
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x000000019e6fd3c8 libsystem_kernel.dylib`__semwait_signal + 8
frame #1: 0x000000019e5de568 libsystem_c.dylib`nanosleep + 220
frame #2: 0x0000000103009628 postgres`pg_usleep(microsec=1000) at pgsleep.c:50:10
frame #3: 0x00000001027bd248 postgres`read_local_xlog_page_guts(state=0x000000011383ed58, targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360, cur_page="\U00000018\xd1\U00000005", wait_for_wal=true) at xlogutils.c:950:4
frame #4: 0x00000001027bd130 postgres`read_local_xlog_page(state=0x000000011383ed58, targetPagePtr=687882240, reqLen=277, targetRecPtr=687874360, cur_page="\U00000018\xd1\U00000005") at xlogutils.c:848:9
frame #5: 0x00000001027ae184 postgres`ReadPageInternal(state=0x000000011383ed58, pageptr=687882240, reqLen=277) at xlogreader.c:1061:12
frame #6: 0x00000001027acd14 postgres`XLogDecodeNextRecord(state=0x000000011383ed58, nonblocking=false) at xlogreader.c:727:14
frame #7: 0x00000001027ac6bc postgres`XLogReadAhead(state=0x000000011383ed58, nonblocking=false) at xlogreader.c:974:11
frame #8: 0x00000001027ac5c0 postgres`XLogReadRecord(state=0x000000011383ed58, errormsg=0x000000016d798dd0) at xlogreader.c:405:3
frame #9: 0x0000000102c1d1d0 postgres`pg_logical_slot_get_changes_guts(fcinfo=0x0000000113830700, confirm=true, binary=false) at logicalfuncs.c:257:13
frame #10: 0x0000000102c1ca00 postgres`pg_logical_slot_get_changes(fcinfo=0x0000000113830700) at logicalfuncs.c:333:9
frame #11: 0x00000001029e1adc postgres`ExecMakeTableFunctionResult(setexpr=0x0000000113825440, econtext=0x00000001138252e0, argContext=0x0000000113830600, expectedDesc=0x000000011382ecb0, randomAccess=false) at execSRF.c:234:13
While looking at frame 5 I found the following code to be executed.
targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
targetRecOff = RecPtr % XLOG_BLCKSZ;
/*
* Read the page containing the record into state->readBuf. Request enough
* byte to cover the whole record header, or at least the part of it that
* fits on the same page.
*/
readOff = ReadPageInternal(state, targetPagePtr,
Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
And I see the following variable values.
(lldb) p/x targetPagePtr
(XLogRecPtr) 0x0000000029004000
(lldb) p/x RecPtr
(XLogRecPtr) 0x0000000029002138
I hardly understand how is this possible given it was compiled with "-O0". I'm planning to continue investigating this tomorrow.
------
Regards,
Alexander Korotkov
Supabase
pgsql-committers by date: