Re: Changing the state of data checksums in a running cluster - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: Changing the state of data checksums in a running cluster
Date
Msg-id b6b0637d-3baf-4a4d-a3b7-9b3558a88d40@gmail.com
Whole thread Raw
In response to Re: Changing the state of data checksums in a running cluster  (Daniel Gustafsson <daniel@yesql.se>)
Responses Re: Changing the state of data checksums in a running cluster
List pgsql-hackers
Hello Daniel,

04.04.2026 00:46, Daniel Gustafsson wrote:
After many more runs on CI I ended up pushing this version, and I see BF
members being angry due the test not waiting for the launcher to exit.  I am
working on a fix right now.

Maybe this is already known or even expected, but I'd still like to let
you know that starting from f19c0ecca, I'm observing checksum errors in a
running instance. I've modified PageIsVerified() to catch errors sooner:
@@ -158,7 +158,7 @@ PageIsVerified(PageData *page, BlockNumber blkno, int flags, bool *checksum_fail
     if (checksum_failure)
     {
         if ((flags & (PIV_LOG_WARNING | PIV_LOG_LOG)) != 0)
-            ereport(flags & PIV_LOG_WARNING ? WARNING : LOG,
+            ereport(PANIC,
                     (errcode(ERRCODE_DATA_CORRUPTED),
                      errmsg("page verification failed, calculated checksum %u but expected %u%s",
                             checksum, p->pd_checksum,

And I'm getting, e.g.:
2026-04-06 18:09:12.077 EEST|postgres|regress_215|69d3cc86.3bfbdc|PANIC:  page verification failed, calculated checksum 40178 but expected 50558, buffer will be zeroed
2026-04-06 18:09:12.077 EEST|postgres|regress_215|69d3cc86.3bfbdc|STATEMENT:  update information_schema.sql_features set
...

Core was generated by `postgres: postgres regress_215 127.0.0.1(42448) UPDATE        '.
Program terminated with signal SIGABRT, Aborted.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x0000796d0004527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x0000796d000288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x000055fe3f92c855 in errfinish (filename=filename@entry=0x55fe3fa54bad "bufpage.c", lineno=lineno@entry=161, funcname=funcname@entry=0x55fe3fb70ff8 <__func__.6> "PageIsVerified") at elog.c:620
#6  0x000055fe3f7c2415 in PageIsVerified (page=page@entry=0x796cf6884000 "", blkno=blkno@entry=0, flags=10, checksum_failure_p=checksum_failure_p@entry=0x7ffd2c524bef) at bufpage.c:161
#7  0x000055fe3f78a93d in buffer_readv_complete_one (zeroed_buffer=<synthetic pointer>, ignored_checksum=<synthetic pointer>, failed_checksum=0x7ffd2c524bef, buffer_invalid=<synthetic pointer>, is_temp=false, failed=false, flags=9 '\t', buffer=15424, buf_off=0 '\000', td=0x796cfc69c2d8) at bufmgr.c:8593
#8  buffer_readv_complete (is_temp=false, cb_data=<optimized out>, prior_result=..., ioh=<optimized out>) at bufmgr.c:8724
#9  shared_buffer_readv_complete (ioh=<optimized out>, prior_result=..., cb_data=<optimized out>) at bufmgr.c:8883
#10 0x000055fe3f77ec61 in pgaio_io_call_complete_shared (ioh=ioh@entry=0x796cfc69c260) at aio_callback.c:258
#11 0x000055fe3f77d4f6 in pgaio_io_process_completion (ioh=ioh@entry=0x796cfc69c260, result=<optimized out>) at aio.c:540
#12 0x000055fe3f77fe42 in pgaio_io_perform_synchronously (ioh=ioh@entry=0x796cfc69c260) at aio_io.c:146
#13 0x000055fe3f77e121 in pgaio_io_stage (ioh=ioh@entry=0x796cfc69c260, op=op@entry=PGAIO_OP_READV) at aio.c:476
#14 0x000055fe3f77fd6d in pgaio_io_start_readv (ioh=ioh@entry=0x796cfc69c260, fd=166, iovcnt=iovcnt@entry=1, offset=offset@entry=0) at aio_io.c:87
#15 0x000055fe3f795bae in FileStartReadV (ioh=ioh@entry=0x796cfc69c260, file=<optimized out>, iovcnt=iovcnt@entry=1, offset=offset@entry=0, wait_event_info=wait_event_info@entry=167772183) at fd.c:2225
#16 0x000055fe3f7c648b in mdstartreadv (ioh=0x796cfc69c260, reln=0x55fe73aeba98, forknum=VISIBILITYMAP_FORKNUM, blocknum=0, buffers=<optimized out>, nblocks=1) at md.c:1041
#17 0x000055fe3f7c809c in smgrstartreadv (ioh=ioh@entry=0x796cfc69c260, reln=<optimized out>, forknum=forknum@entry=VISIBILITYMAP_FORKNUM, blocknum=blocknum@entry=0, buffers=buffers@entry=0x7ffd2c524e70, nblocks=nblocks@entry=1) at smgr.c:758
#18 0x000055fe3f78a1c7 in AsyncReadBuffers (operation=operation@entry=0x7ffd2c5253a0, nblocks_progress=nblocks_progress@entry=0x7ffd2c52530c) at bufmgr.c:2144
#19 0x000055fe3f78ce19 in StartReadBuffersImpl (allow_forwarding=false, flags=9, nblocks=0x7ffd2c52530c, blockNum=0, buffers=0x7ffd2c52539c, operation=0x7ffd2c5253a0) at bufmgr.c:1548
#20 StartReadBuffer (operation=operation@entry=0x7ffd2c5253a0, buffer=buffer@entry=0x7ffd2c52539c, blocknum=blocknum@entry=0, flags=9) at bufmgr.c:1636
#21 0x000055fe3f78d870 in ReadBuffer_common (strategy=0x0, mode=RBM_ZERO_ON_ERROR, blockNum=0, forkNum=VISIBILITYMAP_FORKNUM, smgr_persistence=0 '\000', smgr=0x55fe73aeba98, rel=0x796d006a31a8) at bufmgr.c:1358
#22 ReadBufferExtended (reln=reln@entry=0x796d006a31a8, forkNum=forkNum@entry=VISIBILITYMAP_FORKNUM, blockNum=blockNum@entry=0, mode=mode@entry=RBM_ZERO_ON_ERROR, strategy=strategy@entry=0x0) at bufmgr.c:945
#23 0x000055fe3f3d7e00 in vm_readbuf (rel=rel@entry=0x796d006a31a8, blkno=blkno@entry=0, extend=extend@entry=true) at visibilitymap.c:577
#24 0x000055fe3f3d7fda in visibilitymap_pin (rel=rel@entry=0x796d006a31a8, heapBlk=<optimized out>, vmbuf=vmbuf@entry=0x55fe73ed2b18) at visibilitymap.c:216
#25 0x000055fe3f3d1f7a in heap_page_prune_opt (relation=0x796d006a31a8, buffer=buffer@entry=15403, vmbuffer=vmbuffer@entry=0x55fe73ed2b18, rel_read_only=false) at pruneheap.c:339
#26 0x000055fe3f3c1dcf in heap_prepare_pagescan (sscan=sscan@entry=0x55fe73ed2a88) at heapam.c:636
#27 0x000055fe3f3c242f in heapgettup_pagemode (scan=scan@entry=0x55fe73ed2a88, dir=ForwardScanDirection, nkeys=0, key=0x0) at heapam.c:1111
#28 0x000055fe3f3c27ab in heap_getnextslot (sscan=0x55fe73ed2a88, direction=<optimized out>, slot=0x55fe73ed13a8) at heapam.c:1467
#29 0x000055fe3f5e8d62 in table_scan_getnextslot (sscan=<optimized out>, direction=direction@entry=ForwardScanDirection, slot=slot@entry=0x55fe73ed13a8) at ../../../src/include/access/tableam.h:1099
#30 0x000055fe3f5e939e in SeqNext (node=0x55fe73ed1188) at nodeSeqscan.c:83
#31 ExecScanFetch (recheckMtd=0x55fe3f5e8d2e <SeqRecheck>, accessMtd=0x55fe3f5e8c9c <SeqNext>, epqstate=0x0, node=0x55fe73ed1188) at ../../../src/include/executor/execScan.h:135
#32 ExecScanExtended (projInfo=0x55fe73ed19d8, qual=0x0, epqstate=0x0, recheckMtd=0x55fe3f5e8d2e <SeqRecheck>, accessMtd=0x55fe3f5e8c9c <SeqNext>, node=0x55fe73ed1188) at ../../../src/include/executor/execScan.h:196
#33 ExecSeqScanWithProject (pstate=<optimized out>) at nodeSeqscan.c:164
#34 0x000055fe3f5b65f9 in ExecProcNodeFirst (node=0x55fe73ed1188) at execProcnode.c:470
#35 0x000055fe3f5dfd3b in ExecProcNode (node=node@entry=0x55fe73ed1188) at ../../../src/include/executor/executor.h:320
...


2026-04-06 18:09:12.289 EEST|postgres|regress_147|69d3cc44.3bfaaa|PANIC:  page verification failed, calculated checksum 8769 but expected 0
2026-04-06 18:09:12.289 EEST|postgres|regress_147|69d3cc44.3bfaaa|STATEMENT:  insert into information_schema.sql_features values (   
...

Core was generated by `postgres: postgres regress_147 127.0.0.1(35968) INSERT        '.
Program terminated with signal SIGABRT, Aborted.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x0000796d0004527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x0000796d000288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x000055fe3f92c855 in errfinish (filename=filename@entry=0x55fe3fa54bad "bufpage.c", lineno=lineno@entry=161, funcname=funcname@entry=0x55fe3fb70ff8 <__func__.6> "PageIsVerified") at elog.c:620
#6  0x000055fe3f7c2415 in PageIsVerified (page=page@entry=0x796cf0c80000 "", blkno=blkno@entry=21, flags=2, checksum_failure_p=checksum_failure_p@entry=0x7ffd2c526b1f) at bufpage.c:161
#7  0x000055fe3f78a93d in buffer_readv_complete_one (zeroed_buffer=<synthetic pointer>, ignored_checksum=<synthetic pointer>, failed_checksum=0x7ffd2c526b1f, buffer_invalid=<synthetic pointer>, is_temp=false, failed=false, flags=8 '\b', buffer=3646, buf_off=0 '\000', td=0x796cfc63e7c8) at bufmgr.c:8593
#8  buffer_readv_complete (is_temp=false, cb_data=<optimized out>, prior_result=..., ioh=<optimized out>) at bufmgr.c:8724
#9  shared_buffer_readv_complete (ioh=<optimized out>, prior_result=..., cb_data=<optimized out>) at bufmgr.c:8883
#10 0x000055fe3f77ec61 in pgaio_io_call_complete_shared (ioh=ioh@entry=0x796cfc63e750) at aio_callback.c:258
#11 0x000055fe3f77d4f6 in pgaio_io_process_completion (ioh=ioh@entry=0x796cfc63e750, result=<optimized out>) at aio.c:540
#12 0x000055fe3f77fe42 in pgaio_io_perform_synchronously (ioh=ioh@entry=0x796cfc63e750) at aio_io.c:146
#13 0x000055fe3f77e121 in pgaio_io_stage (ioh=ioh@entry=0x796cfc63e750, op=op@entry=PGAIO_OP_READV) at aio.c:476
#14 0x000055fe3f77fd6d in pgaio_io_start_readv (ioh=ioh@entry=0x796cfc63e750, fd=199, iovcnt=iovcnt@entry=1, offset=offset@entry=172032) at aio_io.c:87
#15 0x000055fe3f795bae in FileStartReadV (ioh=ioh@entry=0x796cfc63e750, file=<optimized out>, iovcnt=iovcnt@entry=1, offset=offset@entry=172032, wait_event_info=wait_event_info@entry=167772183) at fd.c:2225
#16 0x000055fe3f7c648b in mdstartreadv (ioh=0x796cfc63e750, reln=0x55fe73b833b8, forknum=MAIN_FORKNUM, blocknum=21, buffers=<optimized out>, nblocks=1) at md.c:1041
#17 0x000055fe3f7c809c in smgrstartreadv (ioh=ioh@entry=0x796cfc63e750, reln=<optimized out>, forknum=forknum@entry=MAIN_FORKNUM, blocknum=blocknum@entry=21, buffers=buffers@entry=0x7ffd2c526da0, nblocks=nblocks@entry=1) at smgr.c:758
#18 0x000055fe3f78a1c7 in AsyncReadBuffers (operation=operation@entry=0x7ffd2c5272d0, nblocks_progress=nblocks_progress@entry=0x7ffd2c52723c) at bufmgr.c:2144
#19 0x000055fe3f78ce19 in StartReadBuffersImpl (allow_forwarding=false, flags=8, nblocks=0x7ffd2c52723c, blockNum=21, buffers=0x7ffd2c5272cc, operation=0x7ffd2c5272d0) at bufmgr.c:1548
#20 StartReadBuffer (operation=operation@entry=0x7ffd2c5272d0, buffer=buffer@entry=0x7ffd2c5272cc, blocknum=blocknum@entry=21, flags=8) at bufmgr.c:1636
#21 0x000055fe3f78d870 in ReadBuffer_common (strategy=0x0, mode=RBM_NORMAL, blockNum=21, forkNum=MAIN_FORKNUM, smgr_persistence=0 '\000', smgr=0x55fe73b833b8, rel=0x796d0066aa18) at bufmgr.c:1358
#22 ReadBufferExtended (reln=0x796d0066aa18, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=21, mode=mode@entry=RBM_NORMAL, strategy=strategy@entry=0x0) at bufmgr.c:945
#23 0x000055fe3f3ce074 in ReadBufferBI (relation=relation@entry=0x796d0066aa18, targetBlock=targetBlock@entry=21, mode=mode@entry=RBM_NORMAL, bistate=bistate@entry=0x0) at hio.c:93
#24 0x000055fe3f3cea30 in RelationGetBufferForTuple (relation=relation@entry=0x796d0066aa18, len=24, otherBuffer=otherBuffer@entry=0, options=options@entry=0, bistate=bistate@entry=0x0, vmbuffer=vmbuffer@entry=0x7ffd2c527468, vmbuffer_other=0x0, num_pages=1) at hio.c:617
#25 0x000055fe3f3bcb50 in heap_insert (relation=relation@entry=0x796d0066aa18, tup=tup@entry=0x55fe73be9638, cid=cid@entry=0, options=options@entry=0, bistate=bistate@entry=0x0) at heapam.c:2179
#26 0x000055fe3f3c7c82 in heapam_tuple_insert (relation=0x796d0066aa18, slot=0x55fe73be9528, cid=0, options=0, bistate=0x0) at heapam_handler.c:267
#27 0x000055fe3f5e2da2 in table_tuple_insert (bistate=0x0, options=0, cid=<optimized out>, slot=0x55fe73be9528, rel=0x796d0066aa18) at ../../../src/include/access/tableam.h:1456
#28 ExecInsert (context=context@entry=0x7ffd2c527620, resultRelInfo=resultRelInfo@entry=0x55fe737c8b00, slot=0x55fe73be9528, canSetTag=true, inserted_tuple=inserted_tuple@entry=0x0, insert_destrel=insert_destrel@entry=0x0) at nodeModifyTable.c:1272
#29 0x000055fe3f5e5542 in ExecModifyTable (pstate=0x55fe737c88f0) at nodeModifyTable.c:4933
#30 0x000055fe3f5b65f9 in ExecProcNodeFirst (node=0x55fe737c88f0) at execProcnode.c:470
...

I reproduce it rather easily (within 30 minutes) with 600 instances of
"sqlsmith --max-queries=1000" running against separate empty databases, on
my workstation with Ryzen 7900. I think I can compose a self-contained
repro, if needed... If you need more information/diagnostics, I'd be glad
to help.

Best regards,
Alexander

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Add errdetail() with PID and UID about source of termination signal
Next
From: Bruce Momjian
Date:
Subject: Re: PG 19 release notes and authors