Thread: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564
v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564
From
Justin Pryzby
Date:
I hit this assertion while pg_restoring data into a v16 instance. postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64 wal_level=minimal and pg_dump --single-transaction both seem to be required to hit the issue. $ /usr/pgsql-16/bin/postgres -D ./pg16test -c maintenance_work_mem=1GB -c max_wal_size=16GB -c wal_level=minimal -c max_wal_senders=0-c port=5678 -c logging_collector=no & $ time sudo -u postgres /usr/pgsql-16/bin/pg_restore -d postgres -p 5678 --single-transaction --no-tablespace ./curtables TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564 Core was generated by `postgres: postgres postgres [local] COMMIT '. Program terminated with signal 6, Aborted. #0 0x00007f28b8bd5387 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64 (gdb) bt #0 0x00007f28b8bd5387 in raise () from /lib64/libc.so.6 #1 0x00007f28b8bd6a78 in abort () from /lib64/libc.so.6 #2 0x00000000009bc8c9 in ExceptionalCondition (conditionName=conditionName@entry=0xa373e1 "size > SizeOfXLogRecord", fileName=fileName@entry=0xa31b13"xlog.c", lineNumber=lineNumber@entry=1055) at assert.c:66 #3 0x000000000057b049 in ReserveXLogInsertLocation (PrevPtr=0x2e3d750, EndPos=<synthetic pointer>, StartPos=<synthetic pointer>,size=24) at xlog.c:1055 #4 XLogInsertRecord (rdata=rdata@entry=0xf187a0 <hdr_rdt>, fpw_lsn=fpw_lsn@entry=0, flags=<optimized out>, num_fpi=num_fpi@entry=0,topxid_included=topxid_included@entry=false) at xlog.c:844 #5 0x000000000058210c in XLogInsert (rmid=rmid@entry=0 '\000', info=info@entry=176 '\260') at xloginsert.c:510 #6 0x0000000000582b09 in log_newpage_range (rel=rel@entry=0x2e1f628, forknum=forknum@entry=FSM_FORKNUM, startblk=startblk@entry=0,endblk=endblk@entry=3, page_std=page_std@entry=false) at xloginsert.c:1317 #7 0x00000000005d02f9 in smgrDoPendingSyncs (isCommit=isCommit@entry=true, isParallelWorker=isParallelWorker@entry=false)at storage.c:837 #8 0x0000000000571637 in CommitTransaction () at xact.c:2225 #9 0x0000000000572b25 in CommitTransactionCommand () at xact.c:3201 #10 0x000000000086afc7 in finish_xact_command () at postgres.c:2782 #11 0x000000000086d7e1 in exec_simple_query (query_string=0x2dec4f8 "COMMIT") at postgres.c:1307
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564
From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes: > I hit this assertion while pg_restoring data into a v16 instance. > postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64 > wal_level=minimal and pg_dump --single-transaction both seem to be > required to hit the issue. Hmm. I wonder if log_newpages() is confused here: XLogEnsureRecordSpace(XLR_MAX_BLOCK_ID - 1, 0); Why is XLR_MAX_BLOCK_ID - 1 enough, rather than XLR_MAX_BLOCK_ID? regards, tom lane
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564
From
Matthias van de Meent
Date:
On Mon, 17 Apr 2023 at 17:53, Justin Pryzby <pryzby@telsasoft.com> wrote: > > I hit this assertion while pg_restoring data into a v16 instance. > postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64 > > wal_level=minimal and pg_dump --single-transaction both seem to be > required to hit the issue. > > $ /usr/pgsql-16/bin/postgres -D ./pg16test -c maintenance_work_mem=1GB -c max_wal_size=16GB -c wal_level=minimal -c max_wal_senders=0-c port=5678 -c logging_collector=no & > > $ time sudo -u postgres /usr/pgsql-16/bin/pg_restore -d postgres -p 5678 --single-transaction --no-tablespace ./curtables > > TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564 > > Core was generated by `postgres: postgres postgres [local] COMMIT '. > Program terminated with signal 6, Aborted. > #0 0x00007f28b8bd5387 in raise () from /lib64/libc.so.6 > Missing separate debuginfos, use: debuginfo-install postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64 > (gdb) bt > #0 0x00007f28b8bd5387 in raise () from /lib64/libc.so.6 > #1 0x00007f28b8bd6a78 in abort () from /lib64/libc.so.6 > #2 0x00000000009bc8c9 in ExceptionalCondition (conditionName=conditionName@entry=0xa373e1 "size > SizeOfXLogRecord", fileName=fileName@entry=0xa31b13"xlog.c", lineNumber=lineNumber@entry=1055) at assert.c:66 > #3 0x000000000057b049 in ReserveXLogInsertLocation (PrevPtr=0x2e3d750, EndPos=<synthetic pointer>, StartPos=<syntheticpointer>, size=24) at xlog.c:1055 > #4 XLogInsertRecord (rdata=rdata@entry=0xf187a0 <hdr_rdt>, fpw_lsn=fpw_lsn@entry=0, flags=<optimized out>, num_fpi=num_fpi@entry=0,topxid_included=topxid_included@entry=false) at xlog.c:844 > #5 0x000000000058210c in XLogInsert (rmid=rmid@entry=0 '\000', info=info@entry=176 '\260') at xloginsert.c:510 > #6 0x0000000000582b09 in log_newpage_range (rel=rel@entry=0x2e1f628, forknum=forknum@entry=FSM_FORKNUM, startblk=startblk@entry=0,endblk=endblk@entry=3, page_std=page_std@entry=false) at xloginsert.c:1317 Looking at log_newpage_range, it seems like we're always trying to log a record if startblk < endblk; but don't register the PageIsNew() buffers in the range. That means that if the last buffers in the range are new, this can result in no buffers being registered in the last iteration of the main loop (if the number of non-new buffers in the range is 0 (mod 32)). A change like attached should fix the issue; or alternatively we could force log the last (new) buffer when we detect this edge case. Kind regards, Matthias van de Meent
Attachment
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564
From
Tom Lane
Date:
I wrote: > Hmm. I wonder if log_newpages() is confused here: > XLogEnsureRecordSpace(XLR_MAX_BLOCK_ID - 1, 0); Oh, no, it's simpler than that: log_newpage_range is trying to log zero page images, and ReserveXLogInsertLocation doesn't like that because every WAL record should contain some data. Will fix, thanks for report. regards, tom lane
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564
From
Tom Lane
Date:
Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > Looking at log_newpage_range, it seems like we're always trying to log > a record if startblk < endblk; but don't register the PageIsNew() > buffers in the range. That means that if the last buffers in the range > are new, this can result in no buffers being registered in the last > iteration of the main loop (if the number of non-new buffers in the > range is 0 (mod 32)). Yeah, I just came to the same conclusion. One thing I don't understand yet: log_newpage_range is old (it looks like this back to v12), and that Assert is older, so why doesn't this reproduce further back? Maybe the state where all the pages are new didn't happen before? Is that telling us there's a bug somewhere else? Seems like a job for git bisect. To be clear: log_newpage_range is certainly broken, and your fix looks appropriate. I'm just wondering what else we need to learn here. regards, tom lane
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564
From
Tom Lane
Date:
I wrote: > Yeah, I just came to the same conclusion. One thing I don't understand > yet: log_newpage_range is old (it looks like this back to v12), and > that Assert is older, so why doesn't this reproduce further back? > Maybe the state where all the pages are new didn't happen before? Bingo: bisecting shows the failure started at commit 3d6a98457d8e21d85bed86cfd3e1d1df1b260721 Author: Andres Freund <andres@anarazel.de> Date: Wed Apr 5 08:19:39 2023 -0700 Don't initialize page in {vm,fsm}_extend(), not needed So previously, log_newpage_range could only have failed in very unlikely circumstances, whereas now it's not hard to hit when committing a table creation. I wonder what other bugs may be lurking. I'll patch it back to v12 anyway, since that function is clearly wrong in isolation. regards, tom lane
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564
From
Andres Freund
Date:
Hi, On 2023-04-17 12:13:41 -0400, Tom Lane wrote: > Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > > Looking at log_newpage_range, it seems like we're always trying to log > > a record if startblk < endblk; but don't register the PageIsNew() > > buffers in the range. That means that if the last buffers in the range > > are new, this can result in no buffers being registered in the last > > iteration of the main loop (if the number of non-new buffers in the > > range is 0 (mod 32)). > > Yeah, I just came to the same conclusion. One thing I don't understand > yet: log_newpage_range is old (it looks like this back to v12), and > that Assert is older, so why doesn't this reproduce further back? > Maybe the state where all the pages are new didn't happen before? > Is that telling us there's a bug somewhere else? Seems like a job > for git bisect. One plausible explanation is that bulk relation extension has made it more likely to encounter this scenario. We had some bulk extension code before, but it was triggered purely based on contention - quite unlikely in simple test scenarios - but now we also bulk extend if we know that we'll insert multiple pages (when coming from heap_multi_insert(), with sufficient data). Greetings, Andres Freund
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564
From
Andres Freund
Date:
Hi, On 2023-04-17 13:50:30 -0400, Tom Lane wrote: > I wrote: > > Yeah, I just came to the same conclusion. One thing I don't understand > > yet: log_newpage_range is old (it looks like this back to v12), and > > that Assert is older, so why doesn't this reproduce further back? > > Maybe the state where all the pages are new didn't happen before? > > Bingo: bisecting shows the failure started at > > commit 3d6a98457d8e21d85bed86cfd3e1d1df1b260721 > Author: Andres Freund <andres@anarazel.de> > Date: Wed Apr 5 08:19:39 2023 -0700 > > Don't initialize page in {vm,fsm}_extend(), not needed > > So previously, log_newpage_range could only have failed in very > unlikely circumstances, whereas now it's not hard to hit when > committing a table creation. I wonder what other bugs may be > lurking. Oh, interesting. We haven't initialized the extra pages added by RelationAddExtraBlocks() (in <= 15) for quite a while now, so I'm a bit surprised it causes more issues for the VM / FSM. I guess it's that it's quite common in real workloads to contend on the extension lock and add extra blocks, but not in simple single-threaded tests? Greetings, Andres Freund
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564
From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes: > On 2023-04-17 13:50:30 -0400, Tom Lane wrote: >> So previously, log_newpage_range could only have failed in very >> unlikely circumstances, whereas now it's not hard to hit when >> committing a table creation. I wonder what other bugs may be >> lurking. > Oh, interesting. We haven't initialized the extra pages added by > RelationAddExtraBlocks() (in <= 15) for quite a while now, so I'm a bit > surprised it causes more issues for the VM / FSM. I guess it's that it's quite > common in real workloads to contend on the extension lock and add extra > blocks, but not in simple single-threaded tests? I haven't tried hard to run it to ground, but maybe log_newpage_range isn't used in that code path? Seems like we'd have detected this before now if the case were reachable without any crash involved. regards, tom lane
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564
From
Justin Pryzby
Date:
On Mon, Apr 17, 2023 at 01:50:30PM -0400, Tom Lane wrote: > I wrote: > > Yeah, I just came to the same conclusion. One thing I don't understand > > yet: log_newpage_range is old (it looks like this back to v12), and > > that Assert is older, so why doesn't this reproduce further back? > > Maybe the state where all the pages are new didn't happen before? > > Bingo: bisecting shows the failure started at Just curious: what "test" did you use to bisect with ?
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564
From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes: > On Mon, Apr 17, 2023 at 01:50:30PM -0400, Tom Lane wrote: >> Bingo: bisecting shows the failure started at > Just curious: what "test" did you use to bisect with ? The test case I used looked like start postmaster with -c wal_level=minimal -c max_wal_senders=0 make installcheck-parallel psql -d regression -c "do 'begin for i in 1..1000 loop execute ''create table lots''||i||'' as select * from onek''; endloop; end';" pg_dump -Fc -Z0 regression >~/regression.dump createdb r2 pg_restore -d r2 --single-transaction --no-tablespace ~/regression.dump Dumping the regression database as-is didn't reproduce it for me, but after I added a bunch more tables it did reproduce. (I added the -Z0 bit after some of the bisection test points hit the interval where somebody had broken pg_dump's compression features. It didn't seem relevant to the problem so I just disabled that.) regards, tom lane