Re: refactoring relation extension and BufferAlloc(), faster COPY - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: refactoring relation extension and BufferAlloc(), faster COPY
Date
Msg-id 0b5eb82b-cb99-e0a4-b932-3dc60e2e3926@gmail.com
Whole thread Raw
In response to Re: refactoring relation extension and BufferAlloc(), faster COPY  (Andres Freund <andres@anarazel.de>)
Responses Re: refactoring relation extension and BufferAlloc(), faster COPY
List pgsql-hackers
Hi Andres,

07.04.2023 11:39, Andres Freund wrote:
> Hi,
>
> On 2023-04-06 18:15:14 -0700, Andres Freund wrote:
>> I think it might be worth having a C test for some of the bufmgr.c API. Things
>> like testing that retrying a failed relation extension works the second time
>> round.
> A few hours after this I hit a stupid copy-pasto (21d7c05a5cf) that would
> hopefully have been uncovered by such a test...

A few days later I've found a new defect introduced with 31966b151.
The following script:
echo "
CREATE TABLE tbl(id int);
INSERT INTO tbl(id) SELECT i FROM generate_series(1, 1000) i;
DELETE FROM tbl;
CHECKPOINT;
" | psql -q

sleep 2

grep -C2 'automatic vacuum of table ".*.tbl"' server.log

tf=$(psql -Aqt -c "SELECT format('%s/%s', pg_database.oid, relfilenode) FROM pg_database, pg_class WHERE datname = 
current_database() AND relname = 'tbl'")
ls -l "$PGDB/base/$tf"

pg_ctl -D $PGDB stop -m immediate
pg_ctl -D $PGDB -l server.log start

with the autovacuum enabled as follows:
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_naptime = 1

gives:
2023-04-11 20:56:56.261 MSK [675708] LOG:  checkpoint starting: immediate force wait
2023-04-11 20:56:56.324 MSK [675708] LOG:  checkpoint complete: wrote 900 buffers (5.5%); 0 WAL file(s) added, 0 
removed, 0 recycled; write=0.016 s, sync=0.034 s, total=0.063 s; sync files=252, longest=0.017 s, average=0.001 s; 
distance=4162 kB, estimate=4162 kB; lsn=0/1898588, redo lsn=0/1898550
2023-04-11 20:56:57.558 MSK [676060] LOG:  automatic vacuum of table "testdb.public.tbl": index scans: 0
         pages: 5 removed, 0 remain, 5 scanned (100.00% of total)
         tuples: 1000 removed, 0 remain, 0 are dead but not yet removable
-rw------- 1 law law 0 апр 11 20:56 .../tmpdb/base/16384/16385
waiting for server to shut down.... done
server stopped
waiting for server to start.... stopped waiting
pg_ctl: could not start server
Examine the log output.

server stops with the following stack trace:
Core was generated by `postgres: startup recovering 000000010000000000000001                         '.
Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/790626' in core file too small.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140209454906240) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140209454906240) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140209454906240) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140209454906240, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f850ec53476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f850ec397f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000557950889c0b in ExceptionalCondition (
     conditionName=0x557950a67680 "mode == RBM_NORMAL || mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_ON_ERROR",
     fileName=0x557950a673e8 "bufmgr.c", lineNumber=1008) at assert.c:66
#6  0x000055795064f2d0 in ReadBuffer_common (smgr=0x557952739f38, relpersistence=112 'p', forkNum=MAIN_FORKNUM,
     blockNum=4294967295, mode=RBM_ZERO_AND_CLEANUP_LOCK, strategy=0x0, hit=0x7fff22dd648f) at bufmgr.c:1008
#7  0x000055795064ebe7 in ReadBufferWithoutRelcache (rlocator=..., forkNum=MAIN_FORKNUM, blockNum=4294967295,
     mode=RBM_ZERO_AND_CLEANUP_LOCK, strategy=0x0, permanent=true) at bufmgr.c:800
#8  0x000055795021c0fa in XLogReadBufferExtended (rlocator=..., forknum=MAIN_FORKNUM, blkno=0,
     mode=RBM_ZERO_AND_CLEANUP_LOCK, recent_buffer=0) at xlogutils.c:536
#9  0x000055795021bd92 in XLogReadBufferForRedoExtended (record=0x5579526c4998, block_id=0 '\000', mode=RBM_NORMAL,
     get_cleanup_lock=true, buf=0x7fff22dd6598) at xlogutils.c:391
#10 0x00005579501783b1 in heap_xlog_prune (record=0x5579526c4998) at heapam.c:8726
#11 0x000055795017b7db in heap2_redo (record=0x5579526c4998) at heapam.c:9960
#12 0x0000557950215b34 in ApplyWalRecord (xlogreader=0x5579526c4998, record=0x7f85053d0120, replayTLI=0x7fff22dd6720)
     at xlogrecovery.c:1915
#13 0x0000557950215611 in PerformWalRecovery () at xlogrecovery.c:1746
#14 0x0000557950201ce3 in StartupXLOG () at xlog.c:5433
#15 0x00005579505cb6d2 in StartupProcessMain () at startup.c:267
#16 0x00005579505be9f7 in AuxiliaryProcessMain (auxtype=StartupProcess) at auxprocess.c:141
#17 0x00005579505ca2b5 in StartChildProcess (type=StartupProcess) at postmaster.c:5369
#18 0x00005579505c5224 in PostmasterMain (argc=3, argv=0x5579526c3e70) at postmaster.c:1455
#19 0x000055795047a97d in main (argc=3, argv=0x5579526c3e70) at main.c:200

As I can see, autovacuum removes pages from the table file, and this causes
the crash while replaying the record:
rmgr: Heap2       len (rec/tot):     60/   988, tx:          0, lsn: 0/01898600, prev 0/01898588, desc: PRUNE 
snapshotConflictHorizon 732 nredirected 0 ndead 226, blkref #0: rel 1663/16384/16385 blk 0 FPW

Best regards,
Alexander



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: fairywren exiting in ecpg
Next
From: Andrew Dunstan
Date:
Subject: Re: [PATCH] Use role name "system_user" instead of "user" for unsafe_tests