Re: MultiXact\SLRU buffers configuration - Mailing list pgsql-hackers
From | Thom Brown |
---|---|
Subject | Re: MultiXact\SLRU buffers configuration |
Date | |
Msg-id | CAA-aLv7AcTh+O9rnoPVbck=Fw8atMOYrUttk5G9ctFhXOsqQbw@mail.gmail.com Whole thread Raw |
In response to | Re: MultiXact\SLRU buffers configuration (Thom Brown <thom@linux.com>) |
List | pgsql-hackers |
On Tue, 29 Oct 2024 at 16:38, Thom Brown <thom@linux.com> wrote: > > On Fri, 23 Aug 2024 at 01:29, Michael Paquier <michael@paquier.xyz> wrote: >> >> On Thu, Aug 22, 2024 at 10:36:38AM -0400, Alvaro Herrera wrote: >> > On 2024-Aug-22, Michael Paquier wrote: >> >> I'm not sure that we need to get down to that until somebody has a >> >> case where they want to rely on stats of injection points for their >> >> stuff. At this stage, I only want the stats to be enabled to provide >> >> automated checks for the custom pgstats APIs, so disabling it by >> >> default and enabling it only in the stats test of the module >> >> injection_points sounds kind of enough to me for now. >> > >> > Oh! I thought the stats were useful by themselves. >> >> Yep, currently they're not, but I don't want to discard that they'll >> never be, either. Perhaps there would be a case where somebody would >> like to run a callback N times and trigger a condition? That's >> something where the stats could be useful, but I don't have a specific >> case for that now. I'm just imagining possibilities. > > > I believe I am seeing the problem being discussed occuring on a production system running 15.6, causing ever-increasingreplay lag on the standby, until I cancel the offending process on the standby and force it to process itsinterrupts. > > Here's the backtrace before I do that: > > #0 0x00007f4503b81876 in select () from /lib64/libc.so.6 > #1 0x0000558b0956891a in pg_usleep (microsec=microsec@entry=1000) at pgsleep.c:56 > #2 0x0000558b0917e01a in GetMultiXactIdMembers (from_pgupgrade=false, onlyLock=<optimized out>, members=0x7ffcd2a9f1e0,multi=109187502) at multixact.c:1392 > #3 GetMultiXactIdMembers (multi=109187502, members=members@entry=0x7ffcd2a9f1e0, from_pgupgrade=from_pgupgrade@entry=false,onlyLock=onlyLock@entry=false) at multixact.c:1224 > #4 0x0000558b0913de15 in MultiXactIdGetUpdateXid (xmax=<optimized out>, t_infomask=<optimized out>) at heapam.c:6924 > #5 0x0000558b09146028 in HeapTupleGetUpdateXid (tuple=tuple@entry=0x7f440d428308) at heapam.c:6965 > #6 0x0000558b0914c02f in HeapTupleSatisfiesMVCC (htup=0x558b0b7cbf20, htup=0x558b0b7cbf20, buffer=8053429, snapshot=0x558b0b63a2d8)at heapam_visibility.c:1089 > #7 HeapTupleSatisfiesVisibility (tup=tup@entry=0x7ffcd2a9f2b0, snapshot=snapshot@entry=0x558b0b63a2d8, buffer=buffer@entry=8053429)at heapam_visibility.c:1771 > #8 0x0000558b0913e819 in heapgetpage (sscan=sscan@entry=0x558b0b7ccfa0, page=page@entry=115) at heapam.c:468 > #9 0x0000558b0913eb7e in heapgettup_pagemode (scan=scan@entry=0x558b0b7ccfa0, dir=ForwardScanDirection, nkeys=0, key=0x0)at heapam.c:1120 > #10 0x0000558b0913fb5e in heap_getnextslot (sscan=0x558b0b7ccfa0, direction=<optimized out>, slot=0x558b0b7cc000) at heapam.c:1352 > #11 0x0000558b092c0e7a in table_scan_getnextslot (slot=0x558b0b7cc000, direction=ForwardScanDirection, sscan=<optimizedout>) at ../../../src/include/access/tableam.h:1046 > #12 SeqNext (node=0x558b0b7cbe10) at nodeSeqscan.c:80 > #13 0x0000558b0929b9bf in ExecScan (node=0x558b0b7cbe10, accessMtd=0x558b092c0df0 <SeqNext>, recheckMtd=0x558b092c0dc0<SeqRecheck>) at execScan.c:198 > #14 0x0000558b09292cb2 in ExecProcNode (node=0x558b0b7cbe10) at ../../../src/include/executor/executor.h:262 > #15 ExecutePlan (execute_once=<optimized out>, dest=0x558b0bca1350, direction=<optimized out>, numberTuples=0, sendTuples=<optimizedout>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x558b0b7cbe10, estate=0x558b0b7cbbe8) > at execMain.c:1636 > #16 standard_ExecutorRun (queryDesc=0x558b0b8c9798, direction=<optimized out>, count=0, execute_once=<optimized out>) atexecMain.c:363 > #17 0x00007f44f64d43c5 in pgss_ExecutorRun (queryDesc=0x558b0b8c9798, direction=ForwardScanDirection, count=0, execute_once=<optimizedout>) at pg_stat_statements.c:1010 > #18 0x0000558b093fda0f in PortalRunSelect (portal=portal@entry=0x558b0b6ba458, forward=forward@entry=true, count=0, count@entry=9223372036854775807,dest=dest@entry=0x558b0bca1350) at pquery.c:924 > #19 0x0000558b093fedb8 in PortalRun (portal=portal@entry=0x558b0b6ba458, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, dest=dest@entry=0x558b0bca1350, > altdest=altdest@entry=0x558b0bca1350, qc=0x7ffcd2a9f7a0) at pquery.c:768 > #20 0x0000558b093fb243 in exec_simple_query ( > query_string=0x558b0b6170c8 "<redacted>") > at postgres.c:1250 > #21 0x0000558b093fd412 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4598 > #22 0x0000558b0937e170 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4514 > #23 BackendStartup (port=<optimized out>) at postmaster.c:4242 > #24 ServerLoop () at postmaster.c:1809 > #25 0x0000558b0937f147 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x558b0b5d0a30) at postmaster.c:1481 > #26 0x0000558b09100a2c in main (argc=5, argv=0x558b0b5d0a30) at main.c:202 > > This occurred twice, meaning 2 processes needed terminating. Taking a look at what's happening under the hood, it seems to be getting stuck here: if (nextMXOffset == 0) { /* Corner case 2: next multixact is still being filled in */ LWLockRelease(MultiXactOffsetSLRULock); CHECK_FOR_INTERRUPTS(); pg_usleep(1000L); goto retry; } It clearly checks for interrupts, but when I saw this issue happen, it wasn't interruptible. I looked around for similar reports, and I found a fork of Postgres reporting the same issue, but with more info than I thought to get at the time. They determined that the root cause involved a deadlock where the replay thread is stuck due to an out-of-order multixact playback sequence. The standby process waits indefinitely for a page pin to release before proceeding, but the required multixact ID hasn't yet been replayed due to log ordering. What they did to "fix" the issue in their version was to add an error when the reading thread cannot retrieve the next multixact ID, allowing it to exit instead of becoming indefinitely stuck. This took the form of only allowing it to loop 100 times before exiting. Thom
pgsql-hackers by date: