Re: WIP: WAL prefetch (another approach) - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: WIP: WAL prefetch (another approach) |
Date | |
Msg-id | 111657.1650910309@sss.pgh.pa.us Whole thread Raw |
In response to | Re: WIP: WAL prefetch (another approach) (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: WIP: WAL prefetch (another approach)
|
List | pgsql-hackers |
I believe that the WAL prefetch patch probably accounts for the intermittent errors that buildfarm member topminnow has shown since it went in, eg [1]: diff -U3 /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out /home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out --- /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out 2022-04-10 03:05:15.972622440 +0200 +++ /home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out 2022-04-25 05:09:49.861642059 +0200 @@ -34,11 +34,7 @@ (1 row) SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_records_info_till_end_of_wal(:'wal_lsn1'); - ok ----- - t -(1 row) - +ERROR: could not read WAL at 0/1903E40 SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats(:'wal_lsn1', :'wal_lsn2'); ok ---- @@ -46,11 +42,7 @@ (1 row) SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats_till_end_of_wal(:'wal_lsn1'); - ok ----- - t -(1 row) - +ERROR: could not read WAL at 0/1903E40 -- =================================================================== -- Test for filtering out WAL records of a particular table -- =================================================================== I've reproduced this manually on that machine, and confirmed that the proximate cause is that XLogNextRecord() is returning NULL because state->decode_queue_head == NULL, without bothering to provide an errormsg (which doesn't seem very well thought out in itself). I obtained the contents of the xlogreader struct at failure: (gdb) p *xlogreader $1 = {routine = {page_read = 0x594270 <read_local_xlog_page_no_wait>, segment_open = 0x593b44 <wal_segment_open>, segment_close = 0x593d38 <wal_segment_close>}, system_identifier = 0, private_data = 0x0, ReadRecPtr = 26230672, EndRecPtr = 26230752, abortedRecPtr = 26230752, missingContrecPtr = 26230784, overwrittenRecPtr = 0, DecodeRecPtr = 26230672, NextRecPtr = 26230752, PrevRecPtr = 0, record = 0x0, decode_buffer = 0xf25428 "\240", decode_buffer_size = 65536, free_decode_buffer = true, decode_buffer_head = 0xf25428 "\240", decode_buffer_tail = 0xf25428 "\240", decode_queue_head = 0x0, decode_queue_tail = 0x0, readBuf = 0xf173f0 "\020\321\005", readLen = 0, segcxt = { ws_dir = '\000' <repeats 1023 times>, ws_segsize = 16777216}, seg = { ws_file = 25, ws_segno = 0, ws_tli = 1}, segoff = 0, latestPagePtr = 26222592, latestPageTLI = 1, currRecPtr = 26230752, currTLI = 1, currTLIValidUntil = 0, nextTLI = 0, readRecordBuf = 0xf1b3f8 "<", readRecordBufSize = 40960, errormsg_buf = 0xef3270 "", errormsg_deferred = false, nonblocking = false} I don't have an intuition about where to look beyond that, any suggestions? What I do know so far is that while the failure reproduces fairly reliably under "make check" (more than half the time, which squares with topminnow's history), it doesn't reproduce at all under "make installcheck" (after removing NO_INSTALLCHECK), which seems odd. Maybe it's dependent on how much WAL history the installation has accumulated? It could be that this is a bug in pg_walinspect or a fault in its test case; hard to tell since that got committed at about the same time as the prefetch changes. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2022-04-25%2001%3A48%3A47
pgsql-hackers by date: