Thread: BUG #17928: Standby fails to decode WAL on termination of primary
The following bug has been logged on the website: Bug reference: 17928 Logged by: Alexander Lakhin Email address: exclusion@gmail.com PostgreSQL version: 15.2 Operating system: Ubuntu 22.04 Description: When a primary server is stopped, standby might fail to replay WAL with the memory allocation error, for example: 2023-05-11 08:42:38.254 MSK [1706654] LOG: recovery restart point at 7/12012E88 2023-05-11 08:42:38.254 MSK [1706654] DETAIL: Last completed transaction was at log time 2023-05-11 08:42:38.245399+03. 2023-05-11 08:42:38.420 MSK [1706654] LOG: restartpoint starting: wal 2023-05-11 08:42:38.451 MSK [1725929] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2023-05-11 08:42:38.468 MSK [1706656] FATAL: invalid memory alloc request size 2021163525 2023-05-11 08:42:38.469 MSK [1706643] LOG: startup process (PID 1706656) exited with exit code 1 2023-05-11 08:42:38.469 MSK [1706643] LOG: terminating any other active server processes 2023-05-11 08:42:38.469 MSK [1706643] LOG: shutting down due to startup process failure 2023-05-11 08:42:38.470 MSK [1706643] LOG: database system is shut down The call stack of the error: ... #6 0x000055cfabab8e97 in palloc_extended (size=2021163525, flags=2) at mcxt.c:1143 #7 0x000055cfab456ae0 in XLogReadRecordAlloc (state=0x55cfacef4a08, xl_tot_len=2021161080, allow_oversized=true) at xlogreader.c:524 #8 0x000055cfab456dee in XLogDecodeNextRecord (state=0x55cfacef4a08, nonblocking=false) at xlogreader.c:689 #9 0x000055cfab4575c0 in XLogReadAhead (state=0x55cfacef4a08, nonblocking=false) at xlogreader.c:960 #10 0x000055cfab45519d in XLogPrefetcherNextBlock (pgsr_private=94350447959632, lsn=0x55cfacf4bc58) at xlogprefetcher.c:496 #11 0x000055cfab454b1c in lrq_prefetch (lrq=0x55cfacf4bb08) at xlogprefetcher.c:256 #12 0x000055cfab454cdf in lrq_complete_lsn (lrq=0x55cfacf4bb08, lsn=2976923632) at xlogprefetcher.c:294 #13 0x000055cfab455df2 in XLogPrefetcherReadRecord (prefetcher=0x55cfacef5e50, errmsg=0x7ffc2e410458) at xlogprefetcher.c:1039 #14 0x000055cfab45f136 in ReadRecord (xlogprefetcher=0x55cfacef5e50, emode=15, fetching_ckpt=false, replayTLI=1) at xlogrecovery.c:3047 #15 0x000055cfab45c922 in PerformWalRecovery () at xlogrecovery.c:1754 #16 0x000055cfab448869 in StartupXLOG () at xlog.c:5300 #17 0x000055cfab7d151a in StartupProcessMain () at startup.c:267 #18 0x000055cfab7c3f80 in AuxiliaryProcessMain (auxtype=StartupProcess) at auxprocess.c:141 #19 0x000055cfab7cfb4b in StartChildProcess (type=StartupProcess) at postmaster.c:5429 #20 0x000055cfab7ca5c0 in PostmasterMain (argc=4, argv=0x55cfacef3e40) at postmaster.c:1470 #21 0x000055cfab6be302 in main (argc=4, argv=0x55cfacef3e40) at main.c:202 (gdb) frame 8 #8 0x000055cfab456dee in XLogDecodeNextRecord (state=0x55cfacef4a08, nonblocking=false) at xlogreader.c:689 689 decoded = XLogReadRecordAlloc(state, (gdb) print/x *record $5 = {xl_tot_len = 0x78787878, xl_xid = 0x78787878, xl_prev = 0x7878787878787878, xl_info = 0x0, xl_rmid = 0x10, xl_crc = 0x0} (gdb) print/x state->NextRecPtr $6 = 0xb1703ff0 hexdump -C .../standby_data/pgdata/pg_wal/0000000100000000000000B1 00703fe0 00 00 04 40 00 00 0c 56 00 00 00 00 00 00 00 00 ...@...V........ 00703ff0 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx 00704000 10 d1 05 00 01 00 00 00 00 40 70 30 00 00 00 00 .........@p0.... 00704010 db 00 00 00 00 00 00 00 78 78 78 78 78 78 78 78 ........xxxxxxxx 00704020 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx * 007040f0 09 00 00 00 00 00 00 00 33 02 00 00 1b 0d 00 00 ........3....... hexdump -C .../primary_data/pgdata/pg_wal/0000000100000000000000B1 00703fe0 00 00 04 40 00 00 0c 56 00 00 00 00 00 00 00 00 ...@...V........ 00703ff0 33 02 00 00 54 2a 00 00 e0 3e 70 b1 00 00 00 00 3...T*...>p..... 00704000 10 d1 05 00 01 00 00 00 00 40 70 b1 00 00 00 00 .........@p..... 00704010 23 02 00 00 00 00 00 00 80 0a 00 00 55 0f 07 2c #...........U.., 00704020 00 60 02 02 7f 06 00 00 05 00 00 00 f7 85 01 00 .`.............. 00704030 00 00 00 00 ff 03 02 00 02 08 18 00 01 00 00 00 ................ 00704040 e0 07 00 00 78 78 78 78 78 78 78 78 78 78 78 78 ....xxxxxxxxxxxx 00704050 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 xxxxxxxxxxxxxxxx * 00704230 78 78 78 78 78 78 78 78 01 00 00 00 00 00 00 00 xxxxxxxx........ (In this case, rows with 500 'x' were inserted when the primary was stopped.) `git bisect` for this behavior blames 3f1ce9734 (where XLogDecodeNextRecord() -> XLogReadRecordAlloc() call was introduced). A reproducer for the anomaly to follow.
11.05.2023 11:00, PG Bug reporting form wrote: > The following bug has been logged on the website: > > Bug reference: 17928 > ... > `git bisect` for this behavior blames 3f1ce9734 (where > XLogDecodeNextRecord() -> XLogReadRecordAlloc() call was introduced). > > A reproducer for the anomaly to follow. The TAP test that demonstrates the issue is attached. To catch the failure faster, I place it in multiple directories src/test/recoveryX/t, add minimal Makefiles, and run (on tmpfs): for ((i=1;i<=10;i++)); do echo "iteration $i"; NO_TEMP_INSTALL=1 parallel --halt now,fail=1 -j7 --linebuffer --tag make -s check -C src/test/{} ::: recovery1 recovery2 recovery3 recovery4 recovery5 recovery6 recovery7 || break; done iteration 1 recovery1 +++ tap check in src/test/recovery1 +++ recovery2 +++ tap check in src/test/recovery2 +++ recovery3 +++ tap check in src/test/recovery3 +++ recovery4 +++ tap check in src/test/recovery4 +++ recovery5 +++ tap check in src/test/recovery5 +++ recovery6 +++ tap check in src/test/recovery6 +++ recovery7 +++ tap check in src/test/recovery7 +++ ... recovery5 # Restarting primary instance (49) recovery3 # Restarting primary instance (49) recovery7 # Restarting primary instance (49) recovery2 Bailout called. Further testing stopped: pg_ctl stop failed recovery2 FAILED--Further testing stopped: pg_ctl stop failed recovery2 make: *** [Makefile:6: check] Error 255 parallel: This job failed: make -s check -C src/test/recovery2 tail src/test/recovery2/tmp_check/log/099_restart_with_stanby_standby.log 2023-05-11 20:19:22.247 MSK [2046385] DETAIL: End of WAL reached on timeline 1 at 3/64BDFF8. 2023-05-11 20:19:22.247 MSK [2046385] FATAL: could not send end-of-streaming message to primary: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. no COPY in progress 2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525 2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1 2023-05-11 20:19:22.248 MSK [2037114] LOG: terminating any other active server processes 2023-05-11 20:19:22.248 MSK [2037114] LOG: shutting down due to startup process failure 2023-05-11 20:19:22.249 MSK [2037114] LOG: database system is shut down Best regards, Alexander
Attachment
On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > 2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525 > 2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1 Thanks Alexander. Looking into this. I think it is probably something like: recycled standby pages are not zeroed (something we already needed to do something about[1]), and when we read a recycled garbage size (like your "xxxx") at the end of a page at an offset where we don't have a full record header on one page, we skip the ValidXLogRecordHeader() call (and always did), but the check in allocate_recordbuf() which previously handled that "gracefully" (well, it would try to allocate up to 1GB bogusly, but it wouldn't try to allocate more than that and ereport) is a bit too late. I probably need to add an earlier not-too-big validation. Thinking. [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de
On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote: > On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > 2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525 > > 2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1 > > Thanks Alexander. Looking into this. I think it is probably > something like: recycled standby pages are not zeroed (something we > already needed to do something about[1]), and when we read a recycled > garbage size (like your "xxxx") at the end of a page at an offset > where we don't have a full record header on one page, we skip the > ValidXLogRecordHeader() call (and always did), but the check in > allocate_recordbuf() which previously handled that "gracefully" (well, > it would try to allocate up to 1GB bogusly, but it wouldn't try to > allocate more than that and ereport) is a bit too late. I probably > need to add an earlier not-too-big validation. Thinking. I agree about an earlier not-too-big validation. Like the attached? I haven't tested it with Alexander's recipe or pondered it thoroughly. > [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de Regarding [1], is it still worth zeroing recycled pages on standbys and/or reading the whole header before allocating xl_tot_len? (Are there benefits other than avoiding a 1G backend allocation or 4G frontend allocation, or is that benefit worth the cycles?)
Attachment
At Mon, 10 Jul 2023 13:00:12 -0700, Noah Misch <noah@leadboat.com> wrote in > On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote: > > On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > > 2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525 > > > 2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1 > > > > Thanks Alexander. Looking into this. I think it is probably > > something like: recycled standby pages are not zeroed (something we > > already needed to do something about[1]), and when we read a recycled > > garbage size (like your "xxxx") at the end of a page at an offset > > where we don't have a full record header on one page, we skip the > > ValidXLogRecordHeader() call (and always did), but the check in > > allocate_recordbuf() which previously handled that "gracefully" (well, > > it would try to allocate up to 1GB bogusly, but it wouldn't try to > > allocate more than that and ereport) is a bit too late. I probably > > need to add an earlier not-too-big validation. Thinking. > > I agree about an earlier not-too-big validation. Like the attached? I > haven't tested it with Alexander's recipe or pondered it thoroughly. I like the patch for its code clean up, but I'm afraid that it removes the existing record length check when reading continued pages. However, I'm unsure about merely adding a check for too-long records, due to the potential for requesting an excessively large amount of memory, even if it will be released shortly. > > [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de > > Regarding [1], is it still worth zeroing recycled pages on standbys and/or > reading the whole header before allocating xl_tot_len? (Are there benefits > other than avoiding a 1G backend allocation or 4G frontend allocation, or is > that benefit worth the cycles?) I believe reading the whole header is the most sensible approach as it can prevent unnecessary memory requests. Another potential solution (or hack) for this specific case is to let XLogWalRcvFlush write a finalizing ((uint32)0) when dying is true. This stabilizes the behavior to "invalid record length.. got 0" when running the TAP test. regards. Kyotaro Horiguchi NTT Open Source Software Center
Attachment
On Tue, Jul 11, 2023 at 03:54:14PM +0900, Kyotaro Horiguchi wrote: > At Mon, 10 Jul 2023 13:00:12 -0700, Noah Misch <noah@leadboat.com> wrote in > > On Mon, May 15, 2023 at 03:38:17PM +1200, Thomas Munro wrote: > > > On Fri, May 12, 2023 at 6:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > > > > 2023-05-11 20:19:22.248 MSK [2037134] FATAL: invalid memory alloc request size 2021163525 > > > > 2023-05-11 20:19:22.248 MSK [2037114] LOG: startup process (PID 2037134) exited with exit code 1 On my system, testing unpatched REL_15_2, the test reached its first failure at minute 53 and its second failure at minute 189. Perhaps a faster, deterministic test would be feasible: - Use pg_logical_emit_message to fill a few segments with 0xFF. - CHECKPOINT the primary, so the standby recycles segments. - One more pg_logical_emit_message, computing the length from pg_current_wal_insert_lsn() such that new message crosses a segment boundary and ends 4 bytes before the end of a page. - Stop the primary. - If the bug is present, the standby will exit. > > > Thanks Alexander. Looking into this. I think it is probably > > > something like: recycled standby pages are not zeroed (something we > > > already needed to do something about[1]), and when we read a recycled > > > garbage size (like your "xxxx") at the end of a page at an offset > > > where we don't have a full record header on one page, we skip the > > > ValidXLogRecordHeader() call (and always did), but the check in > > > allocate_recordbuf() which previously handled that "gracefully" (well, > > > it would try to allocate up to 1GB bogusly, but it wouldn't try to > > > allocate more than that and ereport) is a bit too late. I probably > > > need to add an earlier not-too-big validation. Thinking. > > > > I agree about an earlier not-too-big validation. Like the attached? I > > haven't tested it with Alexander's recipe or pondered it thoroughly. > > I like the patch for its code clean up, but I'm afraid that it removes > the existing record length check when reading continued > pages. The removed check would be repetitive in all cases known to me, but I wouldn't mind keeping the check instead. > However, I'm unsure about merely adding a check for too-long > records, due to the potential for requesting an excessively large > amount of memory, even if it will be released shortly. > > > > [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de > > > > Regarding [1], is it still worth zeroing recycled pages on standbys and/or > > reading the whole header before allocating xl_tot_len? (Are there benefits > > other than avoiding a 1G backend allocation or 4G frontend allocation, or is > > that benefit worth the cycles?) > > I believe reading the whole header is the most sensible approach as it > can prevent unnecessary memory requests. Another potential solution > (or hack) for this specific case is to let XLogWalRcvFlush write a > finalizing ((uint32)0) when dying is true. This stabilizes the > behavior to "invalid record length.. got 0" when running the TAP test. > --- a/src/backend/replication/walreceiver.c > +++ b/src/backend/replication/walreceiver.c > @@ -989,6 +989,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli) > { > Assert(tli != 0); > > + if (dying) > + { > + int buflen = sizeof(uint32); > + char buf[sizeof(uint32)]; > + > + memset(buf, 0, buflen); > + XLogWalRcvWrite(buf, buflen, LogstreamResult.Write, tli); > + } > + > if (LogstreamResult.Flush < LogstreamResult.Write) > { > WalRcvData *walrcv = WalRcv; That's inexpensive in cycles, and it should make the failures in question quite rare. If we do that instead of pre-zeroing, I think we'd still want to be ready for invalid lengths, for the case of unclean standby shutdown.
At Sun, 16 Jul 2023 17:49:05 -0700, Noah Misch <noah@leadboat.com> wrote in > On Tue, Jul 11, 2023 at 03:54:14PM +0900, Kyotaro Horiguchi wrote: > > I like the patch for its code clean up, but I'm afraid that it removes > > the existing record length check when reading continued > > pages. > > The removed check would be repetitive in all cases known to me, but I wouldn't > mind keeping the check instead. Ah. I got it. No need to do that again. > > I believe reading the whole header is the most sensible approach as it > > can prevent unnecessary memory requests. Another potential solution > > (or hack) for this specific case is to let XLogWalRcvFlush write a > > finalizing ((uint32)0) when dying is true. This stabilizes the > > behavior to "invalid record length.. got 0" when running the TAP test. ... > That's inexpensive in cycles, and it should make the failures in question > quite rare. If we do that instead of pre-zeroing, I think we'd still want to > be ready for invalid lengths, for the case of unclean standby shutdown. Agreed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Sun, Jul 16, 2023 at 05:49:05PM -0700, Noah Misch wrote: > On my system, testing unpatched REL_15_2, the test reached its first failure > at minute 53 and its second failure at minute 189. Perhaps a faster, > deterministic test would be feasible: It took less than five minutes for me to reproduce the failure using the test case sent by Alexander on HEAD: 2023-08-10 15:26:37.401 JST [11239] FATAL: invalid memory alloc request size 2021163525 2023-08-10 15:26:37.405 JST [11235] LOG: startup process (PID 11239) exited with exit code 1 I don't see it after two hours of tests with the patch. > - Use pg_logical_emit_message to fill a few segments with 0xFF. > - CHECKPOINT the primary, so the standby recycles segments. > - One more pg_logical_emit_message, computing the length from > pg_current_wal_insert_lsn() such that new message crosses a segment boundary > and ends 4 bytes before the end of a page. > - Stop the primary. > - If the bug is present, the standby will exit. Good idea to pollute the data with recycled segments. Using a minimal WAL segment size whould help here as well in keeping a test cheap, and two segments should be enough. The alignment calculations and the header size can be known, but the standby records are an issue for the predictability of the test when it comes to adjust the length of the logical message depending on the 8k WAL page, no? >> However, I'm unsure about merely adding a check for too-long >> records, due to the potential for requesting an excessively large >> amount of memory, even if it will be released shortly. >> >>>> [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de >>> >>> Regarding [1], is it still worth zeroing recycled pages on standbys and/or >>> reading the whole header before allocating xl_tot_len? (Are there benefits >>> other than avoiding a 1G backend allocation or 4G frontend allocation, or is >>> that benefit worth the cycles?) >> >> I believe reading the whole header is the most sensible approach as it >> can prevent unnecessary memory requests. Another potential solution >> (or hack) for this specific case is to let XLogWalRcvFlush write a >> finalizing ((uint32)0) when dying is true. This stabilizes the >> behavior to "invalid record length.. got 0" when running the TAP test. FWIW, I came back to this thread while tweaking the error reporting of xlogreader.c for the sake of this thread and this proposal is an improvement to be able to make a distinction between an OOM and an incorrect record: https://www.postgresql.org/message-id/ZMh/WV+CuknqePQQ@paquier.xyz Anyway, agreed that it's an improvement to remove this check out of allocate_recordbuf(). Noah, are you planning to work more on that? >> --- a/src/backend/replication/walreceiver.c >> +++ b/src/backend/replication/walreceiver.c >> @@ -989,6 +989,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli) >> { >> Assert(tli != 0); >> >> + if (dying) >> + { >> + int buflen = sizeof(uint32); >> + char buf[sizeof(uint32)]; >> + >> + memset(buf, 0, buflen); >> + XLogWalRcvWrite(buf, buflen, LogstreamResult.Write, tli); >> + } >> + >> if (LogstreamResult.Flush < LogstreamResult.Write) >> { >> WalRcvData *walrcv = WalRcv; > > That's inexpensive in cycles, and it should make the failures in question > quite rare. If we do that instead of pre-zeroing, I think we'd still want to > be ready for invalid lengths, for the case of unclean standby shutdown. Tweaking the WAL receiver to do more zeroing has been discussed in the past, but as far as I recall we were not completely sure whether it's completely free, either. It seems to me that this should have a discussion on its own, in a new thread. Not sure if we should worry about archiving, actually, even if the segments should be padded with zeros beforehand. + gotheader = targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord ? true : false; Perhaps this could use more parenthesis around the expression checked, for readability. -- Michael
Attachment
On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote: > On Sun, Jul 16, 2023 at 05:49:05PM -0700, Noah Misch wrote: >> - Use pg_logical_emit_message to fill a few segments with 0xFF. >> - CHECKPOINT the primary, so the standby recycles segments. >> - One more pg_logical_emit_message, computing the length from >> pg_current_wal_insert_lsn() such that new message crosses a segment boundary >> and ends 4 bytes before the end of a page. >> - Stop the primary. >> - If the bug is present, the standby will exit. > > Good idea to pollute the data with recycled segments. Using a minimal > WAL segment size whould help here as well in keeping a test cheap, and > two segments should be enough. The alignment calculations and the > header size can be known, but the standby records are an issue for the > predictability of the test when it comes to adjust the length of the > logical message depending on the 8k WAL page, no? Actually, for this one, I think that I have a simpler idea to make it deterministic. Once we have inserted a record at the page limit on the primary, we can: - Stop the standby - Stop the primary - Rewrite by ourselves a few bytes in the last segment on the standby to emulate a recycled segment portion, based on the end LSN of the logical message record, retrieved either with pg_walinspect or pg_waldump. - Start the standby, which would replay up to the previous record at the page limit. - The standby just be in a state where it waits for the missing records from the primary and keeps looking at streaming, but it should not fail startup. -- Michael
Attachment
On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote: > On Sun, Jul 16, 2023 at 05:49:05PM -0700, Noah Misch wrote: > > On my system, testing unpatched REL_15_2, the test reached its first failure > > at minute 53 and its second failure at minute 189. Perhaps a faster, > > deterministic test would be feasible: > > It took less than five minutes for me to reproduce the failure using > the test case sent by Alexander on HEAD: > 2023-08-10 15:26:37.401 JST [11239] FATAL: invalid memory alloc request size 2021163525 > 2023-08-10 15:26:37.405 JST [11235] LOG: startup process (PID 11239) exited with exit code 1 > > I don't see it after two hours of tests with the patch. That is good. > > - Use pg_logical_emit_message to fill a few segments with 0xFF. > > - CHECKPOINT the primary, so the standby recycles segments. > > - One more pg_logical_emit_message, computing the length from > > pg_current_wal_insert_lsn() such that new message crosses a segment boundary > > and ends 4 bytes before the end of a page. > > - Stop the primary. > > - If the bug is present, the standby will exit. > > Good idea to pollute the data with recycled segments. Using a minimal > WAL segment size whould help here as well in keeping a test cheap, and > two segments should be enough. The alignment calculations and the > header size can be known, but the standby records are an issue for the > predictability of the test when it comes to adjust the length of the > logical message depending on the 8k WAL page, no? Could be. I expect there would be challenges translating that outline into a real test, but I don't know if that will be a major one. The test doesn't need to be 100% deterministic. If it fails 25% of the time and is not the slowest test in the recovery suite, I'd find that good enough. > >>>> [1] https://www.postgresql.org/message-id/20210505010835.umylslxgq4a6rbwg@alap3.anarazel.de > >>> > >>> Regarding [1], is it still worth zeroing recycled pages on standbys and/or > >>> reading the whole header before allocating xl_tot_len? (Are there benefits > >>> other than avoiding a 1G backend allocation or 4G frontend allocation, or is > >>> that benefit worth the cycles?) > >> > >> I believe reading the whole header is the most sensible approach as it > >> can prevent unnecessary memory requests. Another potential solution > >> (or hack) for this specific case is to let XLogWalRcvFlush write a > >> finalizing ((uint32)0) when dying is true. This stabilizes the > >> behavior to "invalid record length.. got 0" when running the TAP test. > > FWIW, I came back to this thread while tweaking the error reporting of > xlogreader.c for the sake of this thread and this proposal is an > improvement to be able to make a distinction between an OOM and an > incorrect record: > https://www.postgresql.org/message-id/ZMh/WV+CuknqePQQ@paquier.xyz > > Anyway, agreed that it's an improvement to remove this check out of > allocate_recordbuf(). Noah, are you planning to work more on that? I can push xl_tot_len-validate-v1.patch, particularly given the testing result you reported today. I'm content for my part to stop there. > >> --- a/src/backend/replication/walreceiver.c > >> +++ b/src/backend/replication/walreceiver.c > >> @@ -989,6 +989,15 @@ XLogWalRcvFlush(bool dying, TimeLineID tli) > >> { > >> Assert(tli != 0); > >> > >> + if (dying) > >> + { > >> + int buflen = sizeof(uint32); > >> + char buf[sizeof(uint32)]; > >> + > >> + memset(buf, 0, buflen); > >> + XLogWalRcvWrite(buf, buflen, LogstreamResult.Write, tli); > >> + } > >> + > >> if (LogstreamResult.Flush < LogstreamResult.Write) > >> { > >> WalRcvData *walrcv = WalRcv; > > > > That's inexpensive in cycles, and it should make the failures in question > > quite rare. If we do that instead of pre-zeroing, I think we'd still want to > > be ready for invalid lengths, for the case of unclean standby shutdown. > > Tweaking the WAL receiver to do more zeroing has been discussed in the > past, but as far as I recall we were not completely sure whether it's > completely free, either. It seems to me that this should have a > discussion on its own, in a new thread. Not sure if we should worry > about archiving, actually, even if the segments should be padded with > zeros beforehand. Okay. > + gotheader = targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord ? true : false; Okay.
On Thu, Aug 10, 2023 at 07:58:08PM -0700, Noah Misch wrote: > On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote: >> Good idea to pollute the data with recycled segments. Using a minimal >> WAL segment size whould help here as well in keeping a test cheap, and >> two segments should be enough. The alignment calculations and the >> header size can be known, but the standby records are an issue for the >> predictability of the test when it comes to adjust the length of the >> logical message depending on the 8k WAL page, no? > > Could be. I expect there would be challenges translating that outline into a > real test, but I don't know if that will be a major one. The test doesn't > need to be 100% deterministic. If it fails 25% of the time and is not the > slowest test in the recovery suite, I'd find that good enough. FWIW, I'm having a pretty hard time to get something close enough to a page border in a reliable way. Perhaps using a larger series of records and select only one would be more reliable.. Need to test that a bit more. >> FWIW, I came back to this thread while tweaking the error reporting of >> xlogreader.c for the sake of this thread and this proposal is an >> improvement to be able to make a distinction between an OOM and an >> incorrect record: >> https://www.postgresql.org/message-id/ZMh/WV+CuknqePQQ@paquier.xyz >> >> Anyway, agreed that it's an improvement to remove this check out of >> allocate_recordbuf(). Noah, are you planning to work more on that? > > I can push xl_tot_len-validate-v1.patch, particularly given the testing result > you reported today. I'm content for my part to stop there. Okay, fine by me. That's going to help with what I am doing in the other thread as I'd need to make a better difference between the OOM and the invalid cases for the allocation path. You are planning for a backpatch to take care of the inconsistency, right? The report has been on 15~ where the prefetching was introduced. I'd be OK to just do that and not mess up with the stable branches more than necessary (aka ~14) if nobody complains, especially REL_11_STABLE planned to be EOL'd in the next minor cycle. -- Michael
Attachment
On Fri, Aug 11, 2023 at 03:08:26PM +0900, Michael Paquier wrote: > On Thu, Aug 10, 2023 at 07:58:08PM -0700, Noah Misch wrote: > > On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote: > >> Good idea to pollute the data with recycled segments. Using a minimal > >> WAL segment size whould help here as well in keeping a test cheap, and > >> two segments should be enough. The alignment calculations and the > >> header size can be known, but the standby records are an issue for the > >> predictability of the test when it comes to adjust the length of the > >> logical message depending on the 8k WAL page, no? > > > > Could be. I expect there would be challenges translating that outline into a > > real test, but I don't know if that will be a major one. The test doesn't > > need to be 100% deterministic. If it fails 25% of the time and is not the > > slowest test in the recovery suite, I'd find that good enough. > > FWIW, I'm having a pretty hard time to get something close enough to a > page border in a reliable way. Perhaps using a larger series of > records and select only one would be more reliable.. Need to test > that a bit more. Interesting. So pg_logical_emit_message(false, 'X', repeat('X', n)) doesn't get close enough, but s/n/n+1/ spills to the next page? If so, I did not anticipate that. > >> FWIW, I came back to this thread while tweaking the error reporting of > >> xlogreader.c for the sake of this thread and this proposal is an > >> improvement to be able to make a distinction between an OOM and an > >> incorrect record: > >> https://www.postgresql.org/message-id/ZMh/WV+CuknqePQQ@paquier.xyz > >> > >> Anyway, agreed that it's an improvement to remove this check out of > >> allocate_recordbuf(). Noah, are you planning to work more on that? > > > > I can push xl_tot_len-validate-v1.patch, particularly given the testing result > > you reported today. I'm content for my part to stop there. > > Okay, fine by me. That's going to help with what I am doing in the > other thread as I'd need to make a better difference between the OOM > and the invalid cases for the allocation path. > > You are planning for a backpatch to take care of the inconsistency, > right? The report has been on 15~ where the prefetching was > introduced. I'd be OK to just do that and not mess up with the stable > branches more than necessary (aka ~14) if nobody complains, especially > REL_11_STABLE planned to be EOL'd in the next minor cycle. I recall earlier messages theorizing that it was just harder to hit in v14, so I'm disinclined to stop at v15. I think the main choice is whether to stop at v11 (normal choice) or v12 (worry about breaking the last v11 point release). I don't have a strong opinion between those.
On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote: > On Fri, Aug 11, 2023 at 03:08:26PM +0900, Michael Paquier wrote: > > You are planning for a backpatch to take care of the inconsistency, > > right? The report has been on 15~ where the prefetching was > > introduced. I'd be OK to just do that and not mess up with the stable > > branches more than necessary (aka ~14) if nobody complains, especially > > REL_11_STABLE planned to be EOL'd in the next minor cycle. > > I recall earlier messages theorizing that it was just harder to hit in v14, so > I'm disinclined to stop at v15. I think the main choice is whether to stop at > v11 (normal choice) or v12 (worry about breaking the last v11 point release). > I don't have a strong opinion between those. Thanks for working on this. I wonder if we need a more explicit way to construct pages with the right bits to reach interesting test cases and get full enough coverage... (Cf throwing SQL at the WAL to see if it sticks.)
On Sat, Aug 12, 2023 at 11:56:45AM +1200, Thomas Munro wrote: > On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote: >> On Fri, Aug 11, 2023 at 03:08:26PM +0900, Michael Paquier wrote: >> I recall earlier messages theorizing that it was just harder to hit in v14, so >> I'm disinclined to stop at v15. I think the main choice is whether to stop at >> v11 (normal choice) or v12 (worry about breaking the last v11 point release). >> I don't have a strong opinion between those. Okay. I wouldn't be inclined to patch v11 for that, FWIW, as this code path is touched by recovery and more. At least it does not seem worth taking any risk compared to the potential gain. > Thanks for working on this. > > I wonder if we need a more explicit way to construct pages with the > right bits to reach interesting test cases and get full enough > coverage... (Cf throwing SQL at the WAL to see if it sticks.) You mean SQL functions that write an arbitrary set of bytes at a given LSN, to trigger some expected behavior on a follow-up crash recovery? -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Sat, Aug 12, 2023 at 11:56:45AM +1200, Thomas Munro wrote: >> On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote: >>> I recall earlier messages theorizing that it was just harder to hit in v14, so >>> I'm disinclined to stop at v15. I think the main choice is whether to stop at >>> v11 (normal choice) or v12 (worry about breaking the last v11 point release). >>> I don't have a strong opinion between those. > Okay. I wouldn't be inclined to patch v11 for that, FWIW, as this > code path is touched by recovery and more. At least it does not seem > worth taking any risk compared to the potential gain. That was my gut reaction too -- risk/reward seems not great for the last v11 release. (I've been burned a couple of times now by putting can't-fix-it-anymore bugs into the final release of a branch, so maybe I'm just being overly paranoid. But it's something to worry about.) regards, tom lane
On Fri, Aug 11, 2023 at 08:32:00PM -0400, Tom Lane wrote: > Michael Paquier <michael@paquier.xyz> writes: > > On Sat, Aug 12, 2023 at 11:56:45AM +1200, Thomas Munro wrote: > >> On Sat, Aug 12, 2023 at 2:00 AM Noah Misch <noah@leadboat.com> wrote: > >>> I recall earlier messages theorizing that it was just harder to hit in v14, so > >>> I'm disinclined to stop at v15. I think the main choice is whether to stop at > >>> v11 (normal choice) or v12 (worry about breaking the last v11 point release). > >>> I don't have a strong opinion between those. > > > Okay. I wouldn't be inclined to patch v11 for that, FWIW, as this > > code path is touched by recovery and more. At least it does not seem > > worth taking any risk compared to the potential gain. > > That was my gut reaction too -- risk/reward seems not great for the > last v11 release. (I've been burned a couple of times now by putting > can't-fix-it-anymore bugs into the final release of a branch, so maybe > I'm just being overly paranoid. But it's something to worry about.) Okay, v12+ would be fine with me. Alas, a closer look at v1 found two defects, one with severity far greater than the bug v1 aimed to fix: ===== 1. For oversized records, it changed startup FATAL to silent data loss This affects only supported branches (v15-). In v16+, commit 8fcb32d introduced and checked XLogRecordMaxSize. But in v15 with patch v1, redo of pg_logical_emit_message(false, '_', repeat('x', 1024 * 1024 * 1024 - 1000)) just ends at the oversized record: 2523581 2023-08-12 17:17:39.512 GMT LOG: redo starts at 0/41F155B8 2523581 2023-08-12 17:17:39.512 GMT LOG: record length 1073740879 at 0/41F155F0 too long 2523581 2023-08-12 17:17:39.512 GMT LOG: redo done at 0/41F155B8 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed:0.00 s Any user could call pg_logical_emit_message() to silently terminate the WAL stream, which is far worse than the original bug. So far, I'm seeing one way to clearly fix $SUBJECT without that harm. When a record header spans a page boundary, read the next page to reassemble the header. If !ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL. Otherwise, read the whole record in chunks, calculating CRC. If CRC is invalid, treat as end of WAL. Otherwise, ereport(FATAL) for the oversized record. A side benefit would be avoiding useless large allocations (1GB backend, 4GB frontend) as discussed upthread. May as well do the xl_rmid and xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1 allocations. Thoughts? For invalid-length records in v16+, since every such record is end-of-wal or corruption, those versions could skip the CRC. As an alternative, zeroing recycled pages could remove "most of" the spurious errors without adding silent data loss. I considered another alternative of back-patching XLogRecordMaxSize and proceeding with a patch like v1, but that wouldn't help with records in existing WAL archives. Of course, we could also choose to leave $SUBJECT unfixed in v15-. ===== 2. DecodeXLogRecordRequiredSpace() is the wrong ceiling v1 treats DecodeXLogRecordRequiredSpace(L) (=L+2445 here) as the max needed, but allocate_recordbuf(state, L) allocates up to L+8192. If the uninitialized length happened to contain an integer between about 1<<30-8192 and 1<<30-2445, one still got e.g. "FATAL: invalid memory alloc request size 1073741824". Since v16+ has XLogRecordMaxSize, testing XLogRecordMaxSize avoids the problem and should suffice: === --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -640,7 +640,6 @@ restart: (uint32) SizeOfXLogRecord, total_len); goto err; } -#ifndef FRONTEND /* * We may be looking at a random uint32 read from a recycled segment. For @@ -651,13 +650,12 @@ restart: * the allocation may succeed but record checks are going to fail so this * would be short-lived. */ - if (!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len))) + if (total_len > XLogRecordMaxSize) { report_invalid_record(state, "record length %u at %X/%X too long", total_len, LSN_FORMAT_ARGS(RecPtr)); goto err; } -#endif /* * If the whole record header is on this page, validate it immediately. === allocate_recordbuf()'s addend is not critical; it's just to "avoid useless small increases". v15- should avoid the problem like this: === diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 969bcc3..1bac303 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -201,4 +201,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength) newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ); newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ)); +#ifndef FRONTEND + if (!AllocSizeIsValid(newSize)) + newSize = reclength; +#endif if (state->readRecordBuf) === In v15, one can reach this case with pg_logical_emit_message(false, '_', repeat('x', 1024 * 1024 * 1024 - 4000)). On v16+, one might reach this with unlucky trailing garbage, but XLogRecordMaxSize prevents reaching it through pg_logical_emit_message(). Thanks, nm
On Sun, Aug 13, 2023 at 9:13 AM Noah Misch <noah@leadboat.com> wrote: > Any user could call pg_logical_emit_message() to silently terminate the WAL > stream, which is far worse than the original bug. So far, I'm seeing one way > to clearly fix $SUBJECT without that harm. When a record header spans a page > boundary, read the next page to reassemble the header. If > !ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL. > Otherwise, read the whole record in chunks, calculating CRC. If CRC is > invalid, treat as end of WAL. Otherwise, ereport(FATAL) for the oversized > record. A side benefit would be avoiding useless large allocations (1GB > backend, 4GB frontend) as discussed upthread. May as well do the xl_rmid and > xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1 > allocations. Thoughts? For invalid-length records in v16+, since every such > record is end-of-wal or corruption, those versions could skip the CRC. That sounds quite strong. But... why couldn't the existing xlp_rem_len cross-check protect us from this failure mode? If we could defer the allocation until after that check (and the usual ValidXLogRecordHeader() check), I think we'd know that we're really looking at a size that was actually written in both pages (which must also have survived xlp_pageaddr check), no?
On Sun, Aug 13, 2023 at 03:12:34PM +1200, Thomas Munro wrote: > On Sun, Aug 13, 2023 at 9:13 AM Noah Misch <noah@leadboat.com> wrote: > > Any user could call pg_logical_emit_message() to silently terminate the WAL > > stream, which is far worse than the original bug. So far, I'm seeing one way > > to clearly fix $SUBJECT without that harm. When a record header spans a page > > boundary, read the next page to reassemble the header. If > > !ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL. > > Otherwise, read the whole record in chunks, calculating CRC. If CRC is > > invalid, treat as end of WAL. Otherwise, ereport(FATAL) for the oversized > > record. A side benefit would be avoiding useless large allocations (1GB > > backend, 4GB frontend) as discussed upthread. May as well do the xl_rmid and > > xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1 > > allocations. Thoughts? For invalid-length records in v16+, since every such > > record is end-of-wal or corruption, those versions could skip the CRC. > > That sounds quite strong. But... why couldn't the existing > xlp_rem_len cross-check protect us from this failure mode? If we > could defer the allocation until after that check (and the usual > ValidXLogRecordHeader() check), I think we'd know that we're really > looking at a size that was actually written in both pages (which must > also have survived xlp_pageaddr check), no? Hmm, I think that is right. A coincidental match of the 32-bit CRC is more probable than having all those fields appear valid by coincidence, especially xlp_pageaddr.
On Sat, Aug 12, 2023 at 08:30:34PM -0700, Noah Misch wrote: > On Sun, Aug 13, 2023 at 03:12:34PM +1200, Thomas Munro wrote: >> On Sun, Aug 13, 2023 at 9:13 AM Noah Misch <noah@leadboat.com> wrote: >>> Any user could call pg_logical_emit_message() to silently terminate the WAL >>> stream, which is far worse than the original bug. So far, I'm seeing one way >>> to clearly fix $SUBJECT without that harm. When a record header spans a page >>> boundary, read the next page to reassemble the header. If >>> !ValidXLogRecordHeader() (invalid xl_rmid or xl_prev), treat as end of WAL. >>> Otherwise, read the whole record in chunks, calculating CRC. If CRC is >>> invalid, treat as end of WAL. Otherwise, ereport(FATAL) for the oversized >>> record. A side benefit would be avoiding useless large allocations (1GB >>> backend, 4GB frontend) as discussed upthread. May as well do the xl_rmid and >>> xl_prev checks in all branches, to avoid needless XLogRecordMaxSize-1 >>> allocations. Thoughts? For invalid-length records in v16+, since every such >>> record is end-of-wal or corruption, those versions could skip the CRC. >> >> That sounds quite strong. But... why couldn't the existing >> xlp_rem_len cross-check protect us from this failure mode? If we >> could defer the allocation until after that check (and the usual >> ValidXLogRecordHeader() check), I think we'd know that we're really >> looking at a size that was actually written in both pages (which must >> also have survived xlp_pageaddr check), no? +1 for relying on xlp_rem_len for that. > Hmm, I think that is right. A coincidental match of the 32-bit CRC is more > probable than having all those fields appear valid by coincidence, especially > xlp_pageaddr. Hmm. [.. thinks ..] It seems to me that we should try to also finish the header validation before attempting XLogReadRecordAlloc() on the total_len as well? It looks like the end result would be to move the first ReadPageInternal done for the header with all its cross-page checks before XLogReadRecordAlloc(). That should remove the need of having gotheader from v1. -- Michael
Attachment
On Mon, Aug 14, 2023 at 2:08 PM Michael Paquier <michael@paquier.xyz> wrote: > It seems to me that we should try to also finish the header validation > before attempting XLogReadRecordAlloc() on the total_len as well? It > looks like the end result would be to move the first ReadPageInternal > done for the header with all its cross-page checks before > XLogReadRecordAlloc(). That should remove the need of having > gotheader from v1. Yeah. Trying out an idea for how to untangle that...
On Mon, Aug 14, 2023 at 3:56 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Mon, Aug 14, 2023 at 2:08 PM Michael Paquier <michael@paquier.xyz> wrote: > > It seems to me that we should try to also finish the header validation > > before attempting XLogReadRecordAlloc() on the total_len as well? It > > looks like the end result would be to move the first ReadPageInternal > > done for the header with all its cross-page checks before > > XLogReadRecordAlloc(). That should remove the need of having > > gotheader from v1. > > Yeah. Trying out an idea for how to untangle that... Something like this... patches for 12 and master attached. Needs a lot more testing. I really want to figure out how to get deterministic tests...
Attachment
On Mon, Aug 14, 2023 at 10:28:54PM +1200, Thomas Munro wrote: > Something like this... patches for 12 and master attached. Needs a > lot more testing. I really want to figure out how to get > deterministic tests... Interesting, thanks for the quick patch! /* * Try to find space to decode this record, if we can do so without * calling palloc. If we can't, we'll try again below after we've * validated that total_len isn't garbage bytes from a recycled WAL page. */ decoded = XLogReadRecordAlloc(state, total_len, false /* allow_oversized */ ); The patch relies on total_len before the header validation is completed, meaning that this value of total_len could be invalid because it comes from the partially-read header.. Oh wait, that's actually OK because an oversized allocation is not allowed yet and the decode buffer would not be able to store more than what it can? Perhaps this comment should be updated to tell something like, adding that total_len can be garbage, but we don't care because we don't allocate anything that the decode buffer cannot hold. #ifndef FRONTEND - /* - * Note that in much unlucky circumstances, the random data read from a - * recycled segment can cause this routine to be called with a size - * causing a hard failure at allocation. For a standby, this would cause - * the instance to stop suddenly with a hard failure, preventing it to - * retry fetching WAL from one of its sources which could allow it to move - * on with replay without a manual restart. If the data comes from a past - * recycled segment and is still valid, then the allocation may succeed - * but record checks are going to fail so this would be short-lived. If - * the allocation fails because of a memory shortage, then this is not a - * hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM. - */ if (!AllocSizeIsValid(newSize)) return false; Wouldn't it be OK to drop entirely this check? I'm fine to keep it as a safety measure, but it should not be necessary now that it gets called only once the header is validated? Regarding the tests, I have been using this formula to produce the number of bytes until the next page boundary: select setting::int - ((pg_current_wal_lsn() - '0/0') % setting::int) from pg_settings where name = 'wal_block_size'; Using pg_logical_emit_message() non-transactional with an empty set of strings generates records of 56 bytes, so I was thinking about the following: - Generate a bunch of small records with pg_logical_emit_message(), or records based on the threshold with the previous formula. - Loop until we reach a page limit, at 24 bytes (?). - Generate one last record to cut through. - Stop the node in immediate mode. - Write some garbage bytes on the last page generated to emulate the recycled contents and an allocation - Start the node, which should be able to startup. With wal_level = minimal, autovacuum = off and a large checkpoint_timeout, any other records are minimized. That's fancy, though. Do you think that this could work reliably? -- Michael
Attachment
On Tue, Aug 15, 2023 at 2:05 PM Michael Paquier <michael@paquier.xyz> wrote: > /* > * Try to find space to decode this record, if we can do so without > * calling palloc. If we can't, we'll try again below after we've > * validated that total_len isn't garbage bytes from a recycled WAL page. > */ > decoded = XLogReadRecordAlloc(state, > total_len, > false /* allow_oversized */ ); > > The patch relies on total_len before the header validation is > completed, meaning that this value of total_len could be invalid > because it comes from the partially-read header.. Oh wait, that's > actually OK because an oversized allocation is not allowed yet and the > decode buffer would not be able to store more than what it can? > Perhaps this comment should be updated to tell something like, adding > that total_len can be garbage, but we don't care because we don't > allocate anything that the decode buffer cannot hold. Yeah. > #ifndef FRONTEND > > - /* > - * Note that in much unlucky circumstances, the random data read from a > - * recycled segment can cause this routine to be called with a size > - * causing a hard failure at allocation. For a standby, this would cause > - * the instance to stop suddenly with a hard failure, preventing it to > - * retry fetching WAL from one of its sources which could allow it to move > - * on with replay without a manual restart. If the data comes from a past > - * recycled segment and is still valid, then the allocation may succeed > - * but record checks are going to fail so this would be short-lived. If > - * the allocation fails because of a memory shortage, then this is not a > - * hard failure either per the guarantee given by MCXT_ALLOC_NO_OOM. > - */ > if (!AllocSizeIsValid(newSize)) > return false; > > Wouldn't it be OK to drop entirely this check? I'm fine to keep it as > a safety measure, but it should not be necessary now that it gets > called only once the header is validated? Yeah, now we're in "shouldn't happen" territory, and I'm not sure. > Regarding the tests, I have been using this formula to produce the > number of bytes until the next page boundary: > select setting::int - ((pg_current_wal_lsn() - '0/0') % setting::int) > from pg_settings where name = 'wal_block_size'; > > Using pg_logical_emit_message() non-transactional with an empty set of > strings generates records of 56 bytes, so I was thinking about the > following: > - Generate a bunch of small records with pg_logical_emit_message(), or > records based on the threshold with the previous formula. > - Loop until we reach a page limit, at 24 bytes (?). > - Generate one last record to cut through. > - Stop the node in immediate mode. > - Write some garbage bytes on the last page generated to emulate the > recycled contents and an allocation > - Start the node, which should be able to startup. > With wal_level = minimal, autovacuum = off and a large > checkpoint_timeout, any other records are minimized. That's fancy, > though. > > Do you think that this could work reliably? Yeah, I think that sounds quite promising, and funnily enough I was just now working on some Perl code that appends controlled junk to the WAL in a test like that so we can try to hit all the error paths...
On Tue, Aug 15, 2023 at 02:36:10PM +1200, Thomas Munro wrote: > Yeah, I think that sounds quite promising, and funnily enough I was > just now working on some Perl code that appends controlled junk to the > WAL in a test like that so we can try to hit all the error paths... I am pretty sure that adding some junk in a controlled manner is the only cheap and rather-reliable way to get something.. Not sure if that will help, but what I was playing with some stuff in the lines of: -- Store the length up to page boundary. select setting::int - ((pg_current_wal_insert_lsn() - '0/0') % setting::int) as boundary from pg_settings where name = 'wal_block_size' \gset -- Generate record up to boundary (56 bytes for base size of the record, -- stop at 12 bytes before the end of the page. select pg_logical_emit_message(false, '', repeat('a', :boundary - 56 - 12)); Then by injecting some FF's on the last page written and forcing replay I am able to force some of the error code paths, so I guess that's what you were basically doing? -- Michael
Attachment
On Tue, Aug 15, 2023 at 12:00:30PM +0900, Michael Paquier wrote: > Not sure if that will help, but what I was playing with some stuff in > the lines of: > -- Store the length up to page boundary. > select setting::int - ((pg_current_wal_insert_lsn() - '0/0') % > setting::int) as boundary from pg_settings where name = 'wal_block_size' > \gset > -- Generate record up to boundary (56 bytes for base size of the record, > -- stop at 12 bytes before the end of the page. > select pg_logical_emit_message(false, '', repeat('a', :boundary - 56 - 12)); > > Then by injecting some FF's on the last page written and forcing > replay I am able to force some of the error code paths, so I guess > that's what you were basically doing? I've been spending some extra time on this one and hacked a TAP test that reliably reproduces the original issue, using a message similar to what I mentioned in my previous messages. I guess that we could use something like that: 2023-08-15 15:07:03.790 JST [8729] LOG: redo starts at 0/14EA428 2023-08-15 15:07:03.790 JST [8729] FATAL: invalid memory alloc request size 4294969740 2023-08-15 15:07:03.791 JST [8726] LOG: startup process (PID 8729) exited with exit code 1 The proposed patches pass the test, HEAD does not. We may want to do more with page boundaries, and more error patterns, but the idea looks worth exploring more. At least this can be used to validate patches. I've noticed while hacking the test that we don't do a XLogFlush() after inserting the message's record, so we may lose it on crash. That makes the test unstable except if an extra record is added after the logical messages. The attached patch forces that for the sake of the test, but I'm spawning a different thread as losing this data looks like a bug to me. -- Michael
Attachment
On Tue, Aug 15, 2023 at 6:11 PM Michael Paquier <michael@paquier.xyz> wrote: > I've been spending some extra time on this one and hacked a TAP test > that reliably reproduces the original issue, using a message similar > to what I mentioned in my previous messages. Nice. I hacked on this idea for quite a long time yesterday and today and came up with a set of tests for the main end-of-WAL conditions: ▶ 1/1 - xl_tot_len zero OK ▶ 1/1 - xl_tot_len short OK ▶ 1/1 - xl_prev bad OK ▶ 1/1 - xl_crc bad OK ▶ 1/1 - xlp_magic zero OK ▶ 1/1 - xlp_magic bad OK ▶ 1/1 - xlp_pageaddr bad OK ▶ 1/1 - xlp_info bad OK ▶ 1/1 - xlp_info lacks XLP_FIRST_IS_CONTRECORD OK ▶ 1/1 - xlp_rem_len bad OK ▶ 1/1 - xlp_magic zero (split record header) OK ▶ 1/1 - xlp_pageaddr bad (split record header) OK ▶ 1/1 - xlp_rem_len bad (split record header) OK 1/1 postgresql:recovery / recovery/038_end_of_wal OK 5.79s 13 subtests passed It took me a while to come up with a workable way to get into the record-header-splitting zone. Based on some of your clues about flushing, I eventually realised I needed transactional messages, and I built a kind of self-calibrating Rube Goldberg function around that. It's terrible, and I'm sure we can do better. I wonder what people think about putting internal details of the WAL format into a Perl test like this. Obviously it requires maintenance, since it knows the size and layout of a few things. I guess it'd be allowed to fish a couple of those numbers out of the source. Work in progress... I'm sure more useful checks could be added. One thing that occurred to me while thinking about all this it that the 'treat malloc failure as end of WAL' thing you highlighted in another thread is indeed completely bananas -- I didn't go digging, but perhaps it was an earlier solution to the very same garbage xl_tot_len problem, before 70b4f82a4b5 and now this xl_rem_len-based solution?
Attachment
On Wed, Aug 16, 2023 at 03:29:49PM +1200, Thomas Munro wrote: > I hacked on this idea for quite a long time yesterday and today and > came up with a set of tests for the main end-of-WAL conditions: > > ▶ 1/1 - xl_tot_len zero OK > ▶ 1/1 - xl_tot_len short OK > ▶ 1/1 - xl_prev bad OK > ▶ 1/1 - xl_crc bad OK > ▶ 1/1 - xlp_magic zero OK > ▶ 1/1 - xlp_magic bad OK > ▶ 1/1 - xlp_pageaddr bad OK > ▶ 1/1 - xlp_info bad OK > ▶ 1/1 - xlp_info lacks XLP_FIRST_IS_CONTRECORD OK > ▶ 1/1 - xlp_rem_len bad OK > ▶ 1/1 - xlp_magic zero (split record header) OK > ▶ 1/1 - xlp_pageaddr bad (split record header) OK > ▶ 1/1 - xlp_rem_len bad (split record header) OK > 1/1 postgresql:recovery / recovery/038_end_of_wal OK > 5.79s 13 subtests passed Nice. > It took me a while to come up with a workable way to get into the > record-header-splitting zone. Based on some of your clues about > flushing, I eventually realised I needed transactional messages, and I > built a kind of self-calibrating Rube Goldberg function around that. > It's terrible, and I'm sure we can do better. So that's advance_to_record_splitting_zone(), with the idea to loop across multiple records. The use of transactional messages ensures the flush and the stability of the test when the server is stopped, but it makes the records a bit longer than the non-transactional messages. > I wonder what people think about putting internal details of the WAL > format into a Perl test like this. Obviously it requires maintenance, > since it knows the size and layout of a few things. I guess it'd be > allowed to fish a couple of those numbers out of the source. The format of the WAL header does not change a lot across the years, so I'd be OK with that. If that's an issue, the scope of the test could always be restricted a bit. > Work in progress... I'm sure more useful checks could be added. One > thing that occurred to me while thinking about all this it that the > 'treat malloc failure as end of WAL' thing you highlighted in another > thread is indeed completely bananas -- I didn't go digging, but > perhaps it was an earlier solution to the very same garbage xl_tot_len > problem, before 70b4f82a4b5 and now this xl_rem_len-based solution? It is bananas. Still, the problem of the other thread is similar to what's here, and different at the same time because it can happen with a valid record if the host is under memory pressure, and we should still allow a standby to loop and continue in this case. Perhaps also a startup process doing only crash recovery. This thread's problem is dependent on this one, which is why this one needs to happen first: we need to clarify the boundary between a real OOM and garbage indicating the end of WAL. +my $RECORD_HEADER_SIZE = 24; +my $XLP_PAGE_MAGIC = 0xd113; +my $XLP_FIRST_IS_CONTRECORD = 0x1; This could use something like check_pg_config() to find the header contents for the last two ones, but there's no easy way to extract the size of the header struct, is there? Or perhaps perl has a tool for that.. Anything I can read on the matter tells to use pack() to make the size estimations predictible. + my $page_offset = $end_lsn % $WAL_BLOCK_SIZE; + while ($page_offset >= $WAL_BLOCK_SIZE - 2000) + { + $end_lsn = emit_message($node, 2000) The hardcoded 2000 feels magic here, but I get that you just want to get away from the page border, as well. I'd be OK with more comments. +sub lsn_to_segment_and_offset +{ + my $lsn = shift; + return ($lsn / $WAL_SEGMENT_SIZE, $lsn % $WAL_SEGMENT_SIZE); +} Now the backend also has pg_walfile_name_offset(). I tend to rely on the backend logic as much as I can for tests. +sub start_of_page Not sure this one is needed, only being called in start_of_next_page(). +sub get_insert_lsn Name is confusing with Cluster::lsn('insert') available, perhaps this should be named get_insert_lsn_in_bytes. As a whole, I find that pretty cool. Not sure if I would backpatch the test, so my opinion would be to use that on HEAD and let it stabilize there. -- Michael
Attachment
On Wed, Aug 16, 2023 at 03:17:58PM +0900, Michael Paquier wrote: > As a whole, I find that pretty cool. Not sure if I would backpatch > the test, so my opinion would be to use that on HEAD and let it > stabilize there. Thomas, is there anything I can do to help with this patch and make it move on? -- Michael
Attachment
On Fri, Aug 18, 2023 at 12:20 PM Michael Paquier <michael@paquier.xyz> wrote: > On Wed, Aug 16, 2023 at 03:17:58PM +0900, Michael Paquier wrote: > > As a whole, I find that pretty cool. Not sure if I would backpatch > > the test, so my opinion would be to use that on HEAD and let it > > stabilize there. > > Thomas, is there anything I can do to help with this patch and make it > move on? Give me a couple of days and I'll look into how back-patchable the tests can be made, and see what else we can test. Perhaps it's not strictly necessary to back-patch the fix further than 15, but I think we should definitely consider it, and I don't like the idea of not having the tests accompanying the change. If you have any ideas about how to write a more efficient version of advance_to_record_splitting_zone() (or I guess that should really be advance_to_record_header_splitting_zone()), and generally how to make the perl better, and how to get those constants we need from the source or binaries, then I'm all ears.
On Fri, Aug 18, 2023 at 02:30:31PM +1200, Thomas Munro wrote: > Give me a couple of days and I'll look into how back-patchable the > tests can be made, and see what else we can test. Perhaps it's not > strictly necessary to back-patch the fix further than 15, but I think > we should definitely consider it, and I don't like the idea of not > having the tests accompanying the change. Okay, cool. > If you have any ideas about how to write a more efficient version of > advance_to_record_splitting_zone() (or I guess that should really be > advance_to_record_header_splitting_zone()), and generally how to make > the perl better, Yeah, I think that there are ways to reduce the number of records generated. > and how to get those constants we need from the > source or binaries, then I'm all ears. A new perl routine able to do a pg_control --includedir that scans a defined header with a regexp would be able to make the job for the constants. If you're interested, I can code that up. -- Michael
Attachment
On Fri, Aug 18, 2023 at 01:55:32PM +0900, Michael Paquier wrote: > A new perl routine able to do a pg_control --includedir that scans a > defined header with a regexp would be able to make the job for the > constants. If you're interested, I can code that up. Please find attached a patch, that applies on top of the TAP test, to implement a logic able to extract from header files the values wanted. This is Utils::scan_header(), for example: my @scan_result = scan_header('server/access/xlog_internal.h', '#define\s+XLOG_PAGE_MAGIC\s+(\w+)'); my $XLP_PAGE_MAGIC = hex($scan_result[0]); This uses group capture so as it is possible to get more than one field. Guessing the size of the C structure is possible, but I've found these options to be a bit awkward implementation-wise: - Config would not help for the Postgres-specific declarations, like TimeLineId, etc. - I saw something that was specific to GCC.. cannot put my finger on it now. Another possibility would be to store the size in a different variable in xlog_internal.h, coupled with a StaticAssertDecl() to make sure that it matches with the real size. While it has the same maintenance cost as RECORD_HEADER_SIZE in the test itself, that's catching a mismatching size at compilation-time rather than at test-time. I am not sure that I'll be able to do more on this topic this week, at least that's some progress. -- Michael
Attachment
On Mon, Aug 21, 2023 at 08:32:39AM +0900, Michael Paquier wrote: > I am not sure that I'll be able to do more on this topic this week, at > least that's some progress. Some time later.. I have spent more time and thoughts on the whole test suite, finishing with the attached 0003 that applies on top of your own patches. I am really looking forward to making this whole logic more robust, so as WAL replay can be made itself more robust for the OOM/end-of-wal detection on HEAD for standbys and crash recovery. While looking at the whole, I have considered a few things that may make the test cleaner, like: - Calculating the segment name and its offset from the end_lsn of a record directly from the backend, but it felt inelegant to pass through more subroutine layers the couple ($segment, offset) rather than just a LSN, so guessing the segment number and the offset while the cluster is offline if OK by me. - The TLI can be queried from the server rather than hardcoded. - I've been thinking about bundling the tests of each sub-section in their own subroutine, but that felt a bit awkward, particularly for the part where we need a correct $prev_lsn in the record header written to enforce other code paths. - The test needs better documentation. One of the things I kept staring at is cross-checking pack() and the dependency to the C structures, so I have added more details there, explaining more the whys and the hows. I have also looked again at the C code for a few hours, and still got the impression that this is rather solid. There are two things that may be better: - Document at the top of allocate_recordbuf() that this should never be called with a length coming from a header until it is validated. - Removing AllocSizeIsValid() for the non-FRONTEND path should be OK. What do you think? -- Michael
Attachment
Hello I have an error from the new test due to Test/Utils.pm new scan_header function. I applied all 3 patches and use ./configure --prefix=/home/melkij/tmp/pgdev/inst --enable-cassert --enable-debug CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer"--enable-tap-tests make -s world make check-world Then I see: Test Summary Report ------------------- t/038_end_of_wal.pl (Wstat: 512 (exited 2) Tests: 0 Failed: 0) Non-zero exit status: 2 Parse errors: No plan found in TAP output Corresponding tmp_check/log/regress_log_038_end_of_wal show error: No such file or directory at /home/melkij/tmp/srcs/postgresql/src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Utils.pmline 725. pg_config from scan_header function returns /home/melkij/tmp/srcs/postgresql/tmp_install/home/melkij/tmp/pgdev/inst/include, then the function tries to concatenate server/access/xlog_internal.h,while the required file is in postgresql/server/access/xlog_internal.h from this directory. If I change the test like this for both scan_header, then it succeeds. - my @scan_result = scan_header('server/access/xlog_internal.h', + my @scan_result = scan_header('postgresql/server/access/xlog_internal.h', Is there something wrong with my setup? regards, Sergei
On Sat, Sep 02, 2023 at 02:51:48PM +0300, Sergei Kornilov wrote: > Corresponding tmp_check/log/regress_log_038_end_of_wal show error: > > No such file or directory at /home/melkij/tmp/srcs/postgresql/src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Utils.pmline 725. > > pg_config from scan_header function returns > /home/melkij/tmp/srcs/postgresql/tmp_install/home/melkij/tmp/pgdev/inst/include > , then the function tries to concatenate > server/access/xlog_internal.h, while the required file is in > postgresql/server/access/xlog_internal.h from this directory. > > If I change the test like this for both scan_header, then it succeeds. > > - my @scan_result = scan_header('server/access/xlog_internal.h', > + my @scan_result = scan_header('postgresql/server/access/xlog_internal.h', > > Is there something wrong with my setup? I think that's just a mistake I did, as there should not even be a need for having to specify "server/" for the header path. If you switch scan_header() in Utils.pm to use --includedir-server rather than --includedir, does it work correctly for you? -- Michael
Attachment
On Sun, Sep 03, 2023 at 08:42:55AM +0900, Michael Paquier wrote: > I think that's just a mistake I did, as there should not even be a > need for having to specify "server/" for the header path. If you > switch scan_header() in Utils.pm to use --includedir-server rather > than --includedir, does it work correctly for you? In short, I mean something like the v4 attached. Feel free to use it for your tests. -- Michael
Attachment
Hello Thank you, this version passes check-world for me. Looks good (but I'm certainly not a big expert on xlogreader), a solid set of tests for different cases. regards, Sergei
Thanks for looking/testing, Sergei. Thanks for the changes, Michael, these all look good. I've squashed them and added you as co-author. A couple more small comment/text changes: 1. In the place where we fail to allocate memory for an oversized record, I copied the comment about treating that as a "bogus data" condition. I suspect that we will soon be converting that to a FATAL error[1], and that'll need to be done in both places. 2. In this version of the commit message I said we'd only back-patch to 15 for now. After sleeping on this for a week, I realised that the reason I keep vacillating on that point is that I am not sure what your plan is for the malloc-failure-means-end-of-wal policy ([1], ancient code from 0ffe11abd3a). If we're going to fix that in master only but let sleeping dogs lie in the back-branches, then it becomes less important to go back further than 15 with THIS patch. But if you want to be able to distinguish garbage from out-of-memory, and thereby end-of-wal from a FATAL please-insert-more-RAM condition, I think you'd really need this industrial strength validation in all affected branches, and I'd have more work to do, right? The weak validation we are fixing here is the *real* underlying problem going back many years, right? I also wondered about strengthening the validation of various things like redo begin/end LSNs etc in these tests. But we can always continue to improve all this later... Here also is a version for 15 (and a CI run[2]), since we tweaked many of the error messages between 15 and 16. [1] https://www.postgresql.org/message-id/flat/ZMh/WV%2BCuknqePQQ%40paquier.xyz [2] https://cirrus-ci.com/task/4533280897236992 (failed on some unrelated pgbench test, reported in another thread)
Attachment
On Mon, Sep 04, 2023 at 03:20:31PM +1200, Thomas Munro wrote: > 1. In the place where we fail to allocate memory for an oversized > record, I copied the comment about treating that as a "bogus data" > condition. I suspect that we will soon be converting that to a FATAL > error[1], and that'll need to be done in both places. You mean for the two callers of XLogReadRecordAlloc(), even for the case where !allow_oversized? Using a FATAL on non-FRONTEND would be the quickest fix, indeed, but there are argument for standbys where we could let these continue, as well. That would be an improvement over the always-FATAL on OOM, of course. > 2. In this version of the commit message I said we'd only back-patch > to 15 for now. After sleeping on this for a week, I realised that the > reason I keep vacillating on that point is that I am not sure what > your plan is for the malloc-failure-means-end-of-wal policy ([1], > ancient code from 0ffe11abd3a). If we're going to fix that in master > only but let sleeping dogs lie in the back-branches, then it becomes > less important to go back further than 15 with THIS patch. For HEAD, my plan was to give more control to the callers of the WAL reader, which is not something that could happen in back-branches because it requires a few structural changes in the reporting the reader is able to do. That's the standby vs crash-recovery behavior, where we can make the startup process not FATAL all the time because it should just loop forever if it cannot process a record. It does not fix the root issue discussed here, of course, but it make the OOM handling more friendly in some replay cases. > But if you > want to be able to distinguish garbage from out-of-memory, and thereby > end-of-wal from a FATAL please-insert-more-RAM condition, I think > you'd really need this industrial strength validation in all affected > branches, and I'd have more work to do, right? The weak validation we > are fixing here is the *real* underlying problem going back many > years, right? Getting the same validation checks for all the branches would be nice. FATAL-ing on OOM to force recovery to happen again is a better option than assuming that it is the end of recovery. I am OK to provide patches for all the branches for the sake of this thread, if that helps. Switching to a hard FATAL on OOM for the WAL reader in the backend is backpatchable, but I'd rather consider that on a different thread once the better checks for the record header are in place. > I also wondered about strengthening the validation of various things > like redo begin/end LSNs etc in these tests. But we can always > continue to improve all this later... Agreed. -- Michael
Attachment
On Mon, Sep 4, 2023 at 3:54 PM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Sep 04, 2023 at 03:20:31PM +1200, Thomas Munro wrote: > > 1. In the place where we fail to allocate memory for an oversized > > record, I copied the comment about treating that as a "bogus data" > > condition. I suspect that we will soon be converting that to a FATAL > > error[1], and that'll need to be done in both places. > > You mean for the two callers of XLogReadRecordAlloc(), even for the > case where !allow_oversized? Using a FATAL on non-FRONTEND would be > the quickest fix, indeed, but there are argument for standbys where we > could let these continue, as well. That would be an improvement over > the always-FATAL on OOM, of course. I just mean the two places where "bogus data" is mentioned in that v5 patch. > > But if you > > want to be able to distinguish garbage from out-of-memory, and thereby > > end-of-wal from a FATAL please-insert-more-RAM condition, I think > > you'd really need this industrial strength validation in all affected > > branches, and I'd have more work to do, right? The weak validation we > > are fixing here is the *real* underlying problem going back many > > years, right? > > Getting the same validation checks for all the branches would be nice. > FATAL-ing on OOM to force recovery to happen again is a better option > than assuming that it is the end of recovery. I am OK to provide > patches for all the branches for the sake of this thread, if that > helps. Switching to a hard FATAL on OOM for the WAL reader in the > backend is backpatchable, but I'd rather consider that on a different > thread once the better checks for the record header are in place. OK, so it sounds like you want to go back to 12. Let me see if I can get this TAP test to work in 12... more tomorrow.
On Mon, Sep 04, 2023 at 07:17:21PM +1200, Thomas Munro wrote: > I just mean the two places where "bogus data" is mentioned in that v5 patch. Got it, thanks. > OK, so it sounds like you want to go back to 12. Let me see if I can > get this TAP test to work in 12... more tomorrow. If you're restricted on time, I am OK to deep-dive into these branches as well ;) -- Michael
Attachment
On Mon, Sep 4, 2023 at 7:28 PM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Sep 04, 2023 at 07:17:21PM +1200, Thomas Munro wrote: > > OK, so it sounds like you want to go back to 12. Let me see if I can > > get this TAP test to work in 12... more tomorrow. > > If you're restricted on time, I am OK to deep-dive into these branches > as well ;) I am behind on several things, so if you want to, please go ahead! I already posted a 15 version and a master version (which should also apply and work on 16), so we'd need 12, 13, 14.
On Mon, Sep 04, 2023 at 07:43:59PM +1200, Thomas Munro wrote: > I am behind on several things, so if you want to, please go ahead! I > already posted a 15 version and a master version (which should also > apply and work on 16), so we'd need 12, 13, 14. And here you go for all the three branches, with all the tests passing. This change is complicated enough that it would be better to get more eyes on the three patches for these specific branches. Now, this is much simpler since we don't have to worry about the prefetching. For these versions, it struck me that we *require* 6860198 so as the header validation is able to correctly happen across page reads when doing crash recovery. I have added this change in the patches attached for simplicity, but 6860198 should be applied first on 12~14 as an independent change. I can do that as a first step, if you are OK with that. I haven't taken the time to rewrite the commit message yet for the main patch ;p -- Michael
Attachment
On Tue, Sep 5, 2023 at 3:19 PM Michael Paquier <michael@paquier.xyz> wrote: > And here you go for all the three branches, with all the tests > passing. This change is complicated enough that it would be better to > get more eyes on the three patches for these specific branches. Now, > this is much simpler since we don't have to worry about the > prefetching. Thanks! Yeah. > For these versions, it struck me that we *require* 6860198 so as the > header validation is able to correctly happen across page reads when > doing crash recovery. I have added this change in the patches > attached for simplicity, but 6860198 should be applied first on 12~14 > as an independent change. I can do that as a first step, if you are > OK with that. Ahhhh, that explains why back-patching was unexpectedly unsuccessful when I tried it. I have added Fujii-san to the CC list. I don't see any problem with that plan but it would be good to hear from him.
On Tue, Sep 05, 2023 at 03:58:42PM +1200, Thomas Munro wrote: > Ahhhh, that explains why back-patching was unexpectedly unsuccessful > when I tried it. I have added Fujii-san to the CC list. I don't see > any problem with that plan but it would be good to hear from him. Right. Let's wait for a bit and see if there are any comments on this point. With 16.0 in sight next week, I'd like to propose that we move on with this issue once next week's tag is pushed. -- Michael
Attachment
On Tue, Sep 05, 2023 at 01:13:16PM +0900, Michael Paquier wrote: > Right. Let's wait for a bit and see if there are any comments on this > point. With 16.0 in sight next week, I'd like to propose that we move > on with this issue once next week's tag is pushed. Hearing nothing, I have applied the backpatch of 0668719801 to 12~14. Please find attached the remaining patches for all the stable branches. Thomas, would you prefer handling that yourself once 16.0 is tagged with a last round of checks? Or should I? (Still need to edit the commit message a bit..) -- Michael
Attachment
- v6-0001-Don-t-trust-unvalidated-xl_tot_len.patch
- v6-0001-Don-t-trust-unvalidated-xl_tot_len.patch.16
- v6-0001-Don-t-trust-unvalidated-xl_tot_len.patch.15
- v6-0001-Don-t-trust-unvalidated-xl_tot_len.patch.14
- v6-0001-Don-t-trust-unvalidated-xl_tot_len.patch.13
- v6-0001-Don-t-trust-unvalidated-xl_tot_len.patch.12
- signature.asc
On Tue, Sep 12, 2023 at 1:01 PM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Sep 05, 2023 at 01:13:16PM +0900, Michael Paquier wrote: > > Right. Let's wait for a bit and see if there are any comments on this > > point. With 16.0 in sight next week, I'd like to propose that we move > > on with this issue once next week's tag is pushed. > > Hearing nothing, I have applied the backpatch of 0668719801 to 12~14. > Please find attached the remaining patches for all the stable > branches. Thanks! > Thomas, would you prefer handling that yourself once 16.0 is tagged > with a last round of checks? Or should I? (Still need to edit the > commit message a bit..) I'll do that (and update the message). Thanks for all the help!
On Tue, Sep 12, 2023 at 02:08:09PM +1200, Thomas Munro wrote: > I'll do that (and update the message). Thanks for all the help! Cool, thanks! -- Michael
Attachment
On Tue, Sep 12, 2023 at 2:12 PM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Sep 12, 2023 at 02:08:09PM +1200, Thomas Munro wrote: > > I'll do that (and update the message). Thanks for all the help! > > Cool, thanks! I rebased (bumped the test number 038 -> 039), updated the commit message[1], tested on a big endian machine (AIX/POWER), but then I found there is something not quite right or stable with the WAL positioning logic. On one of my machines, but not another, the back-patch to 12 fails in one of the tests unless I hack it to insert a bit more WAL, and I think that's telling me that Perl code to control the insertion point is not quite bullet proof enough yet. I'll need to dig into this some more on Monday. [1] https://github.com/macdice/postgres/commits/fix-12
On Sat, Sep 16, 2023 at 10:07:30AM +1200, Thomas Munro wrote: > I rebased (bumped the test number 038 -> 039), updated the commit > message[1], tested on a big endian machine (AIX/POWER), but then I > found there is something not quite right or stable with the WAL > positioning logic. On one of my machines, but not another, the > back-patch to 12 fails in one of the tests unless I hack it to insert > a bit more WAL, and I think that's telling me that Perl code to > control the insertion point is not quite bullet proof enough yet. > I'll need to dig into this some more on Monday. > > [1] https://github.com/macdice/postgres/commits/fix-12 Hmm. What was the test that failed? -- Michael
Attachment
On Sat, Sep 16, 2023 at 12:03 PM Michael Paquier <michael@paquier.xyz> wrote: > > [1] https://github.com/macdice/postgres/commits/fix-12 > > Hmm. What was the test that failed? $ make -s -C src/test/recovery/ check PROVE_TESTS=t/039* t/039_end_of_wal.pl .. 4/? # Failed test 'xlp_magic zero' # at t/039_end_of_wal.pl line 312. not ok 5 - xlp_magic zero Where the log should say "invalid magic number 0000" I see: 2023-09-16 12:13:07.331 NZST [156812] LOG: record with incorrect prev-link 0/16B60C0 at 0/16B6120 It has to do with initial WAL position after initdb, because I get this only on Debian, on REL_12_STABLE (with the commit listed above on my public fix-12 branch) and only with --with-icu, but not without it, and I can't repro it on my other local OSes.
On Sat, Sep 16, 2023 at 12:20:36PM +1200, Thomas Munro wrote: > It has to do with initial WAL position after initdb, because I get > this only on Debian, on REL_12_STABLE (with the commit listed above on > my public fix-12 branch) and only with --with-icu, but not without it, > and I can't repro it on my other local OSes. That's close to my dev environment, with SID and x64 and the only difference your branch and my local branch is a rename of the TAP file, and I cannot see the problem. Some random thoughts: perhaps reducing $page_threshold would help with more records inserted to get to the limit of a page, reducing noise? Or perhaps skipping a few pages would? -- Michael
Attachment
Hello, 16.09.2023 03:20, Thomas Munro wrote: > On Sat, Sep 16, 2023 at 12:03 PM Michael Paquier <michael@paquier.xyz> wrote: >>> [1] https://github.com/macdice/postgres/commits/fix-12 >> Hmm. What was the test that failed? > $ make -s -C src/test/recovery/ check PROVE_TESTS=t/039* > t/039_end_of_wal.pl .. 4/? > # Failed test 'xlp_magic zero' > # at t/039_end_of_wal.pl line 312. > > not ok 5 - xlp_magic zero > > Where the log should say "invalid magic number 0000" I see: > > 2023-09-16 12:13:07.331 NZST [156812] LOG: record with incorrect > prev-link 0/16B60C0 at 0/16B6120 > > It has to do with initial WAL position after initdb, because I get > this only on Debian, on REL_12_STABLE (with the commit listed above on > my public fix-12 branch) and only with --with-icu, but not without it, > and I can't repro it on my other local OSes. I tried to reproduce the failure on Debian 9, 10, 11, but not succeeded yet. Though I got another error on Debian 9: t/039_end_of_wal.pl .. Dubious, test returned 25 (wstat 6400, 0x1900) No subtests run ... cat src/test/recovery/tmp_check/log/regress_log_039_end_of_wal could not find match in header access/xlog_internal.h It looks like the construction "@{^CAPTURE}" used in scan_server_header() is not supported by Perl 5.24, which is included in Debian stretch: https://perldoc.perl.org/variables/@%7B%5ECAPTURE%7D I replaced it with @match = ($1); and that worked for me. Also, I observed that "wal_log_hints = on" in extra.config, which I use via "TEMP_CONFIG=extra.config make check-world" makes the test fail too, though check-world passes fine without the new test. Maybe that's not an issue, and probably there are other parameters, which might affect this test, but I'm somewhat confused by the fact that only this test breaks with it. Best regards, Alexander
On Sat, Sep 16, 2023 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > It looks like the construction "@{^CAPTURE}" used in scan_server_header() > is not supported by Perl 5.24, which is included in Debian stretch: > https://perldoc.perl.org/variables/@%7B%5ECAPTURE%7D > I replaced it with > @match = ($1); > and that worked for me. Thanks for catching that! I see that %{^CAPTURE} was added to Perl 5.25.7, but we document Perl 5.14 or later as the requirement currently, and for REL_12_STABLE it's Perl 5.8.3. That would certainly have turned a few build farm fossils red. Your replacement only returns the first captured group. I don't know Perl, but I think we can just assign the whole array to @match directly, to maintain the semantics that Michael wanted, like so: - if ($line =~ /^$regexp/) + if (@match = $line =~ /^$regexp/) { - # Found match, so store all the results. - @match = @{^CAPTURE}; last; } Better Perl welcome. Looking into your observation about wal_log_hints...
On Sat, Sep 16, 2023 at 6:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > Also, I observed that "wal_log_hints = on" in extra.config, which I use via > "TEMP_CONFIG=extra.config make check-world" makes the test fail too, though > check-world passes fine without the new test. > Maybe that's not an issue, and probably there are other parameters, which > might affect this test, but I'm somewhat confused by the fact that only this > test breaks with it. The problem was that the Perl functions advance_XXX() would sometimes see an "unflushed" insert position caused by hint records and other incidental stuff, and then we'd write our contrived header at a disconnected location after a bunch of (unwritten) zeroes. The solution in this version is to start each test with emit_message($node, 0), which flushes a small record.
Attachment
- v7-0001-Don-t-trust-unvalidated-xl_tot_len.patch
- v7-0001-Don-t-trust-unvalidated-xl_tot_len.patch.16
- v7-0001-Don-t-trust-unvalidated-xl_tot_len.patch.15
- v7-0001-Don-t-trust-unvalidated-xl_tot_len.patch.14
- v7-0001-Don-t-trust-unvalidated-xl_tot_len.patch.13
- v7-0001-Don-t-trust-unvalidated-xl_tot_len.patch.12
On Tue, Sep 19, 2023 at 09:25:34PM +1200, Thomas Munro wrote: > The problem was that the Perl functions advance_XXX() would sometimes > see an "unflushed" insert position caused by hint records and other > incidental stuff, and then we'd write our contrived header at a > disconnected location after a bunch of (unwritten) zeroes. The > solution in this version is to start each test with > emit_message($node, 0), which flushes a small record. Makes sense. Does this also address the other issue you've noticed on v12 when ICU is enabled (the one with the initial LSN changed after initdb)? -- Michael
Attachment
On Tue, Sep 19, 2023 at 9:53 PM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Sep 19, 2023 at 09:25:34PM +1200, Thomas Munro wrote: > > The problem was that the Perl functions advance_XXX() would sometimes > > see an "unflushed" insert position caused by hint records and other > > incidental stuff, and then we'd write our contrived header at a > > disconnected location after a bunch of (unwritten) zeroes. The > > solution in this version is to start each test with > > emit_message($node, 0), which flushes a small record. > > Makes sense. Does this also address the other issue you've noticed on > v12 when ICU is enabled (the one with the initial LSN changed after > initdb)? Yep.
19.09.2023 13:09, Thomas Munro wrote:
On Tue, Sep 19, 2023 at 9:53 PM Michael Paquier <michael@paquier.xyz> wrote:On Tue, Sep 19, 2023 at 09:25:34PM +1200, Thomas Munro wrote:The problem was that the Perl functions advance_XXX() would sometimes see an "unflushed" insert position caused by hint records and other incidental stuff, and then we'd write our contrived header at a disconnected location after a bunch of (unwritten) zeroes. The solution in this version is to start each test with emit_message($node, 0), which flushes a small record.
Thanks for the fixes!
I've tested this patch version on Debian 9 and it works fine now.
But as to wal_log_hints, I see that v7 patches work for REL_12_STABLE and
REL_13_STABLE, but not for other branches, though v6 patches definitely
failed on REL_12_STABLE and REL_13_STABLE too. So the behavior improved,
but not for all versions, for some reason (I haven't looked deeper yet).
Also I've noticed a few minor discrepancies in the patches.
A commit message in patch.15 contains the sentence "We should fix ...",
not "We could fix ...", as in commit messages for other versions.
Also, patch.12 contains a duplicated "Reviewed-by" string.
And may be "xlp_prev" is a typo?
Best regards,
Alexander
On Wed, Sep 20, 2023 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > I've tested this patch version on Debian 9 and it works fine now. Progress... > But as to wal_log_hints, I see that v7 patches work for REL_12_STABLE and > REL_13_STABLE, but not for other branches, though v6 patches definitely > failed on REL_12_STABLE and REL_13_STABLE too. So the behavior improved, > but not for all versions, for some reason (I haven't looked deeper yet). Ahh, yeah, there was a second problem screwing up the LSN accounting. I was trying to write correct xl_prev links, but also emitting optional padding records to get into the right position (depending on initial conditions varying between branches etc), and I forgot about the extra COMMIT records that emit_message() generates. Which, I guess, comes back to Michael's observation that this would all be a bit easier if we had a way to emit and flush a single record... The solution in this version is to call get_insert_len() instead of using the result of emit_message() for the values returned by the advance_XXX() functions. The result of emit_message() is actually the LSN of the following COMMIT record so can't be used directly for building xl_prev chains. > Also I've noticed a few minor discrepancies in the patches. > A commit message in patch.15 contains the sentence "We should fix ...", > not "We could fix ...", as in commit messages for other versions. > Also, patch.12 contains a duplicated "Reviewed-by" string. > And may be "xlp_prev" is a typo? Fixed.
Attachment
- v8-0001-Don-t-trust-unvalidated-xl_tot_len.patch
- v8-0001-Don-t-trust-unvalidated-xl_tot_len.patch.16
- v8-0001-Don-t-trust-unvalidated-xl_tot_len.patch.15
- v8-0001-Don-t-trust-unvalidated-xl_tot_len.patch.14
- v8-0001-Don-t-trust-unvalidated-xl_tot_len.patch.13
- v8-0001-Don-t-trust-unvalidated-xl_tot_len.patch.12
On Wed, Sep 20, 2023 at 10:51:12AM +1200, Thomas Munro wrote: > Ahh, yeah, there was a second problem screwing up the LSN accounting. > I was trying to write correct xl_prev links, but also emitting > optional padding records to get into the right position (depending on > initial conditions varying between branches etc), and I forgot about > the extra COMMIT records that emit_message() generates. Which, I > guess, comes back to Michael's observation that this would all be a > bit easier if we had a way to emit and flush a single record... Unfortunately this part is not going to be backpatched based on the current consensus, so this is goint to need an alternate and/or fluffy solution to force the flushes at given points: https://www.postgresql.org/message-id/20230816003333.7hn2rx5m2l7una3d@awork3.anarazel.de The test could be largely simplified on HEAD once the flush option is available. I'm just waiting a bit more on the other thread, keeping an eye on the temperature, but it looks like nobody would complain to make that optional, with the default being what we did previously, at least.. > The solution in this version is to call get_insert_len() instead of > using the result of emit_message() for the values returned by the > advance_XXX() functions. The result of emit_message() is actually the > LSN of the following COMMIT record so can't be used directly for > building xl_prev chains. And by doing so the test paths would enter once again in the inner loops generating the records so as we'll only get out once we know that we are at the border wanted, enough to stabilize the tests. Smart. -- Michael
Attachment
20.09.2023 01:51, Thomas Munro wrote: > The solution in this version is to call get_insert_len() instead of > using the result of emit_message() for the values returned by the > advance_XXX() functions. The result of emit_message() is actually the > LSN of the following COMMIT record so can't be used directly for > building xl_prev chains. > >> And may be "xlp_prev" is a typo? > Fixed. I've retested this version (patches for all branches, plus wal_log_hints, plus Valgrind, and also retested code changes with my 099_restart_with_stanby.pl) and found no issues. IMHO, this patch is ready to float off. Thank you! Best regards, Alexander
A thought: commit 8fcb32db prevented us from logging messages that are too big to be decoded, but it wasn't back-patched. I think that means that in older branches, there is a behaviour change unrelated to the "garbage bytes" problem discussed in this thread, and separate also from the out-of-memory problem. If someone generates a record too big to decode, say with pg_logical_emit_message(), we will fail differently. Before this patch set, we'd bogusly detect end-of-WAL, and after this patch we'd fail to palloc and recovery would bogusly fail. Which outcome is more bogus is hard to answer, and clearly we should prevent it upstream, but didn't for technical reasons. Do you agree that that is a separate topic that doesn't prevent us from committing this fix?
On Fri, Sep 22, 2023 at 03:11:54PM +1200, Thomas Munro wrote: > A thought: commit 8fcb32db prevented us from logging messages that are > too big to be decoded, but it wasn't back-patched. Yes, there was one part done in ffd1b6bb6f8 that has required ABI breakages, as well. There is an argument against a backpatch as a set of records in a given range may fail to replay while they were allowed before (I forgot the exact math, but I recall that this was for records larger than XLogRecordMaxSize, still lower than the max allocation mark.). > I think that means > that in older branches, there is a behaviour change unrelated to the > "garbage bytes" problem discussed in this thread, and separate also > from the out-of-memory problem. If someone generates a record too big > to decode, say with pg_logical_emit_message(), we will fail > differently. Before this patch set, we'd bogusly detect end-of-WAL, > and after this patch we'd fail to palloc and recovery would bogusly > fail. Which outcome is more bogus is hard to answer, and clearly we > should prevent it upstream, but didn't for technical reasons. Do you > agree that that is a separate topic that doesn't prevent us from > committing this fix? I don't see why it's a problem on HEAD: a startup process reacts the same way for the end of WAL or an OOM. If we were to FATAL hard on all stable branches for non-FRONTEND on OOM, which is something we'll have to do anyway, then this patch set improves the situation because we would fail in what I'd see is better in this case: with an OOM. -- Michael
Attachment
On Fri, Sep 22, 2023 at 4:02 PM Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Sep 22, 2023 at 03:11:54PM +1200, Thomas Munro wrote: > > A thought: commit 8fcb32db prevented us from logging messages that are > > too big to be decoded, but it wasn't back-patched. > > Yes, there was one part done in ffd1b6bb6f8 that has required ABI > breakages, as well. There is an argument against a backpatch as a > set of records in a given range may fail to replay while they were > allowed before (I forgot the exact math, but I recall that this was > for records larger than XLogRecordMaxSize, still lower than the max > allocation mark.). Hmm, OK, tricky choices. Anyway, I guess the next thing to straighten out in this area is the OOM policy. Pushed. Thanks Alexander and everyone for the report, testing and help! Time to watch the build farm to see if that Perl turns out to be robust enough...
On Sat, Sep 23, 2023 at 11:35:05AM +1200, Thomas Munro wrote: > Hmm, OK, tricky choices. Anyway, I guess the next thing to straighten > out in this area is the OOM policy. > > Pushed. Thanks Alexander and everyone for the report, testing and > help! Time to watch the build farm to see if that Perl turns out to > be robust enough... Looks pretty stable to me. I can see one thing in wrasse related to pack(): https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2023-09-22%2023%3A23%3A23 Invalid type 'Q' in pack at t/039_end_of_wal.pl line 139. I don't have the room to look at that now, perhaps a bit later today depending on the flow of the wind. -- Michael
Attachment
On Sat, Sep 23, 2023 at 12:05 PM Michael Paquier <michael@paquier.xyz> wrote: > Looks pretty stable to me. I can see one thing in wrasse related to > pack(): > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2023-09-22%2023%3A23%3A23 > Invalid type 'Q' in pack at t/039_end_of_wal.pl line 139. Oh, so although we were testing on 32 bit systems via CI (task "test_world_32" on "Linux - Debian Bullseye - Meson"), in that case the perl binary is still 64 bit. Apparently 32 bit perl binary (at least of a certain vintage) doesn't like 'Q'.
Thomas Munro <thomas.munro@gmail.com> writes: > Oh, so although we were testing on 32 bit systems via CI (task > "test_world_32" on "Linux - Debian Bullseye - Meson"), in that case > the perl binary is still 64 bit. Apparently 32 bit perl binary (at > least of a certain vintage) doesn't like 'Q'. We discovered this once before, cf 945d2cb7d. regards, tom lane
On Sat, Sep 23, 2023 at 12:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Oh, so although we were testing on 32 bit systems via CI (task > > "test_world_32" on "Linux - Debian Bullseye - Meson"), in that case > > the perl binary is still 64 bit. Apparently 32 bit perl binary (at > > least of a certain vintage) doesn't like 'Q'. > > We discovered this once before, cf 945d2cb7d. Here is my proposed fix. I couldn't quite use that trick, but we can safely assume that the higher order bits of any LSN generated by this test are zero. Patch tested on x86 and big-endian POWER. If no one has a more idiomatic perl trick for this, I'll push this soon.
Attachment
On Sat, Sep 23, 2023 at 12:05 PM Michael Paquier <michael@paquier.xyz> wrote: > Looks pretty stable to me. I can see one thing in wrasse related to > pack(): Hmm, copperhead (riscv) showed an unusual failure, a segfault in suspiciously nearby code. I don't immediately know what that's about, let's see if we get more clues...
On Sat, Sep 23, 2023 at 1:07 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Here is my proposed fix. I couldn't quite use that trick, but we can > safely assume that the higher order bits of any LSN generated by this > test are zero. Patch tested on x86 and big-endian POWER. If no one > has a more idiomatic perl trick for this, I'll push this soon. And done, after running through pgperltidy.
On Sat, Sep 23, 2023 at 02:02:02PM +1200, Thomas Munro wrote: > Hmm, copperhead (riscv) showed an unusual failure, a segfault in > suspiciously nearby code. I don't immediately know what that's about, > let's see if we get more clues... Yep, something is going on with the prefetching code: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-09-22%2023%3A16%3A33 Using host libthread_db library "/lib/riscv64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: paris: startup recovering 000000030000000000000003 '. Program terminated with signal SIGSEGV, Segmentation fault. #0 pg_comp_crc32c_sb8 (crc=1613114916, crc@entry=4294967295, data=data@entry=0x2af9e00d48, len=<optimized out>) at pg_crc32c_sb8.c:56 56 uint32 a = *p4++ ^ crc; #0 pg_comp_crc32c_sb8 (crc=1613114916, crc@entry=4294967295, data=data@entry=0x2af9e00d48, len=<optimized out>) at pg_crc32c_sb8.c:56 #1 0x0000002ad59a1536 in ValidXLogRecord (state=0x2af9db1fc0, record=0x2af9e00d30, recptr=50520048) at xlogreader.c:1195 #2 0x0000002ad59a285a in XLogDecodeNextRecord (state=state@entry=0x2af9db1fc0, nonblocking=<optimized out>) at xlogreader.c:842 #3 0x0000002ad59a28c0 in XLogReadAhead (state=0x2af9db1fc0, nonblocking=nonblocking@entry=false) at xlogreader.c:969 #4 0x0000002ad59a0996 in XLogPrefetcherNextBlock (pgsr_private=184580836680, lsn=0x2af9e14618) at xlogprefetcher.c:496 #5 0x0000002ad59a11c8 in lrq_prefetch (lrq=<optimized out>) at xlogprefetcher.c:256 #6 lrq_complete_lsn (lsn=<optimized out>, lrq=0x2af9e145c8) at xlogprefetcher.c:294 #7 XLogPrefetcherReadRecord (prefetcher=prefetcher@entry=0x2af9e00d48, errmsg=errmsg@entry=0x3fec9a2bf0) at xlogprefetcher.c:1041 The stack may point out at a different issue, but perhaps this is a matter where we're returning now XLREAD_SUCCESS where previously we had XLREAD_FAIL, causing this code to fail thinking that the block was valid while it's not? -- Michael
Attachment
On Sat, Sep 23, 2023 at 4:44 PM Michael Paquier <michael@paquier.xyz> wrote: > The stack may point out at a different issue, but perhaps this is a > matter where we're returning now XLREAD_SUCCESS where previously we > had XLREAD_FAIL, causing this code to fail thinking that the block was > valid while it's not? "grison" has a little more detail -- we see pg_comp_crc32c_sb8(len=4294636456). I'm wondering how to reproduce this, but among the questions that jump out I have: why was it ever OK that we load record->xl_tot_len into total_len, perform header validation, determine that total_len < len (= this record is all on one page, no reassembly loop needed, so now we're in the single-page branch), then call ReadPageInternal() again, then call ValidXLogRecord() which internally loads record->xl_tot_len *again*? ReadPageInternal() might have changed xl_tot_len, no? That seems to be a possible pathway to reading past the end of the buffer in the CRC check, no? If that value didn't change underneath us, I think we'd need an explanation for how we finished up in the single-page branch at xlogreader.c:842 with a large xl_tot_len, which I'm not seeing yet, though it might take more coffee. (Possibly supporting the re-read theory is the fact that it's only happening on a few very slow computers, though I have no idea why it would only happen on master [so far at least].)
On Sun, Sep 24, 2023 at 09:48:42AM +1300, Thomas Munro wrote: > "grison" has a little more detail -- we see > pg_comp_crc32c_sb8(len=4294636456). I'm wondering how to reproduce > this, but among the questions that jump out I have: why was it ever OK > that we load record->xl_tot_len into total_len, perform header > validation, determine that total_len < len (= this record is all on > one page, no reassembly loop needed, so now we're in the single-page > branch), then call ReadPageInternal() again, then call > ValidXLogRecord() which internally loads record->xl_tot_len *again*? > ReadPageInternal() might have changed xl_tot_len, no? That seems to > be a possible pathway to reading past the end of the buffer in the CRC > check, no? > > If that value didn't change underneath us, I think we'd need an > explanation for how we finished up in the single-page branch at > xlogreader.c:842 with a large xl_tot_len, which I'm not seeing yet, > though it might take more coffee. (Possibly supporting the re-read > theory is the fact that it's only happening on a few very slow > computers, though I have no idea why it would only happen on master > [so far at least].) Hmm, it looks pretty clear that this is a HEAD-only thing as the buildfarm shows and as you say, and my primary suspect here would be 71e4cc6b8ec6, I think. Any race condition underneath it would be easier to see on slower machines. So it's likely possible that this has messed up the page insertion logic. -- Michael
Attachment
On Sun, Sep 24, 2023 at 12:23 PM Michael Paquier <michael@paquier.xyz> wrote: > Hmm, it looks pretty clear that this is a HEAD-only thing as the > buildfarm shows and as you say, and my primary suspect here would be > 71e4cc6b8ec6, I think. Any race condition underneath it would be > easier to see on slower machines. So it's likely possible that this > has messed up the page insertion logic. I see there was a failure on 16 on the very slow AIX box, and I have access so looking into that...
On Mon, Sep 25, 2023 at 09:02:35AM +1300, Thomas Munro wrote: > I see there was a failure on 16 on the very slow AIX box, and I have > access so looking into that... Lucky you, if I may say ;) A bunch of architectures that are not Intel are failing. Here is a summary based on the buildfarm reports: topminnow, mips64el with gcc 4.9.2 mereswine, ARMv7 with gcc 10.2.1 sungazer, ppc64 with gcc 8.3.0 frogfish, mips64el with gcc 4.6.3 mamba, macppc with gcc 10.4.0 gull, ARMv7 with clang 13.0.0 grison, ARMv7 with gcc 4.6.3 copperhead, riscv64 with gcc 10.X The only thing close to that I have close by is tanager on Armv7 (it has not reported to the buildfarm for a few weeks as it has overheated because of the summer here, but I've put it back online now). However, it has passed a few hundred cycles with both gcc and clang yesterday, on top of having a clean buildfarm run. With sungazer now failing on REL_16_STABLE, it feels to me that we are actually looking at two bugs? One on HEAD, and one in stable branches? For HEAD and the 2PC failure, the records up to PREPARE TRANSACTION should be replayed by the standby getting promoted, but I'd rather dig into that with a host that's able to report the failure. copperhead seems to be one of the failing hosts that's able to compile things quickly. Tom, Noah, or copperhead's owner, could it be possible to get access to one of the hosts that are failing for more investigation? I would not do more than compiling the code and check after the state of the 2PC test for this promotion failure. -- Michael
Attachment
On Mon, Sep 25, 2023 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Sep 25, 2023 at 09:02:35AM +1300, Thomas Munro wrote: > > I see there was a failure on 16 on the very slow AIX box, and I have > > access so looking into that... > > Lucky you, if I may say ;) FTR anyone involved with an open source project can get an account on the GCC compile farm machines. That particular machine is so overloaded that it's practically unusable (~8 hours to run the test, hard to run vi etc). > A bunch of architectures that are not Intel are failing. Here is a > summary based on the buildfarm reports: > topminnow, mips64el with gcc 4.9.2 > mereswine, ARMv7 with gcc 10.2.1 > sungazer, ppc64 with gcc 8.3.0 > frogfish, mips64el with gcc 4.6.3 > mamba, macppc with gcc 10.4.0 > gull, ARMv7 with clang 13.0.0 > grison, ARMv7 with gcc 4.6.3 > copperhead, riscv64 with gcc 10.X > > The only thing close to that I have close by is tanager on Armv7 (it > has not reported to the buildfarm for a few weeks as it has > overheated because of the summer here, but I've put it back online > now). However, it has passed a few hundred cycles with both gcc and > clang yesterday, on top of having a clean buildfarm run. One thing that the failing systems have in common is that they are extremely slow. 3 to 8 hours to complete the tests. turaco is an armv7 system that doesn't fail, but it's much faster. At a guess, probably something like an armv8 CPU that is just running 32 bit armv7 software, not a real old school armv7 chip. Which gives me the idea to try these tests under qemu... > With sungazer now failing on REL_16_STABLE, it feels to me that we are > actually looking at two bugs? One on HEAD, and one in stable > branches? For HEAD and the 2PC failure, the records up to PREPARE > TRANSACTION should be replayed by the standby getting promoted, but > I'd rather dig into that with a host that's able to report the > failure. Oh, right yeah that is quite different and could even be unrelated.
On Mon, Sep 25, 2023 at 01:18:56PM +1300, Thomas Munro wrote: > On Mon, Sep 25, 2023 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Sep 25, 2023 at 09:02:35AM +1300, Thomas Munro wrote: > > > I see there was a failure on 16 on the very slow AIX box, and I have > > > access so looking into that... > > > > Lucky you, if I may say ;) > > FTR anyone involved with an open source project can get an account on > the GCC compile farm machines. That particular machine is so > overloaded that it's practically unusable (~8 hours to run the test, > hard to run vi etc). topminnow (gcc23) and frogfish (gcc22) are also part of the compile farm. They're faster than sungazer (gcc111) for src/test/recovery.
Thomas Munro <thomas.munro@gmail.com> writes: > On Mon, Sep 25, 2023 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote: >> A bunch of architectures that are not Intel are failing. Here is a >> summary based on the buildfarm reports: >> mamba, macppc with gcc 10.4.0 That one is mine, so let me know if there's something particular you'd like me to check. regards, tom lane
On Sun, Sep 24, 2023 at 11:07:08PM -0400, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: >> On Mon, Sep 25, 2023 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote: >>> A bunch of architectures that are not Intel are failing. Here is a >>> summary based on the buildfarm reports: >>> mamba, macppc with gcc 10.4.0 > > That one is mine, so let me know if there's something particular > you'd like me to check. That would help, thanks! Could it be possible to check if the attached patch applied on HEAD influences the fate of the recovery test 009_twophase.pl? That's one thing I am really doubting about, and we've not done much since REL_16_STABLE has been forked.. (I'm still trying to get an environment where this can be reproduced, but qemu takes time.) -- Michael
Attachment
On Mon, Sep 25, 2023 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Mon, Sep 25, 2023 at 12:58 PM Michael Paquier <michael@paquier.xyz> wrote: > >> A bunch of architectures that are not Intel are failing. Here is a > >> summary based on the buildfarm reports: > >> mamba, macppc with gcc 10.4.0 > > That one is mine, so let me know if there's something particular > you'd like me to check. Thanks! But after spending the best part of a day installing and compiling stuff at snail's pace, I have just this moment managed to reproduce this reliably on an emulator running Debian armhf (= armv7). So hopefully I can track the issue down from here without bothering build farm owners.
Thomas Munro <thomas.munro@gmail.com> writes: > On Mon, Sep 25, 2023 at 4:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That one is mine, so let me know if there's something particular >> you'd like me to check. > Thanks! But after spending the best part of a day installing and > compiling stuff at snail's pace, I have just this moment managed to > reproduce this reliably on an emulator running Debian armhf (= armv7). > So hopefully I can track the issue down from here without bothering > build farm owners. FWIW, here's what I see on mamba: Program terminated with signal SIGSEGV, Segmentation fault. #0 pg_comp_crc32c_sb8 (crc=3197562372, crc@entry=4294967295, data=data@entry=0xfda6e030, len=<optimized out>) at pg_crc32c_sb8.c:56 56 uint32 a = *p4++ ^ crc; (gdb) bt #0 pg_comp_crc32c_sb8 (crc=3197562372, crc@entry=4294967295, data=data@entry=0xfda6e030, len=<optimized out>) at pg_crc32c_sb8.c:56 #1 0x0195ae5c in ValidXLogRecord (state=0xfde5cc88, record=0xfda6e018, recptr=<optimized out>) at xlogreader.c:1195 #2 0x0195cf18 in XLogDecodeNextRecord (state=state@entry=0xfde5cc88, nonblocking=<optimized out>) at xlogreader.c:842 #3 0x0195d0c8 in XLogReadAhead (state=0xfde5cc88, nonblocking=nonblocking@entry=false) at xlogreader.c:969 #4 0x01959760 in XLogPrefetcherNextBlock (pgsr_private=4259700408, lsn=0xfda6b8a0) at xlogprefetcher.c:496 #5 0x0195a72c in lrq_prefetch (lrq=<optimized out>) at xlogprefetcher.c:256 #6 lrq_complete_lsn (lsn=50520048, lrq=0xfda6b858) at xlogprefetcher.c:294 #7 XLogPrefetcherReadRecord (prefetcher=prefetcher@entry=0xfde5deb8, errmsg=errmsg@entry=0xffffdd88) at xlogprefetcher.c:1041 #8 0x01960054 in ReadRecord (xlogprefetcher=0xfde5deb8, emode=emode@entry=15, fetching_ckpt=fetching_ckpt@entry=false, replayTLI=replayTLI@entry=3) at xlogrecovery.c:3067 #9 0x01963188 in PerformWalRecovery () at xlogrecovery.c:1756 #10 0x01952594 in StartupXLOG () at xlog.c:5470 #11 0x01c26e3c in StartupProcessMain () at startup.c:267 #12 0x01c1b504 in AuxiliaryProcessMain (auxtype=auxtype@entry=StartupProcess) at auxprocess.c:141 #13 0x01c2276c in StartChildProcess (type=StartupProcess) at postmaster.c:5347 #14 0x01c260ec in PostmasterMain (argc=argc@entry=4, argv=argv@entry=0xffffe1e4) at postmaster.c:1457 #15 0x01ee39d8 in main (argc=4, argv=0xffffe1e4) at main.c:198 (gdb) f 1 #1 0x0195ae5c in ValidXLogRecord (state=0xfde5cc88, record=0xfda6e018, recptr=<optimized out>) at xlogreader.c:1195 1195 COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record->xl_tot_len - SizeOfXLogRecord); (gdb) p *record $3 = {xl_tot_len = 0, xl_xid = 0, xl_prev = 0, xl_info = 0 '\000', xl_rmid = 0 '\000', xl_crc = 0} Sure looks like ValidXLogRecord is assuming that record->xl_tot_len can be trusted without reservation. regards, tom lane
On Mon, Sep 25, 2023 at 5:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Sure looks like ValidXLogRecord is assuming that record->xl_tot_len > can be trusted without reservation. I think there is something at least theoretically a bit fishy about the ancient code that re-reads the page, doesn't consider short reads, and assumes that record->xl_tot_len hasn't changed since last time it loaded and validated it. But I think that's also very unlikely to go wrong, and it's not the problem here. The problem is that bae868ca removed something we still need: - /* XXX: more validation should be done here */ - if (total_len < SizeOfXLogRecord) - { - report_invalid_record(state, - "invalid record length at %X/%X: expected at least %u, got %u", - LSN_FORMAT_ARGS(RecPtr), - (uint32) SizeOfXLogRecord, total_len); - goto err; - } + /* We'll validate the header once we have the next page. */ gotheader = false; If you happened to run into zeroes where an xl_tot_len is wanted right at the end of a page (or any value not big enough to get you to the next page), we'll fall through to the single-page branch, and then go directly to the CRC check, but then ValidXLogRecord() subtracts SizeOfXLogRecord and gets a crazy big length. The CRC implementation routines on modern computers happened to use pointer arithmetic that terminates immediately without accessing any memory, which is why nothing was obviously wrong on most systems. The _sb8.c implementation for older ARM, MIPS etc use a length-based loop, and read off into deep space. Draft patch attached, including a new test for 039_end_of_wal.pl that fails on all systems without the above code.
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Sun, Sep 24, 2023 at 11:07:08PM -0400, Tom Lane wrote: >> That one is mine, so let me know if there's something particular >> you'd like me to check. > That would help, thanks! Could it be possible to check if the > attached patch applied on HEAD influences the fate of the recovery > test 009_twophase.pl? That's one thing I am really doubting about, > and we've not done much since REL_16_STABLE has been forked.. Applying this does *not* fix it. Stack trace looks the same as in my previous message. regards, tom lane
Thomas Munro <thomas.munro@gmail.com> writes: > If you happened to run into zeroes where an xl_tot_len is wanted right > at the end of a page (or any value not big enough to get you to the > next page), we'll fall through to the single-page branch, and then go > directly to the CRC check, but then ValidXLogRecord() subtracts > SizeOfXLogRecord and gets a crazy big length. The CRC implementation > routines on modern computers happened to use pointer arithmetic that > terminates immediately without accessing any memory, which is why > nothing was obviously wrong on most systems. The _sb8.c > implementation for older ARM, MIPS etc use a length-based loop, and > read off into deep space. Ah-hah. Maybe there should be an Assert in the modern CRC code path that detects the bogus arguments? > Draft patch attached, including a new test for 039_end_of_wal.pl that > fails on all systems without the above code. Both the 009 and 039 tests pass on mamba with this patch. I did not read the code, but testing-wise it seems good. regards, tom lane
On Mon, Sep 25, 2023 at 01:13:37AM -0400, Tom Lane wrote: > Applying this does *not* fix it. Stack trace looks the same > as in my previous message. Okay. Thanks anyway for the test! -- Michael
Attachment
On Mon, Sep 25, 2023 at 6:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ah-hah. Maybe there should be an Assert in the modern CRC code path > that detects the bogus arguments? Yeah. Assertion added, and pushed. Thanks for confirming on mamba, and thanks also to Alexander for an off-list test report.
On Tue, Sep 26, 2023 at 11:14:52AM +1300, Thomas Munro wrote: > Yeah. Assertion added, and pushed. Thanks for confirming on mamba, > and thanks also to Alexander for an off-list test report. Yeah++, thanks! -- Michael