Thread: WAL record CRC calculated incorrectly because of underlying buffer modification
WAL record CRC calculated incorrectly because of underlying buffer modification
From
Alexander Lakhin
Date:
Hello hackers, I've investigated a recent buildfarm failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-05-02%2006%3A40%3A36 where the test failed due to a CRC error: 2024-05-02 17:08:18.401 ACST [3406:7] LOG: incorrect resource manager data checksum in record at 0/F14D7A60 (Chipmunk produced similar errors as well: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-08-25%2019%3A40%3A11 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2024-03-22%2003%3A20%3A39 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2023-08-19%2006%3A43%3A20 ) and discovered that XLogRecordAssemble() calculates CRC over a buffer, that might be modified by another process. With the attached patch applied, the following test run: echo " autovacuum_naptime = 1 autovacuum_vacuum_threshold = 1 wal_consistency_checking = all " >/tmp/temp.config for ((i=1;i<=100;i++)); do echo "iteration $i"; TEMP_CONFIG=/tmp/temp.config TESTS="test_setup hash_index" make check-tests -s || break; done fails for me on iterations 7, 10, 17: ok 1 - test_setup 2557 ms not ok 2 - hash_index 24719 ms # (test process exited with exit code 2) postmaster.log contains: 2024-05-10 12:46:44.320 UTC checkpointer[1881151] LOG: checkpoint starting: immediate force wait 2024-05-10 12:46:44.365 UTC checkpointer[1881151] LOG: checkpoint complete: wrote 41 buffers (0.3%); 0 WAL file(s) added, 0 removed, 26 recycled; write=0.001 s, sync=0.001 s, total=0.046 s; sync files=0, longest=0.000 s, average=0.000 s; distance=439134 kB, estimate=527137 kB; lsn=0/3CE131F0, redo lsn=0/3CE13198 TRAP: failed Assert("memcmp(block1_ptr, block1_copy, block1_len) == 0"), File: "xloginsert.c", Line: 949, PID: 1881271 ExceptionalCondition at assert.c:52:13 XLogRecordAssemble at xloginsert.c:953:1 XLogInsert at xloginsert.c:520:9 hashbucketcleanup at hash.c:844:14 hashbulkdelete at hash.c:558:3 index_bulk_delete at indexam.c:760:1 vac_bulkdel_one_index at vacuum.c:2498:10 lazy_vacuum_one_index at vacuumlazy.c:2443:10 lazy_vacuum_all_indexes at vacuumlazy.c:2026:26 lazy_vacuum at vacuumlazy.c:1944:10 lazy_scan_heap at vacuumlazy.c:1050:3 heap_vacuum_rel at vacuumlazy.c:503:2 vacuum_rel at vacuum.c:2214:2 vacuum at vacuum.c:622:8 autovacuum_do_vac_analyze at autovacuum.c:3102:2 do_autovacuum at autovacuum.c:2425:23 AutoVacWorkerMain at autovacuum.c:1569:3 pgarch_die at pgarch.c:846:1 StartChildProcess at postmaster.c:3929:5 StartAutovacuumWorker at postmaster.c:3997:12 process_pm_pmsignal at postmaster.c:3809:3 ServerLoop at postmaster.c:1667:5 BackgroundWorkerInitializeConnection at postmaster.c:4156:1 main at main.c:184:3 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fc71a8d7e40] postgres: autovacuum worker regression(_start+0x25)[0x556a8631a5e5] 2024-05-10 12:46:45.038 UTC checkpointer[1881151] LOG: checkpoint starting: immediate force wait 2024-05-10 12:46:45.965 UTC autovacuum worker[1881275] LOG: automatic analyze of table "regression.pg_catalog.pg_attribute" avg read rate: 0.000 MB/s, avg write rate: 5.409 MB/s buffer usage: 1094 hits, 0 misses, 27 dirtied system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.03 s 2024-05-10 12:46:46.892 UTC postmaster[1881150] LOG: server process (PID 1881271) was terminated by signal 6: Aborted 2024-05-10 12:46:46.892 UTC postmaster[1881150] DETAIL: Failed process was running: autovacuum: VACUUM ANALYZE public.hash_cleanup_heap (This can be reproduced with 027_stream_regress, of course, but it would take more time.) Best regards, Alexander
Attachment
Re: WAL record CRC calculated incorrectly because of underlying buffer modification
From
Andres Freund
Date:
Hi, On 2024-05-10 16:00:01 +0300, Alexander Lakhin wrote: > and discovered that XLogRecordAssemble() calculates CRC over a buffer, > that might be modified by another process. If, with "might", you mean that it's legitimate for that buffer to be modified, I don't think so. The bug is that something is modifying the buffer despite it being exclusively locked. I.e. what we likely have here is a bug somewhere in the hash index code. Greetings, Andres Freund
Re: WAL record CRC calculated incorrectly because of underlying buffer modification
From
Thomas Munro
Date:
On Sat, May 11, 2024 at 3:57 AM Andres Freund <andres@anarazel.de> wrote: > On 2024-05-10 16:00:01 +0300, Alexander Lakhin wrote: > > and discovered that XLogRecordAssemble() calculates CRC over a buffer, > > that might be modified by another process. > > If, with "might", you mean that it's legitimate for that buffer to be > modified, I don't think so. The bug is that something is modifying the buffer > despite it being exclusively locked. > > I.e. what we likely have here is a bug somewhere in the hash index code. I don't have a good grip on the higher level locking protocols of hash.c, but one microscopic thing jumps out: /* * bucket buffer was not changed, but still needs to be * registered to ensure that we can acquire a cleanup lock on * it during replay. */ if (!xlrec.is_primary_bucket_page) { uint8 flags = REGBUF_STANDARD | REGBUF_NO_IMAGE | REGBUF_NO_CHANGE; XLogRegisterBuffer(0, bucket_buf, flags); } That registers a buffer that is pinned but not content-locked, and we tell xloginsert.c not to copy its image into the WAL, but it does it anyway because: /* * If needs_backup is true or WAL checking is enabled for current * resource manager, log a full-page write for the current block. */ include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0; So I guess it copies the image on dodo, which has: 'PG_TEST_EXTRA' => 'ssl ldap kerberos wal_consistency_checking libpq_encryption xid_wraparound' Perhaps a no-image, no-change registered buffer should not be including an image, even for XLR_CHECK_CONSISTENCY? It's actually useless for consistency checking too I guess, this issue aside, because it doesn't change anything so there is nothing to check.
Re: WAL record CRC calculated incorrectly because of underlying buffer modification
From
Alexander Lakhin
Date:
Hello Thomas and Andres, Thank you for looking at this! 11.05.2024 06:26, Thomas Munro wrote: > On Sat, May 11, 2024 at 3:57 AM Andres Freund <andres@anarazel.de> wrote: >> On 2024-05-10 16:00:01 +0300, Alexander Lakhin wrote: >>> and discovered that XLogRecordAssemble() calculates CRC over a buffer, >>> that might be modified by another process. >> If, with "might", you mean that it's legitimate for that buffer to be >> modified, I don't think so. The bug is that something is modifying the buffer >> despite it being exclusively locked. >> >> I.e. what we likely have here is a bug somewhere in the hash index code. > I don't have a good grip on the higher level locking protocols of > hash.c, but one microscopic thing jumps out: > > /* > * bucket buffer was not changed, but still needs to be > * registered to ensure that we can acquire a cleanup lock on > * it during replay. > */ > if (!xlrec.is_primary_bucket_page) > { > uint8 flags = REGBUF_STANDARD | > REGBUF_NO_IMAGE | REGBUF_NO_CHANGE; > > XLogRegisterBuffer(0, bucket_buf, flags); > } > > That registers a buffer that is pinned but not content-locked, and we > tell xloginsert.c not to copy its image into the WAL, but it does it > anyway because: > > /* > * If needs_backup is true or WAL checking is enabled for current > * resource manager, log a full-page write for the current block. > */ > include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0; > > So I guess it copies the image on dodo, which has: > > 'PG_TEST_EXTRA' => 'ssl ldap > kerberos wal_consistency_checking libpq_encryption xid_wraparound' > > Perhaps a no-image, no-change registered buffer should not be > including an image, even for XLR_CHECK_CONSISTENCY? It's actually > useless for consistency checking too I guess, this issue aside, > because it doesn't change anything so there is nothing to check. Yes, I think something wrong is here. I've reduced the reproducer to: cat << 'EOF' | psql CREATE TABLE hash_cleanup_heap(keycol INT); CREATE INDEX hash_cleanup_index on hash_cleanup_heap USING HASH (keycol); BEGIN; INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i; ROLLBACK; EOF cat << 'EOF' | psql & INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i; DROP TABLE hash_cleanup_heap; EOF cat << 'EOF' | psql & SELECT pg_sleep(random() / 20); VACUUM hash_cleanup_heap; EOF wait grep 'TRAP:' server.log ("wal_consistency_checking = all" and the xloginsert patch are still required) and with additional logging I see: 2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|LOG: !!!hashbucketcleanup| scan page buf: 1832 2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|CONTEXT: while vacuuming index "hash_cleanup_index" of relation "public.hash_cleanup_heap" 2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|STATEMENT: VACUUM hash_cleanup_heap; 2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96de|LOG: !!!_hash_doinsert| _hash_getbucketbuf_from_hashkey: 1822 2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96de|STATEMENT: INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i; 2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|LOG: !!!hashbucketcleanup| xlrec.is_primary_bucket_page: 0, buf: 1832, bucket_buf: 1822 2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|CONTEXT: while vacuuming index "hash_cleanup_index" of relation "public.hash_cleanup_heap" 2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|STATEMENT: VACUUM hash_cleanup_heap; 2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96de|LOG: !!!_hash_doinsert| _hash_relbuf(rel, 1822) 2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96de|STATEMENT: INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i; TRAP: failed Assert("memcmp(block1_ptr, block1_copy, block1_len) == 0"), File: "xloginsert.c", Line: 949, PID: 2070237 Best regards, Alexander
Re: WAL record CRC calculated incorrectly because of underlying buffer modification
From
Thomas Munro
Date:
On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > 11.05.2024 06:26, Thomas Munro wrote: > > Perhaps a no-image, no-change registered buffer should not be > > including an image, even for XLR_CHECK_CONSISTENCY? It's actually > > useless for consistency checking too I guess, this issue aside, > > because it doesn't change anything so there is nothing to check. > > Yes, I think something wrong is here. I've reduced the reproducer to: Does it reproduce if you do this? - include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0; + include_image = needs_backup || + ((info & XLR_CHECK_CONSISTENCY) != 0 && + (regbuf->flags & REGBUF_NO_CHANGE) == 0); Unfortunately the back branches don't have that new flag from 00d7fb5e so, even if this is the right direction (not sure, I don't understand this clean registered buffer trick) then ... but wait, why are there are no failures like this in the back branches (yet at least)? Does your reproducer work for 16? I wonder if something relevant changed recently, like f56a9def. CC'ing Michael and Amit K for info.
Re: WAL record CRC calculated incorrectly because of underlying buffer modification
From
Alexander Lakhin
Date:
11.05.2024 07:25, Thomas Munro wrote: > On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: >> 11.05.2024 06:26, Thomas Munro wrote: >>> Perhaps a no-image, no-change registered buffer should not be >>> including an image, even for XLR_CHECK_CONSISTENCY? It's actually >>> useless for consistency checking too I guess, this issue aside, >>> because it doesn't change anything so there is nothing to check. >> Yes, I think something wrong is here. I've reduced the reproducer to: > Does it reproduce if you do this? > > - include_image = needs_backup || (info & > XLR_CHECK_CONSISTENCY) != 0; > + include_image = needs_backup || > + ((info & XLR_CHECK_CONSISTENCY) != 0 && > + (regbuf->flags & REGBUF_NO_CHANGE) == 0); No, it doesn't (at least with the latter, more targeted reproducer). > Unfortunately the back branches don't have that new flag from 00d7fb5e > so, even if this is the right direction (not sure, I don't understand > this clean registered buffer trick) then ... but wait, why are there > are no failures like this in the back branches (yet at least)? Does > your reproducer work for 16? I wonder if something relevant changed > recently, like f56a9def. CC'ing Michael and Amit K for info. Maybe it's hard to hit (autovacuum needs to process the index page in a narrow time frame), but locally I could reproduce the issue even on ac27c74de(~1 too) from 2018-09-06 (I tried several last commits touching hash indexes, didn't dig deeper). Best regards, Alexander
Re: WAL record CRC calculated incorrectly because of underlying buffer modification
From
Thomas Munro
Date:
On Sat, May 11, 2024 at 5:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > 11.05.2024 07:25, Thomas Munro wrote: > > On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > >> 11.05.2024 06:26, Thomas Munro wrote: > >>> Perhaps a no-image, no-change registered buffer should not be > >>> including an image, even for XLR_CHECK_CONSISTENCY? It's actually > >>> useless for consistency checking too I guess, this issue aside, > >>> because it doesn't change anything so there is nothing to check. > >> Yes, I think something wrong is here. I've reduced the reproducer to: > > Does it reproduce if you do this? > > > > - include_image = needs_backup || (info & > > XLR_CHECK_CONSISTENCY) != 0; > > + include_image = needs_backup || > > + ((info & XLR_CHECK_CONSISTENCY) != 0 && > > + (regbuf->flags & REGBUF_NO_CHANGE) == 0); > > No, it doesn't (at least with the latter, more targeted reproducer). OK so that seems like a candidate fix, but ... > > Unfortunately the back branches don't have that new flag from 00d7fb5e > > so, even if this is the right direction (not sure, I don't understand > > this clean registered buffer trick) then ... but wait, why are there > > are no failures like this in the back branches (yet at least)? Does > > your reproducer work for 16? I wonder if something relevant changed > > recently, like f56a9def. CC'ing Michael and Amit K for info. > > Maybe it's hard to hit (autovacuum needs to process the index page in a > narrow time frame), but locally I could reproduce the issue even on > ac27c74de(~1 too) from 2018-09-06 (I tried several last commits touching > hash indexes, didn't dig deeper). ... we'd need to figure out how to fix this in the back-branches too. One idea would be to back-patch REGBUF_NO_CHANGE, and another might be to deduce that case from other variables. Let me CC a couple more people from this thread, which most recently hacked on this stuff, to see if they have insights: https://www.postgresql.org/message-id/flat/d2c31606e6bb9b83a02ed4835d65191b38d4ba12.camel%40j-davis.com
Re: WAL record CRC calculated incorrectly because of underlying buffer modification
From
Jeff Davis
Date:
On Mon, 2024-05-13 at 11:15 +1200, Thomas Munro wrote: > > > > > Perhaps a no-image, no-change registered buffer should not be > > > > > including an image, even for XLR_CHECK_CONSISTENCY? It's > > > > > actually > > > > > useless for consistency checking too I guess, this issue > > > > > aside, > > > > > because it doesn't change anything so there is nothing to > > > > > check. I'm not convinced by that reasoning. Can't it check that nothing has changed? > > > > > > Does it reproduce if you do this? > > > > > > - include_image = needs_backup || (info & > > > XLR_CHECK_CONSISTENCY) != 0; > > > + include_image = needs_backup || > > > + ((info & XLR_CHECK_CONSISTENCY) != 0 && > > > + (regbuf->flags & REGBUF_NO_CHANGE) == > > > 0); > > > > No, it doesn't (at least with the latter, more targeted > > reproducer). > > OK so that seems like a candidate fix, but ... ... > ... we'd need to figure out how to fix this in the back-branches too. > One idea would be to back-patch REGBUF_NO_CHANGE, and another might > be > to deduce that case from other variables. Let me CC a couple more > people from this thread, which most recently hacked on this stuff, to > see if they have insights: Starting from the beginning, XLogRecordAssemble() calculates the CRC of the record (technically just the non-header portions, but that's not important here), including any backup blocks. Later, CopyXLogRecordToWAL() copies that data into the actual xlog buffers. If the data changes between those two steps, the CRC will be bad. For most callers, the contents are exclusive-locked, so there's no problem. For checksums, the data is copied out of shared memory into a stack variable first, so no concurrent activity can change it. For hash indexes, it tries to protect itself by passing REGBUF_NO_IMAGE. There are two problems: 1. That implies another invariant that we aren't checking for: that REGBUF_NO_CHANGE must be accompanied by REGBUF_NO_IMAGE. That doesn't seem to be true for all callers, see XLogRegisterBuffer(1, wbuf, wbuf_flags) in _hash_freeovflpage(). 2. As you point out, REGBUF_NO_IMAGE still allows an image to be taken if XLR_CHECK_CONSISTENCY is set, so we need to figure out what to do there. Can we take a step back and think harder about what hash indexes are doing and if there's a better way? Maybe hash indexes need to take a copy of the page, like in XLogSaveBufferForHint()? I'd prefer that we find a way to get rid of REGBUF_NO_CHANGE and make all callers follow the rules than to find new special cases that depend on REGBUF_NO_CHANGE. See these messages here: https://www.postgresql.org/message-id/b1f2d0f230d60fa8df33bb0b2af3236fde9d750d.camel%40j-davis.com https://www.postgresql.org/message-id/CA%2BTgmoY%2BdagCyrMKau7UQeQU6w4LuVEu%2ByjsmJBoXKAo6XbUUA%40mail.gmail.com In other words, we added REGBUF_NO_CHANGE for the call sites (only hash indexes) that don't follow the rules and where it wasn't easy to make them follow the rules. Now that we know of a concrete problem with the design, there's more upside to fixing it properly. Regards, Jeff Davis
Re: WAL record CRC calculated incorrectly because of underlying buffer modification
From
Michael Paquier
Date:
On Thu, May 16, 2024 at 03:54:52PM -0700, Jeff Davis wrote: > On Mon, 2024-05-13 at 11:15 +1200, Thomas Munro wrote: >>>>>> Perhaps a no-image, no-change registered buffer should not be >>>>>> including an image, even for XLR_CHECK_CONSISTENCY? It's >>>>>> actually >>>>>> useless for consistency checking too I guess, this issue >>>>>> aside, >>>>>> because it doesn't change anything so there is nothing to >>>>>> check. > > I'm not convinced by that reasoning. Can't it check that nothing has > changed? That's something I've done four weeks ago in the hash replay code path, and having the image with XLR_CHECK_CONSISTENCY even if REGBUF_NO_CHANGE was necessary because replay was setting up a LSN on a REGBUF_NO_CHANGE page it should not have touched. > I'd prefer that we find a way to get rid of REGBUF_NO_CHANGE and make > all callers follow the rules than to find new special cases that depend > on REGBUF_NO_CHANGE. See these messages here: > > https://www.postgresql.org/message-id/b1f2d0f230d60fa8df33bb0b2af3236fde9d750d.camel%40j-davis.com > > https://www.postgresql.org/message-id/CA%2BTgmoY%2BdagCyrMKau7UQeQU6w4LuVEu%2ByjsmJBoXKAo6XbUUA%40mail.gmail.com > > In other words, we added REGBUF_NO_CHANGE for the call sites (only hash > indexes) that don't follow the rules and where it wasn't easy to make > them follow the rules. Now that we know of a concrete problem with the > design, there's more upside to fixing it properly. Yeah, agreed that getting rid of REGBUF_NO_CHANGE would be nice in the final picture. It still strikes me as a weird concept that WAL replay for hash indexes logs full pages just to be able to lock them at replay based on what's in the records. :/ -- Michael
Attachment
Re: WAL record CRC calculated incorrectly because of underlying buffer modification
From
Jeff Davis
Date:
On Fri, 2024-05-17 at 10:12 +0900, Michael Paquier wrote: > That's something I've done four weeks ago in the hash replay code > path, and having the image with XLR_CHECK_CONSISTENCY even if > REGBUF_NO_CHANGE was necessary because replay was setting up a LSN on > a REGBUF_NO_CHANGE page it should not have touched. Then the candidate fix to selectively break XLR_CHECK_CONSISTENCY is not acceptable. > > Yeah, agreed that getting rid of REGBUF_NO_CHANGE would be nice in > the > final picture. It still strikes me as a weird concept that WAL > replay > for hash indexes logs full pages just to be able to lock them at > replay based on what's in the records. :/ I'm still not entirely clear on why hash indexes can't just follow the rules and exclusive lock the buffer and dirty it. Presumably performance would suffer, but I asked that question previously and didn't get an answer: https://www.postgresql.org/message-id/CA%2BTgmoY%2BdagCyrMKau7UQeQU6w4LuVEu%2ByjsmJBoXKAo6XbUUA%40mail.gmail.com And if that does affect performance, what about following the same protocol as XLogSaveBufferForHint()? Regards, Jeff Davis
Re: WAL record CRC calculated incorrectly because of underlying buffer modification
From
Robert Haas
Date:
On Fri, May 17, 2024 at 10:56 AM Jeff Davis <pgsql@j-davis.com> wrote: > I'm still not entirely clear on why hash indexes can't just follow the > rules and exclusive lock the buffer and dirty it. Presumably > performance would suffer, but I asked that question previously and > didn't get an answer: > > https://www.postgresql.org/message-id/CA%2BTgmoY%2BdagCyrMKau7UQeQU6w4LuVEu%2ByjsmJBoXKAo6XbUUA%40mail.gmail.com In my defense, the last time I worked on hash indexes was 7 years ago. If this question had come up within a year or two of that work, I probably would have both (a) had a much clearer idea of what the answer was and (b) felt obliged to drop everything and go research it if I didn't. But at this point, I feel like it's fair for me to tell you what I know and leave it to you to do further research if you feel like that's warranted. I know that we're each responsible for what we commit, but I don't really think that should extend to having to prioritize answering a hypothetical question ("what would happen if X thing worked like Y instead of the way it does?") about an area I haven't touched in long enough for every release that doesn't contain those commits to be out of support. If you feel otherwise, let's have that argument, but I have a feeling that it may be more that you're hoping I have some kind of oracular powers which, in reality, I lack. -- Robert Haas EDB: http://www.enterprisedb.com