Thread: WAL record CRC calculated incorrectly because of underlying buffer modification

Hello hackers,

I've investigated a recent buildfarm failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dodo&dt=2024-05-02%2006%3A40%3A36

where the test failed due to a CRC error:
2024-05-02 17:08:18.401 ACST [3406:7] LOG:  incorrect resource manager data checksum in record at 0/F14D7A60

(Chipmunk produced similar errors as well:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-08-25%2019%3A40%3A11
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2024-03-22%2003%3A20%3A39
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2023-08-19%2006%3A43%3A20
)

and discovered that XLogRecordAssemble() calculates CRC over a buffer,
that might be modified by another process.

With the attached patch applied, the following test run:
echo "
autovacuum_naptime = 1
autovacuum_vacuum_threshold = 1

wal_consistency_checking = all
" >/tmp/temp.config

for ((i=1;i<=100;i++)); do echo "iteration $i"; TEMP_CONFIG=/tmp/temp.config TESTS="test_setup hash_index" make 
check-tests -s || break; done

fails for me on iterations 7, 10, 17:
ok 1         - test_setup                               2557 ms
not ok 2     - hash_index                              24719 ms
# (test process exited with exit code 2)

postmaster.log contains:
2024-05-10 12:46:44.320 UTC checkpointer[1881151] LOG:  checkpoint starting: immediate force wait
2024-05-10 12:46:44.365 UTC checkpointer[1881151] LOG:  checkpoint complete: wrote 41 buffers (0.3%); 0 WAL file(s) 
added, 0 removed, 26 recycled; write=0.001 s, sync=0.001 s, total=0.046 s; sync files=0, longest=0.000 s, average=0.000

s; distance=439134 kB, estimate=527137 kB; lsn=0/3CE131F0, redo lsn=0/3CE13198
TRAP: failed Assert("memcmp(block1_ptr, block1_copy, block1_len) == 0"), File: "xloginsert.c", Line: 949, PID: 1881271
ExceptionalCondition at assert.c:52:13
XLogRecordAssemble at xloginsert.c:953:1
XLogInsert at xloginsert.c:520:9
hashbucketcleanup at hash.c:844:14
hashbulkdelete at hash.c:558:3
index_bulk_delete at indexam.c:760:1
vac_bulkdel_one_index at vacuum.c:2498:10
lazy_vacuum_one_index at vacuumlazy.c:2443:10
lazy_vacuum_all_indexes at vacuumlazy.c:2026:26
lazy_vacuum at vacuumlazy.c:1944:10
lazy_scan_heap at vacuumlazy.c:1050:3
heap_vacuum_rel at vacuumlazy.c:503:2
vacuum_rel at vacuum.c:2214:2
vacuum at vacuum.c:622:8
autovacuum_do_vac_analyze at autovacuum.c:3102:2
do_autovacuum at autovacuum.c:2425:23
AutoVacWorkerMain at autovacuum.c:1569:3
pgarch_die at pgarch.c:846:1
StartChildProcess at postmaster.c:3929:5
StartAutovacuumWorker at postmaster.c:3997:12
process_pm_pmsignal at postmaster.c:3809:3
ServerLoop at postmaster.c:1667:5
BackgroundWorkerInitializeConnection at postmaster.c:4156:1
main at main.c:184:3
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fc71a8d7e40]
postgres: autovacuum worker regression(_start+0x25)[0x556a8631a5e5]
2024-05-10 12:46:45.038 UTC checkpointer[1881151] LOG:  checkpoint starting: immediate force wait
2024-05-10 12:46:45.965 UTC autovacuum worker[1881275] LOG: automatic analyze of table
"regression.pg_catalog.pg_attribute"
         avg read rate: 0.000 MB/s, avg write rate: 5.409 MB/s
         buffer usage: 1094 hits, 0 misses, 27 dirtied
         system usage: CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.03 s
2024-05-10 12:46:46.892 UTC postmaster[1881150] LOG:  server process (PID 1881271) was terminated by signal 6: Aborted
2024-05-10 12:46:46.892 UTC postmaster[1881150] DETAIL:  Failed process was running: autovacuum: VACUUM ANALYZE 
public.hash_cleanup_heap

(This can be reproduced with 027_stream_regress, of course, but it would
take more time.)

Best regards,
Alexander
Attachment
Hi,

On 2024-05-10 16:00:01 +0300, Alexander Lakhin wrote:
> and discovered that XLogRecordAssemble() calculates CRC over a buffer,
> that might be modified by another process.

If, with "might", you mean that it's legitimate for that buffer to be
modified, I don't think so.  The bug is that something is modifying the buffer
despite it being exclusively locked.

I.e. what we likely have here is a bug somewhere in the hash index code.

Greetings,

Andres Freund



On Sat, May 11, 2024 at 3:57 AM Andres Freund <andres@anarazel.de> wrote:
> On 2024-05-10 16:00:01 +0300, Alexander Lakhin wrote:
> > and discovered that XLogRecordAssemble() calculates CRC over a buffer,
> > that might be modified by another process.
>
> If, with "might", you mean that it's legitimate for that buffer to be
> modified, I don't think so.  The bug is that something is modifying the buffer
> despite it being exclusively locked.
>
> I.e. what we likely have here is a bug somewhere in the hash index code.

I don't have a good grip on the higher level locking protocols of
hash.c, but one microscopic thing jumps out:

                /*
                 * bucket buffer was not changed, but still needs to be
                 * registered to ensure that we can acquire a cleanup lock on
                 * it during replay.
                 */
                if (!xlrec.is_primary_bucket_page)
                {
                    uint8        flags = REGBUF_STANDARD |
REGBUF_NO_IMAGE | REGBUF_NO_CHANGE;

                    XLogRegisterBuffer(0, bucket_buf, flags);
                }

That registers a buffer that is pinned but not content-locked, and we
tell xloginsert.c not to copy its image into the WAL, but it does it
anyway because:

        /*
         * If needs_backup is true or WAL checking is enabled for current
         * resource manager, log a full-page write for the current block.
         */
        include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0;

So I guess it copies the image on dodo, which has:

                                    'PG_TEST_EXTRA' => 'ssl ldap
kerberos wal_consistency_checking libpq_encryption xid_wraparound'

Perhaps a no-image, no-change registered buffer should not be
including an image, even for XLR_CHECK_CONSISTENCY?  It's actually
useless for consistency checking too I guess, this issue aside,
because it doesn't change anything so there is nothing to check.



Hello Thomas and Andres,

Thank you for looking at this!

11.05.2024 06:26, Thomas Munro wrote:
> On Sat, May 11, 2024 at 3:57 AM Andres Freund <andres@anarazel.de> wrote:
>> On 2024-05-10 16:00:01 +0300, Alexander Lakhin wrote:
>>> and discovered that XLogRecordAssemble() calculates CRC over a buffer,
>>> that might be modified by another process.
>> If, with "might", you mean that it's legitimate for that buffer to be
>> modified, I don't think so.  The bug is that something is modifying the buffer
>> despite it being exclusively locked.
>>
>> I.e. what we likely have here is a bug somewhere in the hash index code.
> I don't have a good grip on the higher level locking protocols of
> hash.c, but one microscopic thing jumps out:
>
>                  /*
>                   * bucket buffer was not changed, but still needs to be
>                   * registered to ensure that we can acquire a cleanup lock on
>                   * it during replay.
>                   */
>                  if (!xlrec.is_primary_bucket_page)
>                  {
>                      uint8        flags = REGBUF_STANDARD |
> REGBUF_NO_IMAGE | REGBUF_NO_CHANGE;
>
>                      XLogRegisterBuffer(0, bucket_buf, flags);
>                  }
>
> That registers a buffer that is pinned but not content-locked, and we
> tell xloginsert.c not to copy its image into the WAL, but it does it
> anyway because:
>
>          /*
>           * If needs_backup is true or WAL checking is enabled for current
>           * resource manager, log a full-page write for the current block.
>           */
>          include_image = needs_backup || (info & XLR_CHECK_CONSISTENCY) != 0;
>
> So I guess it copies the image on dodo, which has:
>
>                                      'PG_TEST_EXTRA' => 'ssl ldap
> kerberos wal_consistency_checking libpq_encryption xid_wraparound'
>
> Perhaps a no-image, no-change registered buffer should not be
> including an image, even for XLR_CHECK_CONSISTENCY?  It's actually
> useless for consistency checking too I guess, this issue aside,
> because it doesn't change anything so there is nothing to check.

Yes, I think something wrong is here. I've reduced the reproducer to:
cat << 'EOF' | psql
CREATE TABLE hash_cleanup_heap(keycol INT);
CREATE INDEX hash_cleanup_index on hash_cleanup_heap USING HASH (keycol);

BEGIN;
INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;
ROLLBACK;
EOF

cat << 'EOF' | psql &
INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 500) as i;

DROP TABLE hash_cleanup_heap;
EOF

cat << 'EOF' | psql &
SELECT pg_sleep(random() / 20);
VACUUM hash_cleanup_heap;
EOF
wait
grep 'TRAP:' server.log

("wal_consistency_checking = all" and the xloginsert patch are still required)

and with additional logging I see:
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|LOG: !!!hashbucketcleanup| scan page buf: 1832
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|CONTEXT: while vacuuming index "hash_cleanup_index" of 
relation "public.hash_cleanup_heap"
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|STATEMENT:  VACUUM hash_cleanup_heap;
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96de|LOG: !!!_hash_doinsert| _hash_getbucketbuf_from_hashkey:
1822
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96de|STATEMENT:  INSERT INTO hash_cleanup_heap SELECT 1 FROM 
generate_series(1, 500) as i;
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|LOG: !!!hashbucketcleanup| xlrec.is_primary_bucket_page: 0,

buf: 1832, bucket_buf: 1822
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|CONTEXT: while vacuuming index "hash_cleanup_index" of 
relation "public.hash_cleanup_heap"
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96dd|STATEMENT:  VACUUM hash_cleanup_heap;
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96de|LOG: !!!_hash_doinsert| _hash_relbuf(rel, 1822)
2024-05-11 03:45:23.424 UTC|law|regression|663ee9d3.1f96de|STATEMENT:  INSERT INTO hash_cleanup_heap SELECT 1 FROM 
generate_series(1, 500) as i;
TRAP: failed Assert("memcmp(block1_ptr, block1_copy, block1_len) == 0"), File: "xloginsert.c", Line: 949, PID: 2070237

Best regards,
Alexander



On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> 11.05.2024 06:26, Thomas Munro wrote:
> > Perhaps a no-image, no-change registered buffer should not be
> > including an image, even for XLR_CHECK_CONSISTENCY?  It's actually
> > useless for consistency checking too I guess, this issue aside,
> > because it doesn't change anything so there is nothing to check.
>
> Yes, I think something wrong is here. I've reduced the reproducer to:

Does it reproduce if you do this?

-               include_image = needs_backup || (info &
XLR_CHECK_CONSISTENCY) != 0;
+               include_image = needs_backup ||
+                       ((info & XLR_CHECK_CONSISTENCY) != 0 &&
+                        (regbuf->flags & REGBUF_NO_CHANGE) == 0);

Unfortunately the back branches don't have that new flag from 00d7fb5e
so, even if this is the right direction (not sure, I don't understand
this clean registered buffer trick) then ... but wait, why are there
are no failures like this in the back branches (yet at least)?  Does
your reproducer work for 16?  I wonder if something relevant changed
recently, like f56a9def.  CC'ing Michael and Amit K for info.



11.05.2024 07:25, Thomas Munro wrote:
> On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>> 11.05.2024 06:26, Thomas Munro wrote:
>>> Perhaps a no-image, no-change registered buffer should not be
>>> including an image, even for XLR_CHECK_CONSISTENCY?  It's actually
>>> useless for consistency checking too I guess, this issue aside,
>>> because it doesn't change anything so there is nothing to check.
>> Yes, I think something wrong is here. I've reduced the reproducer to:
> Does it reproduce if you do this?
>
> -               include_image = needs_backup || (info &
> XLR_CHECK_CONSISTENCY) != 0;
> +               include_image = needs_backup ||
> +                       ((info & XLR_CHECK_CONSISTENCY) != 0 &&
> +                        (regbuf->flags & REGBUF_NO_CHANGE) == 0);

No, it doesn't (at least with the latter, more targeted reproducer).

> Unfortunately the back branches don't have that new flag from 00d7fb5e
> so, even if this is the right direction (not sure, I don't understand
> this clean registered buffer trick) then ... but wait, why are there
> are no failures like this in the back branches (yet at least)?  Does
> your reproducer work for 16?  I wonder if something relevant changed
> recently, like f56a9def.  CC'ing Michael and Amit K for info.

Maybe it's hard to hit (autovacuum needs to process the index page in a
narrow time frame), but locally I could reproduce the issue even on
ac27c74de(~1 too) from 2018-09-06 (I tried several last commits touching
hash indexes, didn't dig deeper).

Best regards,
Alexander



On Sat, May 11, 2024 at 5:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> 11.05.2024 07:25, Thomas Munro wrote:
> > On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> >> 11.05.2024 06:26, Thomas Munro wrote:
> >>> Perhaps a no-image, no-change registered buffer should not be
> >>> including an image, even for XLR_CHECK_CONSISTENCY?  It's actually
> >>> useless for consistency checking too I guess, this issue aside,
> >>> because it doesn't change anything so there is nothing to check.

> >> Yes, I think something wrong is here. I've reduced the reproducer to:

> > Does it reproduce if you do this?
> >
> > -               include_image = needs_backup || (info &
> > XLR_CHECK_CONSISTENCY) != 0;
> > +               include_image = needs_backup ||
> > +                       ((info & XLR_CHECK_CONSISTENCY) != 0 &&
> > +                        (regbuf->flags & REGBUF_NO_CHANGE) == 0);
>
> No, it doesn't (at least with the latter, more targeted reproducer).

OK so that seems like a candidate fix, but ...

> > Unfortunately the back branches don't have that new flag from 00d7fb5e
> > so, even if this is the right direction (not sure, I don't understand
> > this clean registered buffer trick) then ... but wait, why are there
> > are no failures like this in the back branches (yet at least)?  Does
> > your reproducer work for 16?  I wonder if something relevant changed
> > recently, like f56a9def.  CC'ing Michael and Amit K for info.
>
> Maybe it's hard to hit (autovacuum needs to process the index page in a
> narrow time frame), but locally I could reproduce the issue even on
> ac27c74de(~1 too) from 2018-09-06 (I tried several last commits touching
> hash indexes, didn't dig deeper).

... we'd need to figure out how to fix this in the back-branches too.
One idea would be to back-patch REGBUF_NO_CHANGE, and another might be
to deduce that case from other variables.  Let me CC a couple more
people from this thread, which most recently hacked on this stuff, to
see if they have insights:

https://www.postgresql.org/message-id/flat/d2c31606e6bb9b83a02ed4835d65191b38d4ba12.camel%40j-davis.com



On Mon, 2024-05-13 at 11:15 +1200, Thomas Munro wrote:
> > > > > Perhaps a no-image, no-change registered buffer should not be
> > > > > including an image, even for XLR_CHECK_CONSISTENCY?  It's
> > > > > actually
> > > > > useless for consistency checking too I guess, this issue
> > > > > aside,
> > > > > because it doesn't change anything so there is nothing to
> > > > > check.

I'm not convinced by that reasoning. Can't it check that nothing has
changed?

> > >
> > > Does it reproduce if you do this?
> > >
> > > -               include_image = needs_backup || (info &
> > > XLR_CHECK_CONSISTENCY) != 0;
> > > +               include_image = needs_backup ||
> > > +                       ((info & XLR_CHECK_CONSISTENCY) != 0 &&
> > > +                        (regbuf->flags & REGBUF_NO_CHANGE) ==
> > > 0);
> >
> > No, it doesn't (at least with the latter, more targeted
> > reproducer).
>
> OK so that seems like a candidate fix, but ...

...

> ... we'd need to figure out how to fix this in the back-branches too.
> One idea would be to back-patch REGBUF_NO_CHANGE, and another might
> be
> to deduce that case from other variables.  Let me CC a couple more
> people from this thread, which most recently hacked on this stuff, to
> see if they have insights:


Starting from the beginning, XLogRecordAssemble() calculates the CRC of
the record (technically just the non-header portions, but that's not
important here), including any backup blocks. Later,
CopyXLogRecordToWAL() copies that data into the actual xlog buffers. If
the data changes between those two steps, the CRC will be bad.

For most callers, the contents are exclusive-locked, so there's no
problem. For checksums, the data is copied out of shared memory into a
stack variable first, so no concurrent activity can change it. For hash
indexes, it tries to protect itself by passing REGBUF_NO_IMAGE.

There are two problems:

1. That implies another invariant that we aren't checking for: that
REGBUF_NO_CHANGE must be accompanied by REGBUF_NO_IMAGE. That doesn't
seem to be true for all callers, see XLogRegisterBuffer(1, wbuf,
wbuf_flags) in _hash_freeovflpage().

2. As you point out, REGBUF_NO_IMAGE still allows an image to be taken
if XLR_CHECK_CONSISTENCY is set, so we need to figure out what to do
there.

Can we take a step back and think harder about what hash indexes are
doing and if there's a better way? Maybe hash indexes need to take a
copy of the page, like in XLogSaveBufferForHint()?

I'd prefer that we find a way to get rid of REGBUF_NO_CHANGE and make
all callers follow the rules than to find new special cases that depend
on REGBUF_NO_CHANGE. See these messages here:

https://www.postgresql.org/message-id/b1f2d0f230d60fa8df33bb0b2af3236fde9d750d.camel%40j-davis.com

https://www.postgresql.org/message-id/CA%2BTgmoY%2BdagCyrMKau7UQeQU6w4LuVEu%2ByjsmJBoXKAo6XbUUA%40mail.gmail.com

In other words, we added REGBUF_NO_CHANGE for the call sites (only hash
indexes) that don't follow the rules and where it wasn't easy to make
them follow the rules. Now that we know of a concrete problem with the
design, there's more upside to fixing it properly.

Regards,
    Jeff Davis




On Thu, May 16, 2024 at 03:54:52PM -0700, Jeff Davis wrote:
> On Mon, 2024-05-13 at 11:15 +1200, Thomas Munro wrote:
>>>>>> Perhaps a no-image, no-change registered buffer should not be
>>>>>> including an image, even for XLR_CHECK_CONSISTENCY?  It's
>>>>>> actually
>>>>>> useless for consistency checking too I guess, this issue
>>>>>> aside,
>>>>>> because it doesn't change anything so there is nothing to
>>>>>> check.
>
> I'm not convinced by that reasoning. Can't it check that nothing has
> changed?

That's something I've done four weeks ago in the hash replay code
path, and having the image with XLR_CHECK_CONSISTENCY even if
REGBUF_NO_CHANGE was necessary because replay was setting up a LSN on
a REGBUF_NO_CHANGE page it should not have touched.

> I'd prefer that we find a way to get rid of REGBUF_NO_CHANGE and make
> all callers follow the rules than to find new special cases that depend
> on REGBUF_NO_CHANGE. See these messages here:
>
> https://www.postgresql.org/message-id/b1f2d0f230d60fa8df33bb0b2af3236fde9d750d.camel%40j-davis.com
>
> https://www.postgresql.org/message-id/CA%2BTgmoY%2BdagCyrMKau7UQeQU6w4LuVEu%2ByjsmJBoXKAo6XbUUA%40mail.gmail.com
>
> In other words, we added REGBUF_NO_CHANGE for the call sites (only hash
> indexes) that don't follow the rules and where it wasn't easy to make
> them follow the rules. Now that we know of a concrete problem with the
> design, there's more upside to fixing it properly.

Yeah, agreed that getting rid of REGBUF_NO_CHANGE would be nice in the
final picture.  It still strikes me as a weird concept that WAL replay
for hash indexes logs full pages just to be able to lock them at
replay based on what's in the records.  :/
--
Michael

Attachment
On Fri, 2024-05-17 at 10:12 +0900, Michael Paquier wrote:
> That's something I've done four weeks ago in the hash replay code
> path, and having the image with XLR_CHECK_CONSISTENCY even if
> REGBUF_NO_CHANGE was necessary because replay was setting up a LSN on
> a REGBUF_NO_CHANGE page it should not have touched.

Then the candidate fix to selectively break XLR_CHECK_CONSISTENCY is
not acceptable.

>
> Yeah, agreed that getting rid of REGBUF_NO_CHANGE would be nice in
> the
> final picture.  It still strikes me as a weird concept that WAL
> replay
> for hash indexes logs full pages just to be able to lock them at
> replay based on what's in the records.  :/

I'm still not entirely clear on why hash indexes can't just follow the
rules and exclusive lock the buffer and dirty it. Presumably
performance would suffer, but I asked that question previously and
didn't get an answer:

https://www.postgresql.org/message-id/CA%2BTgmoY%2BdagCyrMKau7UQeQU6w4LuVEu%2ByjsmJBoXKAo6XbUUA%40mail.gmail.com

And if that does affect performance, what about following the same
protocol as XLogSaveBufferForHint()?

Regards,
    Jeff Davis




On Fri, May 17, 2024 at 10:56 AM Jeff Davis <pgsql@j-davis.com> wrote:
> I'm still not entirely clear on why hash indexes can't just follow the
> rules and exclusive lock the buffer and dirty it. Presumably
> performance would suffer, but I asked that question previously and
> didn't get an answer:
>
> https://www.postgresql.org/message-id/CA%2BTgmoY%2BdagCyrMKau7UQeQU6w4LuVEu%2ByjsmJBoXKAo6XbUUA%40mail.gmail.com

In my defense, the last time I worked on hash indexes was 7 years ago.
If this question had come up within a year or two of that work, I
probably would have both (a) had a much clearer idea of what the
answer was and (b) felt obliged to drop everything and go research it
if I didn't. But at this point, I feel like it's fair for me to tell
you what I know and leave it to you to do further research if you feel
like that's warranted. I know that we're each responsible for what we
commit, but I don't really think that should extend to having to
prioritize answering a hypothetical question ("what would happen if X
thing worked like Y instead of the way it does?") about an area I
haven't touched in long enough for every release that doesn't contain
those commits to be out of support. If you feel otherwise, let's have
that argument, but I have a feeling that it may be more that you're
hoping I have some kind of oracular powers which, in reality, I lack.

--
Robert Haas
EDB: http://www.enterprisedb.com