Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum" - Mailing list pgsql-bugs

From Alexander Lakhin
Subject Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Date
Msg-id 11523fe8-7614-9d57-1ad5-c12a4c4ec9cf@gmail.com
Whole thread Raw
In response to Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-bugs
Hello,
22.06.2021 16:00, Thomas Munro wrote:
> On Tue, Jun 22, 2021 at 9:30 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Hmm, the simplest explanation would be that the read() or write() on the
>> relmapper file is not atomic. We assume that it is, and don't use a lock
>> in load_relmap_file() because of that. Is there anything unusual about
>> the filesystem, mount options or the kernel you're using? I could not
>> reproduce this on my laptop. Does the attached patch fix it for you?
> I have managed to reproduce this twice on a laptop running Linux
> 5.10.0-2-amd64, after trying many things for several hours.  Both
> times I was using ext4 in a loopback file (underlying is xfs, I had no
> luck there hence hunch that I should try ext4, may not be significant
> though) with fsync=off (ditto).
I'm sorry, I forgot that I've set "fsync=off" in my postgresql.conf (to
avoid NVME-specific slowdown on fsyncs).
It really does matter. With fsync=on the demo script passes 20
iterations successfully.
I reproduce the issue on Ubuntu 20.04 with the kernel 5.9.15, ext4
(without any specific options) on NVME storage, and Ryzen 3700x.
It was first encountered on Debian 10 with the kernel 4.19.0, ext4 on
software RAID built on NVME storage too, and Xeon 5220.

The attached patch fixes it for me (with fsync=off). 3 runs by 20
iterations completed without the error (without the patch I get the
error on the first iteration).

Best regards,
Alexander



pgsql-bugs by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows
Next
From: Michael Paquier
Date:
Subject: Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"