Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum" - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Date
Msg-id YNKBazxEayjtyb1x@paquier.xyz
Whole thread Raw
In response to Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On Tue, Jun 22, 2021 at 10:11:06AM -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> Your analysis seems right to me.  We have to worry about both things:
>> atomicity of writes on power failure (assumed to be sector-level,
>> hence our 512 byte struct -- all good), and atomicity of concurrent
>> reads and writes (we can't assume anything at all, so r/w locking is
>> the simplest way to get a consistent read).  Shouldn't relmap_redo()
>> also acquire the lock exclusively?

You are implying anything calling write_relmap_file(), right?

> Shouldn't we instead file a kernel bug report?  I seem to recall that
> POSIX guarantees atomicity of these things up to some operation size.
> Or is that just for pipe I/O?

Even if this is recognized as a bug report, it seems to me that we'd
better cope with an extra lock for instances that may run into this
issue anyway in the future, no?  Just to be on the safe side.

> If we can't assume atomicity of relmapper file I/O, I wonder about
> pg_control as well.  But on the whole, what I'm smelling is a moderately
> recently introduced kernel bug.  We've been doing this this way for
> years and heard no previous reports.

True.  PG_CONTROL_MAX_SAFE_SIZE relies on that.  Now, the only things
updating the control file are the startup process and the checkpointer
so that's less prone to conflicts contrary to the reported problem
here, and the code takes a ControlFileLock where necessary.
--
Michael

Attachment

pgsql-bugs by date:

Previous
From: Alexander Lakhin
Date:
Subject: Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Next
From: Tom Lane
Date:
Subject: Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"