Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum" - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Date
Msg-id CA+hUKGLbK6j-jxf=2odz2kuEEwcRxjJiko=4uMtXzktQ4KwzaA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-bugs
> We haven't heard of broken control files from the field, so that doesn't
> seem to be a problem in practice, at least not yet. Still, I would sleep
> better if the control file had more redundancy. For example, have two
> copies of it on disk. At startup, read both copies, and if they're both
> valid, ignore the one with older timestamp. When updating it, write over
> the older copy. That way, if you crash in the middle of updating it, the
> old copy is still intact.

Seems like a good idea.  I somehow doubt that accessing pmem through
old school read()/write() interfaces is the future of databases, but
ideally this should work correctly, and the dependency is indeed
unnecessary if we are prepared to jump through more hoops in just a
couple of places.  There may also be other benefits.  In hindsight,
it's a bit strange that we don't have explicit documentation of this
requirement.  There is some related (and rather dated) discussion of
sectors in wal.sgml but nothing to say that we need 512 byte atomic
sectors for correct operation, unless I've managed to miss it (even
though it's well known among people who read the source code).

I experimented with a slightly different approach, attached, and a TAP
test to exercise it.  Instead of alternating between two copies, I
tried writing out both copies every time with a synchronisation
barrier in between (the same double-write principle some other
database uses to deal with torn data pages).  I think it's mostly
equivalent to your scheme, though the updates are of course slower.  I
was thinking that there may be other benefits to having two copies of
the "current" version around, for resilience (though perhaps they
should be in separate files, not done here), and maybe it's better to
avoid having to invent a timestamp scheme.  Or maybe the two ideas
should be combined: when both CRC checks pass, you could still be more
careful which one you choose than I have been here.  Or maybe trying
to be resilient against handwavy unknown forms of corruption is a
waste of time.  I'm not proposing anything here, I was just trying out
ideas, for discussion.

Attachment

pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Assertion on create index concurrently
Next
From: Thomas Munro
Date:
Subject: Re: Unicode FFFF Special Codepoint should always collate high.