Thread: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"

The following bug has been logged on the website:

Bug reference:      17064
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 14beta1
Operating system:   Ubuntu 20.04
Description:

The following script:
===
for i in `seq 100`; do
createdb db$i
done

# Based on the contents of the regression test "vacuum"
echo "
CREATE TABLE pvactst (i INT);
INSERT INTO pvactst SELECT i FROM generate_series(1,10000) i;
DELETE FROM pvactst;
VACUUM pvactst;
DROP TABLE pvactst;

VACUUM FULL pg_database;
" >/tmp/vacuum.sql

for n in `seq 10`; do
  echo "iteration $n"
  for i in `seq 100`; do
    ( { for f in `seq 100`; do cat /tmp/vacuum.sql; done } | psql -d db$i )
>psql-$i.log 2>&1 &
  done
  wait
  grep -C5 FATAL psql*.log && break;
done
===
detects sporadic FATAL errors:
iteration 1
psql-56.log-DROP TABLE
psql-56.log-VACUUM
psql-56.log-CREATE TABLE
psql-56.log-INSERT 0 10000
psql-56.log-DELETE 10000
psql-56.log:FATAL:  relation mapping file "global/pg_filenode.map" contains
incorrect checksum
psql-56.log-server closed the connection unexpectedly
psql-56.log-    This probably means the server terminated abnormally
psql-56.log-    before or while processing the request.
psql-56.log-connection to server was lost

(Discovered when running multiple installcheck's for a single instance.
(Except for this failure, installcheck x 100 (with several tests excluded)
is rather stable.))

Reproduced on REL9_6_STABLE .. master.


On 18/06/2021 18:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      17064
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 14beta1
> Operating system:   Ubuntu 20.04
> Description:
> 
> The following script:
> ===
> for i in `seq 100`; do
> createdb db$i
> done
> 
> # Based on the contents of the regression test "vacuum"
> echo "
> CREATE TABLE pvactst (i INT);
> INSERT INTO pvactst SELECT i FROM generate_series(1,10000) i;
> DELETE FROM pvactst;
> VACUUM pvactst;
> DROP TABLE pvactst;
> 
> VACUUM FULL pg_database;
> " >/tmp/vacuum.sql
> 
> for n in `seq 10`; do
>    echo "iteration $n"
>    for i in `seq 100`; do
>      ( { for f in `seq 100`; do cat /tmp/vacuum.sql; done } | psql -d db$i )
>> psql-$i.log 2>&1 &
>    done
>    wait
>    grep -C5 FATAL psql*.log && break;
> done
> ===
> detects sporadic FATAL errors:
> iteration 1
> psql-56.log-DROP TABLE
> psql-56.log-VACUUM
> psql-56.log-CREATE TABLE
> psql-56.log-INSERT 0 10000
> psql-56.log-DELETE 10000
> psql-56.log:FATAL:  relation mapping file "global/pg_filenode.map" contains
> incorrect checksum
> psql-56.log-server closed the connection unexpectedly
> psql-56.log-    This probably means the server terminated abnormally
> psql-56.log-    before or while processing the request.
> psql-56.log-connection to server was lost

Hmm, the simplest explanation would be that the read() or write() on the 
relmapper file is not atomic. We assume that it is, and don't use a lock 
in load_relmap_file() because of that. Is there anything unusual about 
the filesystem, mount options or the kernel you're using? I could not 
reproduce this on my laptop. Does the attached patch fix it for you?

If that's the cause, it is easy to fix by taking the RelationMappingLock 
in load_relmap_file(), like in the attached patch. But if the write is 
not atomic, you might have a bigger problem: we also rely on the 
atomicity when writing the pg_control file. If that becomes corrupt 
because of a partial write, the server won't start up. If it's just a 
race condition between the read/write, or only the read() is not atomic, 
maybe pg_control is OK, but I'd like to understand the issue better 
before just adding a lock to load_relmap_file().

- Heikki

Attachment
On Tue, Jun 22, 2021 at 9:30 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm, the simplest explanation would be that the read() or write() on the
> relmapper file is not atomic. We assume that it is, and don't use a lock
> in load_relmap_file() because of that. Is there anything unusual about
> the filesystem, mount options or the kernel you're using? I could not
> reproduce this on my laptop. Does the attached patch fix it for you?

I have managed to reproduce this twice on a laptop running Linux
5.10.0-2-amd64, after trying many things for several hours.  Both
times I was using ext4 in a loopback file (underlying is xfs, I had no
luck there hence hunch that I should try ext4, may not be significant
though) with fsync=off (ditto).

> If that's the cause, it is easy to fix by taking the RelationMappingLock
> in load_relmap_file(), like in the attached patch. But if the write is
> not atomic, you might have a bigger problem: we also rely on the
> atomicity when writing the pg_control file. If that becomes corrupt
> because of a partial write, the server won't start up. If it's just a
> race condition between the read/write, or only the read() is not atomic,
> maybe pg_control is OK, but I'd like to understand the issue better
> before just adding a lock to load_relmap_file().

Your analysis seems right to me.  We have to worry about both things:
atomicity of writes on power failure (assumed to be sector-level,
hence our 512 byte struct -- all good), and atomicity of concurrent
reads and writes (we can't assume anything at all, so r/w locking is
the simplest way to get a consistent read).  Shouldn't relmap_redo()
also acquire the lock exclusively?



Thomas Munro <thomas.munro@gmail.com> writes:
> Your analysis seems right to me.  We have to worry about both things:
> atomicity of writes on power failure (assumed to be sector-level,
> hence our 512 byte struct -- all good), and atomicity of concurrent
> reads and writes (we can't assume anything at all, so r/w locking is
> the simplest way to get a consistent read).  Shouldn't relmap_redo()
> also acquire the lock exclusively?

Shouldn't we instead file a kernel bug report?  I seem to recall that
POSIX guarantees atomicity of these things up to some operation size.
Or is that just for pipe I/O?

If we can't assume atomicity of relmapper file I/O, I wonder about
pg_control as well.  But on the whole, what I'm smelling is a moderately
recently introduced kernel bug.  We've been doing this this way for
years and heard no previous reports.

            regards, tom lane



Hello,
22.06.2021 16:00, Thomas Munro wrote:
> On Tue, Jun 22, 2021 at 9:30 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Hmm, the simplest explanation would be that the read() or write() on the
>> relmapper file is not atomic. We assume that it is, and don't use a lock
>> in load_relmap_file() because of that. Is there anything unusual about
>> the filesystem, mount options or the kernel you're using? I could not
>> reproduce this on my laptop. Does the attached patch fix it for you?
> I have managed to reproduce this twice on a laptop running Linux
> 5.10.0-2-amd64, after trying many things for several hours.  Both
> times I was using ext4 in a loopback file (underlying is xfs, I had no
> luck there hence hunch that I should try ext4, may not be significant
> though) with fsync=off (ditto).
I'm sorry, I forgot that I've set "fsync=off" in my postgresql.conf (to
avoid NVME-specific slowdown on fsyncs).
It really does matter. With fsync=on the demo script passes 20
iterations successfully.
I reproduce the issue on Ubuntu 20.04 with the kernel 5.9.15, ext4
(without any specific options) on NVME storage, and Ryzen 3700x.
It was first encountered on Debian 10 with the kernel 4.19.0, ext4 on
software RAID built on NVME storage too, and Xeon 5220.

The attached patch fixes it for me (with fsync=off). 3 runs by 20
iterations completed without the error (without the patch I get the
error on the first iteration).

Best regards,
Alexander



On Tue, Jun 22, 2021 at 10:11:06AM -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> Your analysis seems right to me.  We have to worry about both things:
>> atomicity of writes on power failure (assumed to be sector-level,
>> hence our 512 byte struct -- all good), and atomicity of concurrent
>> reads and writes (we can't assume anything at all, so r/w locking is
>> the simplest way to get a consistent read).  Shouldn't relmap_redo()
>> also acquire the lock exclusively?

You are implying anything calling write_relmap_file(), right?

> Shouldn't we instead file a kernel bug report?  I seem to recall that
> POSIX guarantees atomicity of these things up to some operation size.
> Or is that just for pipe I/O?

Even if this is recognized as a bug report, it seems to me that we'd
better cope with an extra lock for instances that may run into this
issue anyway in the future, no?  Just to be on the safe side.

> If we can't assume atomicity of relmapper file I/O, I wonder about
> pg_control as well.  But on the whole, what I'm smelling is a moderately
> recently introduced kernel bug.  We've been doing this this way for
> years and heard no previous reports.

True.  PG_CONTROL_MAX_SAFE_SIZE relies on that.  Now, the only things
updating the control file are the startup process and the checkpointer
so that's less prone to conflicts contrary to the reported problem
here, and the code takes a ControlFileLock where necessary.
--
Michael

Attachment
Michael Paquier <michael@paquier.xyz> writes:
> On Tue, Jun 22, 2021 at 10:11:06AM -0400, Tom Lane wrote:
>> Shouldn't we instead file a kernel bug report?  I seem to recall that
>> POSIX guarantees atomicity of these things up to some operation size.
>> Or is that just for pipe I/O?

> Even if this is recognized as a bug report, it seems to me that we'd
> better cope with an extra lock for instances that may run into this
> issue anyway in the future, no?  Just to be on the safe side.

Based on the evidence at hand, we can't really say whether the writes
are non-atomic.  If that's the case then we have much worse problems
than just needing to add a lock.  Thus, I think we need to get some
kernel people involved.

            regards, tom lane



On Wed, Jun 23, 2021 at 2:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Your analysis seems right to me.  We have to worry about both things:
> > atomicity of writes on power failure (assumed to be sector-level,
> > hence our 512 byte struct -- all good), and atomicity of concurrent
> > reads and writes (we can't assume anything at all, so r/w locking is
> > the simplest way to get a consistent read).  Shouldn't relmap_redo()
> > also acquire the lock exclusively?
>
> Shouldn't we instead file a kernel bug report?  I seem to recall that
> POSIX guarantees atomicity of these things up to some operation size.
> Or is that just for pipe I/O?

The spec doesn't cover us according to some opinions, at least:

https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic

But at the same time, the behaviour seems quite surprising given the
parameters involved and how at least I thought this stuff worked in
practice (ie what the rules about the visibility of writes that
precede reads imply for the unspoken locking rule that must be the
obvious reasonable implementation, and the reality of the inode-level
read/write locking plainly visible in the source).  It's possible that
it's not working as designed in some weird edge case.  I guess the
next thing to do is write a minimal repro and find an expert to ask
about what it's supposed to do.



On Wed, Jun 23, 2021 at 12:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> repro

Here's a quick-and-dirty repro -- just touch test-file and then run
it.  It fails about 0.1% of its read loops on ext4, for me, but
outputs nothing on other file systems and OSes I tried (XFS, FBSD UFS,
APFS).  Seems a bit like there is a tendency for 64 byte (cache line?)
granularity in the mashed data it prints out.

Attachment
23.06.2021 05:29, Thomas Munro wrote:
> On Wed, Jun 23, 2021 at 12:50 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>> repro
> Here's a quick-and-dirty repro -- just touch test-file and then run
> it.  It fails about 0.1% of its read loops on ext4, for me, but
> outputs nothing on other file systems and OSes I tried (XFS, FBSD UFS,
> APFS).  Seems a bit like there is a tendency for 64 byte (cache line?)
> granularity in the mashed data it prints out.
Yes, this repro fails for me too (on ext4).

11111111111111111111111111111111222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222

11111111111111111111111111111111111111111111111111111111111111111111111111111111222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222221111111111111111111111111111111122222222222222222222222222222222111111111111111111111111111111112222222222222222222222222222222211111111111111111111111111111111222222222222222222222222222222222222222222222222

11111111111111111111111111111111111111111111111111111111111111112222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222

But granularity is 16-byte. (I've seen the fail count 19763 out of
1000000 loops.)

Best regards,
Alexander




On 23/06/2021 03:50, Thomas Munro wrote:
> On Wed, Jun 23, 2021 at 2:11 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thomas Munro <thomas.munro@gmail.com> writes:
>>> Your analysis seems right to me.  We have to worry about both things:
>>> atomicity of writes on power failure (assumed to be sector-level,
>>> hence our 512 byte struct -- all good), and atomicity of concurrent
>>> reads and writes (we can't assume anything at all, so r/w locking is
>>> the simplest way to get a consistent read).  Shouldn't relmap_redo()
>>> also acquire the lock exclusively?
>>
>> Shouldn't we instead file a kernel bug report?  I seem to recall that
>> POSIX guarantees atomicity of these things up to some operation size.
>> Or is that just for pipe I/O?
> 
> The spec doesn't cover us according to some opinions, at least:
> 
> https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic
> 
> But at the same time, the behaviour seems quite surprising given the
> parameters involved and how at least I thought this stuff worked in
> practice (ie what the rules about the visibility of writes that
> precede reads imply for the unspoken locking rule that must be the
> obvious reasonable implementation, and the reality of the inode-level
> read/write locking plainly visible in the source).  It's possible that
> it's not working as designed in some weird edge case.  I guess the
> next thing to do is write a minimal repro and find an expert to ask
> about what it's supposed to do.

That would be nice. At this point, though, I'm convinced at this point 
that the POSIX doesn't give the guarantees we want, or even if it does, 
there are a lot of systems out there that don't respect that. Do we rely 
on that anywhere else than in load_relmap_file()? I don't think we do. 
Let's just add the lock there.

Now, that leaves the question with pg_control. That's a different 
situation. It doesn't rely on read() and write() being atomic across 
processes, but on a 512 sector write not being torn on power failure. 
How strong is that guarantee? It used to be common wisdom with hard 
drives, and it was carried over to SSDs although I'm not sure if it was 
ever strictly speaking guaranteed. What about the new kid on the block: 
Persistent Memory? I found this article: 
https://lwn.net/Articles/686150/. So at hardware level, Persistent 
Memory only guarantees atomicity at cache line level (64 bytes). To 
provide the traditional 512 byte sector atomicity, there's a feature in 
Linux called BTT. Perhaps we should add a note to the docs that you 
should enable that.

We haven't heard of broken control files from the field, so that doesn't 
seem to be a problem in practice, at least not yet. Still, I would sleep 
better if the control file had more redundancy. For example, have two 
copies of it on disk. At startup, read both copies, and if they're both 
valid, ignore the one with older timestamp. When updating it, write over 
the older copy. That way, if you crash in the middle of updating it, the 
old copy is still intact.

- Heikki



On Wed, Jun 23, 2021 at 7:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Let's just add the lock there.

+1, no doubt about that.

> Now, that leaves the question with pg_control. That's a different
> situation. It doesn't rely on read() and write() being atomic across
> processes, but on a 512 sector write not being torn on power failure.
> How strong is that guarantee? It used to be common wisdom with hard
> drives, and it was carried over to SSDs although I'm not sure if it was
> ever strictly speaking guaranteed. ...

Right, it's always been tacit, no standard relevant to userspace
mentions any of this AFAIK.

> ... What about the new kid on the block:
> Persistent Memory? I found this article:
> https://lwn.net/Articles/686150/. So at hardware level, Persistent
> Memory only guarantees atomicity at cache line level (64 bytes). To
> provide the traditional 512 byte sector atomicity, there's a feature in
> Linux called BTT. Perhaps we should add a note to the docs that you
> should enable that.

Right, also called sector mode.  I don't know enough about that to
comment really, but... if my google-fu is serving me, you can't
actually use interesting sector sizes like 8KB (you have to choose 512
or 4096 bytes), so you'll have to pay for *two* synthetic atomic page
schemes: BTT and our full page writes.  That makes me wonder... if you
need to leave full page writes on anyway, maybe it would be a better
trade-off to do double writes of our special atomic files (relmapper
files and control file) so that we could safely turn BTT off and avoid
double-taxation for relation data.  Just a thought.  No pmem
experience here, I could be way off.

> We haven't heard of broken control files from the field, so that doesn't
> seem to be a problem in practice, at least not yet. Still, I would sleep
> better if the control file had more redundancy. For example, have two
> copies of it on disk. At startup, read both copies, and if they're both
> valid, ignore the one with older timestamp. When updating it, write over
> the older copy. That way, if you crash in the middle of updating it, the
> old copy is still intact.

+1, with a flush in between so that only one can be borked no matter
how the storage works.  It is interesting how few reports there are on
the mailing list of a control file CRC check failures though, if I'm
searching for the right thing[1].

[1] https://www.postgresql.org/search/?m=1&q=calculated+CRC+checksum+does+not+match+value+stored+in+file&l=&d=-1&s=r



On 23/06/2021 12:45, Thomas Munro wrote:
> On Wed, Jun 23, 2021 at 7:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Let's just add the lock there.
> 
> +1, no doubt about that.

Committed that. Thanks for the report, Alexander!

>> ... What about the new kid on the block:
>> Persistent Memory? I found this article:
>> https://lwn.net/Articles/686150/. So at hardware level, Persistent
>> Memory only guarantees atomicity at cache line level (64 bytes). To
>> provide the traditional 512 byte sector atomicity, there's a feature in
>> Linux called BTT. Perhaps we should add a note to the docs that you
>> should enable that.
> 
> Right, also called sector mode.  I don't know enough about that to
> comment really, but... if my google-fu is serving me, you can't
> actually use interesting sector sizes like 8KB (you have to choose 512
> or 4096 bytes), so you'll have to pay for *two* synthetic atomic page
> schemes: BTT and our full page writes.  That makes me wonder... if you
> need to leave full page writes on anyway, maybe it would be a better
> trade-off to do double writes of our special atomic files (relmapper
> files and control file) so that we could safely turn BTT off and avoid
> double-taxation for relation data.  Just a thought.  No pmem
> experience here, I could be way off.

Yeah, you wouldn't want to turn on BTT for anything else than the 
pg_control file. That's the only place where we rely on sector 
atomicity, I believe. For everything else, it just adds overhead. Not 
sure how much overhead; maybe it doesn't matter in practice.

>> We haven't heard of broken control files from the field, so that doesn't
>> seem to be a problem in practice, at least not yet. Still, I would sleep
>> better if the control file had more redundancy. For example, have two
>> copies of it on disk. At startup, read both copies, and if they're both
>> valid, ignore the one with older timestamp. When updating it, write over
>> the older copy. That way, if you crash in the middle of updating it, the
>> old copy is still intact.
> 
> +1, with a flush in between so that only one can be borked no matter
> how the storage works.  It is interesting how few reports there are on
> the mailing list of a control file CRC check failures though, if I'm
> searching for the right thing[1].
> 
> [1] https://www.postgresql.org/search/?m=1&q=calculated+CRC+checksum+does+not+match+value+stored+in+file&l=&d=-1&s=r

If anyone wants a write a patch for that, I'd be happy to review it. And 
if anyone has access to a system with pmem hardware, it would be 
interesting to try to reproduce a torn sector and broken control file by 
pulling the power plug.

- Heikki



On Thu, Jun 24, 2021 at 7:52 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 23/06/2021 12:45, Thomas Munro wrote:
> > On Wed, Jun 23, 2021 at 7:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >> Let's just add the lock there.
> >
> > +1, no doubt about that.
>
> Committed that. Thanks for the report, Alexander!

I think you missed relmap_redo (including a misleading comment).

> If anyone wants a write a patch for that, I'd be happy to review it. And
> if anyone has access to a system with pmem hardware, it would be
> interesting to try to reproduce a torn sector and broken control file by
> pulling the power plug.

I have been working on a kind of experimental file system for
simulating torn sectors (and other interesting file system phenomena)
as part of some work on recovery scenerio testing, not quite ready to
share yet but it can simulate that exact failure...



On 24/06/2021 11:06, Thomas Munro wrote:
> On Thu, Jun 24, 2021 at 7:52 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> On 23/06/2021 12:45, Thomas Munro wrote:
>>> On Wed, Jun 23, 2021 at 7:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>>> Let's just add the lock there.
>>>
>>> +1, no doubt about that.
>>
>> Committed that. Thanks for the report, Alexander!
> 
> I think you missed relmap_redo (including a misleading comment).

Fixed, thanks!

>> If anyone wants a write a patch for that, I'd be happy to review it. And
>> if anyone has access to a system with pmem hardware, it would be
>> interesting to try to reproduce a torn sector and broken control file by
>> pulling the power plug.
> 
> I have been working on a kind of experimental file system for
> simulating torn sectors (and other interesting file system phenomena)
> as part of some work on recovery scenerio testing, not quite ready to
> share yet but it can simulate that exact failure...

Cool! We know what happens if pg_control file is torn, though. What I'd 
like to know is whether it can happen in practice with pmem, an how 
likely it is. For curiosity mostly, I think we have already established 
that it can happen, and it would be nice to protect against it in any 
case, even if it's rare.

- Heikki



> We haven't heard of broken control files from the field, so that doesn't
> seem to be a problem in practice, at least not yet. Still, I would sleep
> better if the control file had more redundancy. For example, have two
> copies of it on disk. At startup, read both copies, and if they're both
> valid, ignore the one with older timestamp. When updating it, write over
> the older copy. That way, if you crash in the middle of updating it, the
> old copy is still intact.

Seems like a good idea.  I somehow doubt that accessing pmem through
old school read()/write() interfaces is the future of databases, but
ideally this should work correctly, and the dependency is indeed
unnecessary if we are prepared to jump through more hoops in just a
couple of places.  There may also be other benefits.  In hindsight,
it's a bit strange that we don't have explicit documentation of this
requirement.  There is some related (and rather dated) discussion of
sectors in wal.sgml but nothing to say that we need 512 byte atomic
sectors for correct operation, unless I've managed to miss it (even
though it's well known among people who read the source code).

I experimented with a slightly different approach, attached, and a TAP
test to exercise it.  Instead of alternating between two copies, I
tried writing out both copies every time with a synchronisation
barrier in between (the same double-write principle some other
database uses to deal with torn data pages).  I think it's mostly
equivalent to your scheme, though the updates are of course slower.  I
was thinking that there may be other benefits to having two copies of
the "current" version around, for resilience (though perhaps they
should be in separate files, not done here), and maybe it's better to
avoid having to invent a timestamp scheme.  Or maybe the two ideas
should be combined: when both CRC checks pass, you could still be more
careful which one you choose than I have been here.  Or maybe trying
to be resilient against handwavy unknown forms of corruption is a
waste of time.  I'm not proposing anything here, I was just trying out
ideas, for discussion.

Attachment