Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum" - Mailing list pgsql-bugs

From Heikki Linnakangas
Subject Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Date
Msg-id ac119d1e-05d1-f050-b92a-0a524d68b848@iki.fi
Whole thread Raw
In response to BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"  (PG Bug reporting form <noreply@postgresql.org>)
Responses Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-bugs
On 18/06/2021 18:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      17064
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 14beta1
> Operating system:   Ubuntu 20.04
> Description:
> 
> The following script:
> ===
> for i in `seq 100`; do
> createdb db$i
> done
> 
> # Based on the contents of the regression test "vacuum"
> echo "
> CREATE TABLE pvactst (i INT);
> INSERT INTO pvactst SELECT i FROM generate_series(1,10000) i;
> DELETE FROM pvactst;
> VACUUM pvactst;
> DROP TABLE pvactst;
> 
> VACUUM FULL pg_database;
> " >/tmp/vacuum.sql
> 
> for n in `seq 10`; do
>    echo "iteration $n"
>    for i in `seq 100`; do
>      ( { for f in `seq 100`; do cat /tmp/vacuum.sql; done } | psql -d db$i )
>> psql-$i.log 2>&1 &
>    done
>    wait
>    grep -C5 FATAL psql*.log && break;
> done
> ===
> detects sporadic FATAL errors:
> iteration 1
> psql-56.log-DROP TABLE
> psql-56.log-VACUUM
> psql-56.log-CREATE TABLE
> psql-56.log-INSERT 0 10000
> psql-56.log-DELETE 10000
> psql-56.log:FATAL:  relation mapping file "global/pg_filenode.map" contains
> incorrect checksum
> psql-56.log-server closed the connection unexpectedly
> psql-56.log-    This probably means the server terminated abnormally
> psql-56.log-    before or while processing the request.
> psql-56.log-connection to server was lost

Hmm, the simplest explanation would be that the read() or write() on the 
relmapper file is not atomic. We assume that it is, and don't use a lock 
in load_relmap_file() because of that. Is there anything unusual about 
the filesystem, mount options or the kernel you're using? I could not 
reproduce this on my laptop. Does the attached patch fix it for you?

If that's the cause, it is easy to fix by taking the RelationMappingLock 
in load_relmap_file(), like in the attached patch. But if the write is 
not atomic, you might have a bigger problem: we also rely on the 
atomicity when writing the pg_control file. If that becomes corrupt 
because of a partial write, the server won't start up. If it's just a 
race condition between the read/write, or only the read() is not atomic, 
maybe pg_control is OK, but I'd like to understand the issue better 
before just adding a lock to load_relmap_file().

- Heikki

Attachment

pgsql-bugs by date:

Previous
From: Telford Tendys
Date:
Subject: Unicode FFFF Special Codepoint should always collate high.
Next
From: Mohan Nagandlla
Date:
Subject: Re: BUG #17063: repmgrd_upstream_reconnect getting more