Home > mailing lists

Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum" - Mailing list pgsql-bugs

From	Heikki Linnakangas
Subject	Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
Date	June 22, 2021 09:30:38
Msg-id	ac119d1e-05d1-f050-b92a-0a524d68b848@iki.fi Whole thread Raw
In response to	BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum" (PG Bug reporting form <noreply@postgresql.org>)
Responses	Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum"
List	pgsql-bugs

Tree view

On 18/06/2021 18:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      17064
> Logged by:          Alexander Lakhin
> Email address:      exclusion@gmail.com
> PostgreSQL version: 14beta1
> Operating system:   Ubuntu 20.04
> Description:
> 
> The following script:
> ===
> for i in `seq 100`; do
> createdb db$i
> done
> 
> # Based on the contents of the regression test "vacuum"
> echo "
> CREATE TABLE pvactst (i INT);
> INSERT INTO pvactst SELECT i FROM generate_series(1,10000) i;
> DELETE FROM pvactst;
> VACUUM pvactst;
> DROP TABLE pvactst;
> 
> VACUUM FULL pg_database;
> " >/tmp/vacuum.sql
> 
> for n in `seq 10`; do
>    echo "iteration $n"
>    for i in `seq 100`; do
>      ( { for f in `seq 100`; do cat /tmp/vacuum.sql; done } | psql -d db$i )
>> psql-$i.log 2>&1 &
>    done
>    wait
>    grep -C5 FATAL psql*.log && break;
> done
> ===
> detects sporadic FATAL errors:
> iteration 1
> psql-56.log-DROP TABLE
> psql-56.log-VACUUM
> psql-56.log-CREATE TABLE
> psql-56.log-INSERT 0 10000
> psql-56.log-DELETE 10000
> psql-56.log:FATAL:  relation mapping file "global/pg_filenode.map" contains
> incorrect checksum
> psql-56.log-server closed the connection unexpectedly
> psql-56.log-    This probably means the server terminated abnormally
> psql-56.log-    before or while processing the request.
> psql-56.log-connection to server was lost

Hmm, the simplest explanation would be that the read() or write() on the 
relmapper file is not atomic. We assume that it is, and don't use a lock 
in load_relmap_file() because of that. Is there anything unusual about 
the filesystem, mount options or the kernel you're using? I could not 
reproduce this on my laptop. Does the attached patch fix it for you?

If that's the cause, it is easy to fix by taking the RelationMappingLock 
in load_relmap_file(), like in the attached patch. But if the write is 
not atomic, you might have a bigger problem: we also rely on the 
atomicity when writing the pg_control file. If that becomes corrupt 
because of a partial write, the server won't start up. If it's just a 
race condition between the read/write, or only the read() is not atomic, 
maybe pg_control is OK, but I'd like to understand the issue better 
before just adding a lock to load_relmap_file().

- Heikki

Attachment

lock-load_relmap_file-1.patch

pgsql-bugs by date:

From: Telford Tendys
Date: 22 June 2021, 08:39:18
Subject: Unicode FFFF Special Codepoint should always collate high.

From: Mohan Nagandlla
Date: 22 June 2021, 11:07:10
Subject: Re: BUG #17063: repmgrd_upstream_reconnect getting more

Re: BUG #17064: Parallel VACUUM operations cause the error "global/pg_filenode.map contains incorrect checksum" - Mailing list pgsql-bugs

Attachment

Previous

Next