Hi,
On 2022-02-10 14:26:59 -0800, Andres Freund wrote:
> On 2022-02-11 09:10:38 +1300, Thomas Munro wrote:
> > It seems like I should go ahead and do that today, and we can study
> > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?
>
> Yes.
I wrote a test to show the problem. While looking at the code I unfortunately
found that CREATE DATABASE ... OID isn't the only problem. Two consecutive
ALTER DATABASE ... SET TABLESPACE also can cause corruption.
The test doesn't work on < 15 (due to PostgresNode renaming and use of
allow_in_place_tablespaces). But manually it's unfortunately quite
reproducible :(
Start a server with
shared_buffers = 1MB
autovacuum=off
bgwriter_lru_maxpages=1
bgwriter_delay=10000
these are just to give more time / to require smaller tables.
Start two psql sessions
s1: \c postgres
s1: CREATE DATABASE conflict_db;
s1: CREATE TABLESPACE test_tablespace LOCATION '/tmp/blarg';
s1: CREATE EXTENSION pg_prewarm;
s2: \c conflict_db
s2: CREATE TABLE large(id serial primary key, dataa text, datab text);
s2: INSERT INTO large(dataa, datab) SELECT g.i::text, 0 FROM generate_series(1, 4000) g(i);
s2: UPDATE large SET datab = 1;
-- prevent AtEOXact_SMgr
s1: BEGIN;
-- Fill shared_buffers with other contents. This needs to write out the prior dirty contents
-- leading to opening smgr rels / file descriptors
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0;
s2: \c postgres
-- unlinks the files
s2: ALTER DATABASE conflict_db SET TABLESPACE test_tablespace;
-- now new files with the same relfilenode exist
s2: ALTER DATABASE conflict_db SET TABLESPACE pg_default;
s2: \c conflict_db
-- dirty buffers again
s2: UPDATE large SET datab = 2;
-- this again writes out the dirty buffers. But because nothing forced the smgr handles to be closed,
-- fds pointing to the *old* file contents are used. I.e. we'll write to the wrong file
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0;
-- verify that everything is [not] ok
s2: SELECT datab, count(*) FROM large GROUP BY 1 ORDER BY 1 LIMIT 10;
┌───────┬───────┐
│ datab │ count │
├───────┼───────┤
│ 1 │ 2117 │
│ 2 │ 67 │
└───────┴───────┘
(2 rows)
oops.
I don't yet have a good idea how to tackle this in the backbranches.
I've started to work on a few debugging aids to find problem like
these. Attached are two WIP patches:
1) use fstat(...).st_nlink == 0 to detect dangerous actions on fds with
deleted files. That seems to work (almost) everywhere. Optionally, on
linux, use readlink(/proc/$pid/fd/$fd) to get the filename.
2) On linux, iterate over PGPROC to get a list of pids. Then iterate over
/proc/$pid/fd/ to check for deleted files. This only can be done after
we've just made sure there's no fds open to deleted files.
Greetings,
Andres Freund