wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: - Mailing list pgsql-hackers

From Andres Freund
Subject wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Date
Msg-id 20220209220004.kb3dgtn2x2k2gtdm@alap3.anarazel.de
Whole thread Raw
Responses Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
List pgsql-hackers
Hi,

I was working on rebasing the AIO branch. Tests started to fail after, but it
turns out that the problem exists independent of AIO.

The problem starts with

commit aa01051418f10afbdfa781b8dc109615ca785ff9
Author: Robert Haas <rhaas@postgresql.org>
Date:   2022-01-24 14:23:15 -0500

    pg_upgrade: Preserve database OIDs.

    Commit 9a974cbcba005256a19991203583a94b4f9a21a9 arranged to preserve
    relfilenodes and tablespace OIDs. For similar reasons, also arrange
    to preserve database OIDs.

after this commit the database contents for databases that exist before
pg_upgrade runs, and the databases pg_upgrade (re-)creates can have the same
oid. Consequently the RelFileNode of files in the "pre-existing" database and
the re-created database will be the same.

That's a problem: There is no reliable invalidation mechanism forcing all
processes to clear SMgrRelationHash when a database is dropped /
created. Which means that backends can end up using file descriptors for the
old, dropped, database, when accessing contents in the new database.  This is
most likely to happen to bgwriter, but could be any backend (backends in other
databases can end up opening files in another database to write out dirty
victim buffers during buffer replacement).

That's of course dangerous, because we can end up reading the wrong data (by
reading from the old file), or loosing writes (by writing to the old file).


Now, this isn't a problem that's entirely new - but previously it requried
relfilenodes to be recycled due to wraparound or such. The timeframes for that
are long enough that it's very likely that *something* triggers smgr
invalidation processing. E.g. has

        if (FirstCallSinceLastCheckpoint())
        {
            /*
             * After any checkpoint, close all smgr files.  This is so we
             * won't hang onto smgr references to deleted files indefinitely.
             */
            smgrcloseall();
        }

in its mainloop. Previously for the problem to occur there would have to be
"oid wraparound relfilenode recycling" within a single checkpoint window -
quite unlikely. But now a bgwriter cycle just needs to span a drop/create
database, easier to hit.


I think the most realistic way to address this is the mechanism is to use the
mechanism from
https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com

If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed.

We perhaps could avoid all relfilenode recycling, issues on the fd level, by
adding a barrier whenever relfilenode allocation wraps around. But I'm not
sure how easy that is to detect.


I think we might actually be able to remove the checkpoint from dropdb() by
using barriers?


I think we need to add some debugging code to detect "fd to deleted file of
same name" style problems. Especially during regression tests they're quite
hard to notice, because most of the contents read/written are identical across
recycling.

On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to
a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink
== 0.

b) might also work on a few other operating systems, but I have not verified
that. a) has the advantage of providing a filename, which makes it a lot
easier to understand the problem.  It'd probably make sense to use b) on all
the operating systems it works on, and then a) on linux for a more useful
error message.

The checks for this would need to be in FileRead/FileWrite/... and direct
calls to pread etc. I think that's an acceptable cost.   I have a patch for
this, but it's currently based on the AIO tree. if others agree it'd be useful
to have, I'll adapt it to work on master.

Comments?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Joshua Brindle
Date:
Subject: Re: [PATCH v2] use has_privs_for_role for predefined roles
Next
From: Aliaksandr Kalenik
Date:
Subject: Re: [PATCH] nodeindexscan with reorder memory leak