wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: |
Date | |
Msg-id | 20220209220004.kb3dgtn2x2k2gtdm@alap3.anarazel.de Whole thread Raw |
Responses |
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: |
List | pgsql-hackers |
Hi, I was working on rebasing the AIO branch. Tests started to fail after, but it turns out that the problem exists independent of AIO. The problem starts with commit aa01051418f10afbdfa781b8dc109615ca785ff9 Author: Robert Haas <rhaas@postgresql.org> Date: 2022-01-24 14:23:15 -0500 pg_upgrade: Preserve database OIDs. Commit 9a974cbcba005256a19991203583a94b4f9a21a9 arranged to preserve relfilenodes and tablespace OIDs. For similar reasons, also arrange to preserve database OIDs. after this commit the database contents for databases that exist before pg_upgrade runs, and the databases pg_upgrade (re-)creates can have the same oid. Consequently the RelFileNode of files in the "pre-existing" database and the re-created database will be the same. That's a problem: There is no reliable invalidation mechanism forcing all processes to clear SMgrRelationHash when a database is dropped / created. Which means that backends can end up using file descriptors for the old, dropped, database, when accessing contents in the new database. This is most likely to happen to bgwriter, but could be any backend (backends in other databases can end up opening files in another database to write out dirty victim buffers during buffer replacement). That's of course dangerous, because we can end up reading the wrong data (by reading from the old file), or loosing writes (by writing to the old file). Now, this isn't a problem that's entirely new - but previously it requried relfilenodes to be recycled due to wraparound or such. The timeframes for that are long enough that it's very likely that *something* triggers smgr invalidation processing. E.g. has if (FirstCallSinceLastCheckpoint()) { /* * After any checkpoint, close all smgr files. This is so we * won't hang onto smgr references to deleted files indefinitely. */ smgrcloseall(); } in its mainloop. Previously for the problem to occur there would have to be "oid wraparound relfilenode recycling" within a single checkpoint window - quite unlikely. But now a bgwriter cycle just needs to span a drop/create database, easier to hit. I think the most realistic way to address this is the mechanism is to use the mechanism from https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(), XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed. We perhaps could avoid all relfilenode recycling, issues on the fd level, by adding a barrier whenever relfilenode allocation wraps around. But I'm not sure how easy that is to detect. I think we might actually be able to remove the checkpoint from dropdb() by using barriers? I think we need to add some debugging code to detect "fd to deleted file of same name" style problems. Especially during regression tests they're quite hard to notice, because most of the contents read/written are identical across recycling. On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink == 0. b) might also work on a few other operating systems, but I have not verified that. a) has the advantage of providing a filename, which makes it a lot easier to understand the problem. It'd probably make sense to use b) on all the operating systems it works on, and then a) on linux for a more useful error message. The checks for this would need to be in FileRead/FileWrite/... and direct calls to pread etc. I think that's an acceptable cost. I have a patch for this, but it's currently based on the AIO tree. if others agree it'd be useful to have, I'll adapt it to work on master. Comments? Greetings, Andres Freund
pgsql-hackers by date: