Thread: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Andres Freund
Date:
Hi, I was working on rebasing the AIO branch. Tests started to fail after, but it turns out that the problem exists independent of AIO. The problem starts with commit aa01051418f10afbdfa781b8dc109615ca785ff9 Author: Robert Haas <rhaas@postgresql.org> Date: 2022-01-24 14:23:15 -0500 pg_upgrade: Preserve database OIDs. Commit 9a974cbcba005256a19991203583a94b4f9a21a9 arranged to preserve relfilenodes and tablespace OIDs. For similar reasons, also arrange to preserve database OIDs. after this commit the database contents for databases that exist before pg_upgrade runs, and the databases pg_upgrade (re-)creates can have the same oid. Consequently the RelFileNode of files in the "pre-existing" database and the re-created database will be the same. That's a problem: There is no reliable invalidation mechanism forcing all processes to clear SMgrRelationHash when a database is dropped / created. Which means that backends can end up using file descriptors for the old, dropped, database, when accessing contents in the new database. This is most likely to happen to bgwriter, but could be any backend (backends in other databases can end up opening files in another database to write out dirty victim buffers during buffer replacement). That's of course dangerous, because we can end up reading the wrong data (by reading from the old file), or loosing writes (by writing to the old file). Now, this isn't a problem that's entirely new - but previously it requried relfilenodes to be recycled due to wraparound or such. The timeframes for that are long enough that it's very likely that *something* triggers smgr invalidation processing. E.g. has if (FirstCallSinceLastCheckpoint()) { /* * After any checkpoint, close all smgr files. This is so we * won't hang onto smgr references to deleted files indefinitely. */ smgrcloseall(); } in its mainloop. Previously for the problem to occur there would have to be "oid wraparound relfilenode recycling" within a single checkpoint window - quite unlikely. But now a bgwriter cycle just needs to span a drop/create database, easier to hit. I think the most realistic way to address this is the mechanism is to use the mechanism from https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(), XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed. We perhaps could avoid all relfilenode recycling, issues on the fd level, by adding a barrier whenever relfilenode allocation wraps around. But I'm not sure how easy that is to detect. I think we might actually be able to remove the checkpoint from dropdb() by using barriers? I think we need to add some debugging code to detect "fd to deleted file of same name" style problems. Especially during regression tests they're quite hard to notice, because most of the contents read/written are identical across recycling. On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink == 0. b) might also work on a few other operating systems, but I have not verified that. a) has the advantage of providing a filename, which makes it a lot easier to understand the problem. It'd probably make sense to use b) on all the operating systems it works on, and then a) on linux for a more useful error message. The checks for this would need to be in FileRead/FileWrite/... and direct calls to pread etc. I think that's an acceptable cost. I have a patch for this, but it's currently based on the AIO tree. if others agree it'd be useful to have, I'll adapt it to work on master. Comments? Greetings, Andres Freund
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Justin Pryzby
Date:
On Wed, Feb 09, 2022 at 02:00:04PM -0800, Andres Freund wrote: > On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to > a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink > == 0. You could also stat() the file in proc/self/fd/N and compare st_ino. It "looks" like a symlink (in which case that wouldn't work) but it's actually a Very Special File. You can even recover deleted, still-opened files that way.. PS. I didn't know pg_upgrade knew about Reply-To ;)
Hi, On 2022-02-09 16:42:30 -0600, Justin Pryzby wrote: > On Wed, Feb 09, 2022 at 02:00:04PM -0800, Andres Freund wrote: > > On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to > > a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink > > == 0. > > You could also stat() the file in proc/self/fd/N and compare st_ino. It > "looks" like a symlink (in which case that wouldn't work) but it's actually a > Very Special File. You can even recover deleted, still-opened files that way.. Yea, the readlink() thing above relies on it being a /proc/self/fd/$fd being a "Very Special File". In most places we'd not have convenient access to a inode / filename to compare it to. I don't think there's any additional information we could gain anyway, compared to looking at st_nlink == 0 and then doing a readlink() to get the filename? > PS. I didn't know pg_upgrade knew about Reply-To ;) Ugh, formatting fail... Greetings, Andres Freund
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Robert Haas
Date:
On Wed, Feb 9, 2022 at 5:00 PM Andres Freund <andres@anarazel.de> wrote: > The problem starts with > > commit aa01051418f10afbdfa781b8dc109615ca785ff9 > Author: Robert Haas <rhaas@postgresql.org> > Date: 2022-01-24 14:23:15 -0500 > > pg_upgrade: Preserve database OIDs. Well, that's sad. > I think the most realistic way to address this is the mechanism is to use the > mechanism from > https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com I agree. While I feel sort of bad about missing this issue in review, I also feel like it's pretty surprising that there isn't something plugging this hole already. It feels unexpected that our FD management layer might hand you an FD that references the previous file that had the same name instead of the one that exists right now, and I don't think it's too much of a stretch of the imagination to suppose that this won't be the last problem we have if we don't fix it. I also agree that the mechanism proposed in that thread is the best way to fix the problem. The sinval mechanism won't work, since there's nothing to ensure that the invalidation messages are processed in a sufficient timely fashion, and there's no obvious way to get around that problem. But the ProcSignalBarrier mechanism is not intertwined with heavyweight locking in the way that sinval is, so it should be a good fit for this problem. The main question in my mind is who is going to actually make that happen. It was your idea (I think), Thomas coded it, and my commit made it a live problem. So who's going to get something committed here? > If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(), > XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed. What about movedb()? > We perhaps could avoid all relfilenode recycling, issues on the fd level, by > adding a barrier whenever relfilenode allocation wraps around. But I'm not > sure how easy that is to detect. I definitely think it would be nice to cover this issue not only for database OIDs and tablespace OIDs but also for relfilenode OIDs. Otherwise we're right on the cusp of having the same bug in that case. pg_upgrade doesn't drop and recreate tables the way it does databases, but if somebody made it do that in the future, would they think about this? I bet not. Dilip's been working on your idea of making relfilenode into a 56-bit quantity. That would make the problem of detecting wraparound go away. In the meantime, I suppose we could do think of trying to do something here: /* wraparound, or first post-initdb assignment, in normal mode */ ShmemVariableCache->nextOid = FirstNormalObjectId; ShmemVariableCache->oidCount = 0; That's a bit sketch, because this is the OID counter, which isn't only used for relfilenodes. I'm not sure whether there would be problems waiting for a barrier here - I think we might be holding some lwlocks in some code paths. > I think we might actually be able to remove the checkpoint from dropdb() by > using barriers? That looks like it might work. We'd need to be sure that the checkpointer would both close any open FDs and also absorb fsync requests before acknowledging the barrier. > I think we need to add some debugging code to detect "fd to deleted file of > same name" style problems. Especially during regression tests they're quite > hard to notice, because most of the contents read/written are identical across > recycling. Maybe. If we just plug the holes here, I am not sure we need permanent instrumentation. But I'm also not violently opposed to it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Fri, Feb 11, 2022 at 7:50 AM Robert Haas <robertmhaas@gmail.com> wrote: > The main question in my mind is who is going to actually make that > happen. It was your idea (I think), Thomas coded it, and my commit > made it a live problem. So who's going to get something committed > here? I was about to commit that, because the original Windows problem it solved is showing up occasionally in CI failures (that is, it already solves a live problem, albeit a different and non-data-corrupting one): https://www.postgresql.org/message-id/CA%2BhUKGJp-m8uAD_wS7%2BdkTgif013SNBSoJujWxvRUzZ1nkoUyA%40mail.gmail.com It seems like I should go ahead and do that today, and we can study further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Robert Haas
Date:
On Thu, Feb 10, 2022 at 3:11 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Feb 11, 2022 at 7:50 AM Robert Haas <robertmhaas@gmail.com> wrote: > > The main question in my mind is who is going to actually make that > > happen. It was your idea (I think), Thomas coded it, and my commit > > made it a live problem. So who's going to get something committed > > here? > > I was about to commit that, because the original Windows problem it > solved is showing up occasionally in CI failures (that is, it already > solves a live problem, albeit a different and non-data-corrupting > one): > > https://www.postgresql.org/message-id/CA%2BhUKGJp-m8uAD_wS7%2BdkTgif013SNBSoJujWxvRUzZ1nkoUyA%40mail.gmail.com > > It seems like I should go ahead and do that today, and we can study > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work? A bird in the hand is worth two in the bush. Unless it's a vicious, hand-biting bird. In other words, if you think what you've got is better than what we have now and won't break anything worse than it is today, +1 for committing, and more improvements can come when we get enough round tuits. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Andres Freund
Date:
Hi, On 2022-02-10 13:49:50 -0500, Robert Haas wrote: > I agree. While I feel sort of bad about missing this issue in review, > I also feel like it's pretty surprising that there isn't something > plugging this hole already. It feels unexpected that our FD management > layer might hand you an FD that references the previous file that had > the same name instead of the one that exists right now, and I don't > think it's too much of a stretch of the imagination to suppose that > this won't be the last problem we have if we don't fix it. Arguably it's not really the FD layer that's the issue - it's on the smgr level. But that's semantics... > The main question in my mind is who is going to actually make that > happen. It was your idea (I think), Thomas coded it, and my commit > made it a live problem. So who's going to get something committed > here? I think it'd make sense for Thomas to commit the current patch for the tablespace issues first. And then we can go from there? > > If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(), > > XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed. > > What about movedb()? My first reaction was that it should be fine, because RelFileNode.spcNode would differ. But of course that doesn't protect against moving a database back and forth. So yes, you're right, it's needed there too. Are there other commands that could be problematic? ALTER TABLE SET TABLESPACE should be fine, that allocates a new relfilenode. > I definitely think it would be nice to cover this issue not only for > database OIDs and tablespace OIDs but also for relfilenode OIDs. > Otherwise we're right on the cusp of having the same bug in that case. > pg_upgrade doesn't drop and recreate tables the way it does databases, > but if somebody made it do that in the future, would they think about > this? I bet not. And even just plugging the wraparound aspect would be great. It's not actually *that* hard to wrap oids around, because of toast. > Dilip's been working on your idea of making relfilenode into a 56-bit > quantity. That would make the problem of detecting wraparound go away. > In the meantime, I suppose we could do think of trying to do something > here: > > /* wraparound, or first post-initdb > assignment, in normal mode */ > ShmemVariableCache->nextOid = FirstNormalObjectId; > ShmemVariableCache->oidCount = 0; > > That's a bit sketch, because this is the OID counter, which isn't only > used for relfilenodes. I can see a few ways to deal with that: 1) Split nextOid into two, nextRelFileNode and nextOid. Have the wraparound behaviour only in the nextRelFileNode case. That'd be a nice improvement, but not entirely trivial, because we currently use GetNewRelFileNode() for oid assignment as well (c.f. heap_create_with_catalog()). 2) Keep track of the last assigned relfilenode in ShmemVariableCache, and wait for a barrier in GetNewRelFileNode() if the assigned oid is wrapped around. While there would be a chance of "missing" a ->nextOid wraparound, e.g. because of >2**32 toast oids being allocated, I think that could only happen if there were no "wrapped around relfilenodes" allocated, so that should be ok. 3) Any form of oid wraparound might cause problems, not just relfilenode issues. So having a place to emit barriers on oid wraparound might be a good idea. One thing I wonder is how to make sure the path would be tested. Perhaps we could emit the anti-wraparound barriers more frequently when assertions are enabled? > I'm not sure whether there would be problems waiting for a barrier here - I > think we might be holding some lwlocks in some code paths. It seems like a bad idea to call GetNewObjectId() with lwlocks held. From what I can see the two callers of GetNewObjectId() both have loops that can go on for a long time after a wraparound. > > I think we need to add some debugging code to detect "fd to deleted file of > > same name" style problems. Especially during regression tests they're quite > > hard to notice, because most of the contents read/written are identical across > > recycling. > > Maybe. If we just plug the holes here, I am not sure we need permanent > instrumentation. But I'm also not violently opposed to it. The question is how we can find all the places we need to add barriers emit & wait to. I e.g. didn't initially didn't realize that there's wraparound danger in WAL replay as well. Greetings, Andres Freund
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Andres Freund
Date:
Hi, On 2022-02-11 09:10:38 +1300, Thomas Munro wrote: > I was about to commit that, because the original Windows problem it > solved is showing up occasionally in CI failures (that is, it already > solves a live problem, albeit a different and non-data-corrupting > one): +1 > It seems like I should go ahead and do that today, and we can study > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work? Yes. I wonder whether we really should make the barriers be conditional on defined(WIN32) || defined(USE_ASSERT_CHECKING) as done in that patch, even leaving wraparound dangers aside. On !windows we still have the issues of the space for the dropped / moved files not being released if there are processes having them open. That can be a lot of space if there's long-lived connections in a cluster that doesn't fit into s_b (because processes will have random fds open for writing back dirty buffers). And we don't truncate the files before unlinking when done as part of a DROP DATABASE... But that's something we can fine-tune later as well... Greetings, Andres Freund
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Andres Freund
Date:
Hi, On 2022-02-10 14:26:59 -0800, Andres Freund wrote: > On 2022-02-11 09:10:38 +1300, Thomas Munro wrote: > > It seems like I should go ahead and do that today, and we can study > > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work? > > Yes. I wrote a test to show the problem. While looking at the code I unfortunately found that CREATE DATABASE ... OID isn't the only problem. Two consecutive ALTER DATABASE ... SET TABLESPACE also can cause corruption. The test doesn't work on < 15 (due to PostgresNode renaming and use of allow_in_place_tablespaces). But manually it's unfortunately quite reproducible :( Start a server with shared_buffers = 1MB autovacuum=off bgwriter_lru_maxpages=1 bgwriter_delay=10000 these are just to give more time / to require smaller tables. Start two psql sessions s1: \c postgres s1: CREATE DATABASE conflict_db; s1: CREATE TABLESPACE test_tablespace LOCATION '/tmp/blarg'; s1: CREATE EXTENSION pg_prewarm; s2: \c conflict_db s2: CREATE TABLE large(id serial primary key, dataa text, datab text); s2: INSERT INTO large(dataa, datab) SELECT g.i::text, 0 FROM generate_series(1, 4000) g(i); s2: UPDATE large SET datab = 1; -- prevent AtEOXact_SMgr s1: BEGIN; -- Fill shared_buffers with other contents. This needs to write out the prior dirty contents -- leading to opening smgr rels / file descriptors s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0; s2: \c postgres -- unlinks the files s2: ALTER DATABASE conflict_db SET TABLESPACE test_tablespace; -- now new files with the same relfilenode exist s2: ALTER DATABASE conflict_db SET TABLESPACE pg_default; s2: \c conflict_db -- dirty buffers again s2: UPDATE large SET datab = 2; -- this again writes out the dirty buffers. But because nothing forced the smgr handles to be closed, -- fds pointing to the *old* file contents are used. I.e. we'll write to the wrong file s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0; -- verify that everything is [not] ok s2: SELECT datab, count(*) FROM large GROUP BY 1 ORDER BY 1 LIMIT 10; ┌───────┬───────┐ │ datab │ count │ ├───────┼───────┤ │ 1 │ 2117 │ │ 2 │ 67 │ └───────┴───────┘ (2 rows) oops. I don't yet have a good idea how to tackle this in the backbranches. I've started to work on a few debugging aids to find problem like these. Attached are two WIP patches: 1) use fstat(...).st_nlink == 0 to detect dangerous actions on fds with deleted files. That seems to work (almost) everywhere. Optionally, on linux, use readlink(/proc/$pid/fd/$fd) to get the filename. 2) On linux, iterate over PGPROC to get a list of pids. Then iterate over /proc/$pid/fd/ to check for deleted files. This only can be done after we've just made sure there's no fds open to deleted files. Greetings, Andres Freund
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Andres Freund
Date:
Hi, On 2022-02-22 01:11:21 -0800, Andres Freund wrote: > I've started to work on a few debugging aids to find problem like > these. Attached are two WIP patches: Forgot to attach. Also importantly includes a tap test for several of these issues Greetings, Andres Freund
Attachment
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Robert Haas
Date:
On Tue, Feb 22, 2022 at 4:40 AM Andres Freund <andres@anarazel.de> wrote: > On 2022-02-22 01:11:21 -0800, Andres Freund wrote: > > I've started to work on a few debugging aids to find problem like > > these. Attached are two WIP patches: > > Forgot to attach. Also importantly includes a tap test for several of these > issues Hi, Just a few very preliminary comments: - I am having some trouble understanding clearly what 0001 is doing. I'll try to study it further. - 0002 seems like it's generally a good idea. I haven't yet dug into why the call sites for AssertFileNotDeleted() are where they are, or whether that's a complete set of places to check. - In general, 0003 makes a lot of sense to me. + /* + * Finally tell the kernel to write the data to storage. Don't smgr if + * previously closed, otherwise we could end up evading fd-reuse + * protection. + */ - I think the above hunk is missing a word, because it uses smgr as a verb. I also think that it's easy to imagine this explanation being insufficient for some future hacker to understand the issue. - While 0004 seems useful for testing, it's an awfully big hammer. I'm not sure we should be thinking about committing something like that, or at least not as a default part of the build. But ... maybe? -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Andres Freund
Date:
Hi, On 2022-03-02 14:52:01 -0500, Robert Haas wrote: > - I am having some trouble understanding clearly what 0001 is doing. > I'll try to study it further. It tests for the various scenarios I could think of that could lead to FD reuse, to state the obvious ;). Anything particularly unclear. > - In general, 0003 makes a lot of sense to me. > > + /* > + * Finally tell the kernel to write the data to > storage. Don't smgr if > + * previously closed, otherwise we could end up evading fd-reuse > + * protection. > + */ > > - I think the above hunk is missing a word, because it uses smgr as a > verb. I also think that it's easy to imagine this explanation being > insufficient for some future hacker to understand the issue. Yea, it's definitely not sufficient or gramattically correct. I think I wanted to send something out, but was very tired by that point.. > - While 0004 seems useful for testing, it's an awfully big hammer. I'm > not sure we should be thinking about committing something like that, > or at least not as a default part of the build. But ... maybe? Aggreed. I think it's racy anyway, because further files could get deleted (e.g. segments > 0) after the barrier has been processed. What I am stuck on is what we can do for the released branches. Data corruption after two consecutive ALTER DATABASE SET TABLESPACEs seems like something we need to address. Greetings, Andres Freund
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Robert Haas
Date:
On Wed, Mar 2, 2022 at 3:00 PM Andres Freund <andres@anarazel.de> wrote: > What I am stuck on is what we can do for the released branches. Data > corruption after two consecutive ALTER DATABASE SET TABLESPACEs seems like > something we need to address. I think we should consider back-porting the ProcSignalBarrier stuff eventually. I realize that it's not really battle-tested yet and I am not saying we should do it right now, but I think that if we get these changes into v15, back-porting it in let's say May of next year could be a reasonable thing to do. Sure, there is some risk there, but on the other hand, coming up with completely different fixes for the back-branches is not risk-free either, nor is it clear that there is any alternative fix that is nearly as good. In the long run, I am fairly convinced that ProcSignalBarrier is the way forward not only for this purpose but for other things as well, and everybody's got to get on the train or be left behind. Also, I am aware of multiple instances where the project waited a long, long time to fix bugs because we didn't have a back-patchable fix. I disagree with that on principle. A master-only fix now is better than a back-patchable fix two or three years from now. Of course a back-patchable fix now is better still, but we have to pick from the options we have, not the ones we'd like to have. <digressing a bit> It seems to me that if we were going to try to construct an alternative fix for the back-branches, it would have to be something that didn't involve a new invalidation mechanism -- because the ProcSignalBarrier stuff is an invalidation mechanism in effect, and I feel that it can't be better to invent two new invalidation mechanisms rather than one. And the only idea I have is trying to detect a dangerous sequence of operations and just outright block it. We have some cases sort of like that already - e.g. you can't prepare a transaction if it's done certain things. But, the existing precedents that occur to me are, I think, all cases where all of the related actions are being performed in the same backend. It doesn't sound crazy to me to have some rule like "you can't ALTER TABLESPACE on the same tablespace in the same backend twice in a row without an intervening checkpoint", or whatever, and install the book-keeping to enforce that. But I don't think anything like that can work, both because the two ALTER TABLESPACE commands could be performed in different sessions, and also because an intervening checkpoint is no guarantee of safety anyway, IIUC. So I'm just not really seeing a reasonable strategy that isn't basically the barrier stuff. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Robert Haas
Date:
On Wed, Mar 2, 2022 at 3:00 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-03-02 14:52:01 -0500, Robert Haas wrote: > > - I am having some trouble understanding clearly what 0001 is doing. > > I'll try to study it further. > > It tests for the various scenarios I could think of that could lead to FD > reuse, to state the obvious ;). Anything particularly unclear. As I understand it, the basic idea of this test is: 1. create a database called 'conflict_db' containing a table called 'large' 2. also create a table called 'replace_sb' in the postgres db. 3. have a long running transaction open in 'postgres' database that repeatedly accesses the 'replace_sb' table to evict the 'conflict_db' table, sometimes causing write-outs 4. try to get it to write out the data to a stale file descriptor by fiddling with various things, and then detect the corruption that results 5. use both a primary and a standby not because they are intrinsically necessary but because you want to test that both cases work As to (1) and (2), I note that 'large' is not a hugely informative table name, especially when it's the smaller of the two test tables. And also, the names are all over the place. The table names don't have much to do with each other (large, replace_sb) and they're completely unrelated to the database names (postgres, conflict_db). And then you have Perl functions whose names also don't make it obvious what we're talking about. I can't remember that verify() is the one that accesses conflict.db large while cause_eviction() is the one that accesses postgres.replace_sb for more than like 15 seconds. It'd make it a lot easier if the table names, database names, and Perl function names had some common elements. As to (5), the identifiers are just primary and standby, without mentioning the database name, which adds to the confusion, and there are no comments explaining why we have two nodes. Also, some of the comments seem like they might be in the wrong place: +# Create template database with a table that we'll update, to trigger dirty +# rows. Using a template database + preexisting rows makes it a bit easier to +# reproduce, because there's no cache invalidations generated. Right after this comment, we create a table and then a template database just after, so I think we're not avoiding any invalidations at this point. We're avoiding them at later points where the comments aren't talking about this. I think it would be helpful to find a way to divide the test case up into sections that are separated from each other visually, and explain the purpose of each test a little more in a comment. For example I'm struggling to understand the exact purpose of this bit: +$node_primary->safe_psql('conflict_db', "VACUUM FULL large;"); +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;"); + +verify($node_primary, $node_standby, 3, + "restored contents as expected"); I'm all for having test coverage of VACUUM FULL, but it isn't clear to me what this does that is related to FD reuse. Similarly for the later ones. I generally grasp that you are trying to make sure that things are OK with respect to FD reuse in various scenarios, but I couldn't explain to you what we'd be losing in terms of test coverage if we removed this line: +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;"); I am not sure how much all of this potential cleanup matters because I'm pretty skeptical that this would be stable in the buildfarm. It relies on a lot of assumptions about timing and row sizes and details of when invalidations are sent that feel like they might not be entirely predictable at the level you would need to have this consistently pass everywhere. Perhaps I am too pessimistic? -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Andres Freund
Date:
Hi, On 2022-03-03 13:11:17 -0500, Robert Haas wrote: > On Wed, Mar 2, 2022 at 3:00 PM Andres Freund <andres@anarazel.de> wrote: > > On 2022-03-02 14:52:01 -0500, Robert Haas wrote: > > > - I am having some trouble understanding clearly what 0001 is doing. > > > I'll try to study it further. > > > > It tests for the various scenarios I could think of that could lead to FD > > reuse, to state the obvious ;). Anything particularly unclear. > > As I understand it, the basic idea of this test is: > > 1. create a database called 'conflict_db' containing a table called 'large' > 2. also create a table called 'replace_sb' in the postgres db. > 3. have a long running transaction open in 'postgres' database that > repeatedly accesses the 'replace_sb' table to evict the 'conflict_db' > table, sometimes causing write-outs > 4. try to get it to write out the data to a stale file descriptor by > fiddling with various things, and then detect the corruption that > results > 5. use both a primary and a standby not because they are intrinsically > necessary but because you want to test that both cases work Right. I wasn't proposing to commit the test as-is, to be clear. It was meant as a demonstration of the types of problems I can see, and what's needed to fix them... > I can't remember that verify() is the one that accesses conflict.db large > while cause_eviction() is the one that accesses postgres.replace_sb for more > than like 15 seconds. For more than 15seconds? The whole test runs in a few seconds for me... > As to (5), the identifiers are just primary and standby, without > mentioning the database name, which adds to the confusion, and there > are no comments explaining why we have two nodes. I don't follow this one - physical rep doesn't do anything below the database level? > I think it would be helpful to find a way to divide the test case up > into sections that are separated from each other visually, and explain > the purpose of each test a little more in a comment. For example I'm > struggling to understand the exact purpose of this bit: > > +$node_primary->safe_psql('conflict_db', "VACUUM FULL large;"); > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;"); > + > +verify($node_primary, $node_standby, 3, > + "restored contents as expected"); > > I'm all for having test coverage of VACUUM FULL, but it isn't clear to > me what this does that is related to FD reuse. It's just trying to clean up prior damage, so the test continues to later ones. Definitely not needed once there's a fix. But it's useful while trying to make the test actually detect various corruptions. > Similarly for the later ones. I generally grasp that you are trying to > make sure that things are OK with respect to FD reuse in various > scenarios, but I couldn't explain to you what we'd be losing in terms > of test coverage if we removed this line: > > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;"); > > I am not sure how much all of this potential cleanup matters because > I'm pretty skeptical that this would be stable in the buildfarm. Which "potential cleanup" are you referring to? > It relies on a lot of assumptions about timing and row sizes and details of > when invalidations are sent that feel like they might not be entirely > predictable at the level you would need to have this consistently pass > everywhere. Perhaps I am too pessimistic? I don't think the test passing should be dependent on row size, invalidations etc to a significant degree (unless the tables are too small to reach s_b etc)? The exact symptoms of failures are highly unstable, but obviously we'd fix them in-tree before committing a test. But maybe I'm missing a dependency? FWIW, the test did pass on freebsd, linux, macos and windows with the fix. After a few iterations of improving the fix ;) Greetings, Andres Freund
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Robert Haas
Date:
On Thu, Mar 3, 2022 at 1:28 PM Andres Freund <andres@anarazel.de> wrote: > > I can't remember that verify() is the one that accesses conflict.db large > > while cause_eviction() is the one that accesses postgres.replace_sb for more > > than like 15 seconds. > > For more than 15seconds? The whole test runs in a few seconds for me... I'm not talking about running the test. I'm talking about reading it and trying to keep straight what is happening. The details of which function is accessing which database keep going out of my head. Quickly. Maybe because I'm dumb, but I think some better naming could help, just in case other people are dumb, too. > > As to (5), the identifiers are just primary and standby, without > > mentioning the database name, which adds to the confusion, and there > > are no comments explaining why we have two nodes. > > I don't follow this one - physical rep doesn't do anything below the database > level? Right but ... those handles are connected to a particular DB on that node, not just the node in general. > > I think it would be helpful to find a way to divide the test case up > > into sections that are separated from each other visually, and explain > > the purpose of each test a little more in a comment. For example I'm > > struggling to understand the exact purpose of this bit: > > > > +$node_primary->safe_psql('conflict_db', "VACUUM FULL large;"); > > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;"); > > + > > +verify($node_primary, $node_standby, 3, > > + "restored contents as expected"); > > > > I'm all for having test coverage of VACUUM FULL, but it isn't clear to > > me what this does that is related to FD reuse. > > It's just trying to clean up prior damage, so the test continues to later > ones. Definitely not needed once there's a fix. But it's useful while trying > to make the test actually detect various corruptions. Ah, OK, I definitely did not understand that before. > > Similarly for the later ones. I generally grasp that you are trying to > > make sure that things are OK with respect to FD reuse in various > > scenarios, but I couldn't explain to you what we'd be losing in terms > > of test coverage if we removed this line: > > > > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;"); > > > > I am not sure how much all of this potential cleanup matters because > > I'm pretty skeptical that this would be stable in the buildfarm. > > Which "potential cleanup" are you referring to? I mean, whatever improvements you might consider making based on my comments. > > It relies on a lot of assumptions about timing and row sizes and details of > > when invalidations are sent that feel like they might not be entirely > > predictable at the level you would need to have this consistently pass > > everywhere. Perhaps I am too pessimistic? > > I don't think the test passing should be dependent on row size, invalidations > etc to a significant degree (unless the tables are too small to reach s_b > etc)? The exact symptoms of failures are highly unstable, but obviously we'd > fix them in-tree before committing a test. But maybe I'm missing a dependency? > > FWIW, the test did pass on freebsd, linux, macos and windows with the > fix. After a few iterations of improving the fix ;) Hmm. Well, I admit that's already more than I would have expected.... -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
Some thoughts: The v1-0003 patch introduced smgropen_cond() to avoid the problem of IssuePendingWritebacks(), which does desynchronised smgropen() calls and could open files after the barrier but just before they are unlinked. Makes sense, but... 1. For that to actually work, we'd better call smgrcloseall() when PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls closeAllVds(). Here's a patch for that. But... 2. The reason I used closeAllVds() in 4eb21763 is because I didn't want random CHECK_FOR_INTERRUPTS() calls to break code like this, from RelationCopyStorage(): for (blkno = 0; blkno < nblocks; blkno++) { /* If we got a cancel signal during the copy of the data, quit */ CHECK_FOR_INTERRUPTS(); smgrread(src, forkNum, blkno, buf.data); Perhaps we could change RelationCopyStorage() to take Relation, and use CreateFakeRelCacheEntry() in various places to satisfy that, and also extend that mechanism to work with temporary tables. Here's an initial sketch of that idea, to see what problems it crashes into... Fake Relations are rather unpleasant though; I wonder if there is an SMgrRelationHandle concept that could make this all nicer. There is possibly also some cleanup missing to avoid an SMgrRelation that points to a defunct fake Relation on non-local exit (?).
Attachment
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Robert Haas
Date:
On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro <thomas.munro@gmail.com> wrote: > The v1-0003 patch introduced smgropen_cond() to avoid the problem of > IssuePendingWritebacks(), which does desynchronised smgropen() calls > and could open files after the barrier but just before they are > unlinked. Makes sense, but... > > 1. For that to actually work, we'd better call smgrcloseall() when > PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls > closeAllVds(). Here's a patch for that. But... What if we instead changed our approach to fixing IssuePendingWritebacks()? Design sketch: 1. We add a new flag, bool smgr_zonk, to SmgrRelationData. Initially it's false. 2. Receipt of PROCSIGNAL_BARRIER_SMGRRELEASE sets smgr_zonk to true. 3. If you want, you can set it back to false any time you access the smgr while holding a relation lock. 4. IssuePendingWritebacks() uses smgropen_cond(), as today, but that function now refuses either to create a new smgr or to return one where smgr_zonk is true. I think this would be sufficient to guarantee that we never try to write back to a relfilenode if we haven't had a lock on it since the last PROCSIGNAL_BARRIER_SMGRRELEASE, which I think is the invariant we need here? My thinking here is that it's a lot easier to get away with marking the content of a data structure invalid than it is to get away with invalidating pointers to that data structure. If we can tinker with the design so that the SMgrRelationData doesn't actually go away but just gets a little bit more thoroughly invalidated, we could avoid having to audit the entire code base for code that keeps smgr handles around longer than would be possible with the design you demonstrate here. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Sat, Apr 2, 2022 at 2:52 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > The v1-0003 patch introduced smgropen_cond() to avoid the problem of > > IssuePendingWritebacks(), which does desynchronised smgropen() calls > > and could open files after the barrier but just before they are > > unlinked. Makes sense, but... > > > > 1. For that to actually work, we'd better call smgrcloseall() when > > PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls > > closeAllVds(). Here's a patch for that. But... > > What if we instead changed our approach to fixing > IssuePendingWritebacks()? Design sketch: > > 1. We add a new flag, bool smgr_zonk, to SmgrRelationData. Initially it's false. > 2. Receipt of PROCSIGNAL_BARRIER_SMGRRELEASE sets smgr_zonk to true. > 3. If you want, you can set it back to false any time you access the > smgr while holding a relation lock. > 4. IssuePendingWritebacks() uses smgropen_cond(), as today, but that > function now refuses either to create a new smgr or to return one > where smgr_zonk is true. > > I think this would be sufficient to guarantee that we never try to > write back to a relfilenode if we haven't had a lock on it since the > last PROCSIGNAL_BARRIER_SMGRRELEASE, which I think is the invariant we > need here? > > My thinking here is that it's a lot easier to get away with marking > the content of a data structure invalid than it is to get away with > invalidating pointers to that data structure. If we can tinker with > the design so that the SMgrRelationData doesn't actually go away but > just gets a little bit more thoroughly invalidated, we could avoid > having to audit the entire code base for code that keeps smgr handles > around longer than would be possible with the design you demonstrate > here. Another idea would be to call a new function DropPendingWritebacks(), and also tell all the SMgrRelation objects to close all their internal state (ie the fds + per-segment objects) but not free the main SMgrRelationData object, and for good measure also invalidate the small amount of cached state (which hasn't been mentioned in this thread, but it kinda bothers me that that state currently survives, so it was one unspoken reason I liked the smgrcloseall() idea). Since DropPendingWritebacks() is in a rather different module, perhaps if we were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to something else, because it's not really a SMGR-only thing anymore.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Sat, Apr 2, 2022 at 10:03 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Another idea would be to call a new function DropPendingWritebacks(), > and also tell all the SMgrRelation objects to close all their internal > state (ie the fds + per-segment objects) but not free the main > SMgrRelationData object, and for good measure also invalidate the > small amount of cached state (which hasn't been mentioned in this > thread, but it kinda bothers me that that state currently survives, so > it was one unspoken reason I liked the smgrcloseall() idea). Since > DropPendingWritebacks() is in a rather different module, perhaps if we > were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to > something else, because it's not really a SMGR-only thing anymore. Here's a sketch of that idea.
Attachment
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Robert Haas
Date:
On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Another idea would be to call a new function DropPendingWritebacks(), > and also tell all the SMgrRelation objects to close all their internal > state (ie the fds + per-segment objects) but not free the main > SMgrRelationData object, and for good measure also invalidate the > small amount of cached state (which hasn't been mentioned in this > thread, but it kinda bothers me that that state currently survives, so > it was one unspoken reason I liked the smgrcloseall() idea). Since > DropPendingWritebacks() is in a rather different module, perhaps if we > were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to > something else, because it's not really a SMGR-only thing anymore. I'm not sure that it really matters, but with the idea that I proposed it's possible to "save" a pending writeback, if we notice that we're accessing the relation with a proper lock after the barrier arrives and before we actually try to do the writeback. With this approach we throw them out immediately, so they're just gone. I don't mind that because I don't think this will happen often enough to matter, and I don't think it will be very expensive when it does, but it's something to think about. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Tue, Apr 5, 2022 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Another idea would be to call a new function DropPendingWritebacks(), > > and also tell all the SMgrRelation objects to close all their internal > > state (ie the fds + per-segment objects) but not free the main > > SMgrRelationData object, and for good measure also invalidate the > > small amount of cached state (which hasn't been mentioned in this > > thread, but it kinda bothers me that that state currently survives, so > > it was one unspoken reason I liked the smgrcloseall() idea). Since > > DropPendingWritebacks() is in a rather different module, perhaps if we > > were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to > > something else, because it's not really a SMGR-only thing anymore. > > I'm not sure that it really matters, but with the idea that I proposed > it's possible to "save" a pending writeback, if we notice that we're > accessing the relation with a proper lock after the barrier arrives > and before we actually try to do the writeback. With this approach we > throw them out immediately, so they're just gone. I don't mind that > because I don't think this will happen often enough to matter, and I > don't think it will be very expensive when it does, but it's something > to think about. The checkpointer never takes heavyweight locks, so the opportunity you're describing can't arise. This stuff happens entirely within the scope of a call to BufferSync(), which is called by CheckPointBuffer(). BufferSync() does WritebackContextInit() to set up the object that accumulates the writeback requests, and then runs for a while performing a checkpoint (usually spread out over time), and then at the end does IssuePendingWritebacks(). A CFI() can be processed any time up until the IssuePendingWritebacks(), but not during IssuePendingWritebacks(). That's the size of the gap we need to cover. I still like the patch I posted most recently. Note that it's not quite as I described above: there is no DropPendingWritebacks(), because that turned out to be impossible to implement in a way that the barrier handler could call -- it needs access to the writeback context. Instead I had to expose a counter so that IssuePendingWritebacks() could detect whether any smgrreleaseall() calls had happened while the writeback context was being filled up with writeback requests, by comparing with the level recorded therein.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Tue, Apr 5, 2022 at 10:24 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Apr 5, 2022 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote: > > I'm not sure that it really matters, but with the idea that I proposed > > it's possible to "save" a pending writeback, if we notice that we're > > accessing the relation with a proper lock after the barrier arrives > > and before we actually try to do the writeback. With this approach we > > throw them out immediately, so they're just gone. I don't mind that > > because I don't think this will happen often enough to matter, and I > > don't think it will be very expensive when it does, but it's something > > to think about. > > The checkpointer never takes heavyweight locks, so the opportunity > you're describing can't arise. <thinks harder> Hmm, oh, you probably meant the buffer interlocking in SyncOneBuffer(). It's true that my most recent patch throws away more requests than it could, by doing the level check at the end of the loop over all buffers instead of adding some kind of DropPendingWritebacks() in the barrier handler. I guess I could find a way to improve that, basically checking the level more often instead of at the end, but I don't know if it's worth it; we're still throwing out an arbitrary percentage of writeback requests.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Robert Haas
Date:
On Mon, Apr 4, 2022 at 10:20 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > The checkpointer never takes heavyweight locks, so the opportunity > > you're describing can't arise. > > <thinks harder> Hmm, oh, you probably meant the buffer interlocking > in SyncOneBuffer(). It's true that my most recent patch throws away > more requests than it could, by doing the level check at the end of > the loop over all buffers instead of adding some kind of > DropPendingWritebacks() in the barrier handler. I guess I could find > a way to improve that, basically checking the level more often instead > of at the end, but I don't know if it's worth it; we're still throwing > out an arbitrary percentage of writeback requests. Doesn't every backend have its own set of pending writebacks? BufferAlloc() calls ScheduleBufferTagForWriteback(&BackendWritebackContext, ...)? -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Wed, Apr 6, 2022 at 5:07 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Apr 4, 2022 at 10:20 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > The checkpointer never takes heavyweight locks, so the opportunity > > > you're describing can't arise. > > > > <thinks harder> Hmm, oh, you probably meant the buffer interlocking > > in SyncOneBuffer(). It's true that my most recent patch throws away > > more requests than it could, by doing the level check at the end of > > the loop over all buffers instead of adding some kind of > > DropPendingWritebacks() in the barrier handler. I guess I could find > > a way to improve that, basically checking the level more often instead > > of at the end, but I don't know if it's worth it; we're still throwing > > out an arbitrary percentage of writeback requests. > > Doesn't every backend have its own set of pending writebacks? > BufferAlloc() calls > ScheduleBufferTagForWriteback(&BackendWritebackContext, ...)? Yeah. Alright, for my exaggeration above, s/can't arise/probably won't often arise/, since when regular backends do this it's for random dirty buffers, not necessarily stuff they hold a relation lock on. I think you are right that if we find ourselves calling smgrwrite() on another buffer for that relfilenode a bit later, then it's safe to "unzonk" the relation. In the worst case, IssuePendingWritebacks() might finish up calling sync_file_range() on a file that we originally wrote to for generation 1 of that relfilenode, even though we now have a file descriptor for generation 2 of that relfilenode that was reopened by a later smgrwrite(). That would be acceptable, because a bogus sync_file_range() call would be harmless. The key point here is that we absolutely must avoid re-opening files without careful interlocking, because that could lead to later *writes* for generation 2 going to the defunct generation 1 file that we opened just as it was being unlinked. (For completeness: we know of another way for smgrwrite() to write to the wrong generation of a recycled relfilenode, but that's hopefully very unlikely and will hopefully be addressed by adding more bits and killing off OID-wraparound[1]. In this thread we're concerned only with these weird "explicit OID reycling" cases we're trying to fix with PROCSIGNAL_BARRIER_SMGRRELEASE sledgehammer-based cache invalidation.) My new attempt, attached, is closer to what Andres proposed, except at the level of md.c segment objects instead of level of SMgrRelation objects. This avoids the dangling SMgrRelation pointer problem discussed earlier, and is also a little like your "zonk" idea, except instead of a zonk flag, the SMgrRelation object is reset to a state where it knows nothing at all except its relfilenode. Any access through smgrXXX() functions *except* smgrwriteback() will rebuild the state, just as you would clear the hypothetical zonk flag. So, to summarise the new patch that I'm attaching to this email as 0001: 1. When PROCSIGNAL_BARRIER_SMGRRELEASE is handled, we call smgrreleaseall(), which calls smgrrelease() on all SMgrRelation objects, telling them to close all their files. 2. All SMgrRelation objects are still valid, and any pointers to them that were acquired before a CFI() and then used in an smgrXXX() call afterwards will still work (the example I know of being RelationCopyStorage()[2]). 3. mdwriteback() internally uses a new "behavior" flag EXTENSION_DONT_OPEN when getting its hands on the internal segment object, which says that we should just give up immediately if we don't already have the file open (concretely: if PROCSIGNAL_BARRIER_SMGRRELEASE came along between our recent smgrwrite() call and the writeback machinery's smgrwriteback() call, it'll do nothing at all). Someone might say that it's weird that smgrwriteback() has that behaviour internally, and we might eventually want to make it more explicit by adding a "behavior" argument to the function itself, so that it's the caller that controls it. It didn't seem worth it for now though; the whole thing is a hint to the operating system anyway. However it seems that I have something wrong, because CI is failing on Windows; I ran out of time for looking into that today, but wanted to post what I have so far since I know we have an open item or two to close here ASAP... Patches 0002-0004 are Andres's, with minor tweaks: v3-0002-Fix-old-fd-issues-using-global-barriers-everywher.patch: It seems useless to even keep the macro USE_BARRIER_SMGRRELEASE if we're going to define it always, so I removed it. I guess it's useful to be able to disable that logic easily to see that the assertion in the other patch fails, but you can do that by commenting out a line in ProcessBarrierSmgrRelease(). I'm still slightly confused about whether we need an *extra* global barrier in DROP TABLESPACE, not just if destroy_tablespace_directory() failed. v3-0003-WIP-test-for-file-reuse-dangers-around-database-a.patch Was missing: -EXTRA_INSTALL=contrib/test_decoding +EXTRA_INSTALL=contrib/test_decoding contrib/pg_prewarm v3-0004-WIP-AssertFileNotDeleted-fd.patch + * XXX: Figure out which operating systems this works on. Do we want this to be wrapped in some kind of macros that vanishes into thin air unless you enable it with USE_UNLINKED_FILE_CHECKING or something? Then we wouldn't care so much if it doesn't work on some unusual system. Bikeshed mode: I would prefer "not unlinked", since that has a more specific meaning than "not deleted". [1] https://www.postgresql.org/message-id/flat/CA%2BTgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg%40mail.gmail.com#1070c79256f2330ec52f063cdbe2add0 [2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BBfaZ2puQDYV6h2oSWO2QW21_JOXZpha65gWRcmGNCZA%40mail.gmail.com#5deeb3d8adae4daf0da1f09e509eef56
Attachment
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Robert Haas
Date:
On Fri, Apr 22, 2022 at 3:38 AM Thomas Munro <thomas.munro@gmail.com> wrote: > So, to summarise the new patch that I'm attaching to this email as 0001: This all makes sense to me, and I didn't see anything obviously wrong looking through the patch, either. > However it seems that I have something wrong, because CI is failing on > Windows; I ran out of time for looking into that today, but wanted to > post what I have so far since I know we have an open item or two to > close here ASAP... :-( > Patches 0002-0004 are Andres's, with minor tweaks: > > v3-0002-Fix-old-fd-issues-using-global-barriers-everywher.patch: > > I'm still slightly confused about whether we need an *extra* global > barrier in DROP TABLESPACE, not just if destroy_tablespace_directory() > failed. Andres wrote this code, but it looks correct to me. Currently, the reason why we use USE_BARRIER_SMGRRELEASE is that we want to make sure that we don't fail to remove a tablespace because some other backend might have files open. However, it might also be that no other backend has files open, and in that case we don't need to do anything, so the current placement of the call is correct. With this patch, though, we want to make sure that no FD that is open before we start dropping files remains open after we drop files - and that means we need to force files to be closed before we even try to remove files the first time. It seems to me that Andres's patch (your 0002) doesn't add a second call - it moves the existing one earlier. And that seems right: no new files should be getting opened once we force them closed the first time, I hope anyway. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Wed, May 4, 2022 at 6:36 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Apr 22, 2022 at 3:38 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > So, to summarise the new patch that I'm attaching to this email as 0001: > > This all makes sense to me, and I didn't see anything obviously wrong > looking through the patch, either. Thanks. > > However it seems that I have something wrong, because CI is failing on > > Windows; I ran out of time for looking into that today, but wanted to > > post what I have so far since I know we have an open item or two to > > close here ASAP... > > :-( It passes sometimes and fails sometimes. Here's the weird failure I need to debug: https://api.cirrus-ci.com/v1/artifact/task/6033765456674816/log/src/test/recovery/tmp_check/log/regress_log_032_relfilenode_reuse Right at the end, it says: Warning: unable to close filehandle GEN26 properly: Bad file descriptor during global destruction. Warning: unable to close filehandle GEN21 properly: Bad file descriptor during global destruction. Warning: unable to close filehandle GEN6 properly: Bad file descriptor during global destruction. I don't know what it means (Windows' fd->handle mapping table got corrupted?) or even which program is printing it (you'd think maybe perl? but how could that be affected by anything I did in postgres.exe, but if it's not perl why is it always at the end like that?). Hrmph.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Wed, May 4, 2022 at 7:44 AM Thomas Munro <thomas.munro@gmail.com> wrote: > It passes sometimes and fails sometimes. Here's the weird failure I > need to debug: > > https://api.cirrus-ci.com/v1/artifact/task/6033765456674816/log/src/test/recovery/tmp_check/log/regress_log_032_relfilenode_reuse > > Right at the end, it says: > > Warning: unable to close filehandle GEN26 properly: Bad file > descriptor during global destruction. > Warning: unable to close filehandle GEN21 properly: Bad file > descriptor during global destruction. > Warning: unable to close filehandle GEN6 properly: Bad file descriptor > during global destruction. > > I don't know what it means (Windows' fd->handle mapping table got > corrupted?) or even which program is printing it (you'd think maybe > perl? but how could that be affected by anything I did in > postgres.exe, but if it's not perl why is it always at the end like > that?). Hrmph. Got some off-list clues: that's just distracting Perl cleanup noise after something else went wrong (thanks Robert), and now I'm testing a theory from Andres that we're missing a barrier on the redo side when replaying XLOG_DBASE_CREATE_FILE_COPY. More soon.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Wed, May 4, 2022 at 8:53 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Got some off-list clues: that's just distracting Perl cleanup noise > after something else went wrong (thanks Robert), and now I'm testing a > theory from Andres that we're missing a barrier on the redo side when > replaying XLOG_DBASE_CREATE_FILE_COPY. More soon. Yeah, looks like that was the explanation. Presumably in older releases, recovery can fail with EACCES here, and since commit e2f0f8ed we get ENOENT, because someone's got an unlinked file open, and ReadDir() can still see it. (I've wondered before if ReadDir() should also hide zombie Windows directory entries, but that's kinda independent and would only get us one step further, a later rmdir() would still fail.) Adding the barrier fixes the problem. Assuming no objections or CI failures show up, I'll consider pushing the first two patches tomorrow.
Attachment
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Wed, May 4, 2022 at 2:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Assuming no > objections or CI failures show up, I'll consider pushing the first two > patches tomorrow. Done. Time to watch the build farm. It's possible that these changes will produce some blowback, now that we're using PROCSIGNAL_BARRIER_SMGRRELEASE on common platforms. Obviously the earlier work started down this path already, but that was Windows-only, so it probably didn't get much attention from our mostly Unix crowd. For example, if there are bugs in the procsignal barrier system, or if there are places that don't process interrupts at all or promptly, we might hear complaints about that. That bug surface includes, for example, background workers created by extensions. An example on my mind is a place where we hold interrupts while cleaning up temporary files (= a loop of arbitrary size with filesystem ops that might hang on slow storage), so a concurrent DROP TABLESPACE will have to wait, which is kinda strange because it would in fact be perfectly safe to process this particular "interrupt". In that case we really just don't want the kinds of interrupts that perform non-local exits and prevent our cleanup work from running to completion, but we don't have a way to say that. I think we'll probably also want to invent a way to report which backend is holding up progress, since without that it's practically impossible for an end user to understand why their command is hanging.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Sat, May 7, 2022 at 4:52 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Done. Time to watch the build farm. So far "grison" failed. I think it's probably just that the test forgot to wait for replay of CREATE EXTENSION before using pg_prewarm on the standby, hence "ERROR: function pg_prewarm(oid) does not exist at character 12". I'll wait for more animals to report before I try to fix that tomorrow.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Sat, May 7, 2022 at 4:52 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I think we'll probably also want to invent a way > to report which backend is holding up progress, since without that > it's practically impossible for an end user to understand why their > command is hanging. Simple idea: how about logging the PID of processes that block progress for too long? In the attached, I arbitrarily picked 5 seconds as the wait time between LOG messages. Also, DEBUG1 messages let you see the processing speed on eg build farm animals. Thoughts? To test this, kill -STOP a random backend, and then try an ALTER DATABASE SET TABLESPACE in another backend. Example output: DEBUG: waiting for all backends to process ProcSignalBarrier generation 1 LOG: still waiting for pid 1651417 to accept ProcSignalBarrier STATEMENT: alter database mydb set tablespace ts1; LOG: still waiting for pid 1651417 to accept ProcSignalBarrier STATEMENT: alter database mydb set tablespace ts1; LOG: still waiting for pid 1651417 to accept ProcSignalBarrier STATEMENT: alter database mydb set tablespace ts1; LOG: still waiting for pid 1651417 to accept ProcSignalBarrier STATEMENT: alter database mydb set tablespace ts1; ... then kill -CONT: DEBUG: finished waiting for all backends to process ProcSignalBarrier generation 1 Another thought is that it might be nice to be able to test with a dummy PSB that doesn't actually do anything. You could use it to see how fast your system processes it, while doing various other things, and to find/debug problems in other code that fails to handle interrupts correctly. Here's an attempt at that. I guess it could go into a src/test/modules/something instead of core, but on the other hand the PSB itself has to be in core anyway, so maybe not. Thoughts? No documentation yet, just seeing if people think this is worth having... better names/ideas welcome. To test this, just SELECT pg_test_procsignal_barrier().
Attachment
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Robert Haas
Date:
On Sun, May 8, 2022 at 7:30 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Simple idea: how about logging the PID of processes that block > progress for too long? In the attached, I arbitrarily picked 5 > seconds as the wait time between LOG messages. Also, DEBUG1 messages > let you see the processing speed on eg build farm animals. Thoughts? > > To test this, kill -STOP a random backend, and then try an ALTER > DATABASE SET TABLESPACE in another backend. Example output: > > DEBUG: waiting for all backends to process ProcSignalBarrier generation 1 > LOG: still waiting for pid 1651417 to accept ProcSignalBarrier > STATEMENT: alter database mydb set tablespace ts1; > LOG: still waiting for pid 1651417 to accept ProcSignalBarrier > STATEMENT: alter database mydb set tablespace ts1; > LOG: still waiting for pid 1651417 to accept ProcSignalBarrier > STATEMENT: alter database mydb set tablespace ts1; > LOG: still waiting for pid 1651417 to accept ProcSignalBarrier > STATEMENT: alter database mydb set tablespace ts1; This is a very good idea. > Another thought is that it might be nice to be able to test with a > dummy PSB that doesn't actually do anything. You could use it to see > how fast your system processes it, while doing various other things, > and to find/debug problems in other code that fails to handle > interrupts correctly. Here's an attempt at that. I guess it could go > into a src/test/modules/something instead of core, but on the other > hand the PSB itself has to be in core anyway, so maybe not. Thoughts? > No documentation yet, just seeing if people think this is worth > having... better names/ideas welcome. I did this at one point, but I wasn't convinced it was going to find enough bugs to be worth committing. It's OK if you're convinced of things that didn't convince me, though. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Tue, May 10, 2022 at 1:07 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, May 8, 2022 at 7:30 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > LOG: still waiting for pid 1651417 to accept ProcSignalBarrier > > STATEMENT: alter database mydb set tablespace ts1; > This is a very good idea. OK, I pushed this, after making the ereport call look a bit more like others that talk about backend PIDs. > > Another thought is that it might be nice to be able to test with a > > dummy PSB that doesn't actually do anything. You could use it to see > > how fast your system processes it, while doing various other things, > > and to find/debug problems in other code that fails to handle > > interrupts correctly. Here's an attempt at that. I guess it could go > > into a src/test/modules/something instead of core, but on the other > > hand the PSB itself has to be in core anyway, so maybe not. Thoughts? > > No documentation yet, just seeing if people think this is worth > > having... better names/ideas welcome. > > I did this at one point, but I wasn't convinced it was going to find > enough bugs to be worth committing. It's OK if you're convinced of > things that didn't convince me, though. I'll leave this here for now.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Sat, May 7, 2022 at 9:37 PM Thomas Munro <thomas.munro@gmail.com> wrote: > So far "grison" failed. I think it's probably just that the test > forgot to wait for replay of CREATE EXTENSION before using pg_prewarm > on the standby, hence "ERROR: function pg_prewarm(oid) does not exist > at character 12". I'll wait for more animals to report before I try > to fix that tomorrow. That one was addressed by commit a22652e. Unfortunately two new kinds of failure showed up: Chipmunk, another little early model Raspberry Pi: error running SQL: 'psql:<stdin>:1: ERROR: source database "conflict_db_template" is being accessed by other users DETAIL: There is 1 other session using the database.' while running 'psql -XAtq -d port=57394 host=/tmp/luyJopPv9L dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CREATE DATABASE conflict_db TEMPLATE conflict_db_template OID = 50001;' at /home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 1836. I think that might imply that when you do two $node_primary->safe_psql() calls in a row, the backend running the second one might still see the ghost of the first backend in the database, even though the first psql process has exited. Hmmm. Skink, the valgrind animal, also failed. After first 8 tests, it times out: [07:18:26.237](14.827s) ok 8 - standby: post move contents as expected [07:18:42.877](16.641s) Bail out! aborting wait: program timed out stream contents: >><< pattern searched for: (?^m:warmed_buffers) That's be in the perl routine cause_eviction(), while waiting for a pg_prewarm query to return, but I'm not yet sure what's going on (I have to suspect it's in the perl scripting rather than anything in the server, given the location of the failure). Will try to repro locally.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Thu, May 12, 2022 at 3:13 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Chipmunk, another little early model Raspberry Pi: > > error running SQL: 'psql:<stdin>:1: ERROR: source database > "conflict_db_template" is being accessed by other users > DETAIL: There is 1 other session using the database.' Oh, for this one I think it may just be that the autovacuum worker with PID 23757 took longer to exit than the 5 seconds CountOtherDBBackends() is prepared to wait, after sending it SIGTERM.
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Thu, May 12, 2022 at 4:57 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, May 12, 2022 at 3:13 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > error running SQL: 'psql:<stdin>:1: ERROR: source database > > "conflict_db_template" is being accessed by other users > > DETAIL: There is 1 other session using the database.' > > Oh, for this one I think it may just be that the autovacuum worker > with PID 23757 took longer to exit than the 5 seconds > CountOtherDBBackends() is prepared to wait, after sending it SIGTERM. In this test, autovacuum_naptime is set to 1s (per Andres, AV was implicated when he first saw the problem with pg_upgrade, hence desire to crank it up). That's not necessary: commenting out the active line in ProcessBarrierSmgrRelease() shows that the tests reliably reproduce data corruption without it. Let's just take that out. As for skink failing, the timeout was hard coded 300s for the whole test, but apparently that wasn't enough under valgrind. Let's use the standard PostgreSQL::Test::Utils::timeout_default (180s usually), but reset it for each query we send. See attached.
Attachment
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Robert Haas
Date:
On Thu, May 12, 2022 at 10:20 PM Thomas Munro <thomas.munro@gmail.com> wrote: > As for skink failing, the timeout was hard coded 300s for the whole > test, but apparently that wasn't enough under valgrind. Let's use the > standard PostgreSQL::Test::Utils::timeout_default (180s usually), but > reset it for each query we send. @@ -202,6 +198,9 @@ sub send_query_and_wait my ($psql, $query, $untl) = @_; my $ret; + $psql_timeout->reset(); + $psql_timeout->start(); + # send query $$psql{stdin} .= $query; $$psql{stdin} .= "\n"; This seems fine, but I think you should add a non-trivial comment about it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
From
Thomas Munro
Date:
On Sat, May 14, 2022 at 3:33 AM Robert Haas <robertmhaas@gmail.com> wrote: > This seems fine, but I think you should add a non-trivial comment about it. Thanks for looking. Done, and pushed. Let's see if 180s per query is enough...