Thread: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From
Andres Freund
Date:
Hi,

I was working on rebasing the AIO branch. Tests started to fail after, but it
turns out that the problem exists independent of AIO.

The problem starts with

commit aa01051418f10afbdfa781b8dc109615ca785ff9
Author: Robert Haas <rhaas@postgresql.org>
Date:   2022-01-24 14:23:15 -0500

    pg_upgrade: Preserve database OIDs.

    Commit 9a974cbcba005256a19991203583a94b4f9a21a9 arranged to preserve
    relfilenodes and tablespace OIDs. For similar reasons, also arrange
    to preserve database OIDs.

after this commit the database contents for databases that exist before
pg_upgrade runs, and the databases pg_upgrade (re-)creates can have the same
oid. Consequently the RelFileNode of files in the "pre-existing" database and
the re-created database will be the same.

That's a problem: There is no reliable invalidation mechanism forcing all
processes to clear SMgrRelationHash when a database is dropped /
created. Which means that backends can end up using file descriptors for the
old, dropped, database, when accessing contents in the new database.  This is
most likely to happen to bgwriter, but could be any backend (backends in other
databases can end up opening files in another database to write out dirty
victim buffers during buffer replacement).

That's of course dangerous, because we can end up reading the wrong data (by
reading from the old file), or loosing writes (by writing to the old file).


Now, this isn't a problem that's entirely new - but previously it requried
relfilenodes to be recycled due to wraparound or such. The timeframes for that
are long enough that it's very likely that *something* triggers smgr
invalidation processing. E.g. has

        if (FirstCallSinceLastCheckpoint())
        {
            /*
             * After any checkpoint, close all smgr files.  This is so we
             * won't hang onto smgr references to deleted files indefinitely.
             */
            smgrcloseall();
        }

in its mainloop. Previously for the problem to occur there would have to be
"oid wraparound relfilenode recycling" within a single checkpoint window -
quite unlikely. But now a bgwriter cycle just needs to span a drop/create
database, easier to hit.


I think the most realistic way to address this is the mechanism is to use the
mechanism from
https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com

If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed.

We perhaps could avoid all relfilenode recycling, issues on the fd level, by
adding a barrier whenever relfilenode allocation wraps around. But I'm not
sure how easy that is to detect.


I think we might actually be able to remove the checkpoint from dropdb() by
using barriers?


I think we need to add some debugging code to detect "fd to deleted file of
same name" style problems. Especially during regression tests they're quite
hard to notice, because most of the contents read/written are identical across
recycling.

On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to
a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink
== 0.

b) might also work on a few other operating systems, but I have not verified
that. a) has the advantage of providing a filename, which makes it a lot
easier to understand the problem.  It'd probably make sense to use b) on all
the operating systems it works on, and then a) on linux for a more useful
error message.

The checks for this would need to be in FileRead/FileWrite/... and direct
calls to pread etc. I think that's an acceptable cost.   I have a patch for
this, but it's currently based on the AIO tree. if others agree it'd be useful
to have, I'll adapt it to work on master.

Comments?

Greetings,

Andres Freund



Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From
Justin Pryzby
Date:
On Wed, Feb 09, 2022 at 02:00:04PM -0800, Andres Freund wrote:
> On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to
> a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink
> == 0.

You could also stat() the file in proc/self/fd/N and compare st_ino.  It
"looks" like a symlink (in which case that wouldn't work) but it's actually a
Very Special File.  You can even recover deleted, still-opened files that way..

PS. I didn't know pg_upgrade knew about Reply-To ;)



Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes

From
Andres Freund
Date:
Hi,

On 2022-02-09 16:42:30 -0600, Justin Pryzby wrote:
> On Wed, Feb 09, 2022 at 02:00:04PM -0800, Andres Freund wrote:
> > On linux we can do so by a) checking if readlink(/proc/self/fd/$fd) points to
> > a filename ending in " (deleted)", b) doing fstat(fd) and checking if st_nlink
> > == 0.
> 
> You could also stat() the file in proc/self/fd/N and compare st_ino.  It
> "looks" like a symlink (in which case that wouldn't work) but it's actually a
> Very Special File.  You can even recover deleted, still-opened files that way..

Yea, the readlink() thing above relies on it being a /proc/self/fd/$fd being a
"Very Special File".

In most places we'd not have convenient access to a inode / filename to
compare it to. I don't think there's any additional information we could gain
anyway, compared to looking at st_nlink == 0 and then doing a readlink() to
get the filename?


> PS. I didn't know pg_upgrade knew about Reply-To ;)

Ugh, formatting fail...

Greetings,

Andres Freund



On Wed, Feb 9, 2022 at 5:00 PM Andres Freund <andres@anarazel.de> wrote:
> The problem starts with
>
> commit aa01051418f10afbdfa781b8dc109615ca785ff9
> Author: Robert Haas <rhaas@postgresql.org>
> Date:   2022-01-24 14:23:15 -0500
>
>     pg_upgrade: Preserve database OIDs.

Well, that's sad.

> I think the most realistic way to address this is the mechanism is to use the
> mechanism from
> https://postgr.es/m/CA%2BhUKGLdemy2gBm80kz20GTe6hNVwoErE8KwcJk6-U56oStjtg%40mail.gmail.com

I agree. While I feel sort of bad about missing this issue in review,
I also feel like it's pretty surprising that there isn't something
plugging this hole already. It feels unexpected that our FD management
layer might hand you an FD that references the previous file that had
the same name instead of the one that exists right now, and I don't
think it's too much of a stretch of the imagination to suppose that
this won't be the last problem we have if we don't fix it. I also
agree that the mechanism proposed in that thread is the best way to
fix the problem. The sinval mechanism won't work, since there's
nothing to ensure that the invalidation messages are processed in a
sufficient timely fashion, and there's no obvious way to get around
that problem. But the ProcSignalBarrier mechanism is not intertwined
with heavyweight locking in the way that sinval is, so it should be a
good fit for this problem.

The main question in my mind is who is going to actually make that
happen. It was your idea (I think), Thomas coded it, and my commit
made it a live problem. So who's going to get something committed
here?

> If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
> XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed.

What about movedb()?

> We perhaps could avoid all relfilenode recycling, issues on the fd level, by
> adding a barrier whenever relfilenode allocation wraps around. But I'm not
> sure how easy that is to detect.

I definitely think it would be nice to cover this issue not only for
database OIDs and tablespace OIDs but also for relfilenode OIDs.
Otherwise we're right on the cusp of having the same bug in that case.
pg_upgrade doesn't drop and recreate tables the way it does databases,
but if somebody made it do that in the future, would they think about
this? I bet not.

Dilip's been working on your idea of making relfilenode into a 56-bit
quantity. That would make the problem of detecting wraparound go away.
In the meantime, I suppose we could do think of trying to do something
here:

                        /* wraparound, or first post-initdb
assignment, in normal mode */
                        ShmemVariableCache->nextOid = FirstNormalObjectId;
                        ShmemVariableCache->oidCount = 0;

That's a bit sketch, because this is the OID counter, which isn't only
used for relfilenodes. I'm not sure whether there would be problems
waiting for a barrier here - I think we might be holding some lwlocks
in some code paths.

> I think we might actually be able to remove the checkpoint from dropdb() by
> using barriers?

That looks like it might work. We'd need to be sure that the
checkpointer would both close any open FDs and also absorb fsync
requests before acknowledging the barrier.

> I think we need to add some debugging code to detect "fd to deleted file of
> same name" style problems. Especially during regression tests they're quite
> hard to notice, because most of the contents read/written are identical across
> recycling.

Maybe. If we just plug the holes here, I am not sure we need permanent
instrumentation. But I'm also not violently opposed to it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From
Thomas Munro
Date:
On Fri, Feb 11, 2022 at 7:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
> The main question in my mind is who is going to actually make that
> happen. It was your idea (I think), Thomas coded it, and my commit
> made it a live problem. So who's going to get something committed
> here?

I was about to commit that, because the original Windows problem it
solved is showing up occasionally in CI failures (that is, it already
solves a live problem, albeit a different and non-data-corrupting
one):

https://www.postgresql.org/message-id/CA%2BhUKGJp-m8uAD_wS7%2BdkTgif013SNBSoJujWxvRUzZ1nkoUyA%40mail.gmail.com

It seems like I should go ahead and do that today, and we can study
further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?



On Thu, Feb 10, 2022 at 3:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Feb 11, 2022 at 7:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > The main question in my mind is who is going to actually make that
> > happen. It was your idea (I think), Thomas coded it, and my commit
> > made it a live problem. So who's going to get something committed
> > here?
>
> I was about to commit that, because the original Windows problem it
> solved is showing up occasionally in CI failures (that is, it already
> solves a live problem, albeit a different and non-data-corrupting
> one):
>
> https://www.postgresql.org/message-id/CA%2BhUKGJp-m8uAD_wS7%2BdkTgif013SNBSoJujWxvRUzZ1nkoUyA%40mail.gmail.com
>
> It seems like I should go ahead and do that today, and we can study
> further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?

A bird in the hand is worth two in the bush. Unless it's a vicious,
hand-biting bird.

In other words, if you think what you've got is better than what we
have now and won't break anything worse than it is today, +1 for
committing, and more improvements can come when we get enough round
tuits.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From
Andres Freund
Date:
Hi,

On 2022-02-10 13:49:50 -0500, Robert Haas wrote:
> I agree. While I feel sort of bad about missing this issue in review,
> I also feel like it's pretty surprising that there isn't something
> plugging this hole already. It feels unexpected that our FD management
> layer might hand you an FD that references the previous file that had
> the same name instead of the one that exists right now, and I don't
> think it's too much of a stretch of the imagination to suppose that
> this won't be the last problem we have if we don't fix it.

Arguably it's not really the FD layer that's the issue - it's on the smgr
level. But that's semantics...


> The main question in my mind is who is going to actually make that
> happen. It was your idea (I think), Thomas coded it, and my commit
> made it a live problem. So who's going to get something committed
> here?

I think it'd make sense for Thomas to commit the current patch for the
tablespace issues first. And then we can go from there?


> > If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
> > XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed.
> 
> What about movedb()?

My first reaction was that it should be fine, because RelFileNode.spcNode
would differ. But of course that doesn't protect against moving a database
back and forth. So yes, you're right, it's needed there too.


Are there other commands that could be problematic? ALTER TABLE SET TABLESPACE
should be fine, that allocates a new relfilenode.


> I definitely think it would be nice to cover this issue not only for
> database OIDs and tablespace OIDs but also for relfilenode OIDs.
> Otherwise we're right on the cusp of having the same bug in that case.
> pg_upgrade doesn't drop and recreate tables the way it does databases,
> but if somebody made it do that in the future, would they think about
> this? I bet not.

And even just plugging the wraparound aspect would be great. It's not actually
*that* hard to wrap oids around, because of toast.


> Dilip's been working on your idea of making relfilenode into a 56-bit
> quantity. That would make the problem of detecting wraparound go away.
> In the meantime, I suppose we could do think of trying to do something
> here:
> 
>                         /* wraparound, or first post-initdb
> assignment, in normal mode */
>                         ShmemVariableCache->nextOid = FirstNormalObjectId;
>                         ShmemVariableCache->oidCount = 0;
> 
> That's a bit sketch, because this is the OID counter, which isn't only
> used for relfilenodes.

I can see a few ways to deal with that:

1) Split nextOid into two, nextRelFileNode and nextOid. Have the wraparound
   behaviour only in the nextRelFileNode case.  That'd be a nice improvement,
   but not entirely trivial, because we currently use GetNewRelFileNode() for
   oid assignment as well (c.f. heap_create_with_catalog()).

2) Keep track of the last assigned relfilenode in ShmemVariableCache, and wait
   for a barrier in GetNewRelFileNode() if the assigned oid is wrapped
   around.  While there would be a chance of "missing" a ->nextOid wraparound,
   e.g. because of >2**32 toast oids being allocated, I think that could only
   happen if there were no "wrapped around relfilenodes" allocated, so that
   should be ok.

3) Any form of oid wraparound might cause problems, not just relfilenode
   issues. So having a place to emit barriers on oid wraparound might be a
   good idea.


One thing I wonder is how to make sure the path would be tested. Perhaps we
could emit the anti-wraparound barriers more frequently when assertions are
enabled?


> I'm not sure whether there would be problems waiting for a barrier here - I
> think we might be holding some lwlocks in some code paths.

It seems like a bad idea to call GetNewObjectId() with lwlocks held. From what
I can see the two callers of GetNewObjectId() both have loops that can go on
for a long time after a wraparound.



> > I think we need to add some debugging code to detect "fd to deleted file of
> > same name" style problems. Especially during regression tests they're quite
> > hard to notice, because most of the contents read/written are identical across
> > recycling.
> 
> Maybe. If we just plug the holes here, I am not sure we need permanent
> instrumentation. But I'm also not violently opposed to it.

The question is how we can find all the places we need to add barriers emit &
wait to. I e.g. didn't initially didn't realize that there's wraparound danger
in WAL replay as well.

Greetings,

Andres Freund



Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From
Andres Freund
Date:
Hi,

On 2022-02-11 09:10:38 +1300, Thomas Munro wrote:
> I was about to commit that, because the original Windows problem it
> solved is showing up occasionally in CI failures (that is, it already
> solves a live problem, albeit a different and non-data-corrupting
> one):

+1

> It seems like I should go ahead and do that today, and we can study
> further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?

Yes.

I wonder whether we really should make the barriers be conditional on
defined(WIN32) || defined(USE_ASSERT_CHECKING) as done in that patch, even
leaving wraparound dangers aside. On !windows we still have the issues of the
space for the dropped / moved files not being released if there are processes
having them open. That can be a lot of space if there's long-lived connections
in a cluster that doesn't fit into s_b (because processes will have random fds
open for writing back dirty buffers). And we don't truncate the files before
unlinking when done as part of a DROP DATABASE...

But that's something we can fine-tune later as well...

Greetings,

Andres Freund



Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From
Andres Freund
Date:
Hi,

On 2022-02-10 14:26:59 -0800, Andres Freund wrote:
> On 2022-02-11 09:10:38 +1300, Thomas Munro wrote:
> > It seems like I should go ahead and do that today, and we can study
> > further uses for PROCSIGNAL_BARRIER_SMGRRELEASE in follow-on work?
> 
> Yes.

I wrote a test to show the problem. While looking at the code I unfortunately
found that CREATE DATABASE ... OID isn't the only problem. Two consecutive
ALTER DATABASE ... SET TABLESPACE also can cause corruption.

The test doesn't work on < 15 (due to PostgresNode renaming and use of
allow_in_place_tablespaces). But manually it's unfortunately quite
reproducible :(

Start a server with

shared_buffers = 1MB
autovacuum=off
bgwriter_lru_maxpages=1
bgwriter_delay=10000

these are just to give more time / to require smaller tables.


Start two psql sessions

s1: \c postgres
s1: CREATE DATABASE conflict_db;
s1: CREATE TABLESPACE test_tablespace LOCATION '/tmp/blarg';
s1: CREATE EXTENSION pg_prewarm;

s2: \c conflict_db
s2: CREATE TABLE large(id serial primary key, dataa text, datab text);
s2: INSERT INTO large(dataa, datab) SELECT g.i::text, 0 FROM generate_series(1, 4000) g(i);
s2: UPDATE large SET datab = 1;

-- prevent AtEOXact_SMgr
s1: BEGIN;

-- Fill shared_buffers with other contents. This needs to write out the prior dirty contents
-- leading to opening smgr rels / file descriptors
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0;

s2: \c postgres
-- unlinks the files
s2: ALTER DATABASE conflict_db SET TABLESPACE test_tablespace;
-- now new files with the same relfilenode exist
s2: ALTER DATABASE conflict_db SET TABLESPACE pg_default;
s2: \c conflict_db
-- dirty buffers again
s2: UPDATE large SET datab = 2;

-- this again writes out the dirty buffers. But because nothing forced the smgr handles to be closed,
-- fds pointing to the *old* file contents are used. I.e. we'll write to the wrong file
s1: SELECT SUM(pg_prewarm(oid)) warmed_buffers FROM pg_class WHERE pg_relation_filenode(oid) != 0;

-- verify that everything is [not] ok
s2: SELECT datab, count(*) FROM large GROUP BY 1 ORDER BY 1 LIMIT 10;

┌───────┬───────┐
│ datab │ count │
├───────┼───────┤
│ 1     │  2117 │
│ 2     │    67 │
└───────┴───────┘
(2 rows)


oops.


I don't yet have a good idea how to tackle this in the backbranches.



I've started to work on a few debugging aids to find problem like
these. Attached are two WIP patches:

1) use fstat(...).st_nlink == 0 to detect dangerous actions on fds with
   deleted files. That seems to work (almost) everywhere. Optionally, on
   linux, use readlink(/proc/$pid/fd/$fd) to get the filename.
2) On linux, iterate over PGPROC to get a list of pids. Then iterate over
   /proc/$pid/fd/ to check for deleted files. This only can be done after
   we've just made sure there's no fds open to deleted files.

Greetings,

Andres Freund



Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

From
Andres Freund
Date:
Hi,

On 2022-02-22 01:11:21 -0800, Andres Freund wrote:
> I've started to work on a few debugging aids to find problem like
> these. Attached are two WIP patches:

Forgot to attach. Also importantly includes a tap test for several of these
issues

Greetings,

Andres Freund

Attachment
On Tue, Feb 22, 2022 at 4:40 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-02-22 01:11:21 -0800, Andres Freund wrote:
> > I've started to work on a few debugging aids to find problem like
> > these. Attached are two WIP patches:
>
> Forgot to attach. Also importantly includes a tap test for several of these
> issues

Hi,

Just a few very preliminary comments:

- I am having some trouble understanding clearly what 0001 is doing.
I'll try to study it further.

- 0002 seems like it's generally a good idea. I haven't yet dug into
why the call sites for AssertFileNotDeleted() are where they are, or
whether that's a complete set of places to check.

- In general, 0003 makes a lot of sense to me.

+               /*
+                * Finally tell the kernel to write the data to
storage. Don't smgr if
+                * previously closed, otherwise we could end up evading fd-reuse
+                * protection.
+                */

- I think the above hunk is missing a word, because it uses smgr as a
verb. I also think that it's easy to imagine this explanation being
insufficient for some future hacker to understand the issue.

- While 0004 seems useful for testing, it's an awfully big hammer. I'm
not sure we should be thinking about committing something like that,
or at least not as a default part of the build. But ... maybe?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Hi,

On 2022-03-02 14:52:01 -0500, Robert Haas wrote:
> - I am having some trouble understanding clearly what 0001 is doing.
> I'll try to study it further.

It tests for the various scenarios I could think of that could lead to FD
reuse, to state the obvious ;). Anything particularly unclear.


> - In general, 0003 makes a lot of sense to me.
> 
> +               /*
> +                * Finally tell the kernel to write the data to
> storage. Don't smgr if
> +                * previously closed, otherwise we could end up evading fd-reuse
> +                * protection.
> +                */
> 
> - I think the above hunk is missing a word, because it uses smgr as a
> verb. I also think that it's easy to imagine this explanation being
> insufficient for some future hacker to understand the issue.

Yea, it's definitely not sufficient or gramattically correct. I think I wanted
to send something out, but was very tired by that point..


> - While 0004 seems useful for testing, it's an awfully big hammer. I'm
> not sure we should be thinking about committing something like that,
> or at least not as a default part of the build. But ... maybe?

Aggreed. I think it's racy anyway, because further files could get deleted
(e.g. segments > 0) after the barrier has been processed.


What I am stuck on is what we can do for the released branches. Data
corruption after two consecutive ALTER DATABASE SET TABLESPACEs seems like
something we need to address.

Greetings,

Andres Freund



On Wed, Mar 2, 2022 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:
> What I am stuck on is what we can do for the released branches. Data
> corruption after two consecutive ALTER DATABASE SET TABLESPACEs seems like
> something we need to address.

I think we should consider back-porting the ProcSignalBarrier stuff
eventually. I realize that it's not really battle-tested yet and I am
not saying we should do it right now, but I think that if we get these
changes into v15, back-porting it in let's say May of next year could
be a reasonable thing to do. Sure, there is some risk there, but on
the other hand, coming up with completely different fixes for the
back-branches is not risk-free either, nor is it clear that there is
any alternative fix that is nearly as good. In the long run, I am
fairly convinced that ProcSignalBarrier is the way forward not only
for this purpose but for other things as well, and everybody's got to
get on the train or be left behind.

Also, I am aware of multiple instances where the project waited a
long, long time to fix bugs because we didn't have a back-patchable
fix. I disagree with that on principle. A master-only fix now is
better than a back-patchable fix two or three years from now. Of
course a back-patchable fix now is better still, but we have to pick
from the options we have, not the ones we'd like to have.

<digressing a bit>

It seems to me that if we were going to try to construct an
alternative fix for the back-branches, it would have to be something
that didn't involve a new invalidation mechanism -- because the
ProcSignalBarrier stuff is an invalidation mechanism in effect, and I
feel that it can't be better to invent two new invalidation mechanisms
rather than one. And the only idea I have is trying to detect a
dangerous sequence of operations and just outright block it. We have
some cases sort of like that already - e.g. you can't prepare a
transaction if it's done certain things. But, the existing precedents
that occur to me are, I think, all cases where all of the related
actions are being performed in the same backend. It doesn't sound
crazy to me to have some rule like "you can't ALTER TABLESPACE on the
same tablespace in the same backend twice in a row without an
intervening checkpoint", or whatever, and install the book-keeping to
enforce that. But I don't think anything like that can work, both
because the two ALTER TABLESPACE commands could be performed in
different sessions, and also because an intervening checkpoint is no
guarantee of safety anyway, IIUC. So I'm just not really seeing a
reasonable strategy that isn't basically the barrier stuff.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Wed, Mar 2, 2022 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-03-02 14:52:01 -0500, Robert Haas wrote:
> > - I am having some trouble understanding clearly what 0001 is doing.
> > I'll try to study it further.
>
> It tests for the various scenarios I could think of that could lead to FD
> reuse, to state the obvious ;). Anything particularly unclear.

As I understand it, the basic idea of this test is:

1. create a database called 'conflict_db' containing a table called 'large'
2. also create a table called 'replace_sb' in the postgres db.
3. have a long running transaction open in 'postgres' database that
repeatedly accesses the 'replace_sb' table to evict the 'conflict_db'
table, sometimes causing write-outs
4. try to get it to write out the data to a stale file descriptor by
fiddling with various things, and then detect the corruption that
results
5. use both a primary and a standby not because they are intrinsically
necessary but because you want to test that both cases work

As to (1) and (2), I note that 'large' is not a hugely informative
table name, especially when it's the smaller of the two test tables.
And also, the names are all over the place. The table names don't have
much to do with each other (large, replace_sb) and they're completely
unrelated to the database names (postgres, conflict_db). And then you
have Perl functions whose names also don't make it obvious what we're
talking about. I can't remember that verify() is the one that accesses
conflict.db large while cause_eviction() is the one that accesses
postgres.replace_sb for more than like 15 seconds. It'd make it a lot
easier if the table names, database names, and Perl function names had
some common elements.

As to (5), the identifiers are just primary and standby, without
mentioning the database name, which adds to the confusion, and there
are no comments explaining why we have two nodes.

Also, some of the comments seem like they might be in the wrong place:

+# Create template database with a table that we'll update, to trigger dirty
+# rows. Using a template database + preexisting rows makes it a bit easier to
+# reproduce, because there's no cache invalidations generated.

Right after this comment, we create a table and then a template
database just after, so I think we're not avoiding any invalidations
at this point. We're avoiding them at later points where the comments
aren't talking about this.

I think it would be helpful to find a way to divide the test case up
into sections that are separated from each other visually, and explain
the purpose of each test a little more in a comment. For example I'm
struggling to understand the exact purpose of this bit:

+$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
+
+verify($node_primary, $node_standby, 3,
+          "restored contents as expected");

I'm all for having test coverage of VACUUM FULL, but it isn't clear to
me what this does that is related to FD reuse.

Similarly for the later ones. I generally grasp that you are trying to
make sure that things are OK with respect to FD reuse in various
scenarios, but I couldn't explain to you what we'd be losing in terms
of test coverage if we removed this line:

+$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;");

I am not sure how much all of this potential cleanup matters because
I'm pretty skeptical that this would be stable in the buildfarm. It
relies on a lot of assumptions about timing and row sizes and details
of when invalidations are sent that feel like they might not be
entirely predictable at the level you would need to have this
consistently pass everywhere. Perhaps I am too pessimistic?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Hi,

On 2022-03-03 13:11:17 -0500, Robert Haas wrote:
> On Wed, Mar 2, 2022 at 3:00 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2022-03-02 14:52:01 -0500, Robert Haas wrote:
> > > - I am having some trouble understanding clearly what 0001 is doing.
> > > I'll try to study it further.
> >
> > It tests for the various scenarios I could think of that could lead to FD
> > reuse, to state the obvious ;). Anything particularly unclear.
> 
> As I understand it, the basic idea of this test is:
> 
> 1. create a database called 'conflict_db' containing a table called 'large'
> 2. also create a table called 'replace_sb' in the postgres db.
> 3. have a long running transaction open in 'postgres' database that
> repeatedly accesses the 'replace_sb' table to evict the 'conflict_db'
> table, sometimes causing write-outs
> 4. try to get it to write out the data to a stale file descriptor by
> fiddling with various things, and then detect the corruption that
> results
> 5. use both a primary and a standby not because they are intrinsically
> necessary but because you want to test that both cases work

Right.


I wasn't proposing to commit the test as-is, to be clear. It was meant as a
demonstration of the types of problems I can see, and what's needed to fix
them...


> I can't remember that verify() is the one that accesses conflict.db large
> while cause_eviction() is the one that accesses postgres.replace_sb for more
> than like 15 seconds.

For more than 15seconds? The whole test runs in a few seconds for me...


> As to (5), the identifiers are just primary and standby, without
> mentioning the database name, which adds to the confusion, and there
> are no comments explaining why we have two nodes.

I don't follow this one - physical rep doesn't do anything below the database
level?



> I think it would be helpful to find a way to divide the test case up
> into sections that are separated from each other visually, and explain
> the purpose of each test a little more in a comment. For example I'm
> struggling to understand the exact purpose of this bit:
> 
> +$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
> +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
> +
> +verify($node_primary, $node_standby, 3,
> +          "restored contents as expected");
> 
> I'm all for having test coverage of VACUUM FULL, but it isn't clear to
> me what this does that is related to FD reuse.

It's just trying to clean up prior damage, so the test continues to later
ones. Definitely not needed once there's a fix. But it's useful while trying
to make the test actually detect various corruptions.


> Similarly for the later ones. I generally grasp that you are trying to
> make sure that things are OK with respect to FD reuse in various
> scenarios, but I couldn't explain to you what we'd be losing in terms
> of test coverage if we removed this line:
> 
> +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;");
> 
> I am not sure how much all of this potential cleanup matters because
> I'm pretty skeptical that this would be stable in the buildfarm.

Which "potential cleanup" are you referring to?


> It relies on a lot of assumptions about timing and row sizes and details of
> when invalidations are sent that feel like they might not be entirely
> predictable at the level you would need to have this consistently pass
> everywhere. Perhaps I am too pessimistic?

I don't think the test passing should be dependent on row size, invalidations
etc to a significant degree (unless the tables are too small to reach s_b
etc)? The exact symptoms of failures are highly unstable, but obviously we'd
fix them in-tree before committing a test. But maybe I'm missing a dependency?

FWIW, the test did pass on freebsd, linux, macos and windows with the
fix. After a few iterations of improving the fix ;)

Greetings,

Andres Freund



On Thu, Mar 3, 2022 at 1:28 PM Andres Freund <andres@anarazel.de> wrote:
> > I can't remember that verify() is the one that accesses conflict.db large
> > while cause_eviction() is the one that accesses postgres.replace_sb for more
> > than like 15 seconds.
>
> For more than 15seconds? The whole test runs in a few seconds for me...

I'm not talking about running the test. I'm talking about reading it
and trying to keep straight what is happening. The details of which
function is accessing which database keep going out of my head.
Quickly. Maybe because I'm dumb, but I think some better naming could
help, just in case other people are dumb, too.

> > As to (5), the identifiers are just primary and standby, without
> > mentioning the database name, which adds to the confusion, and there
> > are no comments explaining why we have two nodes.
>
> I don't follow this one - physical rep doesn't do anything below the database
> level?

Right but ... those handles are connected to a particular DB on that
node, not just the node in general.

> > I think it would be helpful to find a way to divide the test case up
> > into sections that are separated from each other visually, and explain
> > the purpose of each test a little more in a comment. For example I'm
> > struggling to understand the exact purpose of this bit:
> >
> > +$node_primary->safe_psql('conflict_db', "VACUUM FULL large;");
> > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 3;");
> > +
> > +verify($node_primary, $node_standby, 3,
> > +          "restored contents as expected");
> >
> > I'm all for having test coverage of VACUUM FULL, but it isn't clear to
> > me what this does that is related to FD reuse.
>
> It's just trying to clean up prior damage, so the test continues to later
> ones. Definitely not needed once there's a fix. But it's useful while trying
> to make the test actually detect various corruptions.

Ah, OK, I definitely did not understand that before.

> > Similarly for the later ones. I generally grasp that you are trying to
> > make sure that things are OK with respect to FD reuse in various
> > scenarios, but I couldn't explain to you what we'd be losing in terms
> > of test coverage if we removed this line:
> >
> > +$node_primary->safe_psql('conflict_db', "UPDATE large SET datab = 7;");
> >
> > I am not sure how much all of this potential cleanup matters because
> > I'm pretty skeptical that this would be stable in the buildfarm.
>
> Which "potential cleanup" are you referring to?

I mean, whatever improvements you might consider making based on my comments.

> > It relies on a lot of assumptions about timing and row sizes and details of
> > when invalidations are sent that feel like they might not be entirely
> > predictable at the level you would need to have this consistently pass
> > everywhere. Perhaps I am too pessimistic?
>
> I don't think the test passing should be dependent on row size, invalidations
> etc to a significant degree (unless the tables are too small to reach s_b
> etc)? The exact symptoms of failures are highly unstable, but obviously we'd
> fix them in-tree before committing a test. But maybe I'm missing a dependency?
>
> FWIW, the test did pass on freebsd, linux, macos and windows with the
> fix. After a few iterations of improving the fix ;)

Hmm. Well, I admit that's already more than I would have expected....

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Some thoughts:

The v1-0003 patch introduced smgropen_cond() to avoid the problem of
IssuePendingWritebacks(), which does desynchronised smgropen() calls
and could open files after the barrier but just before they are
unlinked.  Makes sense, but...

1.  For that to actually work, we'd better call smgrcloseall() when
PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls
closeAllVds().  Here's a patch for that.  But...

2.  The reason I used closeAllVds() in 4eb21763 is because I didn't
want random CHECK_FOR_INTERRUPTS() calls to break code like this, from
RelationCopyStorage():

    for (blkno = 0; blkno < nblocks; blkno++)
    {
        /* If we got a cancel signal during the copy of the data, quit */
        CHECK_FOR_INTERRUPTS();

        smgrread(src, forkNum, blkno, buf.data);

Perhaps we could change RelationCopyStorage() to take Relation, and
use CreateFakeRelCacheEntry() in various places to satisfy that, and
also extend that mechanism to work with temporary tables.  Here's an
initial sketch of that idea, to see what problems it crashes into...
Fake Relations are rather unpleasant though; I wonder if there is an
SMgrRelationHandle concept that could make this all nicer.  There is
possibly also some cleanup missing to avoid an SMgrRelation that
points to a defunct fake Relation on non-local exit (?).

Attachment
On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> The v1-0003 patch introduced smgropen_cond() to avoid the problem of
> IssuePendingWritebacks(), which does desynchronised smgropen() calls
> and could open files after the barrier but just before they are
> unlinked.  Makes sense, but...
>
> 1.  For that to actually work, we'd better call smgrcloseall() when
> PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls
> closeAllVds().  Here's a patch for that.  But...

What if we instead changed our approach to fixing
IssuePendingWritebacks()?  Design sketch:

1. We add a new flag, bool smgr_zonk, to SmgrRelationData. Initially it's false.
2. Receipt of PROCSIGNAL_BARRIER_SMGRRELEASE sets smgr_zonk to true.
3. If you want, you can set it back to false any time you access the
smgr while holding a relation lock.
4. IssuePendingWritebacks() uses smgropen_cond(), as today, but that
function now refuses either to create a new smgr or to return one
where smgr_zonk is true.

I think this would be sufficient to guarantee that we never try to
write back to a relfilenode if we haven't had a lock on it since the
last PROCSIGNAL_BARRIER_SMGRRELEASE, which I think is the invariant we
need here?

My thinking here is that it's a lot easier to get away with marking
the content of a data structure invalid than it is to get away with
invalidating pointers to that data structure. If we can tinker with
the design so that the SMgrRelationData doesn't actually go away but
just gets a little bit more thoroughly invalidated, we could avoid
having to audit the entire code base for code that keeps smgr handles
around longer than would be possible with the design you demonstrate
here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Sat, Apr 2, 2022 at 2:52 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Apr 1, 2022 at 2:04 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > The v1-0003 patch introduced smgropen_cond() to avoid the problem of
> > IssuePendingWritebacks(), which does desynchronised smgropen() calls
> > and could open files after the barrier but just before they are
> > unlinked.  Makes sense, but...
> >
> > 1.  For that to actually work, we'd better call smgrcloseall() when
> > PROCSIGNAL_BARRIER_SMGRRELEASE is handled; currently it only calls
> > closeAllVds().  Here's a patch for that.  But...
>
> What if we instead changed our approach to fixing
> IssuePendingWritebacks()?  Design sketch:
>
> 1. We add a new flag, bool smgr_zonk, to SmgrRelationData. Initially it's false.
> 2. Receipt of PROCSIGNAL_BARRIER_SMGRRELEASE sets smgr_zonk to true.
> 3. If you want, you can set it back to false any time you access the
> smgr while holding a relation lock.
> 4. IssuePendingWritebacks() uses smgropen_cond(), as today, but that
> function now refuses either to create a new smgr or to return one
> where smgr_zonk is true.
>
> I think this would be sufficient to guarantee that we never try to
> write back to a relfilenode if we haven't had a lock on it since the
> last PROCSIGNAL_BARRIER_SMGRRELEASE, which I think is the invariant we
> need here?
>
> My thinking here is that it's a lot easier to get away with marking
> the content of a data structure invalid than it is to get away with
> invalidating pointers to that data structure. If we can tinker with
> the design so that the SMgrRelationData doesn't actually go away but
> just gets a little bit more thoroughly invalidated, we could avoid
> having to audit the entire code base for code that keeps smgr handles
> around longer than would be possible with the design you demonstrate
> here.

Another idea would be to call a new function DropPendingWritebacks(),
and also tell all the SMgrRelation objects to close all their internal
state (ie the fds + per-segment objects) but not free the main
SMgrRelationData object, and for good measure also invalidate the
small amount of cached state (which hasn't been mentioned in this
thread, but it kinda bothers me that that state currently survives, so
it was one unspoken reason I liked the smgrcloseall() idea).  Since
DropPendingWritebacks() is in a rather different module, perhaps if we
were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
something else, because it's not really a SMGR-only thing anymore.



On Sat, Apr 2, 2022 at 10:03 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Another idea would be to call a new function DropPendingWritebacks(),
> and also tell all the SMgrRelation objects to close all their internal
> state (ie the fds + per-segment objects) but not free the main
> SMgrRelationData object, and for good measure also invalidate the
> small amount of cached state (which hasn't been mentioned in this
> thread, but it kinda bothers me that that state currently survives, so
> it was one unspoken reason I liked the smgrcloseall() idea).  Since
> DropPendingWritebacks() is in a rather different module, perhaps if we
> were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
> something else, because it's not really a SMGR-only thing anymore.

Here's a sketch of that idea.

Attachment
On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Another idea would be to call a new function DropPendingWritebacks(),
> and also tell all the SMgrRelation objects to close all their internal
> state (ie the fds + per-segment objects) but not free the main
> SMgrRelationData object, and for good measure also invalidate the
> small amount of cached state (which hasn't been mentioned in this
> thread, but it kinda bothers me that that state currently survives, so
> it was one unspoken reason I liked the smgrcloseall() idea).  Since
> DropPendingWritebacks() is in a rather different module, perhaps if we
> were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
> something else, because it's not really a SMGR-only thing anymore.

I'm not sure that it really matters, but with the idea that I proposed
it's possible to "save" a pending writeback, if we notice that we're
accessing the relation with a proper lock after the barrier arrives
and before we actually try to do the writeback. With this approach we
throw them out immediately, so they're just gone. I don't mind that
because I don't think this will happen often enough to matter, and I
don't think it will be very expensive when it does, but it's something
to think about.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Tue, Apr 5, 2022 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Apr 1, 2022 at 5:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Another idea would be to call a new function DropPendingWritebacks(),
> > and also tell all the SMgrRelation objects to close all their internal
> > state (ie the fds + per-segment objects) but not free the main
> > SMgrRelationData object, and for good measure also invalidate the
> > small amount of cached state (which hasn't been mentioned in this
> > thread, but it kinda bothers me that that state currently survives, so
> > it was one unspoken reason I liked the smgrcloseall() idea).  Since
> > DropPendingWritebacks() is in a rather different module, perhaps if we
> > were to do that we'd want to rename PROCSIGNAL_BARRIER_SMGRRELEASE to
> > something else, because it's not really a SMGR-only thing anymore.
>
> I'm not sure that it really matters, but with the idea that I proposed
> it's possible to "save" a pending writeback, if we notice that we're
> accessing the relation with a proper lock after the barrier arrives
> and before we actually try to do the writeback. With this approach we
> throw them out immediately, so they're just gone. I don't mind that
> because I don't think this will happen often enough to matter, and I
> don't think it will be very expensive when it does, but it's something
> to think about.

The checkpointer never takes heavyweight locks, so the opportunity
you're describing can't arise.

This stuff happens entirely within the scope of a call to
BufferSync(), which is called by CheckPointBuffer().  BufferSync()
does WritebackContextInit() to set up the object that accumulates the
writeback requests, and then runs for a while performing a checkpoint
(usually spread out over time), and then at the end does
IssuePendingWritebacks().  A CFI() can be processed any time up until
the IssuePendingWritebacks(), but not during IssuePendingWritebacks().
That's the size of the gap we need to cover.

I still like the patch I posted most recently.  Note that it's not
quite as I described above: there is no DropPendingWritebacks(),
because that turned out to be impossible to implement in a way that
the barrier handler could call -- it needs access to the writeback
context.  Instead I had to expose a counter so that
IssuePendingWritebacks() could detect whether any smgrreleaseall()
calls had happened while the writeback context was being filled up
with writeback requests, by comparing with the level recorded therein.



On Tue, Apr 5, 2022 at 10:24 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Apr 5, 2022 at 2:18 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > I'm not sure that it really matters, but with the idea that I proposed
> > it's possible to "save" a pending writeback, if we notice that we're
> > accessing the relation with a proper lock after the barrier arrives
> > and before we actually try to do the writeback. With this approach we
> > throw them out immediately, so they're just gone. I don't mind that
> > because I don't think this will happen often enough to matter, and I
> > don't think it will be very expensive when it does, but it's something
> > to think about.
>
> The checkpointer never takes heavyweight locks, so the opportunity
> you're describing can't arise.

<thinks harder>  Hmm, oh, you probably meant the buffer interlocking
in SyncOneBuffer().  It's true that my most recent patch throws away
more requests than it could, by doing the level check at the end of
the loop over all buffers instead of adding some kind of
DropPendingWritebacks() in the barrier handler.  I guess I could find
a way to improve that, basically checking the level more often instead
of at the end, but I don't know if it's worth it; we're still throwing
out an arbitrary percentage of writeback requests.



On Mon, Apr 4, 2022 at 10:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > The checkpointer never takes heavyweight locks, so the opportunity
> > you're describing can't arise.
>
> <thinks harder>  Hmm, oh, you probably meant the buffer interlocking
> in SyncOneBuffer().  It's true that my most recent patch throws away
> more requests than it could, by doing the level check at the end of
> the loop over all buffers instead of adding some kind of
> DropPendingWritebacks() in the barrier handler.  I guess I could find
> a way to improve that, basically checking the level more often instead
> of at the end, but I don't know if it's worth it; we're still throwing
> out an arbitrary percentage of writeback requests.

Doesn't every backend have its own set of pending writebacks?
BufferAlloc() calls
ScheduleBufferTagForWriteback(&BackendWritebackContext, ...)?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Wed, Apr 6, 2022 at 5:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Apr 4, 2022 at 10:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > The checkpointer never takes heavyweight locks, so the opportunity
> > > you're describing can't arise.
> >
> > <thinks harder>  Hmm, oh, you probably meant the buffer interlocking
> > in SyncOneBuffer().  It's true that my most recent patch throws away
> > more requests than it could, by doing the level check at the end of
> > the loop over all buffers instead of adding some kind of
> > DropPendingWritebacks() in the barrier handler.  I guess I could find
> > a way to improve that, basically checking the level more often instead
> > of at the end, but I don't know if it's worth it; we're still throwing
> > out an arbitrary percentage of writeback requests.
>
> Doesn't every backend have its own set of pending writebacks?
> BufferAlloc() calls
> ScheduleBufferTagForWriteback(&BackendWritebackContext, ...)?

Yeah.  Alright, for my exaggeration above, s/can't arise/probably
won't often arise/, since when regular backends do this it's for
random dirty buffers, not necessarily stuff they hold a relation lock
on.

I think you are right that if we find ourselves calling smgrwrite() on
another buffer for that relfilenode a bit later, then it's safe to
"unzonk" the relation.  In the worst case, IssuePendingWritebacks()
might finish up calling sync_file_range() on a file that we originally
wrote to for generation 1 of that relfilenode, even though we now have
a file descriptor for generation 2 of that relfilenode that was
reopened by a later smgrwrite().  That would be acceptable, because a
bogus sync_file_range() call would be harmless.  The key point here is
that we absolutely must avoid re-opening files without careful
interlocking, because that could lead to later *writes* for generation
2 going to the defunct generation 1 file that we opened just as it was
being unlinked.

(For completeness:  we know of another way for smgrwrite() to write to
the wrong generation of a recycled relfilenode, but that's hopefully
very unlikely and will hopefully be addressed by adding more bits and
killing off OID-wraparound[1].  In this thread we're concerned only
with these weird "explicit OID reycling" cases we're trying to fix
with PROCSIGNAL_BARRIER_SMGRRELEASE sledgehammer-based cache
invalidation.)

My new attempt, attached, is closer to what Andres proposed, except at
the level of md.c segment objects instead of level of SMgrRelation
objects.  This avoids the dangling SMgrRelation pointer problem
discussed earlier, and is also a little like your "zonk" idea, except
instead of a zonk flag, the SMgrRelation object is reset to a state
where it knows nothing at all except its relfilenode.  Any access
through smgrXXX() functions *except* smgrwriteback() will rebuild the
state, just as you would clear the hypothetical zonk flag.

So, to summarise the new patch that I'm attaching to this email as 0001:

1.  When PROCSIGNAL_BARRIER_SMGRRELEASE is handled, we call
smgrreleaseall(), which calls smgrrelease() on all SMgrRelation
objects, telling them to close all their files.

2.  All SMgrRelation objects are still valid, and any pointers to them
that were acquired before a CFI() and then used in an smgrXXX() call
afterwards will still work (the example I know of being
RelationCopyStorage()[2]).

3.  mdwriteback() internally uses a new "behavior" flag
EXTENSION_DONT_OPEN when getting its hands on the internal segment
object, which says that we should just give up immediately if we don't
already have the file open (concretely: if
PROCSIGNAL_BARRIER_SMGRRELEASE came along between our recent
smgrwrite() call and the writeback machinery's smgrwriteback() call,
it'll do nothing at all).

Someone might say that it's weird that smgrwriteback() has that
behaviour internally, and we might eventually want to make it more
explicit by adding a "behavior" argument to the function itself, so
that it's the caller that controls it.  It didn't seem worth it for
now though; the whole thing is a hint to the operating system anyway.

However it seems that I have something wrong, because CI is failing on
Windows; I ran out of time for looking into that today, but wanted to
post what I have so far since I know we have an open item or two to
close here ASAP...

Patches 0002-0004 are Andres's, with minor tweaks:

v3-0002-Fix-old-fd-issues-using-global-barriers-everywher.patch:

It seems useless to even keep the macro USE_BARRIER_SMGRRELEASE if
we're going to define it always, so I removed it.  I guess it's useful
to be able to disable that logic easily to see that the assertion in
the other patch fails, but you can do that by commenting out a line in
ProcessBarrierSmgrRelease().

I'm still slightly confused about whether we need an *extra* global
barrier in DROP TABLESPACE, not just if destroy_tablespace_directory()
failed.

v3-0003-WIP-test-for-file-reuse-dangers-around-database-a.patch

Was missing:

-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding contrib/pg_prewarm

v3-0004-WIP-AssertFileNotDeleted-fd.patch

+        * XXX: Figure out which operating systems this works on.

Do we want this to be wrapped in some kind of macros that vanishes
into thin air unless you enable it with USE_UNLINKED_FILE_CHECKING or
something?  Then we wouldn't care so much if it doesn't work on some
unusual system.  Bikeshed mode: I would prefer "not unlinked", since
that has a more specific meaning than "not deleted".

[1]
https://www.postgresql.org/message-id/flat/CA%2BTgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg%40mail.gmail.com#1070c79256f2330ec52f063cdbe2add0
[2]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BBfaZ2puQDYV6h2oSWO2QW21_JOXZpha65gWRcmGNCZA%40mail.gmail.com#5deeb3d8adae4daf0da1f09e509eef56

Attachment
On Fri, Apr 22, 2022 at 3:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> So, to summarise the new patch that I'm attaching to this email as 0001:

This all makes sense to me, and I didn't see anything obviously wrong
looking through the patch, either.

> However it seems that I have something wrong, because CI is failing on
> Windows; I ran out of time for looking into that today, but wanted to
> post what I have so far since I know we have an open item or two to
> close here ASAP...

:-(

> Patches 0002-0004 are Andres's, with minor tweaks:
>
> v3-0002-Fix-old-fd-issues-using-global-barriers-everywher.patch:
>
> I'm still slightly confused about whether we need an *extra* global
> barrier in DROP TABLESPACE, not just if destroy_tablespace_directory()
> failed.

Andres wrote this code, but it looks correct to me. Currently, the
reason why we use USE_BARRIER_SMGRRELEASE is that we want to make sure
that we don't fail to remove a tablespace because some other backend
might have files open. However, it might also be that no other backend
has files open, and in that case we don't need to do anything, so the
current placement of the call is correct. With this patch, though, we
want to make sure that no FD that is open before we start dropping
files remains open after we drop files - and that means we need to
force files to be closed before we even try to remove files the first
time. It seems to me that Andres's patch (your 0002) doesn't add a
second call - it moves the existing one earlier. And that seems right:
no new files should be getting opened once we force them closed the
first time, I hope anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Wed, May 4, 2022 at 6:36 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Apr 22, 2022 at 3:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> > So, to summarise the new patch that I'm attaching to this email as 0001:
>
> This all makes sense to me, and I didn't see anything obviously wrong
> looking through the patch, either.

Thanks.

> > However it seems that I have something wrong, because CI is failing on
> > Windows; I ran out of time for looking into that today, but wanted to
> > post what I have so far since I know we have an open item or two to
> > close here ASAP...
>
> :-(

It passes sometimes and fails sometimes.  Here's the weird failure I
need to debug:


https://api.cirrus-ci.com/v1/artifact/task/6033765456674816/log/src/test/recovery/tmp_check/log/regress_log_032_relfilenode_reuse

Right at the end, it says:

Warning: unable to close filehandle GEN26 properly: Bad file
descriptor during global destruction.
Warning: unable to close filehandle GEN21 properly: Bad file
descriptor during global destruction.
Warning: unable to close filehandle GEN6 properly: Bad file descriptor
during global destruction.

I don't know what it means (Windows' fd->handle mapping table got
corrupted?) or even which program is printing it (you'd think maybe
perl?  but how could that be affected by anything I did in
postgres.exe, but if it's not perl why is it always at the end like
that?).  Hrmph.



On Wed, May 4, 2022 at 7:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> It passes sometimes and fails sometimes.  Here's the weird failure I
> need to debug:
>
>
https://api.cirrus-ci.com/v1/artifact/task/6033765456674816/log/src/test/recovery/tmp_check/log/regress_log_032_relfilenode_reuse
>
> Right at the end, it says:
>
> Warning: unable to close filehandle GEN26 properly: Bad file
> descriptor during global destruction.
> Warning: unable to close filehandle GEN21 properly: Bad file
> descriptor during global destruction.
> Warning: unable to close filehandle GEN6 properly: Bad file descriptor
> during global destruction.
>
> I don't know what it means (Windows' fd->handle mapping table got
> corrupted?) or even which program is printing it (you'd think maybe
> perl?  but how could that be affected by anything I did in
> postgres.exe, but if it's not perl why is it always at the end like
> that?).  Hrmph.

Got some off-list clues: that's just distracting Perl cleanup noise
after something else went wrong (thanks Robert), and now I'm testing a
theory from Andres that we're missing a barrier on the redo side when
replaying XLOG_DBASE_CREATE_FILE_COPY.  More soon.



On Wed, May 4, 2022 at 8:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Got some off-list clues: that's just distracting Perl cleanup noise
> after something else went wrong (thanks Robert), and now I'm testing a
> theory from Andres that we're missing a barrier on the redo side when
> replaying XLOG_DBASE_CREATE_FILE_COPY.  More soon.

Yeah, looks like that was the explanation.  Presumably in older
releases, recovery can fail with EACCES here, and since commit
e2f0f8ed we get ENOENT, because someone's got an unlinked file open,
and ReadDir() can still see it.  (I've wondered before if ReadDir()
should also hide zombie Windows directory entries, but that's kinda
independent and would only get us one step further, a later rmdir()
would still fail.)  Adding the barrier fixes the problem.  Assuming no
objections or CI failures show up, I'll consider pushing the first two
patches tomorrow.

Attachment
On Wed, May 4, 2022 at 2:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Assuming no
> objections or CI failures show up, I'll consider pushing the first two
> patches tomorrow.

Done.  Time to watch the build farm.

It's possible that these changes will produce some blowback, now that
we're using PROCSIGNAL_BARRIER_SMGRRELEASE on common platforms.
Obviously the earlier work started down this path already, but that
was Windows-only, so it probably didn't get much attention from our
mostly Unix crowd.

For example, if there are bugs in the procsignal barrier system, or if
there are places that don't process interrupts at all or promptly, we
might hear complaints about that.  That bug surface includes, for
example, background workers created by extensions.  An example on my
mind is a place where we hold interrupts while cleaning up temporary
files (= a loop of arbitrary size with filesystem ops that might hang
on slow storage), so a concurrent DROP TABLESPACE will have to wait,
which is kinda strange because it would in fact be perfectly safe to
process this particular "interrupt".  In that case we really just
don't want the kinds of interrupts that perform non-local exits and
prevent our cleanup work from running to completion, but we don't have
a way to say that.  I think we'll probably also want to invent a way
to report which backend is holding up progress, since without that
it's practically impossible for an end user to understand why their
command is hanging.



On Sat, May 7, 2022 at 4:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Done.  Time to watch the build farm.

So far "grison" failed.  I think it's probably just that the test
forgot to wait for replay of CREATE EXTENSION before using pg_prewarm
on the standby, hence "ERROR:  function pg_prewarm(oid) does not exist
at character 12".  I'll wait for more animals to report before I try
to fix that tomorrow.



On Sat, May 7, 2022 at 4:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I think we'll probably also want to invent a way
> to report which backend is holding up progress, since without that
> it's practically impossible for an end user to understand why their
> command is hanging.

Simple idea: how about logging the PID of processes that block
progress for too long?  In the attached, I arbitrarily picked 5
seconds as the wait time between LOG messages.  Also, DEBUG1 messages
let you see the processing speed on eg build farm animals.  Thoughts?

To test this, kill -STOP a random backend, and then try an ALTER
DATABASE SET TABLESPACE in another backend.  Example output:

DEBUG:  waiting for all backends to process ProcSignalBarrier generation 1
LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
STATEMENT:  alter database mydb set tablespace ts1;
LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
STATEMENT:  alter database mydb set tablespace ts1;
LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
STATEMENT:  alter database mydb set tablespace ts1;
LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
STATEMENT:  alter database mydb set tablespace ts1;

... then kill -CONT:

DEBUG:  finished waiting for all backends to process ProcSignalBarrier
generation 1

Another thought is that it might be nice to be able to test with a
dummy PSB that doesn't actually do anything.  You could use it to see
how fast your system processes it, while doing various other things,
and to find/debug problems in other code that fails to handle
interrupts correctly.  Here's an attempt at that.  I guess it could go
into a src/test/modules/something instead of core, but on the other
hand the PSB itself has to be in core anyway, so maybe not.  Thoughts?
 No documentation yet, just seeing if people think this is worth
having... better names/ideas welcome.

To test this, just SELECT pg_test_procsignal_barrier().

Attachment
On Sun, May 8, 2022 at 7:30 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Simple idea: how about logging the PID of processes that block
> progress for too long?  In the attached, I arbitrarily picked 5
> seconds as the wait time between LOG messages.  Also, DEBUG1 messages
> let you see the processing speed on eg build farm animals.  Thoughts?
>
> To test this, kill -STOP a random backend, and then try an ALTER
> DATABASE SET TABLESPACE in another backend.  Example output:
>
> DEBUG:  waiting for all backends to process ProcSignalBarrier generation 1
> LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
> STATEMENT:  alter database mydb set tablespace ts1;
> LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
> STATEMENT:  alter database mydb set tablespace ts1;
> LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
> STATEMENT:  alter database mydb set tablespace ts1;
> LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
> STATEMENT:  alter database mydb set tablespace ts1;

This is a very good idea.

> Another thought is that it might be nice to be able to test with a
> dummy PSB that doesn't actually do anything.  You could use it to see
> how fast your system processes it, while doing various other things,
> and to find/debug problems in other code that fails to handle
> interrupts correctly.  Here's an attempt at that.  I guess it could go
> into a src/test/modules/something instead of core, but on the other
> hand the PSB itself has to be in core anyway, so maybe not.  Thoughts?
>  No documentation yet, just seeing if people think this is worth
> having... better names/ideas welcome.

I did this at one point, but I wasn't convinced it was going to find
enough bugs to be worth committing. It's OK if you're convinced of
things that didn't convince me, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Tue, May 10, 2022 at 1:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, May 8, 2022 at 7:30 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > LOG:  still waiting for pid 1651417 to accept ProcSignalBarrier
> > STATEMENT:  alter database mydb set tablespace ts1;

> This is a very good idea.

OK, I pushed this, after making the ereport call look a bit more like
others that talk about backend PIDs.

> > Another thought is that it might be nice to be able to test with a
> > dummy PSB that doesn't actually do anything.  You could use it to see
> > how fast your system processes it, while doing various other things,
> > and to find/debug problems in other code that fails to handle
> > interrupts correctly.  Here's an attempt at that.  I guess it could go
> > into a src/test/modules/something instead of core, but on the other
> > hand the PSB itself has to be in core anyway, so maybe not.  Thoughts?
> >  No documentation yet, just seeing if people think this is worth
> > having... better names/ideas welcome.
>
> I did this at one point, but I wasn't convinced it was going to find
> enough bugs to be worth committing. It's OK if you're convinced of
> things that didn't convince me, though.

I'll leave this here for now.



On Sat, May 7, 2022 at 9:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> So far "grison" failed.  I think it's probably just that the test
> forgot to wait for replay of CREATE EXTENSION before using pg_prewarm
> on the standby, hence "ERROR:  function pg_prewarm(oid) does not exist
> at character 12".  I'll wait for more animals to report before I try
> to fix that tomorrow.

That one was addressed by commit a22652e.  Unfortunately two new kinds
of failure showed up:

Chipmunk, another little early model Raspberry Pi:

error running SQL: 'psql:<stdin>:1: ERROR:  source database
"conflict_db_template" is being accessed by other users
DETAIL:  There is 1 other session using the database.'
while running 'psql -XAtq -d port=57394 host=/tmp/luyJopPv9L
dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CREATE DATABASE
conflict_db TEMPLATE conflict_db_template OID = 50001;' at
/home/pgbfarm/buildroot/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm
line 1836.

I think that might imply that when you do two
$node_primary->safe_psql() calls in a row, the backend running the
second one might still see the ghost of the first backend in the
database, even though the first psql process has exited.  Hmmm.

Skink, the valgrind animal, also failed.  After first 8 tests, it times out:

[07:18:26.237](14.827s) ok 8 - standby: post move contents as expected
[07:18:42.877](16.641s) Bail out!  aborting wait: program timed out
stream contents: >><<
pattern searched for: (?^m:warmed_buffers)

That's be in the perl routine cause_eviction(), while waiting for a
pg_prewarm query to return, but I'm not yet sure what's going on (I
have to suspect it's in the perl scripting rather than anything in the
server, given the location of the failure).  Will try to repro
locally.



On Thu, May 12, 2022 at 3:13 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Chipmunk, another little early model Raspberry Pi:
>
> error running SQL: 'psql:<stdin>:1: ERROR:  source database
> "conflict_db_template" is being accessed by other users
> DETAIL:  There is 1 other session using the database.'

Oh, for this one I think it may just be that the autovacuum worker
with PID 23757 took longer to exit than the 5 seconds
CountOtherDBBackends() is prepared to wait, after sending it SIGTERM.



On Thu, May 12, 2022 at 4:57 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, May 12, 2022 at 3:13 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > error running SQL: 'psql:<stdin>:1: ERROR:  source database
> > "conflict_db_template" is being accessed by other users
> > DETAIL:  There is 1 other session using the database.'
>
> Oh, for this one I think it may just be that the autovacuum worker
> with PID 23757 took longer to exit than the 5 seconds
> CountOtherDBBackends() is prepared to wait, after sending it SIGTERM.

In this test, autovacuum_naptime is set to 1s (per Andres, AV was
implicated when he first saw the problem with pg_upgrade, hence desire
to crank it up).  That's not necessary: commenting out the active line
in ProcessBarrierSmgrRelease() shows that the tests reliably reproduce
data corruption without it.  Let's just take that out.

As for skink failing, the timeout was hard coded 300s for the whole
test, but apparently that wasn't enough under valgrind.  Let's use the
standard PostgreSQL::Test::Utils::timeout_default (180s usually), but
reset it for each query we send.

See attached.

Attachment
On Thu, May 12, 2022 at 10:20 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> As for skink failing, the timeout was hard coded 300s for the whole
> test, but apparently that wasn't enough under valgrind.  Let's use the
> standard PostgreSQL::Test::Utils::timeout_default (180s usually), but
> reset it for each query we send.

@@ -202,6 +198,9 @@ sub send_query_and_wait
  my ($psql, $query, $untl) = @_;
  my $ret;

+ $psql_timeout->reset();
+ $psql_timeout->start();
+
  # send query
  $$psql{stdin} .= $query;
  $$psql{stdin} .= "\n";

This seems fine, but I think you should add a non-trivial comment about it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



On Sat, May 14, 2022 at 3:33 AM Robert Haas <robertmhaas@gmail.com> wrote:
> This seems fine, but I think you should add a non-trivial comment about it.

Thanks for looking.  Done, and pushed.  Let's see if 180s per query is enough...