Home > mailing lists

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:
Date	February 10, 2022 22:14:08
Msg-id	20220210221408.b3joflmr4x5opipn@alap3.anarazel.de Whole thread Raw
In response to	Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: (Robert Haas <robertmhaas@gmail.com>)
List	pgsql-hackers

Tree view

Hi,

On 2022-02-10 13:49:50 -0500, Robert Haas wrote:
> I agree. While I feel sort of bad about missing this issue in review,
> I also feel like it's pretty surprising that there isn't something
> plugging this hole already. It feels unexpected that our FD management
> layer might hand you an FD that references the previous file that had
> the same name instead of the one that exists right now, and I don't
> think it's too much of a stretch of the imagination to suppose that
> this won't be the last problem we have if we don't fix it.

Arguably it's not really the FD layer that's the issue - it's on the smgr
level. But that's semantics...

> The main question in my mind is who is going to actually make that
> happen. It was your idea (I think), Thomas coded it, and my commit
> made it a live problem. So who's going to get something committed
> here?

I think it'd make sense for Thomas to commit the current patch for the
tablespace issues first. And then we can go from there?

> > If we add such barriers to dropdb(), XLOG_DBASE_DROP redo, DropTableSpace(),
> > XLOG_TBLSPC_DROP redo, I think all the "new" ways to hit this are foreclosed.
> 
> What about movedb()?

My first reaction was that it should be fine, because RelFileNode.spcNode
would differ. But of course that doesn't protect against moving a database
back and forth. So yes, you're right, it's needed there too.

Are there other commands that could be problematic? ALTER TABLE SET TABLESPACE
should be fine, that allocates a new relfilenode.

> I definitely think it would be nice to cover this issue not only for
> database OIDs and tablespace OIDs but also for relfilenode OIDs.
> Otherwise we're right on the cusp of having the same bug in that case.
> pg_upgrade doesn't drop and recreate tables the way it does databases,
> but if somebody made it do that in the future, would they think about
> this? I bet not.

And even just plugging the wraparound aspect would be great. It's not actually
*that* hard to wrap oids around, because of toast.

> Dilip's been working on your idea of making relfilenode into a 56-bit
> quantity. That would make the problem of detecting wraparound go away.
> In the meantime, I suppose we could do think of trying to do something
> here:
> 
>                         /* wraparound, or first post-initdb
> assignment, in normal mode */
>                         ShmemVariableCache->nextOid = FirstNormalObjectId;
>                         ShmemVariableCache->oidCount = 0;
> 
> That's a bit sketch, because this is the OID counter, which isn't only
> used for relfilenodes.

I can see a few ways to deal with that:

1) Split nextOid into two, nextRelFileNode and nextOid. Have the wraparound
   behaviour only in the nextRelFileNode case.  That'd be a nice improvement,
   but not entirely trivial, because we currently use GetNewRelFileNode() for
   oid assignment as well (c.f. heap_create_with_catalog()).

2) Keep track of the last assigned relfilenode in ShmemVariableCache, and wait
   for a barrier in GetNewRelFileNode() if the assigned oid is wrapped
   around.  While there would be a chance of "missing" a ->nextOid wraparound,
   e.g. because of >2**32 toast oids being allocated, I think that could only
   happen if there were no "wrapped around relfilenodes" allocated, so that
   should be ok.

3) Any form of oid wraparound might cause problems, not just relfilenode
   issues. So having a place to emit barriers on oid wraparound might be a
   good idea.

One thing I wonder is how to make sure the path would be tested. Perhaps we
could emit the anti-wraparound barriers more frequently when assertions are
enabled?

> I'm not sure whether there would be problems waiting for a barrier here - I
> think we might be holding some lwlocks in some code paths.

It seems like a bad idea to call GetNewObjectId() with lwlocks held. From what
I can see the two callers of GetNewObjectId() both have loops that can go on
for a long time after a wraparound.

> > I think we need to add some debugging code to detect "fd to deleted file of
> > same name" style problems. Especially during regression tests they're quite
> > hard to notice, because most of the contents read/written are identical across
> > recycling.
> 
> Maybe. If we just plug the holes here, I am not sure we need permanent
> instrumentation. But I'm also not violently opposed to it.

The question is how we can find all the places we need to add barriers emit &
wait to. I e.g. didn't initially didn't realize that there's wraparound danger
in WAL replay as well.

Greetings,

Andres Freund

pgsql-hackers by date:

From: Thomas Munro
Date: 10 February 2022, 22:12:49
Subject: Re: Windows now has fdatasync()

From: Andres Freund
Date: 10 February 2022, 22:26:59
Subject: Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:

Re: wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To: - Mailing list pgsql-hackers

Previous

Next