Thread: Re: [BUGS] BUG #11867: Strange behaviour with composite types after resetting database tablespace

marc@bloodnok.com writes:
> I have a script (below) that sets and resets the tablespace for a database
> and drops and recreates a composite type.  The script fails when trying to
> re-create the previously dropped composite type because the type has
> apparently been magically resuscitated by changing the tablespace for the
> database.  I assume this is a bug.  

<spock>Fascinating.</spock>

I believe what is happening is:

* You create the composite type.  This results in entries in the
database's pg_class and pg_type tables.  These entries are in blocks
of those tables that are in shared buffers, with hashtag RelFileNodes
that mention the database's original tablespace (tbs3).

* You move the database to another tablespace.  We faithfully flush
the dirty catalog blocks to disk and then copy them somewhere else
on disk.

* You drop the composite type.  This entails fetching in relevant
blocks of the database's pg_class and pg_type tables from their
new tablespace and marking the catalog rows deleted.  This happens in
shared buffers that have hashtags mentioning the database's new
tablespace (pg_default, in this example).

* You move the database back to its original tablespace.  We faithfully
flush the dirty catalog blocks to disk and then copy them back to the
original on-disk location.

However: at no point in this sequence did we flush the original catalog
blocks out of shared buffers.  Therefore, a read of the database's
pg_class or pg_type at this point is likely to find the previous states of
those pages (pre-composite-type-drop) as valid, matching entries, so that
we skip reading the up-to-date page contents back from disk.

A quick fix for this would be to issue DropDatabaseBuffers during
movedb(), though I'm not sure offhand where in that sequence is the
safest place for it.  We could be more restrictive and only drop
buffers that belong to the source tablespace, but I'm not sure it's
worth any extra code to do so.

> This is not mission-critical for me but I'd be grateful for suggestions for
> work-arounds.

I don't see any workaround that's much easier than fixing the bug.
But what's your use case that involves flipping databases from one
tablespace to another and then back again within the life expectancy of
unused shared-buffers pages?  It doesn't seem terribly surprising that
nobody noticed this before ...
        regards, tom lane



On Mon, 2014-11-03 at 22:08 -0500, Tom Lane wrote:
> <spock>Fascinating.</spock>

:-)

> I believe what is happening is:

[ . . . ]

> > This is not mission-critical for me but I'd be grateful for suggestions for
> > work-arounds.
>
> I don't see any workaround that's much easier than fixing the bug.
> But what's your use case that involves flipping databases from one
> tablespace to another and then back again within the life expectancy of
> unused shared-buffers pages?  It doesn't seem terribly surprising that
> nobody noticed this before ...

It's a completely unreal use case.  I am working on a diff tool, and
this was found as part of my test suite: build db x, drop db x,  build
db y, apply diff y->x, compare with original x, apply diff x->y, compare
with original y.

So this is a fairly minor inconvenience for me.

__
Marc

I wrote:
> However: at no point in this sequence did we flush the original catalog
> blocks out of shared buffers.  Therefore, a read of the database's
> pg_class or pg_type at this point is likely to find the previous states of
> those pages (pre-composite-type-drop) as valid, matching entries, so that
> we skip reading the up-to-date page contents back from disk.

> A quick fix for this would be to issue DropDatabaseBuffers during
> movedb()

BTW, I wondered whether the exact same hazard didn't exist for ALTER TABLE
SET TABLESPACE.  At first glance it looks bad, because ATExecSetTableSpace
doesn't forcibly drop old shared buffers for the moved table either.
Closer examination says we're safe though:

* The buffers will be dropped during smgrdounlink at end of transaction,
so you could only be at risk if you moved a table, modified it, and moved
it back in a single transaction.

* A new relfilenode will be assigned during each movement.

* When assigning a relfilenode during the move-back, we'd be certain to
choose a relfilenode different from the original one, because the old
relation still exists on-disk at this point.

So it's safe as things stand; but this seems a bit, um, rickety.  Should
we insert a DropRelFileNodesAllBuffers call into ATExecSetTableSpace to
make it safer?  It's kind of annoying to have to scan the buffer pool
twice, but I'm afraid that sometime in the future somebody will try to get
clever about optimizing away the end-of-transaction buffer pool scan, and
break this case.  Or maybe I'm just worrying too much.
        regards, tom lane



On 11/4/14, 10:45 AM, Tom Lane wrote:
> So it's safe as things stand; but this seems a bit, um, rickety.  Should
> we insert a DropRelFileNodesAllBuffers call into ATExecSetTableSpace to
> make it safer?  It's kind of annoying to have to scan the buffer pool
> twice, but I'm afraid that sometime in the future somebody will try to get
> clever about optimizing away the end-of-transaction buffer pool scan, and
> break this case.  Or maybe I'm just worrying too much.

We could possibly remember what relations have been moved to different tablespaces in a transaction and avoid the call
inthat case, but that seems rather silly.
 

If there's any non-trivial actual data involved then presumably the buffer scan won't be noticed amidst all the IO, so
thiswould only be an issue if you're playing games with mostly empty tables, and only then if you've got really large
sharedbuffers. I can't actually come up with any real use case for that.
 

So +1 for doing the safe thing...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



On Tue, Nov 4, 2014 at 11:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> However: at no point in this sequence did we flush the original catalog
>> blocks out of shared buffers.  Therefore, a read of the database's
>> pg_class or pg_type at this point is likely to find the previous states of
>> those pages (pre-composite-type-drop) as valid, matching entries, so that
>> we skip reading the up-to-date page contents back from disk.
>
>> A quick fix for this would be to issue DropDatabaseBuffers during
>> movedb()
>
> BTW, I wondered whether the exact same hazard didn't exist for ALTER TABLE
> SET TABLESPACE.  At first glance it looks bad, because ATExecSetTableSpace
> doesn't forcibly drop old shared buffers for the moved table either.
> Closer examination says we're safe though:
>
> * The buffers will be dropped during smgrdounlink at end of transaction,
> so you could only be at risk if you moved a table, modified it, and moved
> it back in a single transaction.
>
> * A new relfilenode will be assigned during each movement.
>
> * When assigning a relfilenode during the move-back, we'd be certain to
> choose a relfilenode different from the original one, because the old
> relation still exists on-disk at this point.
>
> So it's safe as things stand; but this seems a bit, um, rickety.  Should
> we insert a DropRelFileNodesAllBuffers call into ATExecSetTableSpace to
> make it safer?  It's kind of annoying to have to scan the buffer pool
> twice, but I'm afraid that sometime in the future somebody will try to get
> clever about optimizing away the end-of-transaction buffer pool scan, and
> break this case.  Or maybe I'm just worrying too much.

I think you're worrying too much.  Adding a comment to the code that
drives the end-of-transaction buffer pool scan to warn future
optimizers not to be too clever might be warranted; adding run-time
overhead to avoid a bug we don't have seems excessive.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company