Thread: Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?

I wrote:
> ... what this sounds like is a
> problem with somebody fetching temporary-table blocks into shared memory
> (where they should never be), and then things going wrong after the
> owning backend drops the temp table (without having cleared out shared
> buffers, which it won't do because it doesn't think it needs to).  Can
> you say what was the exact command(s) you were using with pgstattuple?

A quick look at contrib/pgstattuple shows that it makes no effort
whatsoever to avoid reading temp tables belonging to other sessions.
So even if that wasn't Stuart's problem (and I'll bet it was), this
is quite broken.

There is no way that pgstattuple can compute valid stats for temp
tables of other sessions; it doesn't have access to pages in the other
sessions' temp buffers.  It seems that the alternatives we have are
to make it throw error, or to silently return zeroes (or perhaps
nulls?).  Neither one is tremendously appetizing.  The former would
be especially unhelpful if someone tried to write a query to apply
pgstattuple across all pg_class entries, which I kinda suspect is
what Stuart did.

Opinions?
        regards, tom lane


Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?

From
Stuart Bishop
Date:
On Tue, Mar 31, 2009 at 11:20 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> A quick look at contrib/pgstattuple shows that it makes no effort
> whatsoever to avoid reading temp tables belonging to other sessions.
> So even if that wasn't Stuart's problem (and I'll bet it was), this
> is quite broken.
>
> There is no way that pgstattuple can compute valid stats for temp
> tables of other sessions; it doesn't have access to pages in the other
> sessions' temp buffers.  It seems that the alternatives we have are
> to make it throw error, or to silently return zeroes (or perhaps
> nulls?).  Neither one is tremendously appetizing.  The former would
> be especially unhelpful if someone tried to write a query to apply
> pgstattuple across all pg_class entries, which I kinda suspect is
> what Stuart did.

This is exactly what happened, and temporary tables belonging to other
sessions where fed to pgstattuple.


--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/


Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> A quick look at contrib/pgstattuple shows that it makes no effort
> whatsoever to avoid reading temp tables belonging to other sessions.

contrib/pageinspect has the same bug. Not surprising as it was largely 
inspired by pgstattuple.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?

From
Heikki Linnakangas
Date:
Stuart Bishop wrote:
> On Tue, Mar 31, 2009 at 11:20 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
>> A quick look at contrib/pgstattuple shows that it makes no effort
>> whatsoever to avoid reading temp tables belonging to other sessions.
>> So even if that wasn't Stuart's problem (and I'll bet it was), this
>> is quite broken.
>>
>> There is no way that pgstattuple can compute valid stats for temp
>> tables of other sessions; it doesn't have access to pages in the other
>> sessions' temp buffers.  It seems that the alternatives we have are
>> to make it throw error, or to silently return zeroes (or perhaps
>> nulls?).  Neither one is tremendously appetizing.  The former would
>> be especially unhelpful if someone tried to write a query to apply
>> pgstattuple across all pg_class entries, which I kinda suspect is
>> what Stuart did.
> 
> This is exactly what happened, and temporary tables belonging to other
> sessions where fed to pgstattuple.

+1 for throwing an error. That's what we do for views, composite types, 
and GIN indexes as well. If you want to write a query to call 
pgstattuple for all tables in pg_class, you'll need to exclude all those 
cases anyway. To exclude temp tables of other sessions, you'll need to 
add "AND pg_is_other_temp_schema(relnamespace)".

I'm ok with returning NULLs as well, but returning zeroes doesn't feel 
right.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Tue, Mar 31, 2009 at 2:20 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

>> This is exactly what happened, and temporary tables belonging to other
>> sessions where fed to pgstattuple.
>
> +1 for throwing an error. That's what we do for views, composite types, and
> GIN indexes as well. If you want to write a query to call pgstattuple for
> all tables in pg_class, you'll need to exclude all those cases anyway. To
> exclude temp tables of other sessions, you'll need to add "AND
> pg_is_other_temp_schema(relnamespace)".

I would have expected an exception to be raised personally.

> I'm ok with returning NULLs as well, but returning zeroes doesn't feel
> right.


-- 
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/


Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> A quick look at contrib/pgstattuple shows that it makes no effort
>> whatsoever to avoid reading temp tables belonging to other sessions.

> contrib/pageinspect has the same bug. Not surprising as it was largely 
> inspired by pgstattuple.

Given the seriousness of the consequences (forced database shutdown is
no fun), I wonder whether we should install some low-level defense
against this type of problem; ie teach ReadBuffer to throw error if
asked to read a block from someone else's temp table.

This isn't entirely trivial because it's presently expensive to
determine whether a table is someone else's temp table: it takes a
system catalog lookup.  I'm not even sure that it'd be safe to have
the relcache do it and cache the result --- it could lead to infinite
recursion.  (At the very least this would promote pg_namespace into
the set of critical relcache entries.)

The solution that seems most practical to me is to add a bool column
to pg_class indicating "this is a temp table".  Then, if that flag
is set but it's not our own temp table (which we can tell easily),
refuse to read.  However, a patch of that size would take a little
while to develop, and I'm not entirely sure it's worth the trouble.
I can't remember having seen bugs of this type before.

Comments?
        regards, tom lane


Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?

From
Alvaro Herrera
Date:
Tom Lane wrote:

> The solution that seems most practical to me is to add a bool column
> to pg_class indicating "this is a temp table".  Then, if that flag
> is set but it's not our own temp table (which we can tell easily),
> refuse to read.  However, a patch of that size would take a little
> while to develop, and I'm not entirely sure it's worth the trouble.
> I can't remember having seen bugs of this type before.

If we had had this defense in place, it would have been obvious that
reindex and cluster were buggy.  The code to skip temp tables was not
there from the beginning.

(We already have rel->rd_istemp, but it's not what we need here.)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> I can't remember having seen bugs of this type before.

> If we had had this defense in place, it would have been obvious that
> reindex and cluster were buggy.  The code to skip temp tables was not
> there from the beginning.

I thought my memory was probably failing me (excuse: no caffeine yet).

> (We already have rel->rd_istemp, but it's not what we need here.)

Yeah.  I was considering converting that into a three-state flag, but
it might be simpler to remove it altogether and look to the new pg_class
field; only after we've gone down the path into localbuf.c would we
check relnamespace == our temp namespace before permitting a read or write.

Barring objections, I'll go make that happen.  (And fix the contrib bugs
too, but not till after ... I'll need a test case ;-))
        regards, tom lane


Re: [GENERAL] pgstattuple triggered checkpoint failure and database outage?

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> A quick look at contrib/pgstattuple shows that it makes no effort
>>> whatsoever to avoid reading temp tables belonging to other sessions.
> 
>> contrib/pageinspect has the same bug. Not surprising as it was largely 
>> inspired by pgstattuple.
> 
> Given the seriousness of the consequences (forced database shutdown is
> no fun), I wonder whether we should install some low-level defense
> against this type of problem; ie teach ReadBuffer to throw error if
> asked to read a block from someone else's temp table.

That would be nice.

> This isn't entirely trivial because it's presently expensive to
> determine whether a table is someone else's temp table: it takes a
> system catalog lookup.   I'm not even sure that it'd be safe to have
> the relcache do it and cache the result --- it could lead to infinite
> recursion.  (At the very least this would promote pg_namespace into
> the set of critical relcache entries.)

You could hard code that PG_CATALOG_NAMESPACE is not a temp namespace. I 
believe that would stop the recursion. Would that avoid promoting 
pg_namespace to critical status, too?

> The solution that seems most practical to me is to add a bool column
> to pg_class indicating "this is a temp table".  Then, if that flag
> is set but it's not our own temp table (which we can tell easily),
> refuse to read.  However, a patch of that size would take a little
> while to develop, and I'm not entirely sure it's worth the trouble.
> I can't remember having seen bugs of this type before.

In addition to the one Alvaro mentioned, I recall having problems with 
this when working on the patch to allow temporary file access with two 
phase commit in autumn.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> This isn't entirely trivial because it's presently expensive to
>> determine whether a table is someone else's temp table: it takes a
>> system catalog lookup.   I'm not even sure that it'd be safe to have
>> the relcache do it and cache the result --- it could lead to infinite
>> recursion.  (At the very least this would promote pg_namespace into
>> the set of critical relcache entries.)

> You could hard code that PG_CATALOG_NAMESPACE is not a temp namespace. I 
> believe that would stop the recursion.

True.

> Would that avoid promoting 
> pg_namespace to critical status, too?

Not sure.  It'd still be something you have to access while loading
most relcache entries.

On balance the extra pg_class column seems like a more robust and useful
solution.  I've wished for a clean way to see temp-table-ness at the SQL
level before.  And we've already changed pg_class's rowtype for 8.4, so
it shouldn't pose any additional hardship from an application
compatibility standpoint.
        regards, tom lane


I wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> (We already have rel->rd_istemp, but it's not what we need here.)

> Yeah.  I was considering converting that into a three-state flag, but
> it might be simpler to remove it altogether and look to the new pg_class
> field; only after we've gone down the path into localbuf.c would we
> check relnamespace == our temp namespace before permitting a read or write.

I looked at the uses of this flag a bit further.  It seems that
rd_istemp is being used with three subtly different meanings:
* do we need to WAL-log operations on this relation* do we need to fsync this relation during checkpoint* do we use
sharedor local buffers for this relation
 

I thought briefly about trying to make these shades of meaning explicit,
eg by introducing macros like RelationUsesWal(rel).  It doesn't seem
quite worth the trouble though.  There would be a lot of places to
change, and if we wanted to take it completely seriously, we'd have to
pass more than one argument to, for example, smgrtruncate() (when it
passes isTemp to DropRelFileNodeBuffers, the bufmgr is wanting to know
about the buffer classification; but mdtruncate wants to know about
fsync behavior).

For the buffer-classification meaning we want to consider local and
nonlocal temp tables the same, so that control passes down to localbuf.c
where we can put the error test.  For the other two meanings it probably
doesn't matter, since in principle we shouldn't get as far as making
such decisions for a nonlocal temp table --- but if we did, we'd want
to treat local and nonlocal temp tables alike, ie, no WAL or fsync.

In short, there are only a *very* small number of places, codewise,
where we want to distinguish local from nonlocal temp tables.  The test
to be added to localbuf.c is one, and RELATION_IS_LOCAL() is another
(it must not succeed for nonlocals), and right now I don't see any
others.

So my inclination is to leave rd_istemp as a boolean field of
RelationData so as to minimize textual code changes, but redefine it as
a copy of pg_class.relistemp (so that it will now be true for both local
and nonlocal temp tables).

I'm also thinking of adding a bool field rd_islocaltemp to be true only
for local temp tables.  This isn't really necessary in terms of code
cleanliness, but it will cache the result of isTempOrToastNamespace so
that we don't have to repeat that call every time through localbuf.c.
(Although it's not *that* expensive, so maybe this is overkill...)

Thoughts, better ideas?
        regards, tom lane